Text Classification
Transformers
Safetensors
English
deberta-v2
Generated from Trainer
text-embeddings-inference
Instructions to use agentlans/deberta-v3-xsmall-quality with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use agentlans/deberta-v3-xsmall-quality with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-classification", model="agentlans/deberta-v3-xsmall-quality")# Load model directly from transformers import AutoTokenizer, AutoModelForSequenceClassification tokenizer = AutoTokenizer.from_pretrained("agentlans/deberta-v3-xsmall-quality") model = AutoModelForSequenceClassification.from_pretrained("agentlans/deberta-v3-xsmall-quality") - Notebooks
- Google Colab
- Kaggle
| library_name: transformers | |
| base_model: deberta-v3-xsmall-quality-pretrain | |
| tags: | |
| - generated_from_trainer | |
| model-index: | |
| - name: deberta-v3-xsmall-quality | |
| results: [] | |
| license: mit | |
| datasets: | |
| - agentlans/text-quality | |
| - allenai/c4 | |
| - HuggingFaceFW/fineweb-edu | |
| - monology/pile-uncopyrighted | |
| - agentlans/common-crawl-sample | |
| - agentlans/wikipedia-paragraphs | |
| language: | |
| - en | |
| pipeline_tag: text-classification | |
| # English Text Quality Classifier | |
| The **deberta-v3-xsmall-quality** model is designed to evaluate text quality by using a composite score that combines the results from multiple classifiers. This method provides a more thorough assessment than traditional educational metrics, making it ideal for a variety of NLP and AI applications. | |
| ## Intended Uses & Limitations | |
| **Intended Uses**: | |
| - Quality assessment of text across various domains. | |
| - Enhancing NLP applications by providing a robust measure of text quality. | |
| - Supporting research and development in AI by offering insights into text quality metrics. | |
| **Limitations**: | |
| - The model's performance may vary depending on the specific characteristics of the input text. | |
| - It's also a black box. Hard to explain why something is classified as higher quality than another. | |
| - It is essential to consider the context in which the model is applied, as different domains may have unique quality requirements. | |
| - May still be biased towards non-fiction and educational genres. | |
| ## Training and Evaluation Data | |
| The model was trained on the [agentlans/text-quality](https://huggingface.co/datasets/agentlans/text-quality) dataset comprising **100,000 sentences** sourced from five distinct datasets, with **20,000 sentences** drawn from each of the following: | |
| 1. **allenai/c4** | |
| 2. **HuggingFaceFW/fineweb-edu** | |
| 3. **monology/pile-uncopyrighted** | |
| 4. **agentlans/common-crawl-sample** | |
| 5. **agentlans/wikipedia-paragraphs** | |
| This diverse dataset enables the model to generalize well across different text types and domains. | |
| 90% of the rows were used for training and the remaining 10% for evaluation. | |
| ## How to use | |
| ```python | |
| from transformers import AutoTokenizer, AutoModelForSequenceClassification | |
| import torch | |
| model_name="agentlans/deberta-v3-xsmall-quality" | |
| # Put model on GPU or else CPU | |
| tokenizer = AutoTokenizer.from_pretrained(model_name) | |
| model = AutoModelForSequenceClassification.from_pretrained(model_name) | |
| device = torch.device("cuda" if torch.cuda.is_available() else "cpu") | |
| model = model.to(device) | |
| def quality(text): | |
| """Processes the text using the model and returns its logits. | |
| In this case, it's interpreted as the the combined quality score for that text.""" | |
| inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True).to(device) | |
| with torch.no_grad(): | |
| logits = model(**inputs).logits.squeeze().cpu() | |
| return logits.tolist() | |
| # Example usage | |
| text = [ | |
| "Congratulations! You've won a $1,000 gift card! Click here to claim your prize now!!!", | |
| "Page 1 2 3 4 5 Next Last>>", | |
| "Urgent: Your account has been compromised! Click this link to verify your identity and secure your account immediately!!!", | |
| "Today marks a significant milestone in our journey towards sustainability! 🌍✨ We’re excited to announce our partnership with local organizations to plant 10,000 trees in our community this fall. Join us in making a positive impact on our environment!", | |
| "In recent years, the impact of climate change has become increasingly evident, affecting ecosystems and human livelihoods across the globe."] | |
| result = quality(text) | |
| [round(x, 2) for x in result] # Estimated quality for each text [-0.89, -0.76, -0.7, 0.3, 1.64] | |
| ``` | |
| ## Training Procedure | |
| <details> | |
| <summary>Training hyperparameters, results, framework</summary> | |
| ### Training Hyperparameters | |
| The following hyperparameters were utilized during training: | |
| - **Learning Rate**: 5e-05 | |
| - **Training Batch Size**: 8 | |
| - **Evaluation Batch Size**: 8 | |
| - **Seed**: 42 | |
| - **Optimizer**: Adam with betas=(0.9, 0.999) and epsilon=1e-08 | |
| - **Learning Rate Scheduler Type**: Linear | |
| - **Number of Epochs**: 3.0 | |
| ### Training Results | |
| - **Loss**: 0.0924 | |
| - **Mse**: 0.0924 | |
| - **Num Input Tokens Seen**: 34560000 | |
| ### Framework Versions | |
| The model was developed using the following frameworks and libraries: | |
| - Transformers 4.45.1 | |
| - Pytorch 2.4.1+cu121 | |
| - Datasets 3.0.1 | |
| - Tokenizers 0.20.0 | |
| </details> |