Retrained on improved, reanalyzed data

cd1dbbe verified over 1 year ago

4.37 kB

	---
	library_name: transformers
	base_model: deberta-v3-xsmall-quality-pretrain
	tags:
	- generated_from_trainer
	model-index:
	- name: deberta-v3-xsmall-quality
	results: []
	license: mit
	datasets:
	- agentlans/text-quality
	- allenai/c4
	- HuggingFaceFW/fineweb-edu
	- monology/pile-uncopyrighted
	- agentlans/common-crawl-sample
	- agentlans/wikipedia-paragraphs
	language:
	- en
	pipeline_tag: text-classification
	---

	# English Text Quality Classifier

	The deberta-v3-xsmall-quality model is designed to evaluate text quality by using a composite score that combines the results from multiple classifiers. This method provides a more thorough assessment than traditional educational metrics, making it ideal for a variety of NLP and AI applications.

	## Intended Uses & Limitations

	Intended Uses:
	- Quality assessment of text across various domains.
	- Enhancing NLP applications by providing a robust measure of text quality.
	- Supporting research and development in AI by offering insights into text quality metrics.

	Limitations:
	- The model's performance may vary depending on the specific characteristics of the input text.
	- It's also a black box. Hard to explain why something is classified as higher quality than another.
	- It is essential to consider the context in which the model is applied, as different domains may have unique quality requirements.
	- May still be biased towards non-fiction and educational genres.

	## Training and Evaluation Data

	The model was trained on the [agentlans/text-quality](https://huggingface.co/datasets/agentlans/text-quality) dataset comprising 100,000 sentences sourced from five distinct datasets, with 20,000 sentences drawn from each of the following:

	1. allenai/c4
	2. HuggingFaceFW/fineweb-edu
	3. monology/pile-uncopyrighted
	4. agentlans/common-crawl-sample
	5. agentlans/wikipedia-paragraphs

	This diverse dataset enables the model to generalize well across different text types and domains.

	90% of the rows were used for training and the remaining 10% for evaluation.

	## How to use

	```python
	from transformers import AutoTokenizer, AutoModelForSequenceClassification
	import torch

	model_name="agentlans/deberta-v3-xsmall-quality"

	# Put model on GPU or else CPU
	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = AutoModelForSequenceClassification.from_pretrained(model_name)
	device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
	model = model.to(device)

	def quality(text):
	"""Processes the text using the model and returns its logits.
	In this case, it's interpreted as the the combined quality score for that text."""
	inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True).to(device)
	with torch.no_grad():
	logits = model(**inputs).logits.squeeze().cpu()
	return logits.tolist()

	# Example usage
	text = [
	"Congratulations! You've won a $1,000 gift card! Click here to claim your prize now!!!",
	"Page 1 2 3 4 5 Next Last>>",
	"Urgent: Your account has been compromised! Click this link to verify your identity and secure your account immediately!!!",
	"Today marks a significant milestone in our journey towards sustainability! 🌍✨ We’re excited to announce our partnership with local organizations to plant 10,000 trees in our community this fall. Join us in making a positive impact on our environment!",
	"In recent years, the impact of climate change has become increasingly evident, affecting ecosystems and human livelihoods across the globe."]

	result = quality(text)
	[round(x, 2) for x in result] # Estimated quality for each text [-0.89, -0.76, -0.7, 0.3, 1.64]
	```

	## Training Procedure

	<details>
	<summary>Training hyperparameters, results, framework</summary>

	### Training Hyperparameters

	The following hyperparameters were utilized during training:
	- Learning Rate: 5e-05
	- Training Batch Size: 8
	- Evaluation Batch Size: 8
	- Seed: 42
	- Optimizer: Adam with betas=(0.9, 0.999) and epsilon=1e-08
	- Learning Rate Scheduler Type: Linear
	- Number of Epochs: 3.0

	### Training Results

	- Loss: 0.0924
	- Mse: 0.0924
	- Num Input Tokens Seen: 34560000

	### Framework Versions

	The model was developed using the following frameworks and libraries:
	- Transformers 4.45.1
	- Pytorch 2.4.1+cu121
	- Datasets 3.0.1
	- Tokenizers 0.20.0
	</details>