File size: 10,490 Bytes
4ae6f60 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 |
---
datasets:
- ExponentialScience/DLT-Sentiment-News
language:
- en
base_model:
- ExponentialScience/LedgerBERT
---
# LedgerBERT-Market-Sentiment
## Model Description
### Model Summary
LedgerBERT-Market-Sentiment is a fine-tuned version of LedgerBERT (https://huggingface.co/ExponentialScience/LedgerBERT) specialized for sentiment analysis of cryptocurrency and DLT-related content. The model classifies text into three market direction sentiment categories: **bullish** (positive market outlook), **bearish** (negative market outlook), and **neutral** (balanced or unclear market direction).
This model is particularly effective for analyzing cryptocurrency news headlines, social media posts, and other DLT-related content where understanding market sentiment is important.
- **Model type:** BERT-base encoder for sequence classification
- **Language:** English
- **License:** Creative Commons Attribution-NonCommercial 4.0 International (CC-BY-NC 4.0)
- **Base model:** LedgerBERT (ExponentialScience/LedgerBERT)
- **Fine-tuning dataset:** DLT-Sentiment-News (23,301 examples)
- **Task:** Multi-class sentiment classification (3 classes)
### Model Architecture
- **Architecture:** BERT-base for sequence classification
- **Parameters:** 110 million
- **Hidden size:** 768
- **Number of layers:** 12
- **Attention heads:** 12
- **Vocabulary size:** 30,522 (SciBERT vocabulary)
- **Max sequence length:** 512 tokens
- **Output:** 3-class logits (bullish, bearish, neutral)
## Intended Uses
### Primary Use Cases
This model is designed for sentiment analysis tasks in the cryptocurrency and DLT domain:
- **Market sentiment analysis**: Analyzing sentiment in cryptocurrency news articles, headlines, and market commentary
- **Social media monitoring**: Understanding market direction sentiment in tweets, Reddit posts, and forum discussions
- **News aggregation**: Automatically categorizing cryptocurrency news by market sentiment
- **Research applications**: Studying sentiment trends and their relationship to market dynamics
- **Content filtering**: Organizing DLT content based on market outlook
### Example Applications
```python
# Analyzing news headlines
"Bitcoin surges to new all-time high" → Bullish
"Ethereum faces regulatory scrutiny" → Bearish
"Stablecoin market remains stable" → Neutral
# Social media sentiment
"To the moon! 🚀" → Bullish
"Another crypto winter incoming" → Bearish
"Waiting for clear market direction" → Neutral
```
### Out-of-Scope Uses
- **Investment decisions**: This model should NOT be used as the sole basis for making investment or trading decisions
- **Financial advice**: Not suitable for providing personalized financial or investment recommendations
- **Real-time trading**: Should not be used for automated high-frequency trading systems
- **Market manipulation**: Must not be used to coordinate or facilitate market manipulation
- **General sentiment analysis**: Optimized for market direction sentiment; may not perform well on general emotional sentiment
## Training Details
### Training Data
The model was fine-tuned on the **DLT-Sentiment-News dataset**, which contains:
- **Size:** 23,301 examples
- **Tokens:** 1.85 million tokens (average 79.51 tokens per example)
- **Temporal coverage:** January 2021 to May 2025
- **Source:** CryptoPanic platform cryptocurrency news headlines and descriptions
- **Labels:** Crowdsourced votes from active cryptocurrency community members
- **Classification method:** Percentile-based labeling (25th and 75th percentiles as boundaries)
**Label distribution by sentiment dimension:**
- **Market Direction:** bullish, bearish, neutral
The dataset provides domain expertise through crowdsourced annotations from cryptocurrency users, making the labels more relevant than general crowdworker annotations.
**Note:** News articles are absent from the DLT-Corpus used to pre-train LedgerBERT, making this an out-of-domain generalization test that demonstrates the model's robust language understanding.
For more details on the dataset used for tine-tuning, see: https://huggingface.co/datasets/ExponentialScience/DLT-Sentiment-News
### Training Procedure
**Fine-tuning hyperparameters:**
- **Epochs:** 3
- **Learning rate:** 2×10⁻⁵
- **Warmup steps:** 500
- **Batch size:** 8 per device (training and evaluation)
- **Train/test split:** 90% training, 10% testing
- **Optimizer:** AdamW with fused operations
- **Precision:** bfloat16
- **Max sequence length:** 512 tokens (tokenizer default)
- **Truncation:** Enabled
- **Padding:** Enabled
## Limitations and Biases
### Known Limitations
- **Temporal lag**: Not suitable for real-time sentiment analysis; trained on historical data (2021-2025)
- **Context dependency**: Headlines and descriptions lack full article context, which may affect sentiment interpretation
- **Language coverage**: English only; does not support other languages
- **Sarcasm and irony**: May struggle with nuanced language common in cryptocurrency discourse (e.g., "HFSP" - Have Fun Staying Poor)
- **Evolving terminology**: Cryptocurrency memes and terminology evolve rapidly; may not capture newest slang
- **Market volatility**: Sentiment can change rapidly after news publication; static predictions may become outdated quickly
### Potential Biases
The model may reflect biases present in the training data:
- **Platform bias**: Data from CryptoPanic users only; may not represent broader market sentiment
- **User bias**: Active crypto community members may have different perspectives than general investors
- **Temporal bias**: Training data spans 2021-2025, reflecting specific market conditions (bull markets, bear markets, crypto winters)
- **Source bias**: Certain news sources or cryptocurrencies may be over-represented in the training data
- **Geographic bias**: English-language news sources are over-represented
- **Market condition bias**: Dataset reflects specific market cycles that may not generalize to all conditions
### Data Collection Biases
- **Vote manipulation**: Despite quality filters, coordinated voting on the source platform cannot be completely ruled out
- **Minimum vote threshold**: Filtering by median votes may exclude less popular but valid sentiment signals
- **Percentile-based labeling**: Classification boundaries (25th/75th percentiles) are somewhat arbitrary
## How to Use
### Basic Usage
```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
# Load model and tokenizer
model_name = "ExponentialScience/LedgerBERT-Market-Sentiment"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
# Example texts
texts = [
"Bitcoin reaches new all-time high amid institutional adoption",
"SEC announces crackdown on cryptocurrency exchanges",
"Ethereum network upgrade proceeding as planned"
]
# Classify sentiment
for text in texts:
inputs = tokenizer(text, return_tensors="pt", max_length=512, truncation=True, padding=True)
with torch.no_grad():
outputs = model(**inputs)
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
predicted_class = predictions.argmax(dim=-1).item()
# Map to labels (adjust based on your label mapping)
labels = ["bearish", "bullish", "neutral"] # Order may vary
sentiment = labels[predicted_class]
confidence = predictions[0][predicted_class].item()
print(f"Text: {text}")
print(f"Sentiment: {sentiment} (confidence: {confidence:.3f})\n")
```
### Batch Processing
```python
from transformers import pipeline
# Create sentiment analysis pipeline
classifier = pipeline(
"text-classification",
model="ExponentialScience/LedgerBERT-Market-Sentiment",
tokenizer="ExponentialScience/LedgerBERT-Market-Sentiment"
)
# Process multiple texts
texts = [
"DeFi protocol launches new staking mechanism",
"Major cryptocurrency exchange faces liquidity crisis",
"Blockchain adoption continues in enterprise sector"
]
results = classifier(texts, truncation=True, max_length=512)
for text, result in zip(texts, results):
print(f"Text: {text}")
print(f"Sentiment: {result['label']} (score: {result['score']:.3f})\n")
```
### Integration with News Feeds
```python
import feedparser
from transformers import pipeline
# Initialize classifier
classifier = pipeline(
"text-classification",
model="ExponentialScience/LedgerBERT-Market-Sentiment"
)
# Example: Analyze cryptocurrency news feed
feed_url = "https://example-crypto-news.com/rss"
feed = feedparser.parse(feed_url)
for entry in feed.entries[:5]: # Process first 5 entries
title = entry.title
result = classifier(title, truncation=True, max_length=512)[0]
print(f"Headline: {title}")
print(f"Market Sentiment: {result['label']} ({result['score']:.2%})")
print(f"Link: {entry.link}\n")
```
## Citation
If you use LedgerBERT-Market-Sentiment in your research, please cite:
```bibtex
@article{hernandez2025dlt-corpus,
title={DLT-Corpus: A Large-Scale Text Collection for the Distributed Ledger Technology Domain},
author={Hernandez Cruz, Walter and Devine, Peter and Vadgama, Nikhil and Tasca, Paolo and Xu, Jiahua},
year={2025}
}
```
## Related Resources
- **Base Model (LedgerBERT)**: https://huggingface.co/ExponentialScience/LedgerBERT
- **Training Dataset**: https://huggingface.co/datasets/ExponentialScience/DLT-Sentiment-News
- **DLT-Corpus Collection**: https://huggingface.co/collections/ExponentialScience/dlt-corpus-68e44e40d4e7a3bd7a224402
### Additional Fine-tuned Models
LedgerBERT can also be fine-tuned for other sentiment dimensions available in the DLT-Sentiment-News dataset (https://huggingface.co/datasets/ExponentialScience/DLT-Sentiment-News):
- **Content Characteristics** (liked, disliked, neutral)
- **Engagement Quality** (important, lol, neutral)
## Model Card Contact
For questions or feedback about LedgerBERT-Market-Sentiment, please open an issue on the GitHub repository: https://github.com/dlt-science/DLT-Corpus
---
**⚠️ Important Disclaimer:** This model is provided for research and educational purposes only. It should not be used as financial advice or as the sole basis for investment decisions. Cryptocurrency markets are highly volatile and unpredictable. Always conduct your own research and consult with qualified financial advisors before making investment decisions. |