File size: 8,747 Bytes
5faf9e0 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 |
---
datasets:
- ExponentialScience/DLT-Tweets
- ExponentialScience/DLT-Patents
- ExponentialScience/DLT-Scientific-Literature
language:
- en
base_model:
- allenai/scibert_scivocab_cased
---
# LedgerBERT
## Model Description
### Model Summary
LedgerBERT is a domain-adapted language model specialized for the Distributed Ledger Technology (DLT) field. It was created through continual pre-training of SciBERT on the DLT-Corpus, a comprehensive collection of 2.98 billion tokens from scientific literature, patents, and social media focused on blockchain, cryptocurrencies, and distributed ledger systems.
LedgerBERT captures DLT-specific terminology and concepts, making it particularly effective for NLP tasks involving blockchain technologies, cryptocurrency discourse, smart contracts, consensus mechanisms, and related domain-specific content.
- **Developed by:** Walter Hernandez Cruz, Peter Devine, Nikhil Vadgama, Paolo Tasca, Jiahua Xu
- **Model type:** BERT-base encoder (bidirectional transformer)
- **Language:** English
- **License:** CC-BY-NC 4.0 (Creative Commons Attribution-NonCommercial 4.0 International)
- **Base model:** SciBERT (allenai/scibert_scivocab_cased)
- **Training corpus:** DLT-Corpus (2.98 billion tokens)
### Model Architecture
- **Architecture:** BERT-base
- **Parameters:** 110 million
- **Hidden size:** 768
- **Number of layers:** 12
- **Attention heads:** 12
- **Vocabulary size:** 30,522 (SciBERT vocabulary)
- **Max sequence length:** 512 tokens
## Intended Uses
### Primary Use Cases
LedgerBERT is designed for NLP tasks in the DLT domain, including, but not limited to:
- **Named Entity Recognition (NER)**: Identifying DLT-specific entities such as consensus mechanisms (e.g., Proof of Stake), blockchain platforms (e.g., Ethereum, Hedera), cryptographic concepts (e.g., Merkle tree, hashing)
- **Text Classification**: Categorizing DLT-related documents, patents, or social media posts
- **Sentiment Analysis**: Analyzing sentiment in cryptocurrency news and social media
- **Information Extraction**: Extracting technical concepts and relationships from DLT literature
- **Document Retrieval**: Building search systems for DLT content
- **Question Answering (QA)**: Creating QA systems for blockchain and cryptocurrency topics
### Out-of-Scope Uses
- **Real-time trading systems**: LedgerBERT should not be used as the sole basis for automated trading decisions
- **Investment advice**: Not suitable for providing financial or investment recommendations without proper disclaimers
- **General-purpose NLP**: While LedgerBERT maintains general language understanding, it is optimized for DLT-specific tasks
- **Legal or regulatory compliance**: Should not be used for legal interpretation without expert review
## Training Details
### Training Data
LedgerBERT was continually pre-trained on the **DLT-Corpus**, consisting of:
- **Scientific Literature**: 37,440 documents, 564M tokens (1978-2025). See https://huggingface.co/datasets/ExponentialScience/DLT-Scientific-Literature
- **Patents**: 49,023 documents, 1,296M tokens (1990-2025). See https://huggingface.co/datasets/ExponentialScience/DLT-Patents
- **Social Media**: 22.03M documents, 1,120M tokens (2013-mid 2023). See https://huggingface.co/datasets/ExponentialScience/DLT-Tweets
**Total:** 22.12 million documents, 2.98 billion tokens
For more details, see: https://huggingface.co/collections/ExponentialScience/dlt-corpus-68e44e40d4e7a3bd7a224402
### Training Procedure
**Continual Pre-training:**
Starting from SciBERT (which already captures multidisciplinary scientific content), LedgerBERT was trained using Masked Language Modeling (MLM) on the DLT-Corpus to adapt the model to DLT-specific terminology and concepts.
**Training hyperparameters:**
- **Epochs:** 3
- **Learning rate:** 5×10⁻⁵ with linear decay schedule
- **MLM probability:** 0.15 (standard BERT masking)
- **Warmup ratio:** 0.10
- **Batch size:** 12 per device
- **Sequence length:** 512 tokens
- **Weight decay:** 0.01
- **Optimizer:** Stable AdamW
- **Precision:** bfloat16
## Limitations and Biases
### Known Limitations
- **Language coverage**: English only; does not support other languages
- **Temporal coverage**: Training data extends to mid-2023 for social media; may not capture very recent terminology
- **Domain specificity**: Optimized for DLT tasks; may underperform on general-purpose benchmarks compared to models like RoBERTa
- **Context length**: Limited to 512 tokens; longer documents require truncation or chunking
### Potential Biases
The model may reflect biases present in the training data:
- **Geographic bias**: English-language sources may over-represent certain regions
- **Platform bias**: Social media data only from Twitter/X; other platforms not represented
- **Temporal bias**: More recent DLT developments are more heavily represented
- **Market bias**: Training during periods of market volatility may influence sentiment understanding
- **Source bias**: Certain cryptocurrencies (e.g., Bitcoin, Ethereum) are more discussed than others
### Ethical Considerations
- **Market manipulation risk**: Could potentially be misused for analyzing or generating content for market manipulation
- **Investment decisions**: Should not be used as sole basis for financial decisions without proper risk disclaimers
- **Misinformation**: May reproduce or fail to identify false claims present in training data
- **Privacy**: While usernames were removed from social media data, care should be taken not to re-identify individuals
## How to Use
### Basic Usage
```python
from transformers import AutoTokenizer, AutoModel
# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("ExponentialScience/LedgerBERT")
model = AutoModel.from_pretrained("ExponentialScience/LedgerBERT")
# Example text
text = "Ethereum uses Proof of Stake consensus mechanism for transaction validation."
# Tokenize and encode
inputs = tokenizer(text, return_tensors="pt", max_length=512, truncation=True)
# Get embeddings
outputs = model(**inputs)
embeddings = outputs.last_hidden_state
```
### Fine-tuning for NER
```python
from transformers import AutoTokenizer, AutoModelForTokenClassification, TrainingArguments, Trainer
# Load for token classification
tokenizer = AutoTokenizer.from_pretrained("ExponentialScience/LedgerBERT")
model = AutoModelForTokenClassification.from_pretrained(
"ExponentialScience/LedgerBERT",
num_labels=num_labels # Set based on your NER task
)
# Fine-tune on your dataset
training_args = TrainingArguments(
output_dir="./results",
learning_rate=1e-5,
per_device_train_batch_size=16,
num_train_epochs=20,
warmup_steps=500
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset
)
trainer.train()
```
### Fine-tuning for Sentiment Analysis
A fine-tuned version for market sentiment is available at: https://huggingface.co/ExponentialScience/LedgerBERT-Market-Sentiment
```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained("ExponentialScience/LedgerBERT-Market-Sentiment")
model = AutoModelForSequenceClassification.from_pretrained("ExponentialScience/LedgerBERT-Market-Sentiment")
text = "Bitcoin reaches new all-time high amid institutional adoption"
inputs = tokenizer(text, return_tensors="pt", max_length=512, truncation=True)
outputs = model(**inputs)
predictions = outputs.logits.argmax(dim=-1)
```
## Citation
If you use LedgerBERT in your research, please cite:
```bibtex
@article{hernandez2025dlt-corpus,
title={DLT-Corpus: A Large-Scale Text Collection for the Distributed Ledger Technology Domain},
author={Hernandez Cruz, Walter and Devine, Peter and Vadgama, Nikhil and Tasca, Paolo and Xu, Jiahua},
year={2025}
}
```
## Related Resources
- **DLT-Corpus Collection**: https://huggingface.co/collections/ExponentialScience/dlt-corpus-68e44e40d4e7a3bd7a224402
- **Scientific Literature Dataset**: https://huggingface.co/datasets/ExponentialScience/DLT-Scientific-Literature
- **Patents Dataset**: https://huggingface.co/datasets/ExponentialScience/DLT-Patents
- **Social Media Dataset**: https://huggingface.co/datasets/ExponentialScience/DLT-Tweets
- **Sentiment Analysis Dataset**: https://huggingface.co/datasets/ExponentialScience/DLT-Sentiment-News
- **Fine-tuned Market Sentiment Model**: https://huggingface.co/ExponentialScience/LedgerBERT-Market-Sentiment
## Model Card Contact
For questions or feedback about LedgerBERT, please open an issue on the model repository or contact the authors through the DLT-Corpus GitHub repository: https://github.com/dlt-science/DLT-Corpus |