Safetensors
English
bert
File size: 8,747 Bytes
5faf9e0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
---
datasets:
- ExponentialScience/DLT-Tweets
- ExponentialScience/DLT-Patents
- ExponentialScience/DLT-Scientific-Literature
language:
- en
base_model:
- allenai/scibert_scivocab_cased
---
# LedgerBERT

## Model Description

### Model Summary

LedgerBERT is a domain-adapted language model specialized for the Distributed Ledger Technology (DLT) field. It was created through continual pre-training of SciBERT on the DLT-Corpus, a comprehensive collection of 2.98 billion tokens from scientific literature, patents, and social media focused on blockchain, cryptocurrencies, and distributed ledger systems.

LedgerBERT captures DLT-specific terminology and concepts, making it particularly effective for NLP tasks involving blockchain technologies, cryptocurrency discourse, smart contracts, consensus mechanisms, and related domain-specific content.

- **Developed by:** Walter Hernandez Cruz, Peter Devine, Nikhil Vadgama, Paolo Tasca, Jiahua Xu
- **Model type:** BERT-base encoder (bidirectional transformer)
- **Language:** English
- **License:** CC-BY-NC 4.0 (Creative Commons Attribution-NonCommercial 4.0 International)
- **Base model:** SciBERT (allenai/scibert_scivocab_cased)
- **Training corpus:** DLT-Corpus (2.98 billion tokens)

### Model Architecture

- **Architecture:** BERT-base
- **Parameters:** 110 million
- **Hidden size:** 768
- **Number of layers:** 12
- **Attention heads:** 12
- **Vocabulary size:** 30,522 (SciBERT vocabulary)
- **Max sequence length:** 512 tokens

## Intended Uses

### Primary Use Cases

LedgerBERT is designed for NLP tasks in the DLT domain, including, but not limited to:

- **Named Entity Recognition (NER)**: Identifying DLT-specific entities such as consensus mechanisms (e.g., Proof of Stake), blockchain platforms (e.g., Ethereum, Hedera), cryptographic concepts (e.g., Merkle tree, hashing)
- **Text Classification**: Categorizing DLT-related documents, patents, or social media posts
- **Sentiment Analysis**: Analyzing sentiment in cryptocurrency news and social media
- **Information Extraction**: Extracting technical concepts and relationships from DLT literature
- **Document Retrieval**: Building search systems for DLT content
- **Question Answering (QA)**: Creating QA systems for blockchain and cryptocurrency topics

### Out-of-Scope Uses

- **Real-time trading systems**: LedgerBERT should not be used as the sole basis for automated trading decisions
- **Investment advice**: Not suitable for providing financial or investment recommendations without proper disclaimers
- **General-purpose NLP**: While LedgerBERT maintains general language understanding, it is optimized for DLT-specific tasks
- **Legal or regulatory compliance**: Should not be used for legal interpretation without expert review

## Training Details

### Training Data

LedgerBERT was continually pre-trained on the **DLT-Corpus**, consisting of:

- **Scientific Literature**: 37,440 documents, 564M tokens (1978-2025). See https://huggingface.co/datasets/ExponentialScience/DLT-Scientific-Literature
- **Patents**: 49,023 documents, 1,296M tokens (1990-2025). See https://huggingface.co/datasets/ExponentialScience/DLT-Patents
- **Social Media**: 22.03M documents, 1,120M tokens (2013-mid 2023). See https://huggingface.co/datasets/ExponentialScience/DLT-Tweets

**Total:** 22.12 million documents, 2.98 billion tokens

For more details, see: https://huggingface.co/collections/ExponentialScience/dlt-corpus-68e44e40d4e7a3bd7a224402

### Training Procedure

**Continual Pre-training:**

Starting from SciBERT (which already captures multidisciplinary scientific content), LedgerBERT was trained using Masked Language Modeling (MLM) on the DLT-Corpus to adapt the model to DLT-specific terminology and concepts.

**Training hyperparameters:**
- **Epochs:** 3
- **Learning rate:** 5×10⁻⁵ with linear decay schedule
- **MLM probability:** 0.15 (standard BERT masking)
- **Warmup ratio:** 0.10
- **Batch size:** 12 per device
- **Sequence length:** 512 tokens
- **Weight decay:** 0.01
- **Optimizer:** Stable AdamW
- **Precision:** bfloat16


## Limitations and Biases

### Known Limitations

- **Language coverage**: English only; does not support other languages
- **Temporal coverage**: Training data extends to mid-2023 for social media; may not capture very recent terminology
- **Domain specificity**: Optimized for DLT tasks; may underperform on general-purpose benchmarks compared to models like RoBERTa
- **Context length**: Limited to 512 tokens; longer documents require truncation or chunking

### Potential Biases

The model may reflect biases present in the training data:

- **Geographic bias**: English-language sources may over-represent certain regions
- **Platform bias**: Social media data only from Twitter/X; other platforms not represented
- **Temporal bias**: More recent DLT developments are more heavily represented
- **Market bias**: Training during periods of market volatility may influence sentiment understanding
- **Source bias**: Certain cryptocurrencies (e.g., Bitcoin, Ethereum) are more discussed than others

### Ethical Considerations

- **Market manipulation risk**: Could potentially be misused for analyzing or generating content for market manipulation
- **Investment decisions**: Should not be used as sole basis for financial decisions without proper risk disclaimers
- **Misinformation**: May reproduce or fail to identify false claims present in training data
- **Privacy**: While usernames were removed from social media data, care should be taken not to re-identify individuals

## How to Use

### Basic Usage

```python
from transformers import AutoTokenizer, AutoModel

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("ExponentialScience/LedgerBERT")
model = AutoModel.from_pretrained("ExponentialScience/LedgerBERT")

# Example text
text = "Ethereum uses Proof of Stake consensus mechanism for transaction validation."

# Tokenize and encode
inputs = tokenizer(text, return_tensors="pt", max_length=512, truncation=True)

# Get embeddings
outputs = model(**inputs)
embeddings = outputs.last_hidden_state
```

### Fine-tuning for NER

```python
from transformers import AutoTokenizer, AutoModelForTokenClassification, TrainingArguments, Trainer

# Load for token classification
tokenizer = AutoTokenizer.from_pretrained("ExponentialScience/LedgerBERT")
model = AutoModelForTokenClassification.from_pretrained(
    "ExponentialScience/LedgerBERT",
    num_labels=num_labels  # Set based on your NER task
)

# Fine-tune on your dataset
training_args = TrainingArguments(
    output_dir="./results",
    learning_rate=1e-5,
    per_device_train_batch_size=16,
    num_train_epochs=20,
    warmup_steps=500
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset
)

trainer.train()
```

### Fine-tuning for Sentiment Analysis

A fine-tuned version for market sentiment is available at: https://huggingface.co/ExponentialScience/LedgerBERT-Market-Sentiment

```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("ExponentialScience/LedgerBERT-Market-Sentiment")
model = AutoModelForSequenceClassification.from_pretrained("ExponentialScience/LedgerBERT-Market-Sentiment")

text = "Bitcoin reaches new all-time high amid institutional adoption"
inputs = tokenizer(text, return_tensors="pt", max_length=512, truncation=True)
outputs = model(**inputs)
predictions = outputs.logits.argmax(dim=-1)
```

## Citation

If you use LedgerBERT in your research, please cite:

```bibtex
@article{hernandez2025dlt-corpus,
  title={DLT-Corpus: A Large-Scale Text Collection for the Distributed Ledger Technology Domain},
  author={Hernandez Cruz, Walter and Devine, Peter and Vadgama, Nikhil and Tasca, Paolo and Xu, Jiahua},
  year={2025}
}
```

## Related Resources

- **DLT-Corpus Collection**: https://huggingface.co/collections/ExponentialScience/dlt-corpus-68e44e40d4e7a3bd7a224402
- **Scientific Literature Dataset**: https://huggingface.co/datasets/ExponentialScience/DLT-Scientific-Literature
- **Patents Dataset**: https://huggingface.co/datasets/ExponentialScience/DLT-Patents
- **Social Media Dataset**: https://huggingface.co/datasets/ExponentialScience/DLT-Tweets
- **Sentiment Analysis Dataset**: https://huggingface.co/datasets/ExponentialScience/DLT-Sentiment-News
- **Fine-tuned Market Sentiment Model**: https://huggingface.co/ExponentialScience/LedgerBERT-Market-Sentiment

## Model Card Contact

For questions or feedback about LedgerBERT, please open an issue on the model repository or contact the authors through the DLT-Corpus GitHub repository: https://github.com/dlt-science/DLT-Corpus