Safetensors
English
bert
walterhernandez commited on
Commit
5faf9e0
·
verified ·
1 Parent(s): 607a6f2

Added Model Card

Browse files
Files changed (1) hide show
  1. README.md +208 -0
README.md ADDED
@@ -0,0 +1,208 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ datasets:
3
+ - ExponentialScience/DLT-Tweets
4
+ - ExponentialScience/DLT-Patents
5
+ - ExponentialScience/DLT-Scientific-Literature
6
+ language:
7
+ - en
8
+ base_model:
9
+ - allenai/scibert_scivocab_cased
10
+ ---
11
+ # LedgerBERT
12
+
13
+ ## Model Description
14
+
15
+ ### Model Summary
16
+
17
+ LedgerBERT is a domain-adapted language model specialized for the Distributed Ledger Technology (DLT) field. It was created through continual pre-training of SciBERT on the DLT-Corpus, a comprehensive collection of 2.98 billion tokens from scientific literature, patents, and social media focused on blockchain, cryptocurrencies, and distributed ledger systems.
18
+
19
+ LedgerBERT captures DLT-specific terminology and concepts, making it particularly effective for NLP tasks involving blockchain technologies, cryptocurrency discourse, smart contracts, consensus mechanisms, and related domain-specific content.
20
+
21
+ - **Developed by:** Walter Hernandez Cruz, Peter Devine, Nikhil Vadgama, Paolo Tasca, Jiahua Xu
22
+ - **Model type:** BERT-base encoder (bidirectional transformer)
23
+ - **Language:** English
24
+ - **License:** CC-BY-NC 4.0 (Creative Commons Attribution-NonCommercial 4.0 International)
25
+ - **Base model:** SciBERT (allenai/scibert_scivocab_cased)
26
+ - **Training corpus:** DLT-Corpus (2.98 billion tokens)
27
+
28
+ ### Model Architecture
29
+
30
+ - **Architecture:** BERT-base
31
+ - **Parameters:** 110 million
32
+ - **Hidden size:** 768
33
+ - **Number of layers:** 12
34
+ - **Attention heads:** 12
35
+ - **Vocabulary size:** 30,522 (SciBERT vocabulary)
36
+ - **Max sequence length:** 512 tokens
37
+
38
+ ## Intended Uses
39
+
40
+ ### Primary Use Cases
41
+
42
+ LedgerBERT is designed for NLP tasks in the DLT domain, including, but not limited to:
43
+
44
+ - **Named Entity Recognition (NER)**: Identifying DLT-specific entities such as consensus mechanisms (e.g., Proof of Stake), blockchain platforms (e.g., Ethereum, Hedera), cryptographic concepts (e.g., Merkle tree, hashing)
45
+ - **Text Classification**: Categorizing DLT-related documents, patents, or social media posts
46
+ - **Sentiment Analysis**: Analyzing sentiment in cryptocurrency news and social media
47
+ - **Information Extraction**: Extracting technical concepts and relationships from DLT literature
48
+ - **Document Retrieval**: Building search systems for DLT content
49
+ - **Question Answering (QA)**: Creating QA systems for blockchain and cryptocurrency topics
50
+
51
+ ### Out-of-Scope Uses
52
+
53
+ - **Real-time trading systems**: LedgerBERT should not be used as the sole basis for automated trading decisions
54
+ - **Investment advice**: Not suitable for providing financial or investment recommendations without proper disclaimers
55
+ - **General-purpose NLP**: While LedgerBERT maintains general language understanding, it is optimized for DLT-specific tasks
56
+ - **Legal or regulatory compliance**: Should not be used for legal interpretation without expert review
57
+
58
+ ## Training Details
59
+
60
+ ### Training Data
61
+
62
+ LedgerBERT was continually pre-trained on the **DLT-Corpus**, consisting of:
63
+
64
+ - **Scientific Literature**: 37,440 documents, 564M tokens (1978-2025). See https://huggingface.co/datasets/ExponentialScience/DLT-Scientific-Literature
65
+ - **Patents**: 49,023 documents, 1,296M tokens (1990-2025). See https://huggingface.co/datasets/ExponentialScience/DLT-Patents
66
+ - **Social Media**: 22.03M documents, 1,120M tokens (2013-mid 2023). See https://huggingface.co/datasets/ExponentialScience/DLT-Tweets
67
+
68
+ **Total:** 22.12 million documents, 2.98 billion tokens
69
+
70
+ For more details, see: https://huggingface.co/collections/ExponentialScience/dlt-corpus-68e44e40d4e7a3bd7a224402
71
+
72
+ ### Training Procedure
73
+
74
+ **Continual Pre-training:**
75
+
76
+ Starting from SciBERT (which already captures multidisciplinary scientific content), LedgerBERT was trained using Masked Language Modeling (MLM) on the DLT-Corpus to adapt the model to DLT-specific terminology and concepts.
77
+
78
+ **Training hyperparameters:**
79
+ - **Epochs:** 3
80
+ - **Learning rate:** 5×10⁻⁵ with linear decay schedule
81
+ - **MLM probability:** 0.15 (standard BERT masking)
82
+ - **Warmup ratio:** 0.10
83
+ - **Batch size:** 12 per device
84
+ - **Sequence length:** 512 tokens
85
+ - **Weight decay:** 0.01
86
+ - **Optimizer:** Stable AdamW
87
+ - **Precision:** bfloat16
88
+
89
+
90
+ ## Limitations and Biases
91
+
92
+ ### Known Limitations
93
+
94
+ - **Language coverage**: English only; does not support other languages
95
+ - **Temporal coverage**: Training data extends to mid-2023 for social media; may not capture very recent terminology
96
+ - **Domain specificity**: Optimized for DLT tasks; may underperform on general-purpose benchmarks compared to models like RoBERTa
97
+ - **Context length**: Limited to 512 tokens; longer documents require truncation or chunking
98
+
99
+ ### Potential Biases
100
+
101
+ The model may reflect biases present in the training data:
102
+
103
+ - **Geographic bias**: English-language sources may over-represent certain regions
104
+ - **Platform bias**: Social media data only from Twitter/X; other platforms not represented
105
+ - **Temporal bias**: More recent DLT developments are more heavily represented
106
+ - **Market bias**: Training during periods of market volatility may influence sentiment understanding
107
+ - **Source bias**: Certain cryptocurrencies (e.g., Bitcoin, Ethereum) are more discussed than others
108
+
109
+ ### Ethical Considerations
110
+
111
+ - **Market manipulation risk**: Could potentially be misused for analyzing or generating content for market manipulation
112
+ - **Investment decisions**: Should not be used as sole basis for financial decisions without proper risk disclaimers
113
+ - **Misinformation**: May reproduce or fail to identify false claims present in training data
114
+ - **Privacy**: While usernames were removed from social media data, care should be taken not to re-identify individuals
115
+
116
+ ## How to Use
117
+
118
+ ### Basic Usage
119
+
120
+ ```python
121
+ from transformers import AutoTokenizer, AutoModel
122
+
123
+ # Load model and tokenizer
124
+ tokenizer = AutoTokenizer.from_pretrained("ExponentialScience/LedgerBERT")
125
+ model = AutoModel.from_pretrained("ExponentialScience/LedgerBERT")
126
+
127
+ # Example text
128
+ text = "Ethereum uses Proof of Stake consensus mechanism for transaction validation."
129
+
130
+ # Tokenize and encode
131
+ inputs = tokenizer(text, return_tensors="pt", max_length=512, truncation=True)
132
+
133
+ # Get embeddings
134
+ outputs = model(**inputs)
135
+ embeddings = outputs.last_hidden_state
136
+ ```
137
+
138
+ ### Fine-tuning for NER
139
+
140
+ ```python
141
+ from transformers import AutoTokenizer, AutoModelForTokenClassification, TrainingArguments, Trainer
142
+
143
+ # Load for token classification
144
+ tokenizer = AutoTokenizer.from_pretrained("ExponentialScience/LedgerBERT")
145
+ model = AutoModelForTokenClassification.from_pretrained(
146
+ "ExponentialScience/LedgerBERT",
147
+ num_labels=num_labels # Set based on your NER task
148
+ )
149
+
150
+ # Fine-tune on your dataset
151
+ training_args = TrainingArguments(
152
+ output_dir="./results",
153
+ learning_rate=1e-5,
154
+ per_device_train_batch_size=16,
155
+ num_train_epochs=20,
156
+ warmup_steps=500
157
+ )
158
+
159
+ trainer = Trainer(
160
+ model=model,
161
+ args=training_args,
162
+ train_dataset=train_dataset,
163
+ eval_dataset=eval_dataset
164
+ )
165
+
166
+ trainer.train()
167
+ ```
168
+
169
+ ### Fine-tuning for Sentiment Analysis
170
+
171
+ A fine-tuned version for market sentiment is available at: https://huggingface.co/ExponentialScience/LedgerBERT-Market-Sentiment
172
+
173
+ ```python
174
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
175
+
176
+ tokenizer = AutoTokenizer.from_pretrained("ExponentialScience/LedgerBERT-Market-Sentiment")
177
+ model = AutoModelForSequenceClassification.from_pretrained("ExponentialScience/LedgerBERT-Market-Sentiment")
178
+
179
+ text = "Bitcoin reaches new all-time high amid institutional adoption"
180
+ inputs = tokenizer(text, return_tensors="pt", max_length=512, truncation=True)
181
+ outputs = model(**inputs)
182
+ predictions = outputs.logits.argmax(dim=-1)
183
+ ```
184
+
185
+ ## Citation
186
+
187
+ If you use LedgerBERT in your research, please cite:
188
+
189
+ ```bibtex
190
+ @article{hernandez2025dlt-corpus,
191
+ title={DLT-Corpus: A Large-Scale Text Collection for the Distributed Ledger Technology Domain},
192
+ author={Hernandez Cruz, Walter and Devine, Peter and Vadgama, Nikhil and Tasca, Paolo and Xu, Jiahua},
193
+ year={2025}
194
+ }
195
+ ```
196
+
197
+ ## Related Resources
198
+
199
+ - **DLT-Corpus Collection**: https://huggingface.co/collections/ExponentialScience/dlt-corpus-68e44e40d4e7a3bd7a224402
200
+ - **Scientific Literature Dataset**: https://huggingface.co/datasets/ExponentialScience/DLT-Scientific-Literature
201
+ - **Patents Dataset**: https://huggingface.co/datasets/ExponentialScience/DLT-Patents
202
+ - **Social Media Dataset**: https://huggingface.co/datasets/ExponentialScience/DLT-Tweets
203
+ - **Sentiment Analysis Dataset**: https://huggingface.co/datasets/ExponentialScience/DLT-Sentiment-News
204
+ - **Fine-tuned Market Sentiment Model**: https://huggingface.co/ExponentialScience/LedgerBERT-Market-Sentiment
205
+
206
+ ## Model Card Contact
207
+
208
+ For questions or feedback about LedgerBERT, please open an issue on the model repository or contact the authors through the DLT-Corpus GitHub repository: https://github.com/dlt-science/DLT-Corpus