ExponentialScience
/

LedgerBERT

+---
+datasets:
+- ExponentialScience/DLT-Tweets
+- ExponentialScience/DLT-Patents
+- ExponentialScience/DLT-Scientific-Literature
+language:
+- en
+base_model:
+- allenai/scibert_scivocab_cased
+---
+# LedgerBERT
+## Model Description
+### Model Summary
+LedgerBERT is a domain-adapted language model specialized for the Distributed Ledger Technology (DLT) field. It was created through continual pre-training of SciBERT on the DLT-Corpus, a comprehensive collection of 2.98 billion tokens from scientific literature, patents, and social media focused on blockchain, cryptocurrencies, and distributed ledger systems.
+LedgerBERT captures DLT-specific terminology and concepts, making it particularly effective for NLP tasks involving blockchain technologies, cryptocurrency discourse, smart contracts, consensus mechanisms, and related domain-specific content.
+- **Developed by:** Walter Hernandez Cruz, Peter Devine, Nikhil Vadgama, Paolo Tasca, Jiahua Xu
+- **Model type:** BERT-base encoder (bidirectional transformer)
+- **Language:** English
+- **License:** CC-BY-NC 4.0 (Creative Commons Attribution-NonCommercial 4.0 International)
+- **Base model:** SciBERT (allenai/scibert_scivocab_cased)
+- **Training corpus:** DLT-Corpus (2.98 billion tokens)
+### Model Architecture
+- **Architecture:** BERT-base
+- **Parameters:** 110 million
+- **Hidden size:** 768
+- **Number of layers:** 12
+- **Attention heads:** 12
+- **Vocabulary size:** 30,522 (SciBERT vocabulary)
+- **Max sequence length:** 512 tokens
+## Intended Uses
+### Primary Use Cases
+LedgerBERT is designed for NLP tasks in the DLT domain, including, but not limited to:
+- **Named Entity Recognition (NER)**: Identifying DLT-specific entities such as consensus mechanisms (e.g., Proof of Stake), blockchain platforms (e.g., Ethereum, Hedera), cryptographic concepts (e.g., Merkle tree, hashing)
+- **Text Classification**: Categorizing DLT-related documents, patents, or social media posts
+- **Sentiment Analysis**: Analyzing sentiment in cryptocurrency news and social media
+- **Information Extraction**: Extracting technical concepts and relationships from DLT literature
+- **Document Retrieval**: Building search systems for DLT content
+- **Question Answering (QA)**: Creating QA systems for blockchain and cryptocurrency topics
+### Out-of-Scope Uses
+- **Real-time trading systems**: LedgerBERT should not be used as the sole basis for automated trading decisions
+- **Investment advice**: Not suitable for providing financial or investment recommendations without proper disclaimers
+- **General-purpose NLP**: While LedgerBERT maintains general language understanding, it is optimized for DLT-specific tasks
+- **Legal or regulatory compliance**: Should not be used for legal interpretation without expert review
+## Training Details
+### Training Data
+LedgerBERT was continually pre-trained on the **DLT-Corpus**, consisting of:
+- **Scientific Literature**: 37,440 documents, 564M tokens (1978-2025). See https://huggingface.co/datasets/ExponentialScience/DLT-Scientific-Literature
+- **Patents**: 49,023 documents, 1,296M tokens (1990-2025). See https://huggingface.co/datasets/ExponentialScience/DLT-Patents
+- **Social Media**: 22.03M documents, 1,120M tokens (2013-mid 2023). See https://huggingface.co/datasets/ExponentialScience/DLT-Tweets
+**Total:** 22.12 million documents, 2.98 billion tokens
+For more details, see: https://huggingface.co/collections/ExponentialScience/dlt-corpus-68e44e40d4e7a3bd7a224402
+### Training Procedure
+**Continual Pre-training:**
+Starting from SciBERT (which already captures multidisciplinary scientific content), LedgerBERT was trained using Masked Language Modeling (MLM) on the DLT-Corpus to adapt the model to DLT-specific terminology and concepts.
+**Training hyperparameters:**
+- **Epochs:** 3
+- **Learning rate:** 5×10⁻⁵ with linear decay schedule
+- **MLM probability:** 0.15 (standard BERT masking)
+- **Warmup ratio:** 0.10
+- **Batch size:** 12 per device
+- **Sequence length:** 512 tokens
+- **Weight decay:** 0.01
+- **Optimizer:** Stable AdamW
+- **Precision:** bfloat16
+## Limitations and Biases
+### Known Limitations
+- **Language coverage**: English only; does not support other languages
+- **Temporal coverage**: Training data extends to mid-2023 for social media; may not capture very recent terminology
+- **Domain specificity**: Optimized for DLT tasks; may underperform on general-purpose benchmarks compared to models like RoBERTa
+- **Context length**: Limited to 512 tokens; longer documents require truncation or chunking
+### Potential Biases
+The model may reflect biases present in the training data:
+- **Geographic bias**: English-language sources may over-represent certain regions
+- **Platform bias**: Social media data only from Twitter/X; other platforms not represented
+- **Temporal bias**: More recent DLT developments are more heavily represented
+- **Market bias**: Training during periods of market volatility may influence sentiment understanding
+- **Source bias**: Certain cryptocurrencies (e.g., Bitcoin, Ethereum) are more discussed than others
+### Ethical Considerations
+- **Market manipulation risk**: Could potentially be misused for analyzing or generating content for market manipulation
+- **Investment decisions**: Should not be used as sole basis for financial decisions without proper risk disclaimers
+- **Misinformation**: May reproduce or fail to identify false claims present in training data
+- **Privacy**: While usernames were removed from social media data, care should be taken not to re-identify individuals
+## How to Use
+### Basic Usage
+```python
+from transformers import AutoTokenizer, AutoModel
+# Load model and tokenizer
+tokenizer = AutoTokenizer.from_pretrained("ExponentialScience/LedgerBERT")
+model = AutoModel.from_pretrained("ExponentialScience/LedgerBERT")
+# Example text
+text = "Ethereum uses Proof of Stake consensus mechanism for transaction validation."
+# Tokenize and encode
+inputs = tokenizer(text, return_tensors="pt", max_length=512, truncation=True)
+# Get embeddings
+outputs = model(**inputs)
+embeddings = outputs.last_hidden_state
+```
+### Fine-tuning for NER
+```python
+from transformers import AutoTokenizer, AutoModelForTokenClassification, TrainingArguments, Trainer
+# Load for token classification
+tokenizer = AutoTokenizer.from_pretrained("ExponentialScience/LedgerBERT")
+model = AutoModelForTokenClassification.from_pretrained(
+    "ExponentialScience/LedgerBERT",
+    num_labels=num_labels  # Set based on your NER task
+)
+# Fine-tune on your dataset
+training_args = TrainingArguments(
+    output_dir="./results",
+    learning_rate=1e-5,
+    per_device_train_batch_size=16,
+    num_train_epochs=20,
+    warmup_steps=500
+)
+trainer = Trainer(
+    model=model,
+    args=training_args,
+    train_dataset=train_dataset,
+    eval_dataset=eval_dataset
+)
+trainer.train()
+```
+### Fine-tuning for Sentiment Analysis
+A fine-tuned version for market sentiment is available at: https://huggingface.co/ExponentialScience/LedgerBERT-Market-Sentiment
+```python
+from transformers import AutoTokenizer, AutoModelForSequenceClassification
+tokenizer = AutoTokenizer.from_pretrained("ExponentialScience/LedgerBERT-Market-Sentiment")
+model = AutoModelForSequenceClassification.from_pretrained("ExponentialScience/LedgerBERT-Market-Sentiment")
+text = "Bitcoin reaches new all-time high amid institutional adoption"
+inputs = tokenizer(text, return_tensors="pt", max_length=512, truncation=True)
+outputs = model(**inputs)
+predictions = outputs.logits.argmax(dim=-1)
+```
+## Citation
+If you use LedgerBERT in your research, please cite:
+```bibtex
+@article{hernandez2025dlt-corpus,
+  title={DLT-Corpus: A Large-Scale Text Collection for the Distributed Ledger Technology Domain},
+  author={Hernandez Cruz, Walter and Devine, Peter and Vadgama, Nikhil and Tasca, Paolo and Xu, Jiahua},
+  year={2025}
+}
+```
+## Related Resources
+- **DLT-Corpus Collection**: https://huggingface.co/collections/ExponentialScience/dlt-corpus-68e44e40d4e7a3bd7a224402
+- **Scientific Literature Dataset**: https://huggingface.co/datasets/ExponentialScience/DLT-Scientific-Literature
+- **Patents Dataset**: https://huggingface.co/datasets/ExponentialScience/DLT-Patents
+- **Social Media Dataset**: https://huggingface.co/datasets/ExponentialScience/DLT-Tweets
+- **Sentiment Analysis Dataset**: https://huggingface.co/datasets/ExponentialScience/DLT-Sentiment-News
+- **Fine-tuned Market Sentiment Model**: https://huggingface.co/ExponentialScience/LedgerBERT-Market-Sentiment
+## Model Card Contact
+For questions or feedback about LedgerBERT, please open an issue on the model repository or contact the authors through the DLT-Corpus GitHub repository: https://github.com/dlt-science/DLT-Corpus