VirtualInsight commited on
Commit
0185573
·
verified ·
1 Parent(s): 1501752

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +434 -26
README.md CHANGED
@@ -1,39 +1,447 @@
1
  ---
2
- license: apache-2.0
3
- ---
4
- language:
5
- - en
6
  tags:
7
- - causal-lm
8
- - text-generation
9
- - pytorch
10
- - safetensors
11
- - Lumen
12
  ---
13
 
14
- # 🧠 LumenBase
15
 
16
- **LumenBase** is a 128M parameter language model developed from scratch by Hariom Jangra.
17
- It serves as the **base model** for the *Lumen* series, designed for research and learning Purposes.
18
 
19
- ---
20
 
21
- ## 🧩 Model Details
22
 
23
- - **Architecture:** Decoder-only Transformer
24
- - **Parameters:** 128 million
25
- - **Precision:** Float32 (safetensors format)
26
- - **Framework:** PyTorch
27
- - **File:** `LumenBase.safetensors`
28
- - **Tokenizer:** BPE-based
29
- - **Trained from scratch:** Yes
30
- - **Post-trained version:** Coming soon
31
 
32
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
33
 
34
- ## 🧠 Intended Use
35
 
36
- - Research & experimentation
37
- - Educational purposes (understanding Transformer architectures)
38
 
 
39
 
 
 
1
  ---
2
+ language: en
3
+ license: mit
4
+ library_name: pytorch
 
5
  tags:
6
+ - transformer
7
+ - gpt
8
+ - language-model
9
+ - from-scratch
10
+ - educational
11
  ---
12
 
13
+ # Model Card for LumenBase
14
 
15
+ <!-- Provide a quick summary of what the model is/does. -->
 
16
 
17
+ LumenBase is a 128M parameter GPT-style transformer language model built entirely from scratch for educational and research purposes. The model features modern architectural components including Grouped Multi-Query Attention (GQA), SwiGLU activation, RMSNorm, and Rotary Position Embeddings (RoPE).
18
 
19
+ ## Model Details
20
 
21
+ ### Model Description
 
 
 
 
 
 
 
22
 
23
+ <!-- Provide a longer summary of what this model is. -->
24
+
25
+ LumenBase is a foundational language model created entirely from scratch to explore every step of modern LLM development — from data preprocessing and tokenization to architecture design, training, and evaluation. This project implements a decoder-only transformer architecture with several modern optimizations:
26
+
27
+ - **Grouped Multi-Query Attention (GQA)**: Efficient attention mechanism with 12 query heads and 4 key-value heads (3 groups)
28
+ - **SwiGLU Activation**: Advanced feed-forward network activation function
29
+ - **RMSNorm**: Layer normalization for improved training stability
30
+ - **Rotary Position Embeddings (RoPE)**: Relative position encoding
31
+ - **Weight Tying**: Shared weights between embedding and output layers
32
+
33
+ The model was trained on custom datasets using mixed precision training (FP16/BF16) with gradient accumulation, cosine annealing scheduler with linear warmup, and automatic checkpointing.
34
+
35
+ - **Developed by:** Hariom Jangra (HariomJangra)
36
+ - **Model type:** Decoder-only Transformer Language Model
37
+ - **Language(s) (NLP):** English
38
+ - **License:** MIT
39
+ - **Finetuned from model [optional]:** N/A (trained from scratch)
40
+
41
+ ### Model Sources [optional]
42
+
43
+ <!-- Provide the basic links for the model. -->
44
+
45
+ - **Repository:** https://github.com/HariomJangra/project-lumen
46
+ - **Paper [optional]:** N/A
47
+ - **Demo [optional]:** N/A
48
+
49
+ ## Uses
50
+
51
+ <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
52
+
53
+ ### Direct Use
54
+
55
+ <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
56
+
57
+ LumenBase can be used directly for:
58
+ - Text generation and completion tasks
59
+ - Educational purposes to understand transformer architecture and training
60
+ - Research on language model behavior and capabilities
61
+ - Baseline for fine-tuning on specific downstream tasks
62
+ - Understanding modern LLM architectural components
63
+
64
+ ### Downstream Use [optional]
65
+
66
+ <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
67
+
68
+ The model can be fine-tuned for:
69
+ - Instruction following
70
+ - Chat-based applications
71
+ - Domain-specific text generation
72
+ - Task-specific adaptations
73
+ - Further research on specialized applications
74
+
75
+ ### Out-of-Scope Use
76
+
77
+ <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
78
+
79
+ This model is **not suitable** for:
80
+ - Production deployments requiring high-quality generation
81
+ - Safety-critical applications
82
+ - Applications requiring factual accuracy without verification
83
+ - Generation of harmful, hateful, or biased content
84
+ - Large-scale commercial applications without proper evaluation
85
+
86
+ This is an educational/research implementation. For production use, consider established frameworks like Hugging Face Transformers.
87
+
88
+ ## Bias, Risks, and Limitations
89
+
90
+ <!-- This section is meant to convey both technical and sociotechnical limitations. -->
91
+
92
+ **Technical Limitations:**
93
+ - Limited model size (128M parameters) compared to larger production models
94
+ - Performance on benchmarks is below state-of-the-art models
95
+ - May generate incoherent or nonsensical text for complex prompts
96
+ - Limited context window (2048 tokens)
97
+
98
+ **Bias and Social Limitations:**
99
+ - The model may perpetuate biases present in training data
100
+ - Not evaluated for fairness across different demographic groups
101
+ - May generate inappropriate or offensive content
102
+ - Should not be relied upon for factual information without verification
103
+
104
+ **Research/Educational Nature:**
105
+ - This is a learning project, not optimized for production use
106
+ - Training data sources and quality may vary
107
+ - Limited testing across diverse use cases
108
+
109
+ ### Recommendations
110
+
111
+ <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
112
+
113
+ Users (both direct and downstream) should:
114
+ - Be aware this is an educational model with limited capabilities
115
+ - Not use for safety-critical or production applications
116
+ - Verify all generated content before use
117
+ - Implement appropriate content filtering for downstream applications
118
+ - Consider the model's limitations when interpreting results
119
+ - Use established production-ready models for commercial applications
120
+
121
+ ## How to Get Started with the Model
122
+
123
+ Use the code below to get started with the model.
124
+
125
+ ```python
126
+ import torch
127
+ from ModelArchitecture import Transformer, ModelConfig, generate
128
+ from tokenizers import Tokenizer
129
+
130
+ # Load model configuration
131
+ config = ModelConfig(
132
+ vocab_size=32000,
133
+ hidden_size=768,
134
+ n_heads=12,
135
+ n_kv_heads=4,
136
+ n_kv_groups=3,
137
+ head_dim=64,
138
+ n_layers=12,
139
+ attention_bias=False,
140
+ intermediate_size=3072,
141
+ mlp_bias=False,
142
+ eps=1e-5,
143
+ dropout=0.0,
144
+ max_position_embeddings=2048,
145
+ pre_norm=True,
146
+ tie_weights=True,
147
+ max_seq_len=2048
148
+ )
149
+
150
+ # Initialize model
151
+ model = Transformer(config)
152
+
153
+ # Load trained weights
154
+ checkpoint = torch.load('LumenBase.safetensors', map_location='cpu')
155
+ model.load_state_dict(checkpoint)
156
+ model.eval()
157
+
158
+ # Load tokenizer
159
+ tokenizer = Tokenizer.from_file('LumenTokenizer.json')
160
+
161
+ # Generate text
162
+ prompt = "Once upon a time"
163
+ input_ids = torch.tensor([tokenizer.encode(prompt).ids])
164
+
165
+ output = generate(
166
+ model=model,
167
+ input_ids=input_ids,
168
+ max_new_tokens=100,
169
+ temperature=0.8,
170
+ top_k=50,
171
+ top_p=0.9,
172
+ do_sample=True
173
+ )
174
+
175
+ generated_text = tokenizer.decode(output[0].tolist())
176
+ print(generated_text)
177
+ ```
178
+
179
+ ## Training Details
180
+
181
+ ### Training Data
182
+
183
+ <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
184
+
185
+ The model was trained on custom datasets prepared specifically for this project. The training pipeline included:
186
+
187
+ 1. **Dataset Preparation**: Text data collection and preprocessing
188
+ 2. **Tokenization**: Custom BPE (Byte Pair Encoding) tokenizer trained with a vocabulary size of 32,000 tokens
189
+ 3. **Data Processing**: Text tokenization and conversion to token IDs stored as NumPy arrays
190
+ - TokenizedDataSet1.npy
191
+ - TokenizedDataSet2.npy
192
+ - TokenizedDataset3.npy
193
+
194
+ The training data was tokenized using a custom-trained tokenizer (LumenTokenizer) optimized for the target domain.
195
+
196
+ ### Training Procedure
197
+
198
+ <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
199
+
200
+ #### Preprocessing [optional]
201
+
202
+ - Text cleaning and normalization
203
+ - BPE tokenization with 32K vocabulary
204
+ - Sequence chunking to 2048 token context windows
205
+ - Data stored in efficient NumPy format for fast loading
206
+
207
+ #### Training Hyperparameters
208
+
209
+ - **Optimizer**: AdamW (lr=3e-4, betas=(0.9, 0.95), weight_decay=0.1)
210
+ - **Scheduler**: Linear warmup (2000 steps) + Cosine annealing
211
+ - **Batch Size**: 12 sequences per batch
212
+ - **Gradient Accumulation Steps**: 4 (effective batch size: 48)
213
+ - **Sequence Length**: 2048 tokens
214
+ - **Dropout**: 0.1 during training, 0.0 during inference
215
+ - **Gradient Clipping**: Max norm 1.0
216
+ - **Training regime**: Mixed precision (automatic FP16/BF16/FP32 based on hardware)
217
+
218
+ #### Speeds, Sizes, Times [optional]
219
+
220
+ <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
221
+
222
+ - **Model Parameters**: 128M (128 million)
223
+ - **Model Size**: ~512 MB (FP32), ~256 MB (FP16)
224
+ - **Checkpoint Frequency**: Every N steps with automatic best model saving
225
+ - **Training monitored with**: Training and validation loss curves
226
+ - **Final checkpoint**: best_model_params_110k.pt
227
+
228
+ ![Training Loss Curve](PreTraining/images/training_loss_curve.png)
229
+
230
+ ## Evaluation
231
+
232
+ <!-- This section describes the evaluation protocols and provides the results. -->
233
+
234
+ ### Testing Data, Factors & Metrics
235
+
236
+ #### Testing Data
237
+
238
+ <!-- This should link to a Dataset Card if possible. -->
239
+
240
+ The model was evaluated on three standard NLP benchmarks:
241
+
242
+ 1. **ARC-Easy** (AI2 Reasoning Challenge - Easy): 2,376 questions
243
+ 2. **ARC-Challenge** (AI2 Reasoning Challenge - Challenge): 1,172 questions
244
+ 3. **HellaSwag**: 1,024 examples for commonsense reasoning
245
+
246
+ #### Factors
247
+
248
+ <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
249
+
250
+ The model was evaluated on:
251
+ - Multiple-choice question answering
252
+ - Commonsense reasoning
253
+ - Scientific reasoning
254
+ - Reading comprehension
255
+
256
+ #### Metrics
257
+
258
+ <!-- These are the evaluation metrics being used, ideally with a description of why. -->
259
+
260
+ **Accuracy**: The primary metric used for all benchmarks, measuring the percentage of correctly answered questions.
261
+
262
+ ### Results
263
+
264
+ | Benchmark | Accuracy | Correct | Total |
265
+ |-----------|----------|---------|-------|
266
+ | **ARC-Easy** | 39.48% | 938 | 2,376 |
267
+ | **ARC-Challenge** | 23.55% | 276 | 1,172 |
268
+ | **HellaSwag** | 32.62% | 334 | 1,024 |
269
+
270
+ #### Summary
271
+
272
+ The LumenBase model demonstrates baseline performance on standard NLP benchmarks. As expected for a 128M parameter model trained from scratch for educational purposes:
273
+
274
+ - **ARC-Easy**: Achieves ~39% accuracy, showing some capability on easier scientific reasoning tasks
275
+ - **ARC-Challenge**: Scores ~24% on the more difficult version, indicating room for improvement on complex reasoning
276
+ - **HellaSwag**: Reaches ~33% on commonsense reasoning, slightly above random chance (25% for 4-choice questions)
277
+
278
+ These results are consistent with a small-scale educational model and provide a baseline for future improvements through:
279
+ - Additional training data
280
+ - Longer training duration
281
+ - Model scaling
282
+ - Fine-tuning on specific tasks
283
+ - Improved training techniques
284
+
285
+ ## Model Examination [optional]
286
+
287
+ <!-- Relevant interpretability work for the model goes here -->
288
+
289
+ **Architecture Details:**
290
+ - **Attention Mechanism**: Grouped Multi-Query Attention reduces KV cache size while maintaining performance
291
+ - **Activation Function**: SwiGLU provides better gradient flow compared to traditional ReLU
292
+ - **Normalization**: RMSNorm (Root Mean Square Layer Normalization) for improved stability
293
+ - **Position Encoding**: RoPE (Rotary Position Embeddings) for better handling of relative positions
294
+ - **Weight Tying**: Embedding and output layer share weights, reducing parameter count
295
+
296
+ **Key Design Choices:**
297
+ - Decoder-only architecture following GPT design principles
298
+ - Pre-normalization for better training stability
299
+ - Efficient attention with 12 query heads, 4 KV heads (grouped into 3)
300
+ - Intermediate FFN size of 3072 (4x hidden size)
301
+
302
+ **Implementation Highlights:**
303
+ - Custom implementation from scratch using PyTorch
304
+ - Supports various sampling strategies: greedy, top-k, top-p (nucleus), temperature scaling
305
+ - Gradient accumulation for effective larger batch sizes
306
+ - Automatic mixed precision training support
307
+
308
+ ## Environmental Impact
309
+
310
+ <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
311
+
312
+ Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
313
+
314
+ - **Hardware Type:** Consumer-grade GPU (specific hardware varies)
315
+ - **Hours used:** Educational project, training time not formally tracked
316
+ - **Cloud Provider:** N/A (local training)
317
+ - **Compute Region:** N/A
318
+ - **Carbon Emitted:** Not formally measured
319
+
320
+ **Note**: As an educational project, formal carbon footprint tracking was not implemented. Future iterations could benefit from tracking environmental impact.
321
+
322
+ ## Technical Specifications [optional]
323
+
324
+ ### Model Architecture and Objective
325
+
326
+ **Architecture**: Decoder-only Transformer (GPT-style)
327
+
328
+ **Configuration:**
329
+ ```python
330
+ vocab_size: 32000
331
+ hidden_size: 768
332
+ n_heads: 12
333
+ n_kv_heads: 4
334
+ n_kv_groups: 3
335
+ head_dim: 64
336
+ n_layers: 12
337
+ intermediate_size: 3072
338
+ max_position_embeddings: 2048
339
+ dropout: 0.1 (training) / 0.0 (inference)
340
+ ```
341
+
342
+ **Key Components:**
343
+ - **Grouped Multi-Query Attention**: 12 query heads, 4 key-value heads
344
+ - **Feed-Forward Network**: SwiGLU activation with 3072 intermediate dimensions
345
+ - **Layer Normalization**: RMSNorm (epsilon=1e-5)
346
+ - **Position Encoding**: Rotary Position Embeddings (RoPE)
347
+ - **Weight Tying**: Shared embedding and output projection weights
348
+
349
+ **Training Objective**: Causal language modeling with cross-entropy loss
350
+
351
+ ### Compute Infrastructure
352
+
353
+ Educational project trained on consumer hardware.
354
+
355
+ #### Hardware
356
+
357
+ - Consumer-grade GPU (specific configuration varies)
358
+ - Training performed locally, not on cloud infrastructure
359
+
360
+ #### Software
361
+
362
+ ```
363
+ Python 3.13
364
+ PyTorch (latest)
365
+ NumPy
366
+ Tokenizers (Hugging Face)
367
+ tqdm (progress tracking)
368
+ matplotlib (visualization)
369
+ ```
370
+
371
+ **Custom Implementation**: All model components implemented from scratch in PyTorch without using high-level transformer libraries.
372
+
373
+ ## Citation [optional]
374
+
375
+ <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
376
+
377
+ **BibTeX:**
378
+
379
+ ```bibtex
380
+ @misc{lumenbase2024,
381
+ author = {Jangra, Hariom},
382
+ title = {LumenBase: A 128M Parameter Language Model Built from Scratch},
383
+ year = {2024},
384
+ publisher = {GitHub},
385
+ journal = {GitHub repository},
386
+ howpublished = {\url{https://github.com/HariomJangra/project-lumen}}
387
+ }
388
+ ```
389
+
390
+ **APA:**
391
+
392
+ Jangra, H. (2024). *LumenBase: A 128M Parameter Language Model Built from Scratch* [Computer software]. GitHub. https://github.com/HariomJangra/project-lumen
393
+
394
+ ## Glossary [optional]
395
+
396
+ <!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
397
+
398
+ **Terms:**
399
+
400
+ - **BPE (Byte Pair Encoding)**: Tokenization algorithm that builds vocabulary by iteratively merging frequent character pairs
401
+ - **GQA (Grouped Multi-Query Attention)**: Attention mechanism where multiple query heads share fewer key-value heads, reducing memory and computation
402
+ - **RMSNorm**: Root Mean Square Layer Normalization - simplified normalization that only rescales using RMS statistics
403
+ - **RoPE (Rotary Position Embeddings)**: Position encoding that encodes absolute positions with rotation matrices and naturally incorporates relative position information
404
+ - **SwiGLU**: Activation function combining Swish activation with Gated Linear Units for improved model performance
405
+ - **Weight Tying**: Technique where embedding and output layers share parameters to reduce model size
406
+
407
+ **Sampling Strategies:**
408
+ - **Greedy Decoding**: Always select the token with highest probability
409
+ - **Top-k Sampling**: Sample from the k most likely tokens
410
+ - **Top-p (Nucleus) Sampling**: Sample from smallest set of tokens whose cumulative probability exceeds p
411
+ - **Temperature Scaling**: Adjust probability distribution sharpness (lower = more deterministic, higher = more random)
412
+
413
+ ## More Information [optional]
414
+
415
+ **Project Structure:**
416
+ - `PreTraining/Implementation/`: Training scripts and data preparation notebooks
417
+ - `PreTraining/Benchmark/`: Evaluation scripts and results
418
+ - `PreTraining/Inference/`: Text generation and inference code
419
+ - `PreTraining/Models/`: Saved model checkpoints
420
+ - `PreTraining/DataSets/`: Tokenized training data
421
+
422
+ **Future Work:**
423
+ - Fine-tuning for instruction following
424
+ - Chat model adaptation
425
+ - Task-specific fine-tuning
426
+ - Scaling to larger model sizes
427
+ - Improved training data curation
428
+ - Advanced sampling techniques
429
+
430
+ **Learning Resources:**
431
+ This project serves as a comprehensive educational resource covering:
432
+ 1. Dataset preparation and cleaning
433
+ 2. Custom tokenizer training
434
+ 3. Transformer architecture implementation
435
+ 4. Training loop with modern optimizations
436
+ 5. Evaluation on standard benchmarks
437
+ 6. Text generation with various sampling strategies
438
+
439
+ For detailed implementation and usage, please refer to the [GitHub repository](https://github.com/HariomJangra/project-lumen).
440
 
441
+ ## Model Card Authors [optional]
442
 
443
+ Hariom Jangra ([@HariomJangra](https://github.com/HariomJangra))
 
444
 
445
+ ## Model Card Contact
446
 
447
+ For questions or feedback about this model, please open an issue on the [GitHub repository](https://github.com/HariomJangra/project-lumen).