Papers
updated
Beyond Language Models: Byte Models are Digital World Simulators
Paper
• 2402.19155
• Published
• 53
Griffin: Mixing Gated Linear Recurrences with Local Attention for
Efficient Language Models
Paper
• 2402.19427
• Published
• 56
VisionLLaMA: A Unified LLaMA Interface for Vision Tasks
Paper
• 2403.00522
• Published
• 46
Resonance RoPE: Improving Context Length Generalization of Large
Language Models
Paper
• 2403.00071
• Published
• 24
Learning and Leveraging World Models in Visual Representation Learning
Paper
• 2403.00504
• Published
• 33
AtP*: An efficient and scalable method for localizing LLM behaviour to
components
Paper
• 2403.00745
• Published
• 14
Learning to Decode Collaboratively with Multiple Language Models
Paper
• 2403.03870
• Published
• 21
ShortGPT: Layers in Large Language Models are More Redundant Than You
Expect
Paper
• 2403.03853
• Published
• 66
GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection
Paper
• 2403.03507
• Published
• 189
ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment
Paper
• 2403.05135
• Published
• 45
DeepSeek-VL: Towards Real-World Vision-Language Understanding
Paper
• 2403.05525
• Published
• 49
Stealing Part of a Production Language Model
Paper
• 2403.06634
• Published
• 91
MoAI: Mixture of All Intelligence for Large Language and Vision Models
Paper
• 2403.07508
• Published
• 77
Branch-Train-MiX: Mixing Expert LLMs into a Mixture-of-Experts LLM
Paper
• 2403.07816
• Published
• 44
Synth^2: Boosting Visual-Language Models with Synthetic Captions and
Image Embeddings
Paper
• 2403.07750
• Published
• 23
Chronos: Learning the Language of Time Series
Paper
• 2403.07815
• Published
• 48
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training
Paper
• 2403.09611
• Published
• 129
Veagle: Advancements in Multimodal Representation Learning
Paper
• 2403.08773
• Published
• 10
Paper
• 2309.16609
• Published
• 38
Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities
Paper
• 2308.12966
• Published
• 11
Uni-SMART: Universal Science Multimodal Analysis and Research
Transformer
Paper
• 2403.10301
• Published
• 54
LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models
Paper
• 2403.13372
• Published
• 180
The Unreasonable Ineffectiveness of the Deeper Layers
Paper
• 2403.17887
• Published
• 82
InternLM2 Technical Report
Paper
• 2403.17297
• Published
• 34
Jamba: A Hybrid Transformer-Mamba Language Model
Paper
• 2403.19887
• Published
• 112
Transformer-Lite: High-efficiency Deployment of Large Language Models on
Mobile Phone GPUs
Paper
• 2403.20041
• Published
• 34
Localizing Paragraph Memorization in Language Models
Paper
• 2403.19851
• Published
• 15
DiJiang: Efficient Large Language Models through Compact Kernelization
Paper
• 2403.19928
• Published
• 12
Long-form factuality in large language models
Paper
• 2403.18802
• Published
• 26
Mixture-of-Depths: Dynamically allocating compute in transformer-based
language models
Paper
• 2404.02258
• Published
• 107
Leave No Context Behind: Efficient Infinite Context Transformers with
Infini-attention
Paper
• 2404.07143
• Published
• 111
Pre-training Small Base LMs with Fewer Tokens
Paper
• 2404.08634
• Published
• 36
Megalodon: Efficient LLM Pretraining and Inference with Unlimited
Context Length
Paper
• 2404.08801
• Published
• 66
SnapKV: LLM Knows What You are Looking for Before Generation
Paper
• 2404.14469
• Published
• 27
FlowMind: Automatic Workflow Generation with LLMs
Paper
• 2404.13050
• Published
• 34
Paper
• 2412.15115
• Published
• 377
YuLan-Mini: An Open Data-efficient Language Model
Paper
• 2412.17743
• Published
• 66
HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs
Paper
• 2412.18925
• Published
• 107
Token-Budget-Aware LLM Reasoning
Paper
• 2412.18547
• Published
• 46
DeepSeek-V3 Technical Report
Paper
• 2412.19437
• Published
• 76
MiniMax-01: Scaling Foundation Models with Lightning Attention
Paper
• 2501.08313
• Published
• 300
Evolving Deeper LLM Thinking
Paper
• 2501.09891
• Published
• 115
SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model
Paper
• 2502.02737
• Published
• 255
Demystifying Long Chain-of-Thought Reasoning in LLMs
Paper
• 2502.03373
• Published
• 58
LIMO: Less is More for Reasoning
Paper
• 2502.03387
• Published
• 62
Can 1B LLM Surpass 405B LLM? Rethinking Compute-Optimal Test-Time
Scaling
Paper
• 2502.06703
• Published
• 152
The Differences Between Direct Alignment Algorithms are a Blur
Paper
• 2502.01237
• Published
• 113
s1: Simple test-time scaling
Paper
• 2501.19393
• Published
• 124
Qwen2.5-1M Technical Report
Paper
• 2501.15383
• Published
• 72
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via
Reinforcement Learning
Paper
• 2501.12948
• Published
• 441
Kimi k1.5: Scaling Reinforcement Learning with LLMs
Paper
• 2501.12599
• Published
• 126
Transformers without Normalization
Paper
• 2503.10622
• Published
• 170
A Survey on Post-training of Large Language Models
Paper
• 2503.06072
• Published
• 11
Stop Overthinking: A Survey on Efficient Reasoning for Large Language
Models
Paper
• 2503.16419
• Published
• 77
Reinforcement Learning for Reasoning in Small LLMs: What Works and What
Doesn't
Paper
• 2503.16219
• Published
• 52
Softpick: No Attention Sink, No Massive Activations with Rectified
Softmax
Paper
• 2504.20966
• Published
• 31