OLDI and friends Collection This collection groups the datasets that have been featured as part of WMT’s Open Language Data Initiative shared task. • 4 items • Updated Oct 6 • 4
Insights from the ICLR Peer Review and Rebuttal Process Paper • 2511.15462 • Published 18 days ago • 6
mmBERT: a modern multilingual encoder Collection mmBERT is trained on 3T tokens from over 1800 languages, showing SoTA scores on benchmarks and exceptional low-resource performance • 16 items • Updated Sep 9 • 48
CoBia: Constructed Conversations Can Trigger Otherwise Concealed Societal Biases in LLMs Paper • 2510.09871 • Published Oct 10 • 2
Multi-Turn Puzzles: Evaluating Interactive Reasoning and Strategic Dialogue in LLMs Paper • 2508.10142 • Published Aug 13 • 3
view article Article Reachy Mini - The Open-Source Robot for Today's and Tomorrow's AI Builders Jul 9 • 722
FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language Paper • 2506.20920 • Published Jun 26 • 75
How Programming Concepts and Neurons Are Shared in Code Language Models Paper • 2506.01074 • Published Jun 1 • 3
Tracing Multilingual Factual Knowledge Acquisition in Pretraining Paper • 2505.14824 • Published May 20 • 4
Qwen2.5 Collection Qwen2.5 language models, including pretrained and instruction-tuned models of 7 sizes, including 0.5B, 1.5B, 3B, 7B, 14B, 32B, and 72B. • 46 items • Updated Jul 21 • 666
SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model Paper • 2502.02737 • Published Feb 4 • 249
Advances and Challenges in Foundation Agents: From Brain-Inspired Intelligence to Evolutionary, Collaborative, and Safe Systems Paper • 2504.01990 • Published Mar 31 • 299