SLLHF

community
Activity Feed

AI & ML interests

None defined yet.

Recent Activity

alvarobarttΒ 
posted an update 2 months ago
view post
Post
3712
Learn how to deploy Microsoft Research VibeVoice ASR on Microsoft Azure Foundry with Hugging Face to generate rich audio transcriptions with Who, When, and What! πŸ’₯

> πŸ•’ 60-minute single-pass processing, no chunking or stitching
> πŸ‘€ Customized hotwords to guide recognition on domain-specific content
> πŸ“ Rich transcription: joint ASR + diarization + timestamping in one pass
> 🌍 50+ languages with automatic detection and code-switching support
> πŸ€— Deployed on Microsoft Foundry via an OpenAI-compatible Chat Completions API

https://huggingface.co/docs/microsoft-azure/foundry/examples/deploy-vibevoice-asr
alvarobarttΒ 
posted an update 3 months ago
view post
Post
3252
πŸ’₯ hf-mem v0.4.1 now also estimates KV cache memory requirements for any context length and batch size with the --experimental flag!

uvx hf-mem --model-id ... --experimental will automatically pull the required information from the Hugging Face Hub to include the KV cache estimation, when applicable.

πŸ’‘ Alternatively, you can also set the --max-model-len, --batch-size and --kv-cache-dtype arguments (Γ  la vLLM) manually if preferred.
  • 1 reply
Β·
pcuenqΒ 
posted an update 4 months ago
view post
Post
4809
πŸ‘‰ What happened in AI in 2025? πŸ‘ˆ

We prepared the 2025 version of the HF AI Timeline Grid, highlighting open vs API-based model releases, and allowing you to browse and filter by access, modality, and release type!

Play with it here:
2025-ai-timeline/2025-ai-timeline

Here's my personal quarterly TL;DR:

1️⃣ Q1 β€” Learning to Reason
Deepseek not only releases a top-notch reasoning model, but shows how to train them and compete with closed frontier models. OpenAI debuts Deep Research.

Significant milestones: DeepSeek R1 & R1-Zero, Qwen 2.5 VL, OpenAI Deep Research, Gemini 2.5 Pro (experimental)

2️⃣ Q2 β€” Multimodality and Coding
More LLMs embrace multimodality by default, and there's a surge in coding agents. Strong vision, audio, and generative models emerge.

Significant milestones: Llama 4, Qwen 3, Imagen 4, OpenAI Codex, Google Jules, Claude 4

3️⃣ Q3 β€” "Gold" rush, OpenAI opens up, the community goes bananas
Flagship models get gold in Math olympiads and hard benchmarks. OpenAI releases strong open source models and Google releases the much anticipated nano-banana for image generation and editing. Agentic workflows become commonplace.

Significant milestones: Gemini and OpenAI IMO Gold, gpt-oss, Gemini 2.5 Flash Image, Grok 4, Claude Sonnet 4.5

4️⃣ Q4 β€” Mistral returns, leaderboard hill-climbing
Mistral is back with updated model families. All labs release impressive models to wrap up the year!

Significant milestones: Claude Opus 4.5, DeepSeek Math V2, FLUX 2, GPT 5.1, Kimi K2 Thinking, Nano Banana Pro, GLM 4.7, Gemini 3, Mistral 3, MiniMax M2.1 🀯

Credits
πŸ™ NHLOCAL for the source data https://github.com/NHLOCAL/AiTimeline

🫑 @reach-vb for the original idea, design and recipe

πŸ™Œ @ariG23498 and yours truly for compiling and verifying the 2025 edition

πŸ₯³ Here's to 2026, wishing it becomes the best year ever for open releases and on-device-first use-cases! πŸ₯‚
  • 3 replies
Β·
m-ricΒ 
posted an update 7 months ago
view post
Post
1733
Tokenization is one of the most important processes in AI - yet many would like to kill it πŸ’€

What's tokenization? The neural networks inside LLMs actually only process numbers, not text: tokenization is the process that makes text readable for them, by converting sentences into lists of numbers.

➑️ For instance, "This is tokenization" would be split into "This | is | token | ization", then each of the parts (tokens) are converted to IDs according to a predefined mapping: for instance "ization" could map to id 2438.
Thus "This is tokenization" can become 1335 | 135 | 2980 | 2438 => now the model can process the sentence!

Most tokenizers today use pre-specified mappings called "vocabularies", generally built about the compression algorithme Byte-Pair Encoding (BPE) that learns from a big corpuses of texts an optimized split to efficiently encode any text from the same distribution into a list token IDs.

🀨 Now, these current tokenizers have flaws.
For instance, the rigidity of their mapping creates losses ; the prime example being that a tokenizer designed for English (thus optimized for tokens like "has", "been", "clock", etc) will not have the right tokens to approach Burmese, thus being terribly inefficient at it.

Many alternative approaches have emerged as a result: for instance "tokenizer-free tokenizers". One that I really liked was "entropy-based": it monitors the stream of text, and trigger a split whenever the entropy increases too much, i.e. when something "surprising" happens.

But this great article argues that tokenizers are a lesser evil. Read and decide for yourself!
https://huggingface.co/blog/catherinearnett/in-defense-of-tokenizers
m-ricΒ 
posted an update 7 months ago
view post
Post
5047
STOP EVERYTHING NOW - we might finally have a radical architecture improvement over Transformers!!! 🚨

A lone scientist just proposed Tiny Recursive Model (TRM), and it is literally the most impressive model that I've seen this year.

➑️ Tiny Recursive Model is 7M parameters
➑️ On ARC-AGI, it beats flagship models like Gemini-2.5-pro

Consider how wild this is: Gemini-2.5-pro must be over 10,000x bigger
and had 1,000 as many authors πŸ˜‚ (Alexia is alone on the paper)

What's this sorcery?
In short: it's a very tiny Transformers, but it loops over itself at two different frequencies, updating two latent variables: one for the proposed answer and one for the reasoning.

@AlexiaJM started from the paper Hierarchical Reasoning Model, published a few months ago, that already showed breakthrough improvement on AGI for its small size (27M)

Hierarchical Reasoning Model had introduced one main feature:
πŸ”Ž Deep supervision
In their model, one part (here one layer) would run at high frequency, and another would be lower frequency, running only every n steps.

They had used a recurrent architecture, where these layers would repeat many times ; but to make it work they had to do many approximations, including not fully backpropagating the loss through all layers.

Alexia studied what was useful and what wasn't, and cleaned the architecture as follows :
Why use a recurrent architecture, when you can just make it a loop?
➑️ She made the network recursive, looping over itself

Why use 2 latent variables ?
➑️ She provides a crystal clear explanation : the one that changes frequently is the reasoning, the one that changes at low frequency is the proposed answer.
➑️ She runs ablation studies to validate that 2 is indeed optimal.

This new setup is a much more elegant way to process reasoning than generating huge chains of tokens as all flagship models currently do.

This might be the breakthrough we've been awaiting for so long!
  • 4 replies
Β·
lysandreΒ 
posted an update 8 months ago
view post
Post
8800
We're kick-starting the process of Transformers v5, with @ArthurZ and @cyrilvallez !

v5 should be significant: we're using it as a milestone for performance optimizations, saner defaults, and a much cleaner code base worthy of 2025.

Fun fact: v4.0.0-rc-1 came out on Nov 19, 2020, nearly five years ago!
  • 6 replies
Β·