martinsu (Martins Udris)

New activity in mistralai/Devstral-Small-2-24B-Instruct-2512 about 22 hours ago

FP16 Weights

3

#19 opened 11 days ago by

orendar

liked a model 11 days ago

utter-project/EuroLLM-22B-Instruct-2512

Text Generation • 23B • Updated 13 days ago • 18.4k • • 36

upvoted an article 12 days ago

Article

Phare LLM benchmark V2: Reasoning models don't guarantee better security

12 days ago

•

9

New activity in unsloth/Devstral-2-123B-Instruct-2512-GGUF 13 days ago

Please update llama.cpp to see improved performance!

🔥 ❤️ 3

6

#5 opened 14 days ago by

danielhanchen

reacted to RakshitAralimatti's post with 🚀 13 days ago

Post

2429

I built something crazy you never saw before.

Please check - https://huggingface.co/blog/RakshitAralimatti/streaming-data-rag

A real-time Streaming Data to RAG system that listens to live radio, transcribes it on-the-fly, and lets you query across TIME.

Not just "what was discussed" – but "what happened in the last 10 minutes on channel 0?" or "at 9 AM, what was the breaking news?" This is RAG that understands temporal context.

1 reply

·

upvoted an article 13 days ago

Article

I Built a RAG System That Listens to Live BBC News and Answers Questions About "What Happened 10 Minutes Ago"

19 days ago

•

13

replied to unmodeled-tyler's post 13 days ago

Is there a possibility to add RAG via simple SERP in space? I think ATM most integrators think in terms: how this model plays with RAG-like tasks. How well it focuses and works with context window populated with grounded knowledge. Can it detect what docs are off and should be ignored? Google SERP with returning not only hits, but snippets of text(like short docs) are good addition. SERP is really cheap and blazing fast.

replied to their post 14 days ago

Ultimately, we will see various "flavors" of the same large models, each reflecting distinct "worldviews"—much like Google Maps displays different country names and borders based on a user’s geolocation. This approach will serve as a legally compliant solution to regional regulatory requirements. Currently, such adaptations are implemented crudely—often through rigid guardrails that suppress sensitive topics or force outputs toward regionally "approved" responses. In the future, however, MLOps systems could dynamically select the appropriate model variant for each user, mirroring Google Maps’ long-standing practice of geo-localized content delivery.

But this is not only about real stuff, say, model is trained to generate predictions that are more "rational", it would make it suboptimal fantasy LARP generator, even with excessive prompting.

But if MLOps team is stuck in its options of model choice and must wrestle the model to adhere to something model resists, oh that's a bad situation.

posted an update 15 days ago

Post

3300

https://huggingface.co/blog/martinsu/potus-broke-my-pipeline

How POTUS Completely Broke My Flash 2.5-Based Guardrail

Did quite a bit of deep research on this one, since it IMHO matters. At first I used this story to amuse fellow MLOps guys, but then I went deeper and was surprised.

To those who don't want to read too much, in plain English: when you give the model a high-stakes statement that clashes with what it "knows" about the world, it gets more brittle. Sometimes to a point of being unusable.

Or an even shorter version: do not clash with the model's given worldview—it will degrade to some extent.

And in practice, it means that in lower-resource languages like Latvian and Finnish (and probably others), Flash 2.5 is an unreliable guardrail model when something clashes with the model's general "worldview".

However, I'm sure this degradation applies to other languages and models as well to varying extents.

In one totally normal week of MLOps, my news summarization pipeline started failing intermittently. Nothing was changed. No deploys. No prompt edits. No model version bump (as far as I could tell). Yet the guardrail would suddenly turn into a grumpy judge and reject outputs for reasons that felt random, sometimes even contradicting itself between runs. It was the worst kind of failure: silent, flaky, and impossible to reproduce on demand.

Then I noticed the pattern: it started when one specific named entity appeared in the text — Donald Trump ** (**and later in tests — Bernie Sanders too ).

And then down the rabbit hole I went.

3 replies

·

published an article 15 days ago

Article

One Politically-Salient Entity Broke My Guardrail Pipeline (Flash 2.5 “Trump/Sanders” case study)

15 days ago

•

3

reacted to danielhanchen's post with 🚀 16 days ago

Post

1986

Mistral's new SOTA coding models Devstral 2 can now be Run locally! (25GB RAM) 🐱
We fixed the chat template, so performance should be much better now!
24B: unsloth/Devstral-Small-2-24B-Instruct-2512-GGUF
123B: unsloth/Devstral-2-123B-Instruct-2512-GGUF

🧡Step-by-step Guide: https://docs.unsloth.ai/models/devstral-2

posted an update 17 days ago

Post

2002

I wasted days on a GPU node on a bug that shouldn't exist

So I was fine-tuning TildeOPEN-30B and the outputs were... weird. Token ID 179 (<0x00>) kept appearing between almost every token pair. Took me a bit to figure out what was going on.

Turns out I used the fast tokenizer for training, but the model was trained on the slow one. Silent failure.

Well... long story short—TGI uses (forces) the fast tokenizer, no questions asked. And you'll have agile's kryptonite: silent failure. If the model was trained on slow, it's a silent disaster.

I got curious and wrote a quick script to check how common this is. Ran it on 6,014 LLM HF models overnight.

Roughly 10% of HF model downloads have mismatched tokenizers. Not all mismatches are catastrophic, but some are brutal — like chat template markers inflating from 1 token to 3, silently wrecking context windows and causing model act weird.

This wasn't rigorous research, but the drift is real. And the worst part? 968 models(out of 500+ downloads) have both fast and slow tokenizers present, but they still produce different outputs. No missing files, no errors — just silent degradation.

TGI defaults to the fast tokenizer, as does AutoTokenizer.from_pretrained(). If a fast tokenizer doesn't exist, it auto-generates one. If your model was trained on slow, you get silent degradation. Output looks fine; the model just performs worse. Sometimes really worse. You'd never know.

If model was trained on fast tokenizer, its fine, but how do You know?

The root cause? Either model authors run HF conversion and upload both without verifying, or users run TGI, which always forces(converts to) fast .

The result of this fight with tokenizers is martinsu/tildeopen-30b-mu-instruct

It's based on TildeOPEN-30B (a solid EU HPC multilingual base). Nothing fancy—just a proper instruction fine-tune where I didn't mess up the tokenizer this time.

Full article: https://github.com/martins-u/tokenmagedon