Chocolatine-2-4B-Instruct-DPO-v2.1

Chocolatine-2-4B-Instruct-DPO-v2.1 is a post-trained version of Qwen/Qwen3-4B-Instruct-2507, designed to improve instruction-following, reasoning, and overall performance in French, while preserving strong multilingual capabilities.
In my evaluation setup, it delivers consistent gains across the tested French benchmarks, pointing to a broad improvement in French capabilities.
Although the post-training pipeline focuses on French preference data, no degradation is observed on English tasks, and slight improvements are sometimes seen, suggesting positive cross-lingual transfer.
Optimized variants (MLX, GGUF) are also available, making the model particularly suitable for local inference.

Model Overview

Base model: Qwen/Qwen3-4B-Instruct-2507
Parameters: 4.0B
Context Length: 262,144 natively
Post training methods: DPO + Model Merging

Note: This model supports only non-thinking mode and does not generate <think></think> blocks in its outputs.
This design is consistent with the goals of the post-training setup, which favors a compact dense instruct model focused on direct generation efficiency and practical downstream use.
For use cases requiring explicit reasoning traces or structured thinking outputs, Qwen/Qwen3.5-4B (thinking mode) is recommended.

Model Variants

Chocolatine-2-4B-Instruct-DPO-v2.1 (this repo): Contains the retrainable weights in BF16 format
Quantized GGUF versions : Q4_K_M / Q8_0 and more from mradermacher here
MLX (optimized for Apple silicon): 4Bit / 6Bit

Ollama : In addition to the Hugging Face release, quantized 4-bit and 8-bit variants are also available here on Ollama for convenient local inference.

Benchmarks

The results indicate a consistent improvement across the tested French benchmarks, covering several capability types. This suggests a broad gain in French performance, while English results remain overall stable.

Benchmark fr	Qwen3-4B-Instruct-2507 (base)	Chocolatine-2-4B-Instruct-DPO-v2.1
gpqa-fr:diamond	28.93	32.49
french_bench_arc_challenge	47.13	49.79
french_bench_grammar	70.59	72.27
french_bench_boolqa	88.76	89.89
french_bench_hellaswag	56.99	58.03
global_mmlu_fr	63.75	64.75
xwinograd_fr	66.27	67.47
fr_mt_bench	6.22	6.44

FR-MT-Bench evaluation is performed on MT-Bench-French, using multilingual-mt-bench with OpenAI/GPT-5 as the LLM judge.
global_mmlu_fr, xwinograd_fr and french_bench results were obtained using EleutherAI LM Eval Harness in a 0-shot evaluation setting.
gpqa-fr:diamond using LightEval/vLLM via kurakurai/Luth process eval.

Benchmark eng	Qwen3-4B-Instruct-2507 (base)	Chocolatine-2-4B-Instruct-DPO-v2.1
arc_challenge	58.79	58.45
hellaswag	69.08	70.16
boolq	84.80	85.32
gpqa_diamond_zeroshot	38.89	38.38

English benchmark results were obtained using EleutherAI LM Eval Harness in a 0-shot evaluation setting.

Training & Alignment Pipeline

Chocolatine-2-4B-Instruct-DPO-v2.1 is derived from Qwen/Qwen3-4B-Instruct-2507 using a multi-step post-training pipeline:

Stage 1 – DPO (Compar:IA adaptation)

Direct Preference Optimization (DPO) on a DPO-adapted version of Compar:IA data, derived from the preference dataset comparia-votes, part of a public initiative led by the Ministry of Culture (French gov). Previous iterations of the Chocolatine model series also were selected as part of this initiative.
I constructed an original DPO dataset from these votes by transforming them into preference pairs (chosen / rejected), with additional filtering and formatting steps to make them suitable for DPO fine-tuning.
Two dataset variants were created (6k and 13k preference pairs).
The 6k variant was used for the DPO training reported in this release.

Stage 2 – DPO (French-ORCA pairs)

A second DPO stage using a french-version of ORCA preference pairs, based on the dataset jpacifico/french-orca-dpo-pairs-revised, commonly used in the Chocolatine training pipeline.
This stage further improves : general instruction alignment, robustness across tasks, cross-lingual capabilities.

Stage 3 – Model Merging (MergeKit + TIES)

The resulting checkpoints were merged using MergeKit with the TIES method.

TIES merging: selects task-relevant parameter updates, reduces destructive interference between models and preserves base model stability.

MergeKit configuration:

# ties2 recipe
models:
  - model: jpacifico/Qwen3-4B-Instruct-DPO-test2
    parameters:
      density: 0.5
      weight: 0.5
  - model: jpacifico/Qwen3-4B-Instruct-DPO-test-b3
    parameters:
      density: 0.5
      weight: 0.5

merge_method: ties
base_model: Qwen/Qwen3-4B-Instruct-2507

parameters:
  normalize: false
  int8_mask: true

dtype: bfloat16

Usage

The following contains a code snippet illustrating how to use the model generate content based on given inputs.

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "jpacifico/Chocolatine-2-4B-Instruct-DPO-v2.1"

# load the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)

# prepare the model input
prompt = "Give me a short introduction to large language model."
messages = [
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

# conduct text completion
generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=16384
)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist() 

content = tokenizer.decode(output_ids, skip_special_tokens=True)

print("content:", content)

Limitations

The Chocolatine-2 model series is a quick demonstration that a base model can be easily fine-tuned to achieve compelling performance. It does not have any moderation mechanism.

Developed by: Jonathan Pacifico, 2026
Model type: LLM
Language(s) (NLP): French, English
License: Apache-2.0

Made with ❤️ in France