Spaces:

meetkai
/

modelchorus-evals

Runtime error

App Files Files Community

brycemeetkai commited on 11 days ago

Commit

a540a5c

verified ·

1 Parent(s): 00d42e1

Mirror evals/ from 4ac99d72af66

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

evals/README.md +0 -2
evals/albanian/README.md +136 -0
evals/albanian/albanian.yaml +21 -0
evals/{igbo/classification/igbo_classification.yaml → albanian/classification/albanian_classification.yaml} +2 -2
evals/{igbo/classification/igbo_sib200.yaml → albanian/classification/albanian_sib200.yaml} +10 -3
evals/{yoruba → albanian}/classification/utils.py +3 -6
evals/{igbo → albanian}/mcq/_default_mcq_yaml +2 -1
evals/{igbo/mcq/igbo_belebele.yaml → albanian/mcq/albanian_belebele.yaml} +3 -3
evals/albanian/mcq/albanian_global_mmlu.yaml +11 -0
evals/{igbo/mcq/igbo_mcq.yaml → albanian/mcq/albanian_mcq.yaml} +4 -4
evals/{yoruba → albanian}/mcq/utils.py +39 -49
evals/albanian/open_generation/_default_open_generation_yaml +18 -0
evals/albanian/open_generation/albanian_aya.yaml +14 -0
evals/albanian/open_generation/albanian_open_generation.yaml +17 -0
evals/albanian/open_generation/albanian_polywrite.yaml +15 -0
evals/albanian/open_generation/utils.py +172 -0
evals/albanian/summarization/_default_summarization_yaml +18 -0
evals/albanian/summarization/albanian_massivesumm_long.yaml +14 -0
evals/albanian/summarization/albanian_massivesumm_short.yaml +9 -0
evals/albanian/summarization/albanian_summarization.yaml +11 -0
evals/albanian/summarization/utils.py +111 -0
evals/arabic/classification/arabic_sib200.yaml +7 -0
evals/arabic/qa/arabic_qa.yaml +3 -0
evals/cost_core.py +0 -1
evals/english/english.yaml +0 -1
evals/eval_config.toml +34 -38
evals/f1_utils.py +162 -17
evals/french/classification/french_sib200.yaml +7 -0
evals/french/qa/french_qa.yaml +3 -0
evals/hausa/classification/hausa_sib200.yaml +7 -0
evals/hausa/hausa.yaml +0 -1
evals/hausa/nli/hausa_afrixnli.yaml +7 -0
evals/hausa/nli/utils.py +2 -0
evals/hausa/qa/hausa_qa.yaml +3 -0
evals/hausa/sentiment/utils.py +0 -26
evals/igbo/afrimgsm/igbo_afrimgsm.yaml +0 -28
evals/igbo/igbo.yaml +0 -10
evals/igbo/mcq/igbo_afrimmlu.yaml +0 -9
evals/igbo/nli/utils.py +0 -26
evals/igbo/qa/utils.py +0 -61
evals/igbo/sentiment/igbo_sentiment.yaml +0 -9
evals/igbo/sentiment/utils.py +0 -26
evals/portuguese/README.md +131 -0
evals/{igbo/nli/igbo_afrixnli.yaml → portuguese/classification/_default_classification_yaml} +1 -8
evals/portuguese/classification/portuguese_classification.yaml +12 -0
evals/portuguese/classification/portuguese_hate_speech.yaml +17 -0
evals/portuguese/classification/portuguese_hatebr.yaml +17 -0
evals/portuguese/classification/portuguese_tweetsentbr.yaml +16 -0
evals/portuguese/classification/utils.py +112 -0
evals/{swahili/afrimgsm/swahili_afrimgsm.yaml → portuguese/mcq/_default_mcq_yaml} +4 -11

evals/README.md CHANGED Viewed

@@ -111,8 +111,6 @@ name = "my_task_group"
 | French   | SIB-200, Belebele, MGSM            | `french_classification`, `french_mcq`, `french_math`             |
 | Arabic   | SIB-200, Belebele                  | `arabic_classification`, `arabic_mcq`                            |
 | Hausa    | SIB-200, AfriMMLU, Belebele        | `hausa_classification`, `hausa_mcq`                              |
-| Yoruba   | SIB-200, AfriMMLU, Belebele        | `yoruba_classification`, `yoruba_mcq`                            |
-| Igbo     | SIB-200, AfriMMLU, Belebele        | `igbo_classification`, `igbo_mcq`                                |
 ## Adding a New Language/Benchmark

 | French   | SIB-200, Belebele, MGSM            | `french_classification`, `french_mcq`, `french_math`             |
 | Arabic   | SIB-200, Belebele                  | `arabic_classification`, `arabic_mcq`                            |
 | Hausa    | SIB-200, AfriMMLU, Belebele        | `hausa_classification`, `hausa_mcq`                              |
 ## Adding a New Language/Benchmark

evals/albanian/README.md ADDED Viewed

	@@ -0,0 +1,136 @@

+# Albanian – lm-eval Tasks
+Albanian (Tosk, `als_Latn` / macro `sq`) evaluation suite for the
+`lm-evaluation-harness` framework.
+## Overview
+### Custom Tasks (require `--include_path`)
+| #   | Task Name                    | Category        | Dataset (HuggingFace)                                                                   | Metric             |
+| --- | ---------------------------- | --------------- | --------------------------------------------------------------------------------------- | ------------------ |
+| 1   | `albanian_sib200`            | Classification  | `Davlan/sib200` (`als_Latn`)                                                            | f1_macro           |
+| 2   | `albanian_belebele`          | MCQ             | `facebook/belebele` (`als_Latn`)                                                        | f1_macro           |
+| 3   | `albanian_global_mmlu`       | MCQ             | `CohereLabs/Global-MMLU-Lite` (`sq`, v2)                                                | f1_macro           |
+| 4   | `albanian_massivesumm_short` | Summarization   | `MaLA-LM/MassiveSumm_short` (filtered `language=sqi`)                                   | rouge_l            |
+| 5   | `albanian_massivesumm_long`  | Summarization   | `MaLA-LM/MassiveSumm_long` (filtered `language=sqi`)                                    | rouge_l            |
+| 6   | `albanian_aya`               | Open generation | `CohereLabs/aya_evaluation_suite` (`dolly_machine_translated`, filtered `language=sqi`) | llm_judge_score    |
+| 7   | `albanian_polywrite`         | Open generation | `MaLA-LM/PolyWrite` (filtered `lang_script=sqi_Latn`)                                   | open_quality_score |
+#### Subgroups
+| Group                      | Tasks                               |
+| -------------------------- | ----------------------------------- |
+| `albanian_classification`  | sib200                              |
+| `albanian_mcq`             | belebele, global_mmlu               |
+| `albanian_summarization`   | massivesumm_short, massivesumm_long |
+| `albanian_open_generation` | aya, polywrite                      |
+## Setup
+```bash
+pip install lm-eval
+```
+## Running Tasks
+All commands must be run from the `multilingual_bench/` directory:
+```bash
+cd /path/to/functionary_internal/evaluation/multilingual_bench
+```
+### Run the Entire Albanian Suite (all 7 tasks)
+```bash
+OPENAI_API_KEY="$OPENROUTER_API_KEY" \
+lm_eval \
+  --include_path lm_eval_tasks \
+  --tasks albanian \
+  --model local-chat-completions \
+  --model_args model=openai/gpt-5-mini,base_url=https://openrouter.ai/api/v1/chat/completions,num_concurrent=5 \
+  --apply_chat_template \
+  --num_fewshot 0 \
+  --log_samples \
+  --output_path output/albanian_results
+```
+### Run via the project runner
+```bash
+cd lm_eval_tasks
+export OPENROUTER_API_KEY="sk-or-..."
+python run_eval.py --models gpt-5-mini --tasks albanian
+```
+### Run a Single Category
+```bash
+lm_eval --include_path lm_eval_tasks --tasks albanian_classification ...
+lm_eval --include_path lm_eval_tasks --tasks albanian_mcq ...
+lm_eval --include_path lm_eval_tasks --tasks albanian_summarization ...
+lm_eval --include_path lm_eval_tasks --tasks albanian_open_generation ...
+```
+### Run a Single Task
+```bash
+lm_eval --include_path lm_eval_tasks --tasks albanian_sib200 ...
+lm_eval --include_path lm_eval_tasks --tasks albanian_belebele ...
+lm_eval --include_path lm_eval_tasks --tasks albanian_global_mmlu ...
+lm_eval --include_path lm_eval_tasks --tasks albanian_massivesumm_short ...
+lm_eval --include_path lm_eval_tasks --tasks albanian_massivesumm_long ...
+lm_eval --include_path lm_eval_tasks --tasks albanian_aya ...
+lm_eval --include_path lm_eval_tasks --tasks albanian_polywrite ...
+```
+## Output
+With `--log_samples`, the output directory contains:
+- `results.json` – aggregate scores per task
+- `samples_<task_name>.jsonl` – per-example model outputs for debugging
+## Dataset Sources
+| Dataset           | Source                            | Config                                             | Notes                                                                |
+| ----------------- | --------------------------------- | -------------------------------------------------- | -------------------------------------------------------------------- |
+| SIB-200           | `Davlan/sib200`                   | `als_Latn`                                         | text + ClassLabel `category` (7 topics)                              |
+| Belebele          | `facebook/belebele`               | `als_Latn`                                         | flores_passage + question + 4 mc_answers, `correct_answer_num` 1-4   |
+| Global-MMLU-Lite  | `CohereLabs/Global-MMLU-Lite`     | `sq`                                               | question + `option_a..d` + `answer` letter (400 samples, CS+CA)      |
+| MassiveSumm short | `MaLA-LM/MassiveSumm_short`       | — (filter `language=sqi`)                          | `text`, `summary`, `language`; gated                                 |
+| MassiveSumm long  | `MaLA-LM/MassiveSumm_long`        | — (filter `language=sqi`)                          | same schema; longer articles                                         |
+| Aya Eval          | `CohereLabs/aya_evaluation_suite` | `dolly_machine_translated` (filter `language=sqi`) | `inputs`, `targets`, `language`, `script`                            |
+| PolyWrite         | `MaLA-LM/PolyWrite`               | — (filter `lang_script=sqi_Latn`)                  | `prompt_translated`, `category`, `lang_script` (no reference answer) |
+### Gated datasets
+Several upstream datasets are gated on Hugging Face. Accept the terms (once) and export an HF token before running:
+- Aya Eval: <https://huggingface.co/datasets/CohereLabs/aya_evaluation_suite>
+- MassiveSumm short / long: <https://huggingface.co/datasets/MaLA-LM/MassiveSumm_short> and <https://huggingface.co/datasets/MaLA-LM/MassiveSumm_long>
+```bash
+export HF_TOKEN="hf_..."
+huggingface-cli login   # one-time, optional if HF_TOKEN is exported
+```
+### LLM-judge tasks
+`albanian_aya` and `albanian_polywrite` use an LLM judge (default `openai/gpt-5-mini` via OpenRouter) for scoring; this consumes additional API credits per sample. Override:
+```bash
+export JUDGE_MODEL="openai/gpt-5-mini"
+export JUDGE_BASE_URL="https://openrouter.ai/api/v1"
+export JUDGE_CONCURRENCY=32
+```
+### Tasks not (yet) included
+GlotEval also lists IN22, AmericasNLP, MAFAND, XLSum, MMMLU, NTEU, MMHB, BenchMAX (Math/Science/Rule-based), TICO-19, NTREX-128, Taxi-1500, PBC, MaLA, and UD-UPOS. These are intentionally **not** included because:
+- No Albanian coverage upstream: AmericasNLP, IN22, MAFAND, XLSum, MMMLU, NTEU, MMHB, BenchMAX (all subsets).
+- No HuggingFace dataset (only ship as local files in GlotEval): TICO-19, NTREX-128, Taxi-1500.
+- Not meaningful via chat-completions: PBC, MaLA (intrinsic NLL), UD-UPOS / WikiANN-style token-level tagging.
+Add any of these later if you ship the local data and want a corresponding task config.

evals/albanian/albanian.yaml ADDED Viewed

	@@ -0,0 +1,21 @@

+# Albanian – top-level benchmark group
+# Usage:
+#   cd multilingual_bench
+#   lm_eval --include_path lm_eval_tasks \
+#           --tasks albanian \
+#           --model local-chat-completions \
+#           --model_args model=your-model,base_url=http://your-endpoint/v1/chat/completions \
+#           --num_fewshot 0
+#
+# Metrics:
+#   classification & mcq → f1_macro                            (per sub-group)
+#   summarization        → rouge_l                             (per sub-group)
+#   open_generation      → llm_judge_score / open_quality_score (per sub-group)
+group: albanian
+task:
+  - albanian_classification
+  - albanian_mcq
+  - albanian_summarization
+  - albanian_open_generation
+metadata:
+  version: 1.0

evals/{igbo/classification/igbo_classification.yaml → albanian/classification/albanian_classification.yaml} RENAMED Viewed

@@ -1,7 +1,7 @@
 # Topic Classification subgroup (SIB-200)
-group: igbo_classification
 task:
-  - igbo_sib200
 aggregate_metric_list:
   - metric: f1_macro
     aggregation: mean

 # Topic Classification subgroup (SIB-200)
+group: albanian_classification
 task:
+  - albanian_sib200
 aggregate_metric_list:
   - metric: f1_macro
     aggregation: mean

evals/{igbo/classification/igbo_sib200.yaml → albanian/classification/albanian_sib200.yaml} RENAMED Viewed

@@ -1,7 +1,7 @@
-task: igbo_sib200
 task_alias: sib200
 dataset_path: Davlan/sib200
-dataset_name: ibo_Latn
 test_split: test
 output_type: generate_until
 generation_kwargs:
@@ -10,8 +10,15 @@ generation_kwargs:
   until:
     - "<|endoftext|>"
 process_docs: !function utils.process_sib200_docs
-doc_to_text: "You are a topic classification system.\nChoose the single best label for the following Igbo text.\n\nAllowed labels: {{labels_str}}\n\nInstruction: Reply with ONE label only from the allowed labels. Do not write anything else.\n\nText:\n{{text}}\n\nLabel:"
 doc_to_target: "{{target}}"
 process_results: !function utils.process_results
 metric_list:
   - metric: f1_macro

+task: albanian_sib200
 task_alias: sib200
 dataset_path: Davlan/sib200
+dataset_name: als_Latn
 test_split: test
 output_type: generate_until
 generation_kwargs:
   until:
     - "<|endoftext|>"
 process_docs: !function utils.process_sib200_docs
+doc_to_text: "Ti je një sistem klasifikimi temash.\nZgjidh një etiketë të vetme që e përshkruan më mirë tekstin e mëposhtëm.\n\nEtiketat e lejuara: {{labels_str}}\n\nUdhëzim: Përgjigju vetëm me NJË etiketë nga lista e lejuar. Mos shkruaj asnjë fjalë tjetër.\n\nTeksti:\n{{text}}\n\nEtiketa:"
 doc_to_target: "{{target}}"
+filter_list:
+  - name: "get_label"
+    filter:
+      - function: "strip_think_recover"
+      - function: "regex_label_set"
+        labels_field: "labels_str"
+      - function: "take_first"
 process_results: !function utils.process_results
 metric_list:
   - metric: f1_macro

evals/{yoruba → albanian}/classification/utils.py RENAMED Viewed

@@ -1,10 +1,10 @@
-"""Utility helpers for Yoruba classification tasks (SIB-200).
-Uses macro-averaged F1 scoring (matching Swahili pattern).
 """
 import os as _os, sys as _sys  # noqa: E401
-_sys.path.insert(0, _os.path.normpath(_os.path.join(_os.path.dirname(__file__), "..","..",)))
 import datasets
@@ -40,9 +40,6 @@ def process_sib200_docs(dataset: datasets.Dataset) -> datasets.Dataset:
     return dataset.map(_process)
-# -- Result processing ----------------------------------------------------
 def process_results(doc, results):
     """Return (pred, gold) tuple for macro-F1 aggregation."""
     return process_results_f1(doc, results)

+"""Utility helpers for Albanian classification tasks (SIB-200).
+Uses macro-averaged F1 scoring (matching the Swahili / Urdu pattern).
 """
 import os as _os, sys as _sys  # noqa: E401
+_sys.path.insert(0, _os.path.normpath(_os.path.join(_os.path.dirname(__file__), "..", "..",)))
 import datasets
     return dataset.map(_process)
 def process_results(doc, results):
     """Return (pred, gold) tuple for macro-F1 aggregation."""
     return process_results_f1(doc, results)

evals/{igbo → albanian}/mcq/_default_mcq_yaml RENAMED Viewed

@@ -1,4 +1,4 @@
-# Shared config for Igbo MCQ tasks (generative A/B/C/D).
 output_type: generate_until
 generation_kwargs:
   do_sample: false
@@ -8,6 +8,7 @@ generation_kwargs:
 filter_list:
   - name: "get_answer"
     filter:
       - function: "regex"
         regex_pattern: "([ABCD])"
         group_select: 0

+# Shared config for Albanian MCQ tasks (generative A/B/C/D).
 output_type: generate_until
 generation_kwargs:
   do_sample: false
 filter_list:
   - name: "get_answer"
     filter:
+      - function: "strip_think_recover"
       - function: "regex"
         regex_pattern: "([ABCD])"
         group_select: 0

evals/{igbo/mcq/igbo_belebele.yaml → albanian/mcq/albanian_belebele.yaml} RENAMED Viewed

@@ -1,9 +1,9 @@
-task: igbo_belebele
 task_alias: belebele
 dataset_path: facebook/belebele
-dataset_name: ibo_Latn
 test_split: test
 include: _default_mcq_yaml
 process_docs: !function utils.process_belebele_docs
-doc_to_text: "P: {{flores_passage}}\nQ: {{question}}\nA: {{mc_answer1}}\nB: {{mc_answer2}}\nC: {{mc_answer3}}\nD: {{mc_answer4}}\nInstruction: Reply with EXACTLY one letter: A, B, C, or D. No other text.\nAnswer:"
 doc_to_target: "{{gold_letter}}"

+task: albanian_belebele
 task_alias: belebele
 dataset_path: facebook/belebele
+dataset_name: als_Latn
 test_split: test
 include: _default_mcq_yaml
 process_docs: !function utils.process_belebele_docs
+doc_to_text: "P: {{flores_passage}}\nQ: {{question}}\nA: {{mc_answer1}}\nB: {{mc_answer2}}\nC: {{mc_answer3}}\nD: {{mc_answer4}}\nUdhëzim: Përgjigju me VETËM një shkronjë: A, B, C ose D. Asnjë tekst tjetër.\nPërgjigja:"
 doc_to_target: "{{gold_letter}}"

evals/albanian/mcq/albanian_global_mmlu.yaml ADDED Viewed

	@@ -0,0 +1,11 @@

+task: albanian_global_mmlu
+task_alias: global_mmlu_lite
+# Global-MMLU-Lite v2 added Albanian (sq); the full Global-MMLU dataset
+# does not currently ship Albanian. 400 samples total (200 CS + 200 CA).
+dataset_path: CohereLabs/Global-MMLU-Lite
+dataset_name: sq
+test_split: test
+include: _default_mcq_yaml
+process_docs: !function utils.process_global_mmlu_docs
+doc_to_text: "Ti je një AI me njohuri të gjera që u përgjigjet pyetjeve me zgjedhje të shumëfishta për lëndën '{{subject_field}}'.\n\nPyetja:\n{{question}}\n\nMundësitë:\nA: {{choice_a}}\nB: {{choice_b}}\nC: {{choice_c}}\nD: {{choice_d}}\n\nUdhëzim: Përgjigju me VETËM një shkronjë: A, B, C ose D. Asnjë tekst tjetër.\n\nPërgjigja:"
+doc_to_target: "{{gold_letter}}"

evals/{igbo/mcq/igbo_mcq.yaml → albanian/mcq/albanian_mcq.yaml} RENAMED Viewed

@@ -1,8 +1,8 @@
-# Multiple-Choice QA subgroup (AfriMMLU + Belebele)
-group: igbo_mcq
 task:
-  - igbo_afrimmlu
-  - igbo_belebele
 aggregate_metric_list:
   - metric: f1_macro
     aggregation: mean

+# Multiple-Choice QA subgroup (Belebele + Global-MMLU)
+group: albanian_mcq
 task:
+  - albanian_belebele
+  - albanian_global_mmlu
 aggregate_metric_list:
   - metric: f1_macro
     aggregation: mean

evals/{yoruba → albanian}/mcq/utils.py RENAMED Viewed

@@ -1,7 +1,7 @@
-"""Utility helpers for Yoruba MCQ tasks (AfriMMLU, Belebele)."""
 import os as _os, sys as _sys  # noqa: E401
-_sys.path.insert(0, _os.path.normpath(_os.path.join(_os.path.dirname(__file__), "..","..",)))
 import ast
@@ -18,40 +18,48 @@ def _safe_str(x):
     return "" if x is None else str(x)
-# -- AfriMMLU --------------------------------------------------------------
-def _normalize_choices(doc):
-    """Robustly extract 4 choices from AfriMMLU documents.
-    The HF dataset uses several different field layouts depending on the
-    upload revision; this mirrors the logic in ``run_mcq.py``.
     """
-    # Try mc_answer1-4
-    mc = [
-        _safe_str(doc.get("mc_answer1", "")).strip(),
-        _safe_str(doc.get("mc_answer2", "")).strip(),
-        _safe_str(doc.get("mc_answer3", "")).strip(),
-        _safe_str(doc.get("mc_answer4", "")).strip(),
-    ]
-    if all(mc):
-        return mc
-    # Try A/B/C/D keys
     if all(k in doc for k in CHOICE_LETTERS):
         return [_safe_str(doc[k]).strip() for k in CHOICE_LETTERS]
-    # Try 'choices' field
     choices = doc.get("choices")
     if isinstance(choices, list) and len(choices) >= 4:
         return [_safe_str(x).strip() for x in choices[:4]]
     if isinstance(choices, dict):
         upper = {str(k).upper(): k for k in choices}
         if all(k in upper for k in CHOICE_LETTERS):
             return [_safe_str(choices[upper[k]]).strip() for k in CHOICE_LETTERS]
     if isinstance(choices, str):
         for parser in (json.loads, ast.literal_eval):
             try:
@@ -64,60 +72,42 @@ def _normalize_choices(doc):
     return ["", "", "", ""]
-def _gold_letter(doc):
-    """Resolve the gold answer to a letter (A/B/C/D)."""
     answer = doc.get("answer")
-    # Direct letter
     if isinstance(answer, str) and answer.strip().upper() in set(CHOICE_LETTERS):
         return answer.strip().upper()
-    # Integer index (0-based)
     try:
         idx = int(answer)
         if 0 <= idx <= 3:
             return CHOICE_LETTERS[idx]
     except (ValueError, TypeError):
         pass
     return ""
-def process_afrimmlu_docs(dataset: datasets.Dataset) -> datasets.Dataset:
-    """Normalise AfriMMLU fields for the YAML template."""
     def _process(doc):
-        choices = _normalize_choices(doc)
         doc["choice_a"] = choices[0]
         doc["choice_b"] = choices[1]
         doc["choice_c"] = choices[2]
         doc["choice_d"] = choices[3]
-        doc["gold_letter"] = _gold_letter(doc)
         doc["subject_field"] = doc.get("subject", "unknown")
         return doc
     return dataset.map(_process)
-# -- Belebele --------------------------------------------------------------
-def process_belebele_docs(dataset: datasets.Dataset) -> datasets.Dataset:
-    """Resolve correct_answer_num (1-4) -> gold_letter (A-D)."""
-    def _process(doc):
-        num = doc.get("correct_answer_num")
-        try:
-            n = int(str(num).strip())
-            doc["gold_letter"] = CHOICE_LETTERS[n - 1] if 1 <= n <= 4 else ""
-        except (ValueError, TypeError):
-            doc["gold_letter"] = ""
-        return doc
-    return dataset.map(_process)
-# -- Result processing ----------------------------------------------------
 def process_results(doc, results):

+"""Utility helpers for Albanian MCQ tasks (Belebele, Global-MMLU)."""
 import os as _os, sys as _sys  # noqa: E401
+_sys.path.insert(0, _os.path.normpath(_os.path.join(_os.path.dirname(__file__), "..", "..",)))
 import ast
     return "" if x is None else str(x)
+# ── Belebele ──────────────────────────────────────────────────────────
+def process_belebele_docs(dataset: datasets.Dataset) -> datasets.Dataset:
+    """Resolve correct_answer_num (1-4) → gold_letter (A-D)."""
+    def _process(doc):
+        num = doc.get("correct_answer_num")
+        try:
+            n = int(str(num).strip())
+            doc["gold_letter"] = CHOICE_LETTERS[n - 1] if 1 <= n <= 4 else ""
+        except (ValueError, TypeError):
+            doc["gold_letter"] = ""
+        return doc
+    return dataset.map(_process)
+# ── Global-MMLU ───────────────────────────────────────────────────────
+def _global_mmlu_choices(doc):
+    """Robustly extract 4 choices from a Global-MMLU document.
+    Global-MMLU exposes the four options under ``option_a``..``option_d``
+    keys (lower-case). Older revisions used ``A``..``D`` or a ``choices``
+    list, which we accept as fallbacks.
     """
+    lower_keys = ["option_a", "option_b", "option_c", "option_d"]
+    if all(k in doc for k in lower_keys):
+        return [_safe_str(doc[k]).strip() for k in lower_keys]
     if all(k in doc for k in CHOICE_LETTERS):
         return [_safe_str(doc[k]).strip() for k in CHOICE_LETTERS]
     choices = doc.get("choices")
     if isinstance(choices, list) and len(choices) >= 4:
         return [_safe_str(x).strip() for x in choices[:4]]
     if isinstance(choices, dict):
         upper = {str(k).upper(): k for k in choices}
         if all(k in upper for k in CHOICE_LETTERS):
             return [_safe_str(choices[upper[k]]).strip() for k in CHOICE_LETTERS]
     if isinstance(choices, str):
         for parser in (json.loads, ast.literal_eval):
             try:
     return ["", "", "", ""]
+def _global_mmlu_gold_letter(doc):
+    """Resolve the Global-MMLU gold answer to a letter (A/B/C/D)."""
     answer = doc.get("answer")
     if isinstance(answer, str) and answer.strip().upper() in set(CHOICE_LETTERS):
         return answer.strip().upper()
     try:
         idx = int(answer)
         if 0 <= idx <= 3:
             return CHOICE_LETTERS[idx]
+        if 1 <= idx <= 4:
+            return CHOICE_LETTERS[idx - 1]
     except (ValueError, TypeError):
         pass
     return ""
+def process_global_mmlu_docs(dataset: datasets.Dataset) -> datasets.Dataset:
+    """Normalise Global-MMLU fields for the YAML template."""
     def _process(doc):
+        choices = _global_mmlu_choices(doc)
         doc["choice_a"] = choices[0]
         doc["choice_b"] = choices[1]
         doc["choice_c"] = choices[2]
         doc["choice_d"] = choices[3]
+        doc["gold_letter"] = _global_mmlu_gold_letter(doc)
         doc["subject_field"] = doc.get("subject", "unknown")
         return doc
     return dataset.map(_process)
+# ── Result processing ────────────────────────────────────────────────
 def process_results(doc, results):

evals/albanian/open_generation/_default_open_generation_yaml ADDED Viewed

	@@ -0,0 +1,18 @@

+# Shared config for Albanian open-ended generation tasks (Aya, PolyWrite).
+# Both datasets are highly multilingual single-table dumps. Filtering to
+# Albanian rows is done in process_docs.
+#
+# Required env vars:
+#   OPENAI_API_KEY   – OpenRouter / OpenAI key
+# Optional env vars:
+#   JUDGE_MODEL      – judge model name        (default: openai/gpt-5-mini)
+#   JUDGE_BASE_URL   – judge API endpoint      (default: https://openrouter.ai/api/v1)
+#   JUDGE_CONCURRENCY – parallel judge calls   (default: 32)
+output_type: generate_until
+generation_kwargs:
+  do_sample: false
+  max_gen_toks: 8192
+  until:
+    - "<|endoftext|>"
+metadata:
+  version: 1.0

evals/albanian/open_generation/albanian_aya.yaml ADDED Viewed

	@@ -0,0 +1,14 @@

+task: albanian_aya
+task_alias: aya
+dataset_path: CohereLabs/aya_evaluation_suite
+dataset_name: dolly_machine_translated
+test_split: test
+include: _default_open_generation_yaml
+process_docs: !function utils.process_aya_docs
+doc_to_text: "{{inputs}}"
+doc_to_target: "{{targets}}"
+process_results: !function utils.process_results_aya
+metric_list:
+  - metric: llm_judge_score
+    aggregation: !function utils.llm_judge_agg
+    higher_is_better: true

evals/albanian/open_generation/albanian_open_generation.yaml ADDED Viewed

	@@ -0,0 +1,17 @@

+# Open-ended generation subgroup (Aya + PolyWrite)
+# Note: Aya uses reference-based llm_judge_score; PolyWrite uses the
+# reference-free open_quality_score (1-5 rubric, normalised to 0-1).
+# Both metrics are on the same 0-1 scale so their mean is meaningful.
+group: albanian_open_generation
+task:
+  - albanian_aya
+  - albanian_polywrite
+aggregate_metric_list:
+  - metric: llm_judge_score
+    aggregation: mean
+    weight_by_size: true
+  - metric: open_quality_score
+    aggregation: mean
+    weight_by_size: true
+metadata:
+  version: 1.0

evals/albanian/open_generation/albanian_polywrite.yaml ADDED Viewed

	@@ -0,0 +1,15 @@

+task: albanian_polywrite
+task_alias: polywrite
+dataset_path: MaLA-LM/PolyWrite
+test_split: train
+include: _default_open_generation_yaml
+process_docs: !function utils.process_polywrite_docs
+doc_to_text: "{{prompt}}"
+# PolyWrite has no reference answer; doc_to_target is a placeholder so
+# lm-eval is happy. The reference-free judge ignores it.
+doc_to_target: ""
+process_results: !function utils.process_results_polywrite
+metric_list:
+  - metric: open_quality_score
+    aggregation: mean
+    higher_is_better: true

evals/albanian/open_generation/utils.py ADDED Viewed

	@@ -0,0 +1,172 @@

+"""Utility helpers for Albanian open-ended generation tasks.
+Two datasets are supported:
+* **Aya Evaluation Suite** (``CohereLabs/aya_evaluation_suite``,
+  config ``dolly_machine_translated``) – has a translated reference
+  ``targets`` for every prompt, so we can use the standard reference-based
+  judge from ``judge_utils``.
+* **PolyWrite** (``MaLA-LM/PolyWrite``) – open-ended creative writing
+  prompts with **no** reference answer. We use a separate "open quality"
+  judge that scores 1-5 against an Albanian-fluency + relevance rubric.
+Both datasets are highly multilingual single-table dumps; we filter to
+Albanian inside ``process_docs``.
+"""
+from __future__ import annotations
+import json
+import os as _os
+import sys as _sys
+import time
+_sys.path.insert(0, _os.path.normpath(_os.path.join(_os.path.dirname(__file__), "..", "..",)))
+import datasets
+from judge_utils import (  # noqa: F401  (re-exported for yaml use)
+    llm_judge_agg,
+    strip_think_tags,
+    _get_judge_client,
+)
+# ── Albanian filtering ───────────────────────────────────────────────
+_ALBANIAN_LANG_CODES = {"sqi", "als", "aln"}
+_ALBANIAN_LANG_SCRIPTS = {"sqi_Latn", "als_Latn", "aln_Latn"}
+def process_aya_docs(dataset: datasets.Dataset) -> datasets.Dataset:
+    """Filter Aya rows to Albanian and project the columns we need."""
+    def _is_albanian(row):
+        return str(row.get("language", "")).lower() in _ALBANIAN_LANG_CODES
+    filtered = dataset.filter(_is_albanian)
+    def _project(row):
+        return {
+            "inputs": (row.get("inputs") or "").strip(),
+            "targets": (row.get("targets") or "").strip(),
+        }
+    cols_to_drop = [c for c in filtered.column_names if c not in ("inputs", "targets")]
+    return filtered.map(_project, remove_columns=cols_to_drop)
+def process_polywrite_docs(dataset: datasets.Dataset) -> datasets.Dataset:
+    """Filter PolyWrite rows to Albanian and surface the translated prompt."""
+    def _is_albanian(row):
+        return str(row.get("lang_script", "")) in _ALBANIAN_LANG_SCRIPTS
+    filtered = dataset.filter(_is_albanian)
+    def _project(row):
+        return {
+            "prompt": (row.get("prompt_translated") or "").strip(),
+            "category": (row.get("category") or "").strip(),
+            "name": (row.get("name") or "").strip(),
+        }
+    cols_to_drop = [c for c in filtered.column_names if c not in ("prompt", "category", "name")]
+    return filtered.map(_project, remove_columns=cols_to_drop)
+# ── Aya: reference-based judge ──────────────────────────────────────
+def process_results_aya(doc, results):
+    """Return (question, gold, pred, raw_response) for the standard llm_judge."""
+    raw_response = results[0].strip() if results and results[0] else ""
+    pred = strip_think_tags(raw_response)
+    question = str(doc.get("inputs", ""))
+    gold = str(doc.get("targets", ""))
+    return {"llm_judge_score": (question, gold, pred, raw_response)}
+# ── PolyWrite: reference-free open-quality judge ────────────────────
+_OPEN_QUALITY_PROMPT = """\
+You are an expert evaluator of open-ended writing in Albanian (gjuha shqipe).
+You will be given a writing PROMPT in Albanian and a MODEL_ANSWER produced \
+by a language model. The prompt has NO reference answer; judge the answer \
+on its own merits.
+Score the answer from 1 (very poor) to 5 (excellent) on the following \
+combined rubric:
+1. **Relevance** – does the answer address the prompt and stay on-topic?
+2. **Fluency** – is the writing in fluent, grammatical Albanian (Tosk \
+   ``sqi`` or ``als`` accepted)?
+3. **Coherence** – is the answer well-structured and internally consistent?
+4. **Quality** – is the content interesting / useful / creative as the \
+   prompt requests?
+Calibration:
+- 5: excellent on all four axes.
+- 4: good, minor issues.
+- 3: acceptable but noticeably flawed (e.g. partly off-topic, awkward Albanian).
+- 2: poor (off-topic, broken Albanian, very short, or generic boilerplate).
+- 1: unusable (refusal, wrong language, gibberish, empty).
+Respond ONLY with a single compact JSON object with exactly these keys:
+- "score": integer 1-5
+- "justification": one short sentence (Albanian or English).
+PROMPT:
+{prompt}
+MODEL_ANSWER:
+{pred}"""
+def _call_open_quality_judge(prompt: str, pred: str, max_retries: int = 3) -> float:
+    client = _get_judge_client()
+    judge_model = _os.getenv("JUDGE_MODEL", "openai/gpt-5-mini")
+    judge_text = _OPEN_QUALITY_PROMPT.format(prompt=prompt.strip(), pred=pred.strip())
+    for attempt in range(max_retries):
+        try:
+            resp = client.chat.completions.create(
+                model=judge_model,
+                temperature=0,
+                messages=[{"role": "user", "content": judge_text}],
+                response_format={"type": "json_object"},
+            )
+            data = json.loads(resp.choices[0].message.content)
+            raw = data.get("score", 0)
+            try:
+                score_int = int(raw)
+            except (TypeError, ValueError):
+                score_int = 0
+            score_int = max(1, min(5, score_int)) if score_int else 0
+            return (score_int - 1) / 4.0 if score_int else 0.0
+        except Exception as e:
+            if attempt < max_retries - 1:
+                time.sleep(2 ** attempt)
+            else:
+                print(f"[open-judge] Failed after {max_retries} retries: {e}")
+                return 0.0
+    return 0.0
+def process_results_polywrite(doc, results):
+    """Score each sample with the open-quality judge and return a per-sample float.
+    Returning a numeric score (instead of a tuple consumed by a custom aggregator)
+    means the score is written to every JSONL sample row by lm-eval, so downstream
+    tools (e.g. the dashboard) can recompute the average from samples alone.
+    Aggregation is plain ``mean`` in the YAML.
+    """
+    raw_response = results[0].strip() if results and results[0] else ""
+    pred = strip_think_tags(raw_response)
+    prompt = str(doc.get("prompt", ""))
+    score = _call_open_quality_judge(prompt, pred)
+    return {"open_quality_score": score}

evals/albanian/summarization/_default_summarization_yaml ADDED Viewed

	@@ -0,0 +1,18 @@

+# Shared config for Albanian summarization tasks (MassiveSumm).
+# The Albanian-only filter is applied in process_docs (the upstream
+# datasets are highly multilingual single-table dumps, no per-language
+# config). Scoring is sentence-level ROUGE-L F1.
+output_type: generate_until
+test_split: train
+generation_kwargs:
+  do_sample: false
+  max_gen_toks: 8192
+  until:
+    - "<|endoftext|>"
+process_results: !function utils.process_results
+metric_list:
+  - metric: rouge_l
+    aggregation: !function utils.rouge_l_agg
+    higher_is_better: true
+metadata:
+  version: 1.0

evals/albanian/summarization/albanian_massivesumm_long.yaml ADDED Viewed

	@@ -0,0 +1,14 @@

+task: albanian_massivesumm_long
+task_alias: massivesumm_long
+# Gated dataset: accept terms at
+# https://huggingface.co/datasets/MaLA-LM/MassiveSumm_long and export HF_TOKEN.
+dataset_path: MaLA-LM/MassiveSumm_long
+include: _default_summarization_yaml
+process_docs: !function utils.process_docs
+generation_kwargs:
+  do_sample: false
+  max_gen_toks: 8192
+  until:
+    - "<|endoftext|>"
+doc_to_text: "Ti je një sistem përmbledhjeje lajmesh.\nPërmblidh artikullin e mëposhtëm në një paragraf të shkurtër (3-5 fjali) në gjuhën shqipe. Mos shto komente.\n\nArtikulli:\n{{text}}\n\nPërmbledhja:"
+doc_to_target: "{{summary}}"

evals/albanian/summarization/albanian_massivesumm_short.yaml ADDED Viewed

	@@ -0,0 +1,9 @@

+task: albanian_massivesumm_short
+task_alias: massivesumm_short
+# Gated dataset: accept terms at
+# https://huggingface.co/datasets/MaLA-LM/MassiveSumm_short and export HF_TOKEN.
+dataset_path: MaLA-LM/MassiveSumm_short
+include: _default_summarization_yaml
+process_docs: !function utils.process_docs
+doc_to_text: "Ti je një sistem përmbledhjeje lajmesh.\nPërmblidh artikullin e mëposhtëm në një ose dy fjali të shkurtra në gjuhën shqipe. Mos shto komente.\n\nArtikulli:\n{{text}}\n\nPërmbledhja:"
+doc_to_target: "{{summary}}"

evals/albanian/summarization/albanian_summarization.yaml ADDED Viewed

	@@ -0,0 +1,11 @@

+# Summarization subgroup (MassiveSumm short + long)
+group: albanian_summarization
+task:
+  - albanian_massivesumm_short
+  - albanian_massivesumm_long
+aggregate_metric_list:
+  - metric: rouge_l
+    aggregation: mean
+    weight_by_size: true
+metadata:
+  version: 1.0

evals/albanian/summarization/utils.py ADDED Viewed

	@@ -0,0 +1,111 @@

+"""Utility helpers for Albanian summarization tasks (MassiveSumm short/long).
+Both MassiveSumm subsets are highly multilingual single-table datasets
+(one ``train`` split, no language configs). We filter to Albanian rows
+inside ``process_docs``. The HF dataset is **gated** — accept the terms
+on the dataset page once and export ``HF_TOKEN`` before running.
+Scoring uses ROUGE-L F1 via the ``rouge_score`` package, which is
+already a transitive dependency of lm-evaluation-harness.
+"""
+from __future__ import annotations
+import re
+import string
+import datasets
+_ALBANIAN_LANG_CODES = {"sqi", "als", "aln"}
+def _strip_think_tags(text: str) -> str:
+    """Strip <think>...</think> reasoning wrapper (e.g. Qwen thinking models)."""
+    if "</think>" in text:
+        return text.split("</think>")[-1].strip()
+    return text
+def _filter_albanian(dataset: datasets.Dataset) -> datasets.Dataset:
+    """Keep rows whose ``language`` field is one of Albanian variants."""
+    if "language" not in dataset.column_names:
+        return dataset
+    return dataset.filter(lambda row: str(row.get("language", "")).lower() in _ALBANIAN_LANG_CODES)
+def _normalise_doc(doc):
+    """Project the columns we actually need."""
+    text = (doc.get("text") or "").strip()
+    summary = (doc.get("summary") or "").strip()
+    title = (doc.get("title") or "").strip()
+    return {
+        "text": text,
+        "summary": summary,
+        "title": title,
+    }
+def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:
+    filtered = _filter_albanian(dataset)
+    return filtered.map(_normalise_doc, remove_columns=[
+        c for c in filtered.column_names if c not in ("text", "summary", "title")
+    ])
+# ── ROUGE-L scoring ──────────────────────────────────────────────────
+_PUNCT_TABLE = str.maketrans("", "", string.punctuation)
+_WHITESPACE_RE = re.compile(r"\s+")
+def _normalise(text: str) -> str:
+    text = text.translate(_PUNCT_TABLE)
+    text = _WHITESPACE_RE.sub(" ", text)
+    return text.strip().lower()
+def _lcs_length(a, b):
+    m, n = len(a), len(b)
+    if m == 0 or n == 0:
+        return 0
+    dp = [0] * (n + 1)
+    for i in range(1, m + 1):
+        prev = 0
+        for j in range(1, n + 1):
+            tmp = dp[j]
+            if a[i - 1] == b[j - 1]:
+                dp[j] = prev + 1
+            else:
+                dp[j] = max(dp[j], dp[j - 1])
+            prev = tmp
+    return dp[n]
+def _rouge_l_f1(pred: str, gold: str) -> float:
+    """Compute sentence-level ROUGE-L F1 (no stemming) between pred and gold."""
+    pred_tokens = _normalise(pred).split()
+    gold_tokens = _normalise(gold).split()
+    if not pred_tokens or not gold_tokens:
+        return 0.0
+    lcs = _lcs_length(pred_tokens, gold_tokens)
+    if lcs == 0:
+        return 0.0
+    precision = lcs / len(pred_tokens)
+    recall = lcs / len(gold_tokens)
+    return 2 * precision * recall / (precision + recall)
+def process_results(doc, results):
+    raw_response = results[0].strip() if results and results[0] else ""
+    pred = _strip_think_tags(raw_response)
+    gold = (doc.get("summary") or "").strip()
+    return {"rouge_l": (gold, pred)}
+def rouge_l_agg(items):
+    if not items:
+        return 0.0
+    scores = [_rouge_l_f1(pred, gold) for gold, pred in items]
+    return sum(scores) / len(scores)

evals/arabic/classification/arabic_sib200.yaml CHANGED Viewed

@@ -12,6 +12,13 @@ generation_kwargs:
 process_docs: !function utils.process_sib200_docs
 doc_to_text: "أنت نظام تصنيف مواضيع.\nاختر التصنيف الأنسب للنص التالي.\n\nالتصنيفات المسموحة: {{labels_str}}\n\nالتعليمات: أجب بتصنيف واحد فقط من التصنيفات المسموحة. لا تكتب أي شيء آخر.\n\nالنص:\n{{text}}\n\nالتصنيف:"
 doc_to_target: "{{target}}"
 process_results: !function utils.process_results
 metric_list:
   - metric: f1_macro

 process_docs: !function utils.process_sib200_docs
 doc_to_text: "أنت نظام تصنيف مواضيع.\nاختر التصنيف الأنسب للنص التالي.\n\nالتصنيفات المسموحة: {{labels_str}}\n\nالتعليمات: أجب بتصنيف واحد فقط من التصنيفات المسموحة. لا تكتب أي شيء آخر.\n\nالنص:\n{{text}}\n\nالتصنيف:"
 doc_to_target: "{{target}}"
+filter_list:
+  - name: "get_label"
+    filter:
+      - function: "strip_think_recover"
+      - function: "regex_label_set"
+        labels_field: "labels_str"
+      - function: "take_first"
 process_results: !function utils.process_results
 metric_list:
   - metric: f1_macro

evals/arabic/qa/arabic_qa.yaml CHANGED Viewed

@@ -2,6 +2,9 @@ group: arabic_qa
 task:
   - arabic_tydiqa
 aggregate_metric_list:
   - metric: f1
     aggregation: mean
     weight_by_size: true

 task:
   - arabic_tydiqa
 aggregate_metric_list:
+  - metric: exact_match
+    aggregation: mean
+    weight_by_size: true
   - metric: f1
     aggregation: mean
     weight_by_size: true

evals/cost_core.py CHANGED Viewed

@@ -53,7 +53,6 @@ _OUTPUT_HEURISTIC_BY_KEYWORD = [
     ("afrimgsm", 1024),
     ("mgsm", 1024),
     ("gsm8k", 1024),
-    ("ifeval", 512),
     # Specific QA tasks first (some end in "qa" as a substring).
     ("tydiqa", 256),
     ("aquas", 384),

     ("afrimgsm", 1024),
     ("mgsm", 1024),
     ("gsm8k", 1024),
     # Specific QA tasks first (some end in "qa" as a substring).
     ("tydiqa", 256),
     ("aquas", 384),

evals/english/english.yaml CHANGED Viewed

@@ -2,7 +2,6 @@ group: english
 task:
   - english_mcq
   - english_math
-  - ifeval
   - english_mmlu_pro
 metadata:
   version: 1.0

 task:
   - english_mcq
   - english_math
   - english_mmlu_pro
 metadata:
   version: 1.0

evals/eval_config.toml CHANGED Viewed

@@ -141,9 +141,6 @@ name = "english_mcq"
 [[tasks]]
 name = "english_math"
-[[tasks]]
-name = "ifeval"
 [[tasks]]
 name = "english_mmlu_pro"
@@ -153,95 +150,94 @@ name = "english_mmlu_pro"
 [[tasks]]
 name = "spanish_xquad_es"
-[[tasks]]
-name = "spanish_mcq"
 # Swahili
 [[tasks]]
 name = "swahili_classification"
 [[tasks]]
-name = "swahili_afrimgsm"
 [[tasks]]
-name = "swahili_nli"
-# French
 [[tasks]]
-name = "french_classification"
 [[tasks]]
-name = "french_mcq"
 [[tasks]]
-name = "french_math"
 [[tasks]]
-name = "french_qa"
-# Arabic
 [[tasks]]
-name = "arabic_classification"
 [[tasks]]
-name = "arabic_mcq"
 [[tasks]]
-name = "arabic_qa"
-# Hausa
 [[tasks]]
-name = "hausa_classification"
 [[tasks]]
-name = "hausa_mcq"
 [[tasks]]
-name = "hausa_afrimgsm"
 [[tasks]]
-name = "hausa_nli"
 [[tasks]]
-name = "hausa_qa"
 [[tasks]]
-name = "hausa_sentiment"
-# Yoruba
 [[tasks]]
-name = "yoruba_classification"
 [[tasks]]
-name = "yoruba_mcq"
 [[tasks]]
-name = "yoruba_afrimgsm"
 [[tasks]]
-name = "yoruba_nli"
 [[tasks]]
-name = "yoruba_qa"
 [[tasks]]
-name = "yoruba_sentiment"
-# Igbo
 [[tasks]]
-name = "igbo_classification"
 [[tasks]]
-name = "igbo_mcq"
 [[tasks]]
-name = "igbo_afrimgsm"
 [[tasks]]
-name = "igbo_nli"
 [[tasks]]
-name = "igbo_qa"
 [[tasks]]
-name = "igbo_sentiment"

 [[tasks]]
 name = "english_math"
 [[tasks]]
 name = "english_mmlu_pro"
 [[tasks]]
 name = "spanish_xquad_es"
 # Swahili
 [[tasks]]
 name = "swahili_classification"
+# Albanian
 [[tasks]]
+name = "albanian_classification"
 [[tasks]]
+name = "albanian_mcq"
 [[tasks]]
+name = "albanian_summarization"
 [[tasks]]
+name = "albanian_open_generation"
+# Portuguese
 [[tasks]]
+name = "portuguese_mcq"
 [[tasks]]
+name = "portuguese_classification"
 [[tasks]]
+name = "portuguese_nli"
+# Ukrainian
 [[tasks]]
+name = "ukrainian_classification"
 [[tasks]]
+name = "ukrainian_mcq"
 [[tasks]]
+name = "ukrainian_qa"
 [[tasks]]
+name = "ukrainian_summarization"
 [[tasks]]
+name = "ukrainian_open_generation"
+# Urdu
 [[tasks]]
+name = "urdu_claim"
 [[tasks]]
+name = "urdu_classification"
 [[tasks]]
+name = "urdu_qa"
+# French
 [[tasks]]
+name = "french_classification"
 [[tasks]]
+name = "french_mcq"
 [[tasks]]
+name = "french_math"
 [[tasks]]
+name = "french_qa"
+# Arabic
 [[tasks]]
+name = "arabic_classification"
 [[tasks]]
+name = "arabic_mcq"
 [[tasks]]
+name = "arabic_qa"
+# Hausa
 [[tasks]]
+name = "hausa_classification"
 [[tasks]]
+name = "hausa_mcq"
 [[tasks]]
+name = "hausa_afrimgsm"
 [[tasks]]
+name = "hausa_nli"
 [[tasks]]
+name = "hausa_qa"

evals/f1_utils.py CHANGED Viewed

@@ -3,9 +3,25 @@
 Provides the macro-averaged F1 aggregation and a common process_results
 helper used across all language-specific classification utils modules.
-Registers ``regex_last``: like lm_eval's ``regex`` filter, but picks a match
-from ``findall`` using ``group_select``; default ``group_select=-1`` is the
-**last** match (needed when CoT/reasoning mentions labels before the answer).
 """
 import re
@@ -21,6 +37,41 @@ def _strip_think_tags(text: str) -> str:
     return text
 def macro_f1_agg(items):
     """Compute macro-averaged F1 over all class labels.
@@ -93,29 +144,123 @@ class RegexLastFilter(Filter):
         return list(map(filter_set, resps))
-def _normalize_label(s: str) -> str:
-    """Light normalization for classification labels: lowercase, strip
-    surrounding whitespace and trailing punctuation (e.g. ``entailment.``).
     """
-    if not s:
-        return ""
-    cleaned = s.strip().lower()
-    # Strip trailing punctuation commonly emitted by chat models
-    cleaned = cleaned.rstrip(".,;:!?\"'`")
-    return cleaned.strip()
 def process_results_f1(doc, results, *, gold_key="target"):
     """Return ``(pred, gold)`` for macro-F1 aggregation.
-    ``pred`` is the label after stripping think wrappers and light
-    normalization (lowercasing, trailing punctuation). Full raw generation
-    is logged as ``resps`` / ``reasoning_content`` when using ``run_eval.py``.
     Most tasks use ``gold_key="target"``; override for tasks that store
     the gold label under a different field name.
     """
     raw_response = results[0].strip() if results[0] else ""
-    pred = _normalize_label(_strip_think_tags(raw_response))
-    gold = _normalize_label(doc.get(gold_key, ""))
     return {"f1_macro": (pred, gold)}

 Provides the macro-averaged F1 aggregation and a common process_results
 helper used across all language-specific classification utils modules.
+Registers:
+- ``regex_last``: like lm_eval's ``regex`` filter, but picks a match from
+  ``findall`` using ``group_select``; default ``group_select=-1`` is the
+  **last** match (needed when CoT/reasoning mentions labels before the answer).
+- ``strip_think_recover``: drop ``<think>…</think>`` so
+  downstream ``regex`` sees only the final answer channel; if that tail is empty
+  (e.g. stop at ``\\n\\n`` before content), fall back to the last non-empty line
+  of the reasoning block (see ``run_eval.py`` merge format).
+- ``regex_label_set``: pick the last occurrence of any allowed label from a
+  per-doc field (e.g. ``labels_str`` for SIB-200, ``intents_str`` for InjongoIntent).
+  Robust to channel-marker leak (e.g. a leaked ``<|channel|>`` header before the
+  answer), models that say "the answer is X", and substring collisions
+  (``science/technology`` vs ``science``) -- labels are matched longest-first.
+- ``strip_channel_header``: drop a Harmony-style channel-marker prefix
+  (``<channel|>`` / ``<|channel|>`` and optional trailing ``<|message|>``) from
+  the start of the response. Useful for open-text generation tasks
+  (summarization / QA / open generation) where the actual answer is correct
+  but the chat template leaks tokens at the start. No-op when no marker found.
 """
 import re
     return text
+@register_filter("strip_think_recover")
+class StripThinkRecoverFilter(Filter):
+    """Remove think wrapper so MCQ ``regex`` runs on the answer tail only.
+    When ``run_eval.py`` merges API ``reasoning`` + ``content``, the built-in
+    ``regex`` ``([ABCD])`` / ``([ABCDE])`` filter would otherwise match the
+    **first** letter inside the reasoning block. This step keeps only text after
+    ``</think>`` when non-empty; if that tail is empty, uses the
+    last non-empty line inside the reasoning (common when generation stops early).
+    """
+    def __init__(self) -> None:
+        pass
+    def apply(self, resps, docs):
+        def strip_set(inst):
+            stripped = []
+            for resp in inst:
+                if not isinstance(resp, str):
+                    resp = ""
+                content = _strip_think_tags(resp)
+                if not content and "</think>" in resp:
+                    reasoning = resp.split("</think>")[0]
+                    if "<think>" in reasoning:
+                        reasoning = reasoning.split("<think>", 1)[1]
+                    lines = [
+                        ln.strip() for ln in reasoning.strip().splitlines() if ln.strip()
+                    ]
+                    content = lines[-1] if lines else ""
+                stripped.append(content)
+            return stripped
+        return list(map(strip_set, resps))
 def macro_f1_agg(items):
     """Compute macro-averaged F1 over all class labels.
         return list(map(filter_set, resps))
+@register_filter("regex_label_set")
+class RegexLabelSetFilter(Filter):
+    """Pick the LAST occurrence of any allowed label from the response.
+    The allowed-label list is read from a per-doc field (default ``labels_str``,
+    e.g. ``"entertainment, geography, ..., science/technology, ..."``). Labels
+    are matched **longest-first** so multi-segment labels like
+    ``science/technology`` win over substring collisions like ``science``.
+    Robust to:
+    - ``<think>...</think>`` reasoning leak -- typically chain ``strip_think_recover``
+      first to drop the reasoning block, then this filter on the answer tail.
+    - Harmony / channel-marker leak (e.g. a leaked ``<|channel|>`` header followed
+      by the actual label) -- the regex still finds the trailing label substring.
+    - "The answer is X" / "Final: X" patterns -- the LAST occurrence wins.
+    If the doc field is missing, empty, or no label matches, returns
+    ``fallback`` (default ``"[invalid]"``) so the row counts as wrong in F1.
     """
+    def __init__(
+        self,
+        labels_field: str = "labels_str",
+        separator: str = ",",
+        fallback: str = "[invalid]",
+    ) -> None:
+        self.labels_field = labels_field
+        self.separator = separator
+        self.fallback = fallback
+    def apply(self, resps: list[list[str]], docs: list[dict]) -> list[list[str]]:
+        out: list[list[str]] = []
+        for resp_set, doc in zip(resps, docs):
+            raw_labels = str((doc or {}).get(self.labels_field, "") or "")
+            labels = [
+                lbl.strip() for lbl in raw_labels.split(self.separator) if lbl.strip()
+            ]
+            # Sort longest-first so e.g. "science/technology" matches before
+            # the bare "science" substring inside it.
+            labels_sorted = sorted(set(labels), key=len, reverse=True)
+            pattern = (
+                re.compile("(" + "|".join(re.escape(lbl) for lbl in labels_sorted) + ")")
+                if labels_sorted
+                else None
+            )
+            filtered: list[str] = []
+            for resp in resp_set:
+                if not isinstance(resp, str):
+                    resp = ""
+                if pattern is None:
+                    filtered.append(self.fallback)
+                    continue
+                matches = pattern.findall(resp)
+                filtered.append(matches[-1].strip() if matches else self.fallback)
+            out.append(filtered)
+        return out
+@register_filter("strip_channel_header")
+class StripChannelHeaderFilter(Filter):
+    """Strip Harmony-style channel/message header leaks from the response.
+    Some providers (notably deepinfra fp8 Gemma) leak chat template tokens like
+    ``<|channel|>final<|message|>`` -- or partial fragments such as
+    ``s.<channel|>`` (where ``s.`` is the tail of the previous token) -- into
+    the assistant ``content``. For open-text tasks (summarization / QA / open
+    generation) this hurts every metric (ROUGE / BLEU / SAS-encoder / LLM judge)
+    because the prefix garbage drags down the score even when the actual answer
+    that follows is correct.
+    Strategy: anchored at the start of the response, match up to ``max_prefix_chars``
+    (default 80) characters of any text followed by a ``<channel|>`` /
+    ``<|channel|>`` marker, optionally followed by a Harmony ``<|message|>`` /
+    ``<message|>`` marker (and any text in between like ``final``). Drop everything
+    matched. No-op when no marker is found near the start, so safe for clean responses.
+    Order tip: chain after ``strip_think_recover`` so reasoning is dropped first
+    and this filter operates on the answer tail only.
+    """
+    def __init__(self, max_prefix_chars: int = 80) -> None:
+        self.max_prefix_chars = int(max_prefix_chars)
+        # ^                  - anchored at start
+        # .{0,N}?            - up to N chars of garbage prefix (non-greedy)
+        # <\|?channel\|?>    - matches both <channel|> and <|channel|>
+        # (?:[^<]*<\|?message\|?>)? - optional Harmony "<...><|message|>" tail
+        # \s*                - eat trailing whitespace
+        self._pattern = re.compile(
+            r"^.{0," + str(self.max_prefix_chars) + r"}?<\|?channel\|?>"
+            r"(?:[^<]{0,40}<\|?message\|?>)?\s*",
+            re.DOTALL,
+        )
+    def apply(self, resps: list[list[str]], docs: list[dict]) -> list[list[str]]:
+        out: list[list[str]] = []
+        for resp_set in resps:
+            stripped: list[str] = []
+            for resp in resp_set:
+                if not isinstance(resp, str):
+                    resp = ""
+                cleaned = self._pattern.sub("", resp, count=1)
+                stripped.append(cleaned)
+            out.append(stripped)
+        return out
 def process_results_f1(doc, results, *, gold_key="target"):
     """Return ``(pred, gold)`` for macro-F1 aggregation.
+    ``pred`` is the label after stripping think wrappers. Full reasoning
+    is logged as ``reasoning_content`` when using ``run_eval.py``.
     Most tasks use ``gold_key="target"``; override for tasks that store
     the gold label under a different field name.
     """
     raw_response = results[0].strip() if results[0] else ""
+    pred = _strip_think_tags(raw_response)
+    gold = doc.get(gold_key, "").strip()
     return {"f1_macro": (pred, gold)}

evals/french/classification/french_sib200.yaml CHANGED Viewed

@@ -12,6 +12,13 @@ generation_kwargs:
 process_docs: !function utils.process_sib200_docs
 doc_to_text: "Vous etes un systeme de classification de sujets.\nChoisissez la meilleure etiquette pour le texte suivant.\n\nEtiquettes autorisees: {{labels_str}}\n\nInstruction: Repondez avec UNE SEULE etiquette parmi les etiquettes autorisees. N'ecrivez rien d'autre.\n\nTexte:\n{{text}}\n\nEtiquette:"
 doc_to_target: "{{target}}"
 process_results: !function utils.process_results
 metric_list:
   - metric: f1_macro

 process_docs: !function utils.process_sib200_docs
 doc_to_text: "Vous etes un systeme de classification de sujets.\nChoisissez la meilleure etiquette pour le texte suivant.\n\nEtiquettes autorisees: {{labels_str}}\n\nInstruction: Repondez avec UNE SEULE etiquette parmi les etiquettes autorisees. N'ecrivez rien d'autre.\n\nTexte:\n{{text}}\n\nEtiquette:"
 doc_to_target: "{{target}}"
+filter_list:
+  - name: "get_label"
+    filter:
+      - function: "strip_think_recover"
+      - function: "regex_label_set"
+        labels_field: "labels_str"
+      - function: "take_first"
 process_results: !function utils.process_results
 metric_list:
   - metric: f1_macro

evals/french/qa/french_qa.yaml CHANGED Viewed

@@ -2,6 +2,9 @@ group: french_qa
 task:
   - french_fquad
 aggregate_metric_list:
   - metric: f1
     aggregation: mean
     weight_by_size: true

 task:
   - french_fquad
 aggregate_metric_list:
+  - metric: exact_match
+    aggregation: mean
+    weight_by_size: true
   - metric: f1
     aggregation: mean
     weight_by_size: true

evals/hausa/classification/hausa_sib200.yaml CHANGED Viewed

@@ -12,6 +12,13 @@ generation_kwargs:
 process_docs: !function utils.process_sib200_docs
 doc_to_text: "You are a topic classification system.\nChoose the single best label for the following Hausa text.\n\nAllowed labels: {{labels_str}}\n\nInstruction: Reply with ONE label only from the allowed labels. Do not write anything else.\n\nText:\n{{text}}\n\nLabel:"
 doc_to_target: "{{target}}"
 process_results: !function utils.process_results
 metric_list:
   - metric: f1_macro

 process_docs: !function utils.process_sib200_docs
 doc_to_text: "You are a topic classification system.\nChoose the single best label for the following Hausa text.\n\nAllowed labels: {{labels_str}}\n\nInstruction: Reply with ONE label only from the allowed labels. Do not write anything else.\n\nText:\n{{text}}\n\nLabel:"
 doc_to_target: "{{target}}"
+filter_list:
+  - name: "get_label"
+    filter:
+      - function: "strip_think_recover"
+      - function: "regex_label_set"
+        labels_field: "labels_str"
+      - function: "take_first"
 process_results: !function utils.process_results
 metric_list:
   - metric: f1_macro

evals/hausa/hausa.yaml CHANGED Viewed

@@ -5,6 +5,5 @@ task:
   - hausa_afrimgsm
   - hausa_nli
   - hausa_qa
-  - hausa_sentiment
 metadata:
   version: 1.0

   - hausa_afrimgsm
   - hausa_nli
   - hausa_qa
 metadata:
   version: 1.0

evals/hausa/nli/hausa_afrixnli.yaml CHANGED Viewed

@@ -12,6 +12,13 @@ generation_kwargs:
 process_docs: !function utils.process_afrixnli_docs
 doc_to_text: "Premise: {{premise}}\nHypothesis: {{hypothesis}}\n\nDoes the hypothesis follow from the premise?\nAllowed answers: entailment, neutral, contradiction\n\nInstruction: Reply with ONE word only from the allowed answers.\n\nAnswer:"
 doc_to_target: "{{target}}"
 process_results: !function utils.process_results
 metric_list:
   - metric: f1_macro

 process_docs: !function utils.process_afrixnli_docs
 doc_to_text: "Premise: {{premise}}\nHypothesis: {{hypothesis}}\n\nDoes the hypothesis follow from the premise?\nAllowed answers: entailment, neutral, contradiction\n\nInstruction: Reply with ONE word only from the allowed answers.\n\nAnswer:"
 doc_to_target: "{{target}}"
+filter_list:
+  - name: "get_label"
+    filter:
+      - function: "strip_think_recover"
+      - function: "regex_label_set"
+        labels_field: "labels_str"
+      - function: "take_first"
 process_results: !function utils.process_results
 metric_list:
   - metric: f1_macro

evals/hausa/nli/utils.py CHANGED Viewed

@@ -6,6 +6,7 @@ import datasets
 from f1_utils import macro_f1_agg, process_results_f1  # noqa: F401
 LABELS = ["entailment", "neutral", "contradiction"]
 def process_afrixnli_docs(dataset: datasets.Dataset) -> datasets.Dataset:
@@ -18,6 +19,7 @@ def process_afrixnli_docs(dataset: datasets.Dataset) -> datasets.Dataset:
             doc["target"] = LABELS[lbl]
         else:
             doc["target"] = str(lbl)
         return doc
     return dataset.map(_process)

 from f1_utils import macro_f1_agg, process_results_f1  # noqa: F401
 LABELS = ["entailment", "neutral", "contradiction"]
+LABELS_STR = ", ".join(LABELS)
 def process_afrixnli_docs(dataset: datasets.Dataset) -> datasets.Dataset:
             doc["target"] = LABELS[lbl]
         else:
             doc["target"] = str(lbl)
+        doc["labels_str"] = LABELS_STR
         return doc
     return dataset.map(_process)

evals/hausa/qa/hausa_qa.yaml CHANGED Viewed

@@ -2,6 +2,9 @@ group: hausa_qa
 task:
   - hausa_afriqa
 aggregate_metric_list:
   - metric: f1
     aggregation: mean
     weight_by_size: true

 task:
   - hausa_afriqa
 aggregate_metric_list:
+  - metric: exact_match
+    aggregation: mean
+    weight_by_size: true
   - metric: f1
     aggregation: mean
     weight_by_size: true

evals/hausa/sentiment/utils.py DELETED Viewed

@@ -1,26 +0,0 @@
-"""Sentiment utils."""
-import os as _os, sys as _sys  # noqa: E401
-_sys.path.insert(0, _os.path.normpath(_os.path.join(_os.path.dirname(__file__), "..", "..")))
-import datasets
-from f1_utils import macro_f1_agg, process_results_f1  # noqa: F401
-LABELS = ["positive", "negative", "neutral"]
-def process_naijasenti_docs(dataset: datasets.Dataset) -> datasets.Dataset:
-    feat = dataset.features.get("label")
-    def _process(doc):
-        lbl = doc.get("label")
-        if isinstance(lbl, int) and feat is not None and hasattr(feat, "names") and feat.names:
-            doc["target"] = feat.names[lbl]
-        elif isinstance(lbl, int) and 0 <= lbl < len(LABELS):
-            doc["target"] = LABELS[lbl]
-        else:
-            doc["target"] = str(lbl).lower()
-        return doc
-    return dataset.map(_process)
-def process_results(doc, results):
-    return process_results_f1(doc, results)

evals/igbo/afrimgsm/igbo_afrimgsm.yaml DELETED Viewed

@@ -1,28 +0,0 @@
-task: igbo_afrimgsm
-task_alias: afrimgsm
-dataset_path: masakhane/afrimgsm
-dataset_name: ibo
-test_split: test
-output_type: generate_until
-generation_kwargs:
-  do_sample: false
-  max_gen_toks: 8192
-  until:
-    - "<|endoftext|>"
-doc_to_text: "Question: {{question}}\nAnswer:"
-doc_to_target: "{{answer_number|string}}"
-filter_list:
-  - name: "get_answer"
-    filter:
-      - function: "regex"
-        regex_pattern: "(-?[$0-9.,]{2,})|(-?[0-9]+)"
-        group_select: -1
-      - function: "take_first"
-metric_list:
-  - metric: exact_match
-    aggregation: mean
-    higher_is_better: true
-    ignore_case: true
-    ignore_punctuation: true
-metadata:
-  version: 1.0

evals/igbo/igbo.yaml DELETED Viewed

@@ -1,10 +0,0 @@
-group: igbo
-task:
-  - igbo_classification
-  - igbo_mcq
-  - igbo_afrimgsm
-  - igbo_nli
-  - igbo_qa
-  - igbo_sentiment
-metadata:
-  version: 1.0

evals/igbo/mcq/igbo_afrimmlu.yaml DELETED Viewed

@@ -1,9 +0,0 @@
-task: igbo_afrimmlu
-task_alias: afrimmlu
-dataset_path: masakhane/afrimmlu
-dataset_name: ibo
-test_split: test
-include: _default_mcq_yaml
-process_docs: !function utils.process_afrimmlu_docs
-doc_to_text: "You are a highly knowledgeable AI that answers multiple-choice questions about '{{subject_field}}'.\n\nQuestion:\n{{question}}\n\nChoices:\nA: {{choice_a}}\nB: {{choice_b}}\nC: {{choice_c}}\nD: {{choice_d}}\n\nInstruction: Reply with EXACTLY one letter: A, B, C, or D. No other text.\n\nAnswer:"
-doc_to_target: "{{gold_letter}}"

evals/igbo/nli/utils.py DELETED Viewed

@@ -1,26 +0,0 @@
-"""NLI utils."""
-import os as _os, sys as _sys  # noqa: E401
-_sys.path.insert(0, _os.path.normpath(_os.path.join(_os.path.dirname(__file__), "..", "..")))
-import datasets
-from f1_utils import macro_f1_agg, process_results_f1  # noqa: F401
-LABELS = ["entailment", "neutral", "contradiction"]
-def process_afrixnli_docs(dataset: datasets.Dataset) -> datasets.Dataset:
-    feat = dataset.features.get("label")
-    def _process(doc):
-        lbl = doc.get("label")
-        if isinstance(lbl, int) and feat is not None and hasattr(feat, "names") and feat.names:
-            doc["target"] = feat.names[lbl]
-        elif isinstance(lbl, int) and 0 <= lbl < len(LABELS):
-            doc["target"] = LABELS[lbl]
-        else:
-            doc["target"] = str(lbl)
-        return doc
-    return dataset.map(_process)
-def process_results(doc, results):
-    return process_results_f1(doc, results)

evals/igbo/qa/utils.py DELETED Viewed

@@ -1,61 +0,0 @@
-"""QA utils."""
-import re
-import string
-import unicodedata
-import json
-import datasets
-def _get_gold_answers(doc):
-    """AfriQA stores answers in a nested structure that may be a dict or stringified list."""
-    ap = doc.get("answer_pivot") or doc.get("answers") or {}
-    if isinstance(ap, str):
-        try:
-            ap = json.loads(ap)
-        except Exception:
-            return [ap]
-    if isinstance(ap, dict):
-        a = ap.get("answers") or ap.get("text") or []
-        if isinstance(a, str):
-            try:
-                a = json.loads(a)
-            except Exception:
-                return [a]
-        return a if isinstance(a, list) else [str(a)]
-    if isinstance(ap, list):
-        return [str(x) for x in ap]
-    return [str(ap)]
-def _normalize(s: str) -> str:
-    s = unicodedata.normalize("NFKC", s).lower()
-    s = "".join(c for c in s if c not in string.punctuation)
-    s = " ".join(s.split())
-    return s
-def _f1(pred: str, gold: str) -> float:
-    pred_toks = _normalize(pred).split()
-    gold_toks = _normalize(gold).split()
-    if not pred_toks or not gold_toks:
-        return float(pred_toks == gold_toks)
-    common = set(pred_toks) & set(gold_toks)
-    num_same = sum(min(pred_toks.count(t), gold_toks.count(t)) for t in common)
-    if num_same == 0:
-        return 0.0
-    p = num_same / len(pred_toks)
-    r = num_same / len(gold_toks)
-    return 2 * p * r / (p + r)
-def process_results_qa(doc, results):
-    pred = results[0].strip() if results[0] else ""
-    if "</think>" in pred:
-        pred = pred.split("</think>")[-1].strip()
-    golds = _get_gold_answers(doc)
-    if not golds:
-        return {"exact_match": 0.0, "f1": 0.0}
-    em = max(1.0 if _normalize(pred) == _normalize(g) else 0.0 for g in golds)
-    f1 = max(_f1(pred, g) for g in golds)
-    return {"exact_match": em, "f1": f1}

evals/igbo/sentiment/igbo_sentiment.yaml DELETED Viewed

@@ -1,9 +0,0 @@
-group: igbo_sentiment
-task:
-  - igbo_naijasenti
-aggregate_metric_list:
-  - metric: f1_macro
-    aggregation: mean
-    weight_by_size: true
-metadata:
-  version: 1.0

evals/igbo/sentiment/utils.py DELETED Viewed

@@ -1,26 +0,0 @@
-"""Sentiment utils."""
-import os as _os, sys as _sys  # noqa: E401
-_sys.path.insert(0, _os.path.normpath(_os.path.join(_os.path.dirname(__file__), "..", "..")))
-import datasets
-from f1_utils import macro_f1_agg, process_results_f1  # noqa: F401
-LABELS = ["positive", "negative", "neutral"]
-def process_naijasenti_docs(dataset: datasets.Dataset) -> datasets.Dataset:
-    feat = dataset.features.get("label")
-    def _process(doc):
-        lbl = doc.get("label")
-        if isinstance(lbl, int) and feat is not None and hasattr(feat, "names") and feat.names:
-            doc["target"] = feat.names[lbl]
-        elif isinstance(lbl, int) and 0 <= lbl < len(LABELS):
-            doc["target"] = LABELS[lbl]
-        else:
-            doc["target"] = str(lbl).lower()
-        return doc
-    return dataset.map(_process)
-def process_results(doc, results):
-    return process_results_f1(doc, results)

evals/portuguese/README.md ADDED Viewed

	@@ -0,0 +1,131 @@

+# Portuguese – lm-eval Tasks
+Portuguese (PT-BR) evaluation suite for the `lm-evaluation-harness` framework.
+## Overview
+| #   | Task Name                | Category       | Dataset (HuggingFace)                           | Metric      |
+| --- | ------------------------ | -------------- | ----------------------------------------------- | ----------- |
+| 1   | `portuguese_enem`        | MCQ            | `eduagarcia/enem_challenge`                     | exact_match |
+| 2   | `portuguese_bluex`       | MCQ            | `eduagarcia-temp/BLUEX_without_images`          | exact_match |
+| 3   | `portuguese_oab_exams`   | MCQ            | `eduagarcia/oab_exams`                          | exact_match |
+| 4   | `portuguese_hatebr`      | Classification | `eduagarcia/portuguese_benchmark` (HateBR)      | f1_macro    |
+| 5   | `portuguese_hate_speech` | Classification | `eduagarcia/portuguese_benchmark` (Hate Speech) | f1_macro    |
+| 6   | `portuguese_tweetsentbr` | Classification | `eduagarcia/tweetsentbr_fewshot`                | f1_macro    |
+| 7   | `portuguese_assin2_rte`  | NLI            | `assin2`                                        | f1_macro    |
+| 8   | `portuguese_faquad_nli`  | NLI            | `ruanchaves/faquad-nli`                         | f1_macro    |
+| 9   | `portuguese_assin2_sts`  | NLI            | `assin2`                                        | pearson     |
+### Subgroups
+| Group                       | Tasks                              |
+| --------------------------- | ---------------------------------- |
+| `portuguese_mcq`            | enem, bluex, oab_exams             |
+| `portuguese_classification` | hatebr, hate_speech, tweetsentbr   |
+| `portuguese_nli`            | assin2_rte, faquad_nli, assin2_sts |
+## Setup
+```bash
+pip install lm-eval
+```
+## Running Tasks
+First, `cd` into the `lm_eval_tasks` directory and set the include path:
+```bash
+cd functionary_internal/evaluation/multilingual_bench/lm_eval_tasks
+export INCLUDE_PATH="$(pwd)"
+```
+### Run the Entire Portuguese (all 9 tasks)
+```bash
+OPENAI_API_KEY="your-key" \
+lm_eval \
+  --include_path $INCLUDE_PATH \
+  --tasks portuguese \
+  --model local-chat-completions \
+  --model_args model=your-model,base_url=https://openrouter.ai/api/v1/chat/completions,num_concurrent=5 \
+  --apply_chat_template \
+  --num_fewshot 0 \
+  --log_samples \
+  --output_path output/portuguese_results \
+  --gen_kwargs '{"temperature":0.6,"top_p":0.95,"provider":{"order":["alibaba"]}}'
+```
+### Run a Single Category
+```bash
+# Multiple-choice exams (ENEM + BLUEX + OAB)
+lm_eval --include_path $INCLUDE_PATH --tasks portuguese_mcq ...
+# Classification (hate speech, sentiment)
+lm_eval --include_path $INCLUDE_PATH --tasks portuguese_classification ...
+# Natural Language Inference (ASSIN2 RTE + FaQuAD NLI + ASSIN2 STS)
+lm_eval --include_path $INCLUDE_PATH --tasks portuguese_nli ...
+```
+### Run a Single Task
+```bash
+# ENEM exam
+lm_eval --include_path $INCLUDE_PATH --tasks portuguese_enem ...
+# BLUEX vestibular
+lm_eval --include_path $INCLUDE_PATH --tasks portuguese_bluex ...
+# OAB bar exam
+lm_eval --include_path $INCLUDE_PATH --tasks portuguese_oab_exams ...
+# Hate speech detection
+lm_eval --include_path $INCLUDE_PATH --tasks portuguese_hatebr ...
+# Sentiment analysis
+lm_eval --include_path $INCLUDE_PATH --tasks portuguese_tweetsentbr ...
+# Textual entailment
+lm_eval --include_path $INCLUDE_PATH --tasks portuguese_assin2_rte ...
+```
+### Run with a Local HuggingFace Model
+```bash
+lm_eval \
+  --include_path $INCLUDE_PATH \
+  --tasks portuguese \
+  --model hf \
+  --model_args pretrained=your-org/your-model \
+  --num_fewshot 0 \
+  --log_samples \
+  --output_path output/portuguese_results
+```
+### Mix and Match
+```bash
+# Run ENEM + ASSIN2 RTE only
+lm_eval --include_path $INCLUDE_PATH --tasks portuguese_enem,portuguese_assin2_rte ...
+```
+## Output
+With `--log_samples`, the output directory contains:
+- `results.json` – aggregate scores per task
+- `samples_<task_name>.jsonl` – per-example model outputs for debugging
+## Dataset Sources
+| Dataset                | Source                                 | Config                          | Fields                                                      |
+| ---------------------- | -------------------------------------- | ------------------------------- | ----------------------------------------------------------- |
+| ENEM                   | `eduagarcia/enem_challenge`            | —                               | question, choices, answerKey                                |
+| BLUEX                  | `eduagarcia-temp/BLUEX_without_images` | —                               | question, choices, answerKey                                |
+| OAB Exams              | `eduagarcia/oab_exams`                 | —                               | question, choices, answerKey                                |
+| HateBR                 | `eduagarcia/portuguese_benchmark`      | `HateBR_offensive_binary`       | sentence, label                                             |
+| Portuguese Hate Speech | `eduagarcia/portuguese_benchmark`      | `Portuguese_Hate_Speech_binary` | sentence, label                                             |
+| TweetSentBR            | `eduagarcia/tweetsentbr_fewshot`       | —                               | sentence, label                                             |
+| ASSIN2                 | `assin2`                               | —                               | premise, hypothesis, entailment_judgment, relatedness_score |
+| FaQuAD-NLI             | `ruanchaves/faquad-nli`                | —                               | question, answer, label                                     |

evals/{igbo/nli/igbo_afrixnli.yaml → portuguese/classification/_default_classification_yaml} RENAMED Viewed

@@ -1,17 +1,10 @@
-task: igbo_afrixnli
-task_alias: afrixnli
-dataset_path: masakhane/afrixnli
-dataset_name: ibo
-test_split: test
 output_type: generate_until
 generation_kwargs:
   do_sample: false
   max_gen_toks: 8192
   until:
     - "<|endoftext|>"
-process_docs: !function utils.process_afrixnli_docs
-doc_to_text: "Premise: {{premise}}\nHypothesis: {{hypothesis}}\n\nDoes the hypothesis follow from the premise?\nAllowed answers: entailment, neutral, contradiction\n\nInstruction: Reply with ONE word only from the allowed answers.\n\nAnswer:"
-doc_to_target: "{{target}}"
 process_results: !function utils.process_results
 metric_list:
   - metric: f1_macro

+# Shared config for Portuguese classification tasks (generative).
 output_type: generate_until
 generation_kwargs:
   do_sample: false
   max_gen_toks: 8192
   until:
     - "<|endoftext|>"
 process_results: !function utils.process_results
 metric_list:
   - metric: f1_macro

evals/portuguese/classification/portuguese_classification.yaml ADDED Viewed

	@@ -0,0 +1,12 @@

+# Classification subgroup (hate speech, sentiment)
+group: portuguese_classification
+task:
+  - portuguese_hatebr
+  - portuguese_hate_speech
+  - portuguese_tweetsentbr
+aggregate_metric_list:
+  - metric: f1_macro
+    aggregation: mean
+    weight_by_size: true
+metadata:
+  version: 1.0

evals/portuguese/classification/portuguese_hate_speech.yaml ADDED Viewed

	@@ -0,0 +1,17 @@

+task: portuguese_hate_speech
+task_alias: portuguese_hate_speech
+dataset_path: eduagarcia/portuguese_benchmark
+dataset_name: Portuguese_Hate_Speech_binary
+test_split: test
+include: _default_classification_yaml
+process_docs: !function utils.process_binary_docs
+doc_to_text: "Classifique se o texto a seguir contém discurso de ódio ou não. Responda apenas com \"Sim\" ou \"Não\".\n\nTexto: {{sentence}}\nPergunta: O texto contém discurso de ódio?\nResposta:"
+doc_to_target: "{{target}}"
+filter_list:
+  - name: "get_label"
+    filter:
+      - function: "strip_think_recover"
+      - function: "regex_last"
+        regex_pattern: "(Não|Sim)"
+        group_select: -1
+      - function: "take_first"

evals/portuguese/classification/portuguese_hatebr.yaml ADDED Viewed

	@@ -0,0 +1,17 @@

+task: portuguese_hatebr
+task_alias: hatebr_offensive
+dataset_path: eduagarcia/portuguese_benchmark
+dataset_name: HateBR_offensive_binary
+test_split: test
+include: _default_classification_yaml
+process_docs: !function utils.process_binary_docs
+doc_to_text: "Classifique se o texto a seguir é ofensivo ou não. Responda apenas com \"Sim\" ou \"Não\".\n\nTexto: {{sentence}}\nPergunta: O texto é ofensivo?\nResposta:"
+doc_to_target: "{{target}}"
+filter_list:
+  - name: "get_label"
+    filter:
+      - function: "strip_think_recover"
+      - function: "regex_last"
+        regex_pattern: "(Não|Sim)"
+        group_select: -1
+      - function: "take_first"

evals/portuguese/classification/portuguese_tweetsentbr.yaml ADDED Viewed

	@@ -0,0 +1,16 @@

+task: portuguese_tweetsentbr
+task_alias: tweetsentbr
+dataset_path: eduagarcia/tweetsentbr_fewshot
+test_split: test
+include: _default_classification_yaml
+process_docs: !function utils.process_sentiment_docs
+doc_to_text: "Classifique o sentimento do texto a seguir. Responda apenas com \"Positivo\", \"Neutro\" ou \"Negativo\".\n\nTexto: {{sentence}}\nPergunta: O sentimento do texto é Positivo, Neutro ou Negativo?\nResposta:"
+doc_to_target: "{{target}}"
+filter_list:
+  - name: "get_label"
+    filter:
+      - function: "strip_think_recover"
+      - function: "regex_last"
+        regex_pattern: "(Negativo|Neutro|Positivo)"
+        group_select: -1
+      - function: "take_first"

evals/portuguese/classification/utils.py ADDED Viewed

	@@ -0,0 +1,112 @@

+"""Utility helpers for Portuguese classification tasks (generative mode).
+Each process_docs function adds a ``target`` field with the expected
+Portuguese label. process_results + macro_f1_agg compute macro-averaged F1.
+"""
+import os as _os, sys as _sys  # noqa: E401
+_sys.path.insert(0, _os.path.normpath(_os.path.join(_os.path.dirname(__file__), "..","..",)))
+from f1_utils import macro_f1_agg, process_results_f1  # noqa: F401
+# ── Emotion label mapping (English → Portuguese) ────────────────────
+EMOTION_LABEL_MAP = {
+    "Admiration": "Admiração",
+    "Amusement": "Diversão",
+    "Anger": "Raiva",
+    "Annoyance": "Aborrecimento",
+    "Approval": "Aprovação",
+    "Compassion": "Compaixão",
+    "Confusion": "Confusão",
+    "Curiosity": "Curiosidade",
+    "Desire": "Desejo",
+    "Disappointment": "Decepção",
+    "Disapproval": "Desaprovação",
+    "Disgust": "Nojo",
+    "Embarrassment": "Vergonha",
+    "Envy": "Inveja",
+    "Excitement": "Entusiasmo",
+    "Fear": "Medo",
+    "Gratitude": "Gratidão",
+    "Grief": "Luto",
+    "Joy": "Alegria",
+    "Longing": "Saudade",
+    "Love": "Amor",
+    "Nervousness": "Nervosismo",
+    "Optimism": "Otimismo",
+    "Pride": "Orgulho",
+    "Relief": "Alívio",
+    "Remorse": "Remorso",
+    "Sadness": "Tristeza",
+    "Surprise": "Surpresa",
+}
+SENTIMENT_LABEL_MAP = {
+    "Positive": "Positivo",
+    "Negative": "Negativo",
+    "Neutral": "Neutro",
+}
+# ── Document pre-processing ─────────────────────────────────────────
+def process_binary_docs(dataset):
+    """Map 0/1 label → Não/Sim (for hatebr, portuguese_hate_speech)."""
+    def _map(doc):
+        doc["target"] = "Sim" if doc["label"] == 1 else "Não"
+        return doc
+    return dataset.map(_map)
+def process_sentiment_docs(dataset):
+    """Map Positive/Negative/Neutral → Positivo/Negativo/Neutro (tweetsentbr)."""
+    def _map(doc):
+        doc["target"] = SENTIMENT_LABEL_MAP.get(doc["label"], "Neutro")
+        return doc
+    return dataset.map(_map)
+def process_sparrow_sentiment_docs(dataset):
+    """Map sparrow sentiment labels → Portuguese."""
+    def _map(doc):
+        doc["target"] = SENTIMENT_LABEL_MAP.get(doc["label"], "Neutro")
+        return doc
+    return dataset.map(_map)
+def process_sparrow_emotion_docs(dataset):
+    """Map English emotion labels → Portuguese."""
+    def _map(doc):
+        doc["target"] = EMOTION_LABEL_MAP.get(doc["label"], doc["label"])
+        return doc
+    return dataset.map(_map)
+def process_sparrow_hate_docs(dataset):
+    """Map Hate/NotHate → Sim/Não."""
+    def _map(doc):
+        doc["target"] = "Sim" if doc["label"] == "Hate" else "Não"
+        return doc
+    return dataset.map(_map)
+# ── Result processing ────────────────────────────────────────────────
+def process_results(doc, results):
+    """Return (pred, gold) tuple for macro-F1 aggregation."""
+    return process_results_f1(doc, results)

evals/{swahili/afrimgsm/swahili_afrimgsm.yaml → portuguese/mcq/_default_mcq_yaml} RENAMED Viewed

@@ -1,28 +1,21 @@
-task: swahili_afrimgsm
-task_alias: afrimgsm
-dataset_path: masakhane/afrimgsm
-dataset_name: swa
-test_split: test
 output_type: generate_until
 generation_kwargs:
   do_sample: false
   max_gen_toks: 8192
   until:
     - "<|endoftext|>"
-doc_to_text: "Swali: {{question}}\nJibu:"
-doc_to_target: "{{answer_number|string}}"
 filter_list:
   - name: "get_answer"
     filter:
       - function: "regex"
-        regex_pattern: "(-?[$0-9.,]{2,})|(-?[0-9]+)"
-        group_select: -1
       - function: "take_first"
 metric_list:
   - metric: exact_match
     aggregation: mean
     higher_is_better: true
-    ignore_case: true
-    ignore_punctuation: true
 metadata:
   version: 1.0

+# Shared config for Portuguese MCQ tasks (generative A/B/C/D/E).
 output_type: generate_until
 generation_kwargs:
   do_sample: false
   max_gen_toks: 8192
   until:
     - "<|endoftext|>"
 filter_list:
   - name: "get_answer"
     filter:
+      - function: "strip_think_recover"
       - function: "regex"
+        regex_pattern: "([ABCDE])"
+        group_select: 0
       - function: "take_first"
 metric_list:
   - metric: exact_match
     aggregation: mean
     higher_is_better: true
 metadata:
   version: 1.0