Update README.md
Browse files
README.md
CHANGED
|
@@ -27,20 +27,20 @@ This version remains in non‑thinking mode—built for consistent and bias‑aw
|
|
| 27 |
> - Do Sample: True
|
| 28 |
> - Max New Tokens: 4096
|
| 29 |
|
| 30 |
-
| Category | Winner | Reason |
|
| 31 |
-
|
| 32 |
-
| CS (RAM vs. ROM) | KTO | KTO is clearer, more structured, and avoids inaccuracies like BASE’s claim about excessive RAM. |
|
| 33 |
-
| ENGINEERING (Water Filtration) | KTO | KTO provides a practical, scientifically grounded system; BASE is confusing and impractical. |
|
| 34 |
-
| MATH (Mean, Median, Mode) | KTO | KTO’s structured, concise explanation outperforms BASE’s wordy but accurate response. |
|
| 35 |
-
| SCIENCE (Osmosis vs. Diffusion) | KTO | KTO is more detailed and accurate despite a minor error; BASE oversimplifies and has vague examples. |
|
| 36 |
-
| WRITING (Lost Dog Story) | BASE | BASE focuses on the dog and partially meets the prompt; KTO is off-topic and incoherent. |
|
| 37 |
-
| CODING (Vowel Counting) | BASE | BASE’s program is more robust (handles uppercase/lowercase) and includes test cases; KTO misses uppercase vowels. |
|
| 38 |
-
| MATH SOLVING (Train Speed) | KTO | Both are accurate, but KTO is more concise, delivering the result with less verbosity. |
|
| 39 |
-
| COMMON SENSE LOGIC (Ice Melting) | KTO | KTO accurately describes melting; BASE’s sublimation claim is incorrect. |
|
| 40 |
-
| SOFT REASONING (Dog Barking) | BASE | BASE provides a clearer affirmation despite flaws; KTO overcomplicates and undermines the premise. |
|
| 41 |
-
| RIDDLE (Keys and Locks) | Neither | Both fail to identify the correct answer (piano) and provide irrelevant explanations. |
|
| 42 |
-
| GENERAL CHAT (Hobby) | BASE | BASE’s detailed, engaging piano description outperforms KTO’s brief, shallow list. |
|
| 43 |
-
| REWRITING (Formal Sentence) | KTO | KTO’s rewrite is concise and equally formal; BASE is wordy with unnecessary alternatives. |
|
| 44 |
-
| SUMMARIZATION (Tortoise and Hare) | KTO | KTO is accurate and concise; BASE has factual errors (e.g., ten-day race). |
|
| 45 |
-
| INSTRUCTION FOLLOWING (Vegetable Soup) | KTO | KTO adheres closely to the prompt with clear, healthy steps; BASE misinterprets and lacks clarity. |
|
| 46 |
-
| **Overall** | **KTO** | KTO wins 9 categories vs. BASE’s 4, showing greater accuracy, clarity, and adherence to prompts. |
|
|
|
|
| 27 |
> - Do Sample: True
|
| 28 |
> - Max New Tokens: 4096
|
| 29 |
|
| 30 |
+
| Category | Prompt | Winner | Reason |
|
| 31 |
+
|-----------------------|----------------------------------------------------------------------|--------|----------------------------------------------------------------------|
|
| 32 |
+
| CS (RAM vs. ROM) | | KTO | KTO is clearer, more structured, and avoids inaccuracies like BASE’s claim about excessive RAM. |
|
| 33 |
+
| ENGINEERING (Water Filtration) | | KTO | KTO provides a practical, scientifically grounded system; BASE is confusing and impractical. |
|
| 34 |
+
| MATH (Mean, Median, Mode) | | KTO | KTO’s structured, concise explanation outperforms BASE’s wordy but accurate response. |
|
| 35 |
+
| SCIENCE (Osmosis vs. Diffusion) | | KTO | KTO is more detailed and accurate despite a minor error; BASE oversimplifies and has vague examples. |
|
| 36 |
+
| WRITING (Lost Dog Story) | Write a short story about a lost dog finding its way home. | BASE | BASE focuses on the dog and partially meets the prompt; KTO is off-topic and incoherent. |
|
| 37 |
+
| CODING (Vowel Counting) | Create a simple program that counts the number of vowels in a sentence. | BASE | BASE’s program is more robust (handles uppercase/lowercase) and includes test cases; KTO misses uppercase vowels. |
|
| 38 |
+
| MATH SOLVING (Train Speed) | If a train travels 60 miles in 1.5 hours, what is its average speed? | KTO | Both are accurate, but KTO is more concise, delivering the result with less verbosity. |
|
| 39 |
+
| COMMON SENSE LOGIC (Ice Melting) | If you leave ice outside on a hot day, what happens to it? | KTO | KTO accurately describes melting; BASE’s sublimation claim is incorrect. |
|
| 40 |
+
| SOFT REASONING (Dog Barking) | If all dogs bark and Rex is a dog, does Rex bark? Why? | BASE | BASE provides a clearer affirmation despite flaws; KTO overcomplicates and undermines the premise. |
|
| 41 |
+
| RIDDLE (Keys and Locks) | What has keys but can’t open locks? | Neither | Both fail to identify the correct answer (piano) and provide irrelevant explanations. |
|
| 42 |
+
| GENERAL CHAT (Hobby) | Tell me about a hobby you enjoy. | BASE | BASE’s detailed, engaging piano description outperforms KTO’s brief, shallow list. |
|
| 43 |
+
| REWRITING (Formal Sentence) | Make this sentence more formal: “Can you fix the problem soon?” | KTO | KTO’s rewrite is concise and equally formal; BASE is wordy with unnecessary alternatives. |
|
| 44 |
+
| SUMMARIZATION (Tortoise and Hare) | Summarize the story of “The Tortoise and Hare” in two sentences. | KTO | KTO is accurate and concise; BASE has factual errors (e.g., ten-day race). |
|
| 45 |
+
| INSTRUCTION FOLLOWING (Vegetable Soup) | Explain how to prepare a simple vegetable soup that meets the following conditions: Use at least 3 different vegetables. The cooking time must not exceed 30 minutes. Include steps to make the soup both flavorful and healthy. Mention any kitchen tools needed. Provide alternatives if a vegetable is not available. Include tips to serve the soup nicely. | KTO | KTO adheres closely to the prompt with clear, healthy steps; BASE misinterprets and lacks clarity. |
|
| 46 |
+
| **Overall** | | **KTO** | KTO wins 9 categories vs. BASE’s 4, showing greater accuracy, clarity, and adherence to prompts. |
|