NewstaR
/

Newstar-Qwen3-0.6B-KTO

@@ -27,20 +27,20 @@ This version remains in non‑thinking mode—built for consistent and bias‑aw
 > - Do Sample: True
 > - Max New Tokens: 4096
-| Category              | Winner | Reason                                                                 |
-|-----------------------|--------|----------------------------------------------------------------------|
-| CS (RAM vs. ROM)      | KTO    | KTO is clearer, more structured, and avoids inaccuracies like BASE’s claim about excessive RAM. |
-| ENGINEERING (Water Filtration) | KTO    | KTO provides a practical, scientifically grounded system; BASE is confusing and impractical. |
-| MATH (Mean, Median, Mode) | KTO    | KTO’s structured, concise explanation outperforms BASE’s wordy but accurate response. |
-| SCIENCE (Osmosis vs. Diffusion) | KTO    | KTO is more detailed and accurate despite a minor error; BASE oversimplifies and has vague examples. |
-| WRITING (Lost Dog Story) | BASE   | BASE focuses on the dog and partially meets the prompt; KTO is off-topic and incoherent. |
-| CODING (Vowel Counting) | BASE   | BASE’s program is more robust (handles uppercase/lowercase) and includes test cases; KTO misses uppercase vowels. |
-| MATH SOLVING (Train Speed) | KTO    | Both are accurate, but KTO is more concise, delivering the result with less verbosity. |
-| COMMON SENSE LOGIC (Ice Melting) | KTO    | KTO accurately describes melting; BASE’s sublimation claim is incorrect. |
-| SOFT REASONING (Dog Barking) | BASE   | BASE provides a clearer affirmation despite flaws; KTO overcomplicates and undermines the premise. |
-| RIDDLE (Keys and Locks) | Neither | Both fail to identify the correct answer (piano) and provide irrelevant explanations. |
-| GENERAL CHAT (Hobby)  | BASE   | BASE’s detailed, engaging piano description outperforms KTO’s brief, shallow list. |
-| REWRITING (Formal Sentence) | KTO    | KTO’s rewrite is concise and equally formal; BASE is wordy with unnecessary alternatives. |
-| SUMMARIZATION (Tortoise and Hare) | KTO    | KTO is accurate and concise; BASE has factual errors (e.g., ten-day race). |
-| INSTRUCTION FOLLOWING (Vegetable Soup) | KTO    | KTO adheres closely to the prompt with clear, healthy steps; BASE misinterprets and lacks clarity. |
-| **Overall**           | **KTO** | KTO wins 9 categories vs. BASE’s 4, showing greater accuracy, clarity, and adherence to prompts. |

 > - Do Sample: True
 > - Max New Tokens: 4096
+| Category              | Prompt                                                                 | Winner | Reason                                                                 |
+|-----------------------|----------------------------------------------------------------------|--------|----------------------------------------------------------------------|
+| CS (RAM vs. ROM)      |                                                                      | KTO    | KTO is clearer, more structured, and avoids inaccuracies like BASE’s claim about excessive RAM. |
+| ENGINEERING (Water Filtration) |  | KTO    | KTO provides a practical, scientifically grounded system; BASE is confusing and impractical. |
+| MATH (Mean, Median, Mode) |                                                                      | KTO    | KTO’s structured, concise explanation outperforms BASE’s wordy but accurate response. |
+| SCIENCE (Osmosis vs. Diffusion) |                                                                      | KTO    | KTO is more detailed and accurate despite a minor error; BASE oversimplifies and has vague examples. |
+| WRITING (Lost Dog Story) | Write a short story about a lost dog finding its way home. | BASE   | BASE focuses on the dog and partially meets the prompt; KTO is off-topic and incoherent. |
+| CODING (Vowel Counting) | Create a simple program that counts the number of vowels in a sentence. | BASE   | BASE’s program is more robust (handles uppercase/lowercase) and includes test cases; KTO misses uppercase vowels. |
+| MATH SOLVING (Train Speed) | If a train travels 60 miles in 1.5 hours, what is its average speed? | KTO    | Both are accurate, but KTO is more concise, delivering the result with less verbosity. |
+| COMMON SENSE LOGIC (Ice Melting) | If you leave ice outside on a hot day, what happens to it? | KTO    | KTO accurately describes melting; BASE’s sublimation claim is incorrect. |
+| SOFT REASONING (Dog Barking) | If all dogs bark and Rex is a dog, does Rex bark? Why? | BASE   | BASE provides a clearer affirmation despite flaws; KTO overcomplicates and undermines the premise. |
+| RIDDLE (Keys and Locks) | What has keys but can’t open locks? | Neither | Both fail to identify the correct answer (piano) and provide irrelevant explanations. |
+| GENERAL CHAT (Hobby)  | Tell me about a hobby you enjoy. | BASE   | BASE’s detailed, engaging piano description outperforms KTO’s brief, shallow list. |
+| REWRITING (Formal Sentence) | Make this sentence more formal: “Can you fix the problem soon?” | KTO    | KTO’s rewrite is concise and equally formal; BASE is wordy with unnecessary alternatives. |
+| SUMMARIZATION (Tortoise and Hare) | Summarize the story of “The Tortoise and Hare” in two sentences. | KTO    | KTO is accurate and concise; BASE has factual errors (e.g., ten-day race). |
+| INSTRUCTION FOLLOWING (Vegetable Soup) | Explain how to prepare a simple vegetable soup that meets the following conditions: Use at least 3 different vegetables. The cooking time must not exceed 30 minutes. Include steps to make the soup both flavorful and healthy. Mention any kitchen tools needed. Provide alternatives if a vegetable is not available. Include tips to serve the soup nicely. | KTO    | KTO adheres closely to the prompt with clear, healthy steps; BASE misinterprets and lacks clarity. |
+| **Overall**           |                                                                      | **KTO** | KTO wins 9 categories vs. BASE’s 4, showing greater accuracy, clarity, and adherence to prompts. |