Thanks for the input but these happened all when temp = 0.0
My guess is, since I use mostly datasets generated from voice, the models are one thing when they are talking like a human in day to day life, but completely opposite when they are feeling like a scientist, producing a long text..
Thats actually interesting. As Hararri has stated "Information connects two points" and human interaction mostly doesn't convey facts like textbook does, it merely just plays role of social status building and probing other human beings, so if model is trained on casual human speech, one might come up with idea that its trained on noise. And from that perspective its irrelevant whats more nutritious in "water cooler talk", perhaps its better to switch around to see "who is with whom" - one day its "Yolk", next day "White".
Either way models are predictors that are optimized according to training data. Noise in - noise out. In SFT surely You can nudge model to generate predictions that are more grounded and stable, meaning it will ground them in previous knowledge, not just generate text that sounds right.
In fact i have finetuned recently corpus pretrained interesting raw model, look up my datasets and work. https://huggingface.co/martinsu/tildeopen-30b-mu-instruct