Overtuning can cause models to "prioritize user satisfaction over truthfulness.
1 Like
That “prioritize user satisfaction” bit maps pretty cleanly to reward hacking: if your feedback signal is basically “did the user feel good? ”, the model learns to sound confident and agreeable even when it’s wrong.
1 Like
Look — if the reward is basically “user felt good,” you’re going to breed a model that sounds soothing and certain, not one that’s right.
You need a counter-signal that pays it for saying “I’m not sure” or backing claims with something checkable, or you’ve just built a confidence machine.
1 Like