Keywords: Poisoning Attacks, Reward Model Learning, Preference Label Manipulation
Abstract: Learning reward models from pairwise comparisons is crucial in domains like autonomous control, conversational agents, and recommendation systems to align automated decisions with user preferences. However, the anonymity and subjectivity of preferences make them vulnerable to malicious manipulation. We study attackers flipping a small subset of preference labels to promote or demote target outcomes. Two attack approaches are proposed: gradient-based frameworks and rank-by-distance methods. Evaluations across three domains reveal high success rates, with attacks achieving up to 100\% success by poisoning just 0.3\% of the data. Finally, we show that state-of-the-art defenses against other classes of poisoning attacks exhibit limited efficacy in our setting.
Submission Number: 3
Loading