Preference Poisoning Attacks on Reward Model Learning

Junlin Wu; Jiongxiao Wang; Chaowei Xiao; Chenguang Wang; Ning Zhang; Yevgeniy Vorobeychik

Preference Poisoning Attacks on Reward Model Learning

Junlin Wu, Jiongxiao Wang, Chaowei Xiao, Chenguang Wang, Ning Zhang, Yevgeniy Vorobeychik

Published: 16 Dec 2024, Last Modified: 20 Feb 2025airrworkshop OralandPosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Poisoning Attacks, Reward Model Learning, Preference Label Manipulation

Abstract: Learning reward models from pairwise comparisons is crucial in domains like autonomous control, conversational agents, and recommendation systems to align automated decisions with user preferences. However, the anonymity and subjectivity of preferences make them vulnerable to malicious manipulation. We study attackers flipping a small subset of preference labels to promote or demote target outcomes. Two attack approaches are proposed: gradient-based frameworks and rank-by-distance methods. Evaluations across three domains reveal high success rates, with attacks achieving up to 100\% success by poisoning just 0.3\% of the data. Finally, we show that state-of-the-art defenses against other classes of poisoning attacks exhibit limited efficacy in our setting.

Submission Number: 3

Loading