Robust Abstractions of Human Preferences

Abstracting human preferences into computational objectives is essential for aligning AI systems, yet fundamentally challenging due to the complexity and context-dependence of human values. This talk examines how preferences are captured through human annotation and translated into reward models for reinforcement learning from human feedback. While enabling state-of-the-art chatbots, I'll present evidence that reward models exhibit (novel) systematic biases and discuss mitigation approaches. Finally, I'll explore alternative methods for learning from preferences and outline key directions for future research.
Speaker: Max Lamparth, Hoover Institution
Monday, 02/09/26
Contact:
Website: Click to VisitCost:
FreeSave this Event:
iCalendarGoogle Calendar
Yahoo! Calendar
Windows Live Calendar
