Robust Abstractions of Human Preferences

Max Lamparth

Abstracting human preferences into computational objectives is essential for aligning AI systems, yet fundamentally challenging due to the complexity and context-dependence of human values. This talk examines how preferences are captured through human annotation and translated into reward models for reinforcement learning from human feedback. While enabling state-of-the-art chatbots, I'll present evidence that reward models exhibit (novel) systematic biases and discuss mitigation approaches. Finally, I'll explore alternative methods for learning from preferences and outline key directions for future research.

Speaker: Max Lamparth, Hoover Institution

Monday, 02/09/26

12:30 PM - 01:20 PM

Contact:

Website: Click to Visit

Cost:

Free

Save this Event:

iCalendar
Google Calendar
Yahoo! Calendar
Windows Live Calendar

Computing and Data Science Building (CoDA)

Room E160
Stanford, CA 94305

<						>
S	M	T	W	T	F	S
			01	02	03	04
05	06	07	08	09	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	31

Monday, 02/09/26

Contact:

Cost:

Save this Event:

Computing and Data Science Building (CoDA)

Categories: