Samuel Marks
-
Samuel leads the cognitive oversight subteam of Anthropic’s alignment science team. Their goal is to be able to oversee AI systems not based on whether they have good input/output behavior, but based on whether there’s anything suspicious about the cognitive processes underlying those behaviors. For example, one in-scope problem is “detecting when language models are lying, including in cases where it’s impossible to tell based solely on input/output” (such as when a model knows a piece of private information which it is lying about). His team is interested in both white-box techniques (e.g. interpretability-based techniques) and black-box techniques (e.g. finding good ways to interrogate models about their thought processes and motivations).
-
I will mentor an empirical alignment research project. My main research interests are in (1) overseeing models on tasks where we don't have access to a reliable ground-truth supervision signal and (2) downstream applications of interpretability; this blog post can give a sense of my flavor of research.
-
I'm looking for candidates who:
Are comfortable driving a research project with only ~30 mins/week of direct supervision
Have strong coding skills
Ideally, experience with empirical ML research
Member of Technical Staff, Anthropic