Adria Garriga-Alonso

My goal is to ensure AI is beneficial to society. To this end, I am researching how neural networks internally work, including:
- How can we evaluate the accuracy of an interpretability explanation?
- How can we find explanations of the algorithm the NN implements, at lower labor and compute costs?
- What explains the behavior of agent-like AIs? What do they want?
Previously I worked at Redwood Research on interpretability research and software development.
I hold a PhD in machine learning, which was advised by Prof. Carl Rasmussen at the University of Cambridge. My research focused on improving uncertainty quantification in neural networks (NNs) using Bayesian principles.
Measuring various limitations of post-training with robust lie-detection models
SAEs: fixing the absorption and hedging problems. Studying feature-splitting, on what models does it occur? Do LLMs have a finite amount of SAE latents?
LLM psychology: what can we measure about LLMs from talking to them, and how we can make that legible?
Extending my work on planning in Sokoban models to LLMs.
ML knowledge: know basic algorithms (sgd, logistic regression, L1 regularization, …) and their properties; know the transformer architecture.
Enthusiasm about the subject!!
Ability to try things until they work
Decent writing skills
Decent software engineering skills
Good at using AI to make them better at writing and software

Research Scientist, FAR AI