Joseph Bloom

  • Joseph Bloom is Head of White Box Evaluations at the UK AI Security Institute (AISI). His team has expertise in mechanistic interpretability and works on developing mitigations for emerging risks. They are currently focussed on addressing evaluation sabotage, in particular sandbagging / deliberate underperformance. 

    Joseph previously co-founded Decode Research / Neuronpedia and completed the MATS program under mentorship from Neel Nanda. Joseph wrote SAE Lens, the most popular open source library for SAE training and analysis. Prior to working in technical AI safety, Joseph worked for a computational biology start-up for several years. He has a double degree in computational biology and statistics/stochastic processes.

  • I'm interested in mentoring projects at the intersection of interpretability and control. In particular, I'm interested in developing blue team strategies (probes, CoT monitors, combinations) as well as red-team strategies for training / prompting deceptive model organisms. I'm particularly interested in building gears level models of why blue team strategies work or fail and why model organisms are representative (give us confidence our methods will work in different scenarios).

  • Are very comfortable with standard experimental techniques in technical AI safety / interpretability (ie. could complete a small MATS style project or reproduce simple papers with little to no supervision).

    Are simultaneously quite agentic and happy to operate independently.

    Prefer being disciplined / structured in their communication and work organisation.  

Head of White Box Evaluations, the UK AI Security Institute