The UK AI Security Institute - Red Team

  • Interventions that secure a system from abuse by bad actors or misaligned AI systems will grow in importance as AI systems become more capable, autonomous, and integrated into society. The UK AI Security Institute’s Red Team researches these interventions across three sub-teams (misuse, alignment, and control): we evaluate the protections on current frontier AI systems and research what measures could better secure them in the future. We share our findings with frontier AI companies, key UK officials, and other governments in order to inform their respective deployment, research, and policy decision-making.

  • Xander Davies is a Member of the Technical Staff at the UK AI Security Institute, where he leads the Red Team. He is also a PhD student at the University of Oxford, supervised by Dr. Yarin Gal. He previously studied computer science at Harvard, where he founded and led the Harvard AI Safety Team.

    Robert is a research scientist and the acting lead of the Alignment Red Team at UK AISI. This team's focus is on stress-testing model alignment to detect and understand model propensities relevant to loss-of-control risks. Before that, he's most recently worked on misuse research, focusing on evaluations of safeguards against misuse and mitigations for misuse risk, particularly in open-weight systems. He graduated from his PhD from University College London on generalisation in LLM fine-tuning and RL agents in January 2025.

    Alex Souly is a researcher on the Alignment Red Team at the UK AI Security Institute. She has contributed to pre-deployment evaluations and red-teaming of misuse safeguards and alignment (see Anthropic and OpenAI blogpost), and worked on open source evals like StrongReject and AgentHarm. Previously, she studied Maths at Cambridge and Machine Learning at UCL as part of UCL Dark lab, interned at CHAI, and in another life worked as a SWE at Microsoft.

  • For CBAI mentoring, we are focused on projects in alignment red teaming. This team's focus is on stress-testing model alignment to detect and understand model propensities relevant to loss-of-control risks. Example projects could include:

    • Improving automated auditing tools such as Petri and Bloom to be more realistic and controllable.

    • Adding additional beneficial affordances for propensity evaluations to Petri or Bloom.

    • Designing techniques for measuring and mitigating evaluation awareness.

    • Designing methods for automatically searching for misalignment in frontier models while maintaining environment realism.

  • You may be a good fit if you have:

    • Hands-on research experience with large language models (LLMs) - such as training, fine-tuning, evaluation, or safety research.

    • Ability and experience writing clean, documented research code for machine learning experiments, including experience with ML frameworks like PyTorch or evaluation frameworks like Inspect.

    • A sense of mission, urgency, and responsibility for success. • An ability to bring your own research ideas and work in a self-directed way

    Strong fellows would also have:

    • Experience working on adversarial robustness, other areas of AI security, or red teaming against any kind of system.

    • Experience working on AI alignment or AI control.

    • Extensive experience writing production-quality code.

    • Experience designing, shipping, and maintaining complex technical products at pace.

Xander Davies, Robert Kirk, and Alex Souly