The UK AI Security Institute - Red Team
-
Interventions that secure a system from abuse by bad actors or misaligned AI systems will grow in importance as AI systems become more capable, autonomous, and integrated into society. The AI Security Institute’s Red Team researches these interventions across three sub-teams (misuse, alignment, and control): we evaluate the protections on current frontier AI systems and research what measures could better secure them in the future. We share our findings with frontier AI companies, key UK officials, and other governments in order to inform their respective deployment, research, and policy decision-making.
-
Xander Davies is a Member of the Technical Staff at the UK AI Security Institute, where he leads the Red Teaming group, which uses adversarial ML techniques to understand, attack, and mitigate frontier AI safeguards. He is also a PhD student at the University of Oxford, supervised by Dr. Yarin Gal. He previously studied computer science at Harvard, where I founded and led the Harvard AI Safety Team.
Robert Kirk is a Research Scientists at the UK AI Security Institute’s Red Teaming group. Before that, he was a PhD student in the foundational AI CDT at UCL, researching large language model safety and alignment. He is motivated by reducing the chance of catastrophic or existential risks from advanced AI systems.
Alex Souly is a researcher on the Red Team at the UK AI Security Institute, where she works on the safety and security of frontier LLMs. She has contributed to pre-deployment evaluations and red-teaming of misuse safeguards and alignment (see Anthropic and OpenAI blogpost), and worked on open source evals like StrongReject and AgentHarm. Previously, she studied Maths at Cambridge and Machine Learning at UCL as part of UCL Dark lab, interned at CHAI, and in another life worked as a SWE at Microsoft.
-
Representative works you might work on with us:
Designing, building, running and evaluating methods to automatically attack and evaluate safeguards, such as LLM-automated attacking and direct optimisation approaches.
Designing and running experiments that test measures to keep AI systems under human control even when they might be misaligned
Building a benchmark for asynchronous monitoring for signs of misuse and jailbreak development across multiple model interactions.
Investigating novel attacks and defences for data poisoning LLMs with backdoors or other attacker goals.
Performing adversarial testing of frontier AI system safeguards and producing reports that are impactful and action-guiding for safeguard developers.
-
You may be a good fit if you have:
Hands-on research experience with large language models (LLMs) - such as training, fine-tuning, evaluation, or safety research.
Ability and experience writing clean, documented research code for machine learning experiments, including experience with ML frameworks like PyTorch or evaluation frameworks like Inspect.
A sense of mission, urgency, and responsibility for success.
An ability to bring your own research ideas and work in a self-directed way, while also collaborating effectively and prioritizing team efforts over extensive solo work.
Strong candidates may also have:
Experience working on adversarial robustness, other areas of AI security, or red teaming against any kind of system.
Experience working on AI alignment or AI control.
Extensive experience writing production-quality code.
Desire to and experience with improving our team through mentoring and feedback.
Experience designing, shipping, and maintaining complex technical products.
Xander Davies, Robert Kirk, and Alex Souly