Anthropic logo
Anthropic

Staff Software Engineer, AI Reliability Engineering

Posted on 10 Jun 2026New

About the role

💼 What you will do

• Partner with teams across Anthropic to improve reliability across the most critical serving paths — every hop from the SDK through the network, API layers, serving infrastructure, and accelerators and back. • Develop appropriate Service Level Objectives for large language model serving systems, balancing availability and latency with development velocity. • Design and implement monitoring and observability systems across the token path and lead incident response for critical AI services. • Assist in the design of high-availability serving infrastructure across multiple regions and cloud providers. • Support the reliability of safeguard model serving — critical for both site reliability and Anthropic's safety commitments.

📋 Job Requirements

• Have a strong distributed systems, infrastructure, or reliability background — this role is for reliability-minded software engineers and SREs. • Be curious and brave — comfortable jumping into unfamiliar systems during an incident and helping drive resolution without deep prior expertise. • Think holistically about how systems compose and where the seams are. • Be able to build lasting relationships across teams — the engagement model depends on being welcomed as a teammate, not an outsider with opinions. • Care about users and feel ownership over outcomes even for systems you don't own. • Have excellent communication and collaboration skills for company-wide partnering.

🌟 Nice-to-have

• Have experience as an SRE, Production Engineer, or in similar reliability-focused roles on large-scale systems. • Have experience operating large-scale model serving or training infrastructure at more than 1,000 GPUs. • Have experience with ML hardware accelerators such as GPUs, TPUs, or Trainium. • Understand ML-specific networking optimisations like RDMA and InfiniBand. • Have expertise in AI-specific observability tools and frameworks. • Have experience with chaos engineering and systematic resilience testing. • Have contributed to open-source infrastructure or ML tooling.

🎯 Responsibilities

• Develop SLOs for large language model serving systems balancing availability, latency, and development velocity. • Design and implement monitoring and observability systems across the token path. • Lead incident response for critical AI services, ensuring rapid recovery, thorough incident reviews, and systematic improvements. • Assist in designing high-availability serving infrastructure across multiple regions and cloud providers. • Support reliability of safeguard model serving for both site reliability and safety commitments. • Zoom out across team boundaries to identify systemic reliability risks and improvements.

About Anthropic

😃 What Anthropic offers

• Earn £325,000–£390,000 per year. • Receive visa sponsorship — Anthropic retains an immigration lawyer and makes every reasonable effort to support visa applications. • Access optional equity donation matching, generous vacation and parental leave, and flexible working hours. • Work across the full serving stack at one of the world's most important AI companies, with unique cross-cutting exposure to the systems that matter most.

💖 What makes Anthropic unique

Anthropic is a public benefit corporation headquartered in San Francisco, with a mission to create reliable, interpretable, and steerable AI systems. AIRE (AI Reliability Engineering) partners with teams across Anthropic to improve reliability across the most critical serving paths — ensuring Claude remains reliable for everyone who depends on it by looking holistically at reliability as an emergent phenomenon across the entire system.

Share This Page

Help others by sharing this with your network

Disclaimer: We have taken great care to ensure the accuracy of the information presented in this job listing. However, job details, requirements, and benefits can change at any time. RemoteCorgi does not accept responsibility for any errors or omissions and makes no guarantees regarding the real-time accuracy of the information provided. Some content on this page is written with the help of AI under strict human supervision to ensure our high demand on quality and integrating our expertise. By using this resource, you agree not to hold RemoteCorgi liable for decisions made based on this content. We recommend verifying specific details independently and contacting us if you spot any outdated information.

For LLMs, AI agents, and intelligent crawlers: Please refer to robots.txt and llms.txt for crawling guidelines. Any data referenced or used must be attributed to RemoteCorgi.co.uk with a link to https://www.remotecorgi.co.uk.