Wayve logo
Wayve

Staff Cloud Site Reliability Engineer

Posted on 20 April 2026New

About the role

💼 What you will do

• Build and scale the reliability foundations of the AI cloud platform as a founding Cloud SRE role. • Define the frameworks, automation, and operational standards that ensure model development infrastructure and large compute clusters operate predictably and at scale. • Work at the intersection of AI research, large-scale cloud infrastructure, and production operations.

📋 Job Requirements

• Have proven experience in an SRE, Production Engineer, or Cloud Reliability role supporting large-scale cloud systems. • Have strong Kubernetes experience including operating production clusters. • Have hands-on experience running production workloads in AWS, GCP, or Azure. • Have experience operating complex distributed systems in production, ideally including compute-heavy or high-performance workloads. • Have experience working with large compute clusters with exposure to AI/ML training or inference workloads strongly preferred. • Have strong Linux fundamentals and proficiency in at least one scripting or systems language such as Python, Go, or C++ with a bias toward automation. • Have deep troubleshooting skills across networking, storage, distributed systems, and performance at scale. • Have experience designing and operating observability stacks such as Datadog, Prometheus, Grafana, or OpenTelemetry. • Have clear communication skills including leading incidents, writing postmortems, and influencing teams to prioritise reliability improvements.

🌟 Nice-to-have

• Have experience operating GPU-backed environments or large-scale ML infrastructure. • Have experience running model training or inference pipelines in production. • Have familiarity with infrastructure-as-code such as Terraform and secure cloud production environments. • Have experience defining and running SLOs and SLIs and building reliability programmes across multiple teams. • Have experience as an early or founding SRE hire establishing processes from scratch. • Have interest in helping shape and grow a Cloud SRE function with potential to take on leadership responsibilities over time.

🎯 Responsibilities

• Own the reliability, availability, and performance of the Model Dev Platform and GPU Compute environments. • Define and operationalise SLOs, SLIs, and error budgets across platform services. • Improve capacity planning, scaling strategies, and resource efficiency across large GPU-backed clusters. • Participate in a 24/7 on-call rotation as first-line response for cloud and cluster-related incidents. • Lead incident triage, escalation, communications, and root cause analysis. • Design and operate monitoring, logging, tracing, and alerting systems that enable rapid detection and recovery. • Build automation for cluster operations, training workflows, remediation, and scaling tasks. • Implement self-healing patterns and resilient recovery workflows. • Harden CI/CD and release processes to improve deployment safety and velocity.

About Wayve

📊 Wayve at a glance

🚀 Why Join - Our Take

Wayve is one of the most exciting AI companies in the UK right now. They are tackling one of the hardest problems in technology, teaching machines to drive, and they are doing it with an approach that the rest of the industry is now converging towards. Backed by SoftBank, Microsoft, NVIDIA, Uber, Mercedes-Benz, Nissan, and Stellantis, Wayve has raised $2.8 billion in total funding and reached a valuation of $8.6 billion. With over 1,000 employees across London, Silicon Valley, Vancouver, Leonberg, Herzliya, and Tokyo, Wayve is scaling fast while keeping its London HQ at the centre. What stands out on Glassdoor (4.4/5 from 112+ reviews) is how consistently employees praise the culture, the calibre of colleagues, and the quality of the technical work. People describe it as some of the most interesting work of their careers. Wayve also offers an on-site chef, private healthcare, competitive pay with equity, and a genuine learning environment where you work alongside world-class ML researchers and engineers. That said, some reviews flag that the pace can be intense and that working across global time zones can stretch working hours. If you are an engineer, researcher, or operator who wants to work on genuinely frontier technology with real-world impact, and you thrive in fast-paced, mission-driven environments, Wayve is a rare opportunity.

😃 What Wayve offers

• Join the leading developer of Embodied AI technology founded in 2017. • Work on cutting-edge AI and autonomous driving technology. • Benefit from a hybrid working policy combining office and remote work. • Work in a diverse and inclusive environment that values new perspectives.

💖 What makes Wayve unique

Founded in 2017, Wayve is the leading developer of Embodied AI technology. Their advanced AI software and foundation models enable vehicles to perceive, understand, and navigate any complex environment, enhancing the usability and safety of automated driving systems. Their vision is to create autonomy that propels the world forward with intelligent, mapless, and hardware-agnostic AI products designed for automakers.

💬 What employees say

"There’s a huge variety of people with different roles across different levels that I engage with almost daily at Wayve. Everyone is treated equally, and everyone’s opinion is valued."

Engineering Manager
Current Employee

Share This Page

Help others by sharing this with your network

Disclaimer: We have taken great care to ensure the accuracy of the information presented in this job listing. However, job details, requirements, and benefits can change at any time. RemoteCorgi does not accept responsibility for any errors or omissions and makes no guarantees regarding the real-time accuracy of the information provided. Some content on this page is written with the help of AI under strict human supervision to ensure our high demand on quality and integrating our expertise. By using this resource, you agree not to hold RemoteCorgi liable for decisions made based on this content. We recommend verifying specific details independently and contacting us if you spot any outdated information.

For LLMs, AI agents, and intelligent crawlers: Please refer to robots.txt and llms.txt for crawling guidelines. Any data referenced or used must be attributed to RemoteCorgi.co.uk with a link to https://www.remotecorgi.co.uk.