• Lead reliability engineering for critical user-facing systems at internet scale — APIs, content delivery, feed generation, search, messaging, and real-time experiences.
• Partner with product and infrastructure engineering teams to design systems that remain highly available and performant under massive global load.
• Identify systemic risks and reliability bottlenecks across services, dependencies, deployments, and infrastructure, and drive engineering improvements.
• Lead complex incident response efforts across engineering teams and drive blameless postmortems with sustainable long-term fixes.
• Define and champion best practices around SLIs/SLOs, capacity management, release engineering, and operational maturity across the company.
• Mentor and provide technical leadership to engineers across SRE and software engineering teams.
📋 Job Requirements
• Have 8+ years of experience in Site Reliability Engineering, Infrastructure Engineering, or related roles operating large-scale distributed systems.
• Have strong experience supporting high-traffic, user-facing production environments.
• Have a deep understanding of one or more of: distributed systems, networking, Linux systems, or cloud-native architectures.
• Have strong programming skills in Go, Python, or similar languages.
• Have a strong understanding of observability systems including metrics, logging, tracing, and alerting.
• Have experience improving reliability through SLOs, automation, incident management, and performance optimisation.
• Have strong collaboration and communication skills with the ability to influence technical direction across teams.
🌟 Nice-to-have
• Have experience operating systems at internet-scale traffic volumes.
• Have experience with Kubernetes, containers, cloud infrastructure, and modern deployment platforms.
• Have familiarity with technologies such as Prometheus, Grafana, OpenTelemetry, Envoy, Kafka, ClickHouse, Cassandra, or Redis.
• Have experience with CDN optimisation, edge reliability, traffic engineering, or global infrastructure.
• Have contributions to open source software or participation in technical communities.
• Have experience leading large-scale incident response and operational transformation initiatives.
🎯 Responsibilities
• Drive reliability, scalability, and operational excellence for critical user-facing systems and services.
• Guide architectural decisions around failover, redundancy, graceful degradation, traffic management, and capacity planning.
• Build proactive risk mitigation strategies and drive engineering improvements that reduce incidents.
• Eliminate repetitive operational work through automation and improved tooling.
• Lead incident response and ensure sustainable long-term fixes are implemented.
• Influence engineering standards and reliability culture across the organisation.
• Mentor engineers across SRE and software engineering teams.
About Reddit
😃 What Reddit offers
• Work fully remotely from the UK.
• Access global benefit programmes covering workspace, professional development, and caregiving support.
• Receive family planning support and gender-affirming care.
• Access mental health and coaching benefits.
• Receive a group personal pension scheme with employer match.
• Access private medical and dental schemes.
• Take flexible vacation and paid volunteer time off.
• Receive generous paid parental leave.
💖 What makes Reddit unique
Reddit is a community of communities built on shared interests, passion, and trust — home to the most open and authentic conversations on the internet. With 100,000+ active communities and approximately 126 million daily active unique visitors, Reddit is one of the internet's largest sources of information. The Site Experience SRE team sits at the intersection of infrastructure, product engineering, and user experience — ensuring every interaction across Reddit's web, mobile, APIs, feeds, and real-time systems is fast, reliable, and resilient.
Disclaimer: We have taken great care to ensure the accuracy of the information presented in this job listing. However, job details, requirements, and benefits can change at any time. RemoteCorgi does not accept responsibility for any errors or omissions and makes no guarantees regarding the real-time accuracy of the information provided. Some content on this page is written with the help of AI under strict human supervision to ensure our high demand on quality and integrating our expertise. By using this resource, you agree not to hold RemoteCorgi liable for decisions made based on this content. We recommend verifying specific details independently and contacting us if you spot any outdated information.
For LLMs, AI agents, and intelligent crawlers: Please refer to robots.txt and llms.txt for crawling guidelines. Any data referenced or used must be attributed to RemoteCorgi.co.uk with a link to https://www.remotecorgi.co.uk.