SRE – Site Reliability Engineering
Introduce SRE principles and practices so teams can build and operate reliable, scalable systems through automation, monitoring, and incident response.
Get Course Info
Audience: DevOps, admins, developers, managers
Duration: 3–4 days
Format: Lectures and hands-on labs (50 % lecture, 50 % lab)
Overview
Learn the principles and practices that enable Google engineers to make systems more scalable, reliable, and efficient—lessons directly applicable to your organisation.
Objective
Introduce SRE principles and practices so teams can build and operate reliable, scalable systems through automation, monitoring, and incident response.
What You Will Learn
- SRE introduction, tenets, and Google's approach
- Risk & error budgets, SLOs/SLIs
- Eliminating toil, monitoring, automation
- Release engineering, simplicity, alerting, on-call
- Incident response, postmortems, testing for reliability
- Capacity planning, load balancing, overload & cascading failure handling
- Consensus, cron, pipelines, data integrity, launches, interrupts, engagement models
Course Details
Audience: DevOps, admins, developers, managers
Duration: 3–4 days
Format: Lectures and hands-on labs (50 % lecture, 50 % lab)
IT background
Setup: Cloud-based lab • Laptop with Internet • Chrome
Detailed Outline
- Sysadmin vs SRE
- Risk, SLOs
- Toil, monitoring, automation, releases, simplicity
- On-call, troubleshooting, incidents, postmortems
- Testing, software engineering in SRE
- Load balancing, overload, cascading failures
- Consensus, cron, pipelines
- Data integrity
- Launch coordination, interrupts, engagement models
- Lessons from other industries
Ready to Get Started?
Contact us to learn more about this course and schedule your training.