Skip to course content

SRE – Site Reliability Engineering

Introduce SRE principles and practices so teams can build and operate reliable, scalable systems through automation, monitoring, and incident response.

Get Course Info

Audience: DevOps, admins, developers, managers

Duration: 3–4 days

Format: Lectures and hands-on labs (50 % lecture, 50 % lab)

Overview

Learn the principles and practices that enable Google engineers to make systems more scalable, reliable, and efficient—lessons directly applicable to your organisation.

Objective

Introduce SRE principles and practices so teams can build and operate reliable, scalable systems through automation, monitoring, and incident response.

What You Will Learn

  • SRE introduction, tenets, and Google's approach
  • Risk & error budgets, SLOs/SLIs
  • Eliminating toil, monitoring, automation
  • Release engineering, simplicity, alerting, on-call
  • Incident response, postmortems, testing for reliability
  • Capacity planning, load balancing, overload & cascading failure handling
  • Consensus, cron, pipelines, data integrity, launches, interrupts, engagement models

Course Details

Audience: DevOps, admins, developers, managers

Duration: 3–4 days

Format: Lectures and hands-on labs (50 % lecture, 50 % lab)

Prerequisites:

IT background

Setup: Cloud-based lab • Laptop with Internet • Chrome

Detailed Outline

  • Sysadmin vs SRE
  • Risk, SLOs
  • Toil, monitoring, automation, releases, simplicity
  • On-call, troubleshooting, incidents, postmortems
  • Testing, software engineering in SRE
  • Load balancing, overload, cascading failures
  • Consensus, cron, pipelines
  • Data integrity
  • Launch coordination, interrupts, engagement models
  • Lessons from other industries

Ready to Get Started?

Contact us to learn more about this course and schedule your training.