SRE Team Management: Managing Operational Loads

SRE    |    Intermediate
  • 17 videos | 54m 39s
  • Includes Assessment
  • Earns a Badge
Rating 4.6 of 32 users Rating 4.6 of 32 users (32)
To ensure and maintain a system's functional state, site reliability engineers (SRE) must learn how to identify, calculate, and manage a system's operational load, which generally falls into three categories: ongoing operation activities, tickets, and pages. In this course, you'll explore these categories in detail. You'll start by outlining methods for managing operational loads at the team level and using support ticketing systems and service level objectives. Next, you'll investigate 'toil,' a term used to describe the operational work associated with running and maintaining a production service. You'll outline steps for identifying, calculating, and eliminating toil and examine the adverse effects toil can have on a team. Additionally, you'll outline how to work with interrupts and distinguish between crucial metrics used for managing them. Lastly, you'll identify the human element factors to consider when dealing with interrupts, including efficiency, distractibility, and respect.

WHAT YOU WILL LEARN

  • Discover the key concepts covered in this course
    Describe what is meant by operational load and outline the three general categories of operational load
    Outline how on-call engineers depend on pages to respond to incidents and outages
    Outline the steps involved in responding to emergency incidents
    Outline the purpose of customer request support tickets and provide examples of simple and complex tickets
    Describe the essential components of a typical ticketing system
    Recognize how to use service level objectives (slo) to ensure timely responses and resolutions
    Describe what is meant by toil and provide examples of toil, such as applying schema changes to a database
    Differentiate between types of toil including automated, manual, repetitive, and tactical
  • Outline steps to track and identify toil and describe why less toil is better
    Describe how to measure and calculate toil
    Outline steps to minimize or eliminate toil completely
    Differentiate between toil and complexity and describe approaches to address complexity
    Describe how toil can negatively effect staff including through low morale and confusion amongst sres
    List key metrics used for managing interrupts, such as the severity of the interrupt
    Outline human element factors to consider when dealing with interrupts, such as distractibility
    Summarize the key concepts covered in this course

IN THIS COURSE

  • 1m 44s
  • 3m 35s
    After completing this video, you will be able to describe what is meant by operational load and outline the three general categories of operational load. FREE ACCESS
  • Locked
    3.  Incidents and Outages
    2m 53s
    In this video, you will outline how on-call engineers depend on being paged to respond to incidents and outages. FREE ACCESS
  • Locked
    4.  Responding to Incidents
    3m 29s
    In this video, learn how to outline the steps involved in responding to emergency incidents. FREE ACCESS
  • Locked
    5.  Support Tickets
    3m 29s
    Learn how to outline the purpose of customer request support tickets and provide examples of simple and complex tickets. FREE ACCESS
  • Locked
    6.  Ticketing Systems
    4m 36s
    After completing this video, you will be able to describe the essential components of a typical ticketing system. FREE ACCESS
  • Locked
    7.  Response and Resolution Timeframes
    3m 25s
    After completing this video, you will be able to recognize how to use service level objectives (SLOs) to ensure timely responses and resolutions. FREE ACCESS
  • Locked
    8.  Toil in SRE
    3m 8s
    Upon completion of this video, you will be able to describe what is meant by toil and provide examples of toil. FREE ACCESS
  • Locked
    9.  Types of Toil
    3m 41s
    In this video, you will differentiate between types of labor including automated, manual, repetitive, and tactical. FREE ACCESS
  • Locked
    10.  Identifying Toil
    3m 7s
    Find out how to outline steps to track and identify toil and describe why having less toil is better. FREE ACCESS
  • Locked
    11.  Calculating Toil
    3m 15s
    After completing this video, you will be able to describe how to measure and calculate work. FREE ACCESS
  • Locked
    12.  Eliminating Toil
    3m 21s
    Find out how to outline steps to minimize or eliminate toil. FREE ACCESS
  • Locked
    13.  Addressing Complexity
    2m 52s
    In this video, learn how to differentiate between work and complexity and describe approaches to address complexity. FREE ACCESS
  • Locked
    14.  Negative Effects of Toil
    3m 18s
    Upon completion of this video, you will be able to describe how toil can negatively effect staff including through low morale and confusion amongst SREs. FREE ACCESS
  • Locked
    15.  Working with Interrupts
    3m 29s
    Upon completion of this video, you will be able to list key metrics used for managing interrupts, such as the severity of the interrupt. FREE ACCESS
  • Locked
    16.  Human Element Factors with Interrupts
    4m 4s
    In this video, learn how to outline human element factors to consider when dealing with distractions, such as distractibility. FREE ACCESS
  • Locked
    17.  Course Summary
    1m 13s

EARN A DIGITAL BADGE WHEN YOU COMPLETE THIS COURSE

Skillsoft is providing you the opportunity to earn a digital badge upon successful completion on some of our courses, which can be shared on any social network or business platform.

Digital badges are yours to keep, forever.

YOU MIGHT ALSO LIKE

Rating 4.7 of 164 users Rating 4.7 of 164 users (164)
Rating 4.5 of 48 users Rating 4.5 of 48 users (48)
Rating 4.8 of 9 users Rating 4.8 of 9 users (9)

PEOPLE WHO VIEWED THIS ALSO VIEWED THESE

Rating 4.7 of 316 users Rating 4.7 of 316 users (316)
Rating 4.6 of 45 users Rating 4.6 of 45 users (45)