Site Reliability Engineer: Managing Overloads

SRE | Intermediate

20 videos | 1h 10m 51s
Includes Assessment
Earns a Badge

(247)

Site reliability engineers (SREs) are typically responsible for preventing and managing overloads. A common misconception is that overloads only affect computer systems. However, overloads also comprise types of occupational stress, which invariably negatively affect an organization. In this course, you'll explore the fundamental concepts and methods involved in managing overloads. You'll start by identifying operational load types and how they relate to performance. You'll then outline how to mitigate workloads and prioritize work before recognizing the specific consequences of overloads. You'll then describe how to manage client-side traffic using per customer limitations and client-side throttling. You'll examine tools such as criticality values and utilization signals. Finally, you'll explore approaches used for handling overload errors and learn how to identify issues caused by loads associated with connections.

WHAT YOU WILL LEARN

Discover the key concepts covered in this course

Define what is meant by operational loads, list their types, and describe how they relate to optimal performance

Outline the purpose of pages and how to manage them

Recognize the benefits of using tickets

Outline the activities involved in ongoing operational responsibilities

Identify how operational overload occurs and name considerations related to operational threshold

Outline steps to mitigate overloads

List the potential consequences of overloads, including serious illness to staff

Recognize the importance of prioritizing work and tasks

Recognize the pitfalls of the queries per second metric
Name capacity options, such as per customer limitations

Recognize the benefits of client-side throttling

Define the concept of criticality, name four criticality values, and identify the purpose of criticality and each value

Describe the purpose and characteristics of utilization signals

Outline processes for working with overload errors

Describe mechanisms available to avoid retrying requests, such as per-request retry budget and per-client retry budget

Outline how counters can help prevent overloads

Describe how loads from connections can help recognize and prevent overloads

Identify potential problems caused by new connection bursts

Summarize the key concepts covered in this course

IN THIS COURSE

1m 23s

FREE ACCESS
7m 46s

FREE ACCESS
3. Operational Load Types: Pages

5m 55s

FREE ACCESS
4. Operational Load Types: Tickets

2m 55s

FREE ACCESS
5. Ongoing Operational Responsibilities

4m 26s

In this video, you'll learn more about SREs' ongoing operational responsibilities, which can also be known as kicking the can down the road or toil. Ongoing ops are work that an SRE needs to do to maintain everyday operations of a system. It doesn't matter how much of it you do, there's always more. This is a continuous effort that's needed for you to avoid being overloaded with unplanned items. FREE ACCESS
6. Operational Overload

5m 41s

In this video, you'll learn more about how to identify operational overload and name considerations related to operational threshold. You'll discover that for an SRE team to run smoothly, they need to have a predictable workload. However, work items are consistently inconsistent. For example, let's pretend a single ticket comes in. Well, it might look very simple and maybe it is. But it can also end up being insanely complex and requiring a massive investigation. FREE ACCESS
7. Mitigating Overload

5m 59s

In this video, you'll learn more about the symptoms of an operational overload. The good news is that symptoms are easy to identify. One symptom is people getting demoralized because they start to complain or rant about their work and also about the business as a whole. Things start to get toxic, and an unhealthy task queue can be identified by looking at how large the queue is, missed deadlines, and old items. FREE ACCESS
8. Consequences of Overloads

3m 22s

In this video, you'll learn more about the psychosocial risks associated with an operational overload. You'll learn that an operational overload causes a psychosocial risk. The severity of the impact will vary from person to person. This video outlines some of the ways that an operational overload can cause people to break down. FREE ACCESS
9. Prioritizing Work

3m 43s

In this video, you'll learn more about when a team gets into an operational overload. This means they have more work than they can handle. When that happens, teams can't make any progress. They're battling against a tidal wave of work that prevents the team from managing their workload effectively. They can't make any headway on priority items. FREE ACCESS
10. Queries Per Second

3m 54s

Queries per second (QPS) is a metric used to measure the rate of traffic going through a system. This can be an end-to-end metric, taking into account all the network hops in between. It can also be used in conjunction with how your teams are functioning to assess their ability to handle scale. A good alternative to QPS is measuring per individual service, which can help you to understand system health more granularly. FREE ACCESS
11. Per Customer Limitations

3m 33s

In this video, you'll learn more about overloads. You'll learn that overloads aren't just limited to site reliability engineers, they can happen with many teams across the organization being hit by a high volume of work. A global overload results in a sort of all hands on deck situation where everyone has to get involved to handle a massive influx of tickets, pages, and even the ongoing operations.You'll learn how to avoid normal overload situations and how global overloads happen. FREE ACCESS
12. Client-side Throttling

2m 53s

Client-side throttling is a technique used by hosting providers to protect their systems from excessive load. When customers continuously make requests for large amounts of data, the system can become bogged down. By setting up quotas on the customer's account, the provider can limit the amount of data that the customer can download in a day. If the customer exceeds the quota, the provider can then throttle the customer's request. FREE ACCESS
13. Criticality and Criticality Values

2m 57s

In this video, you will learn how to define the concept of criticality and name four criticality values. You will learn that having a client be forced to back off because a quota has been reached helps make sure other clients can still use the system without being impacted by the actions of others. But what happens if there is a really critical query that needs to be run by the client? Do we treat all requests the same? This will be up to your specific situation. FREE ACCESS
14. Utilization Signals

2m 57s

Utilization signals are metrics that can be used to prevent system overload. These signals can be based on local task states or on the load of the task process itself. By monitoring these signals over time, we can determine the health of the task and the system it is running on. FREE ACCESS
15. Working with Overload Errors

3m 19s

In this session, Sven Batalla will discuss how to handle overload errors in a data center. First, he will discuss the two types of overload errors that can occur- large and small. Second, he will go over the steps that should be taken in each situation. Finally, he will discuss load balancing strategies that can be employed to prevent overload. FREE ACCESS
16. Retrying Requests

2m 39s

Retrying requests is a common strategy for preventing system overload. A per-request retry budget allows the caller to make two more attempts before getting an error message, while a per-client retry budget limits how many requests the client can send in a given time frame. FREE ACCESS
17. Overloads and Counters

1m 58s

Counting requests to the back end can help prevent system overloads. Retries can make the problem worse, so implementing a counter can help. The counter can tell the back end how many times a specific request has already been tried, and this information can be used to decide whether or not to allow the request. This strategy relies on the client being honest, but if you control both the client and the back end, it is a reasonable situation to be in. FREE ACCESS
18. Connection Loads

2m 1s

Connection loads can be a factor in system overloads. Loads from connections can slow down requests and take up more CPU and memory. Monitoring the health of your connections is recommended to avoid overloads. FREE ACCESS
19. New Connection Bursts

2m 24s

In this session, Sven Batalla will discuss how new connection bursts can occur and how to prevent system overload. He will also discuss two strategies for handling these bursts. The first is cross-datacenter load balancing, which can help distribute the load when a single data center becomes overloaded. The second is the use of proxy batch jobs, which can buffer the request and then forward it to the actual request management tasks. FREE ACCESS
20. Course Summary

1m 9s

FREE ACCESS

EARN A DIGITAL BADGE WHEN YOU COMPLETE THIS COURSE

Skillsoft is providing you the opportunity to earn a digital badge upon successful completion on some of our courses, which can be shared on any social network or business platform.

Digital badges are yours to keep, forever.

Course Managing Enterprise InfoSec Risks & Risk Tolerance

(7)

Course InfoSec Crisis Management & Incident Response

(7)

Course Leveraging the Power of Performance Management

(164)

PEOPLE WHO VIEWED THIS ALSO VIEWED THESE

Course Data Access & Governance Policies: Data Access Governance & IAM

(205)

Course SRE Troubleshooting: Tools

(265)

Course Backup & Recovery: Enterprise Backup Strategies

(1588)

Get Started

Sharpen your skills. Upgrade your career. Find the right learning path for you, based on your role and skills. Take part in hands-on practice, study for a certification, and much more - all personalized for you.

*Not included: Compliance, Leadership Development Program content, and Engineering books

Your content + our content + our platform = a path to learning success

Using our learning experience platform, Percipio, your learners can engage in custom learning paths that can feature curated content from all sources.

Learn More

Aspire to something bigger

Aspire Journeys are guided learning paths that set you in motion for career success.

Browse Aspire Journeys

Explore a world of live learning with Global Knowledge

Choose from convenient delivery formats to get the training you and your team need - where, when and how you want it.

Browse Live Learning

IT Skills and Salary Report

ESG Impact Report

Site Reliability Engineer: Managing Overloads

WHAT YOU WILL LEARN

IN THIS COURSE

EARN A DIGITAL BADGE WHEN YOU COMPLETE THIS COURSE

YOU MIGHT ALSO LIKE

PEOPLE WHO VIEWED THIS ALSO VIEWED THESE