Site Reliability Engineer: Managing Overloads
SRE
| Intermediate
- 20 videos | 1h 10m 51s
- Includes Assessment
- Earns a Badge
Site reliability engineers (SREs) are typically responsible for preventing and managing overloads. A common misconception is that overloads only affect computer systems. However, overloads also comprise types of occupational stress, which invariably negatively affect an organization. In this course, you'll explore the fundamental concepts and methods involved in managing overloads. You'll start by identifying operational load types and how they relate to performance. You'll then outline how to mitigate workloads and prioritize work before recognizing the specific consequences of overloads. You'll then describe how to manage client-side traffic using per customer limitations and client-side throttling. You'll examine tools such as criticality values and utilization signals. Finally, you'll explore approaches used for handling overload errors and learn how to identify issues caused by loads associated with connections.
WHAT YOU WILL LEARN
-
Discover the key concepts covered in this courseDefine what is meant by operational loads, list their types, and describe how they relate to optimal performanceOutline the purpose of pages and how to manage themRecognize the benefits of using ticketsOutline the activities involved in ongoing operational responsibilitiesIdentify how operational overload occurs and name considerations related to operational thresholdOutline steps to mitigate overloadsList the potential consequences of overloads, including serious illness to staffRecognize the importance of prioritizing work and tasksRecognize the pitfalls of the queries per second metric
-
Name capacity options, such as per customer limitationsRecognize the benefits of client-side throttlingDefine the concept of criticality, name four criticality values, and identify the purpose of criticality and each valueDescribe the purpose and characteristics of utilization signalsOutline processes for working with overload errorsDescribe mechanisms available to avoid retrying requests, such as per-request retry budget and per-client retry budgetOutline how counters can help prevent overloadsDescribe how loads from connections can help recognize and prevent overloadsIdentify potential problems caused by new connection burstsSummarize the key concepts covered in this course
IN THIS COURSE
-
1m 23s
-
7m 46s
-
5m 55s
-
2m 55s
-
4m 26sIn this video, you'll learn more about SREs' ongoing operational responsibilities, which can also be known as kicking the can down the road or toil. Ongoing ops are work that an SRE needs to do to maintain everyday operations of a system. It doesn't matter how much of it you do, there's always more. This is a continuous effort that's needed for you to avoid being overloaded with unplanned items. FREE ACCESS
-
5m 41sIn this video, you'll learn more about how to identify operational overload and name considerations related to operational threshold. You'll discover that for an SRE team to run smoothly, they need to have a predictable workload. However, work items are consistently inconsistent. For example, let's pretend a single ticket comes in. Well, it might look very simple and maybe it is. But it can also end up being insanely complex and requiring a massive investigation. FREE ACCESS
-
5m 59sIn this video, you'll learn more about the symptoms of an operational overload. The good news is that symptoms are easy to identify. One symptom is people getting demoralized because they start to complain or rant about their work and also about the business as a whole. Things start to get toxic, and an unhealthy task queue can be identified by looking at how large the queue is, missed deadlines, and old items. FREE ACCESS
-
3m 22sIn this video, you'll learn more about the psychosocial risks associated with an operational overload. You'll learn that an operational overload causes a psychosocial risk. The severity of the impact will vary from person to person. This video outlines some of the ways that an operational overload can cause people to break down. FREE ACCESS
-
3m 43sIn this video, you'll learn more about when a team gets into an operational overload. This means they have more work than they can handle. When that happens, teams can't make any progress. They're battling against a tidal wave of work that prevents the team from managing their workload effectively. They can't make any headway on priority items. FREE ACCESS
-
3m 54sQueries per second (QPS) is a metric used to measure the rate of traffic going through a system. This can be an end-to-end metric, taking into account all the network hops in between. It can also be used in conjunction with how your teams are functioning to assess their ability to handle scale. A good alternative to QPS is measuring per individual service, which can help you to understand system health more granularly. FREE ACCESS
-
3m 33sIn this video, you'll learn more about overloads. You'll learn that overloads aren't just limited to site reliability engineers, they can happen with many teams across the organization being hit by a high volume of work. A global overload results in a sort of all hands on deck situation where everyone has to get involved to handle a massive influx of tickets, pages, and even the ongoing operations.You'll learn how to avoid normal overload situations and how global overloads happen. FREE ACCESS
-
2m 53sClient-side throttling is a technique used by hosting providers to protect their systems from excessive load. When customers continuously make requests for large amounts of data, the system can become bogged down. By setting up quotas on the customer's account, the provider can limit the amount of data that the customer can download in a day. If the customer exceeds the quota, the provider can then throttle the customer's request. FREE ACCESS
-
2m 57sIn this video, you will learn how to define the concept of criticality and name four criticality values. You will learn that having a client be forced to back off because a quota has been reached helps make sure other clients can still use the system without being impacted by the actions of others. But what happens if there is a really critical query that needs to be run by the client? Do we treat all requests the same? This will be up to your specific situation. FREE ACCESS
-
2m 57sUtilization signals are metrics that can be used to prevent system overload. These signals can be based on local task states or on the load of the task process itself. By monitoring these signals over time, we can determine the health of the task and the system it is running on. FREE ACCESS
-
3m 19sIn this session, Sven Batalla will discuss how to handle overload errors in a data center. First, he will discuss the two types of overload errors that can occur- large and small. Second, he will go over the steps that should be taken in each situation. Finally, he will discuss load balancing strategies that can be employed to prevent overload. FREE ACCESS
-
2m 39sRetrying requests is a common strategy for preventing system overload. A per-request retry budget allows the caller to make two more attempts before getting an error message, while a per-client retry budget limits how many requests the client can send in a given time frame. FREE ACCESS
-
1m 58sCounting requests to the back end can help prevent system overloads. Retries can make the problem worse, so implementing a counter can help. The counter can tell the back end how many times a specific request has already been tried, and this information can be used to decide whether or not to allow the request. This strategy relies on the client being honest, but if you control both the client and the back end, it is a reasonable situation to be in. FREE ACCESS
-
2m 1sConnection loads can be a factor in system overloads. Loads from connections can slow down requests and take up more CPU and memory. Monitoring the health of your connections is recommended to avoid overloads. FREE ACCESS
-
2m 24sIn this session, Sven Batalla will discuss how new connection bursts can occur and how to prevent system overload. He will also discuss two strategies for handling these bursts. The first is cross-datacenter load balancing, which can help distribute the load when a single data center becomes overloaded. The second is the use of proxy batch jobs, which can buffer the request and then forward it to the actual request management tasks. FREE ACCESS
-
1m 9s
EARN A DIGITAL BADGE WHEN YOU COMPLETE THIS COURSE
Skillsoft is providing you the opportunity to earn a digital badge upon successful completion on some of our courses, which can be shared on any social network or business platform.
Digital badges are yours to keep, forever.