Site Reliability Engineer: Managing Cascading Failures
SRE
| Intermediate
- 21 videos | 1h 11m 9s
- Includes Assessment
- Earns a Badge
Cascading failures are a concern for site reliability engineers (SREs) because they often stem from positive feedback and grow over time. In this course, you'll examine the various cascading failure triggers, such as overloads, CPU, and memory issues. You'll also explore the resource exhaustion issues resulting from cascading failures and the adverse effects on overall performance and stability. You'll outline steps to prevent server overloads, ensure efficient queue management, deal with latency, and manage slow startups. You'll explore terms such as ""load shedding"" and ""code retries."" You'll also identify the benefits of setting deadlines and how propagating cancellations can reduce or eliminate unneeded work and preserve resources for other needs. Finally, you'll outline the steps involved in testing cascading failures and in addressing them immediately.
WHAT YOU WILL LEARN
-
Discover the key concepts covered in this courseDefine what is meant by cascading failures and identify situations in which this term is usedDescribe how server overloads can lead to cascading failuresDefine what is meant by resource exhaustion and describe its consequencesList cpu considerations as they relate to failures and overutilizationList factors that can contribute to memory exhaustionRecognize how file descriptors and threads can directly lead to failuresRecognize how resource exhaustion can travel from one resource to anotherRecognize how resource exhaustion can lead to service unavailabilityOutline how to prevent server overloadsOutline steps to ensure efficient queue management
-
Differentiate between load shedding and graceful degradationDefine what is meant by code retries and recognize why it is relevant to the topic of cascading failuresRecognize the benefits of setting deadlinesRecognize how propagating cancellations can reduce unneeded workDefine what is meant by latency considerations, including bimodal latency, and describe how to address this class of problemsOutline the steps involved in managing slow startups and working with cold cachingDifferentiate between the various cascading failure triggersOutline how to test cascading failuresList steps to immediately address cascading failuresSummarize the key concepts covered in this course
IN THIS COURSE
-
1m 30s
-
1m 55s
-
3m 34s
-
3m 4sIn this video, you'll learn that one of the potential causes for a cascading failure is resource exhaustion. You'll learn what this means and how it relates to system-level resources like memory, CPU, disk space, and so on. For example, imagine having 64 gigabytes of memory and running at or near full capacity. Or imagine running at 100% CPU, or even running completely out of disk space. FREE ACCESS
-
4m 52sIn this video, you'll learn more about resource exhaustion or system overloads in the CPU of an individual server. You'll learn that if it's starved, processes start to run much more slowly. The host will outline the symptoms of starved CPU and how they cascade into other areas like memory, total number of active threads, and on-screen text. FREE ACCESS
-
2m 50sIn this video, you'll learn more about the system resource exhaustion can happen in all sorts of different areas. One common area is memory consumption. It can happen from processes taking lots of memory, but it can also happen as the CPU request queue starts to back up. A common symptom of memory exhaustion is dying tasks. As memory consumption needs exceed memory availability, a task might be killed or evicted by the system. FREE ACCESS
-
2m 14s
-
1m 58s
-
4m 45s
-
4m 47s
-
2m 59sIn this video, you'll outline steps to ensure efficient queue management when systems are overloaded. You'll learn that one of the most common resources to get exhausted is the CPU. When the CPU gets exhausted, it no longer can handle all the requests coming at it. This means the queue of requests start to get larger and larger until the system eventually runs out of memory. FREE ACCESS
-
6m 32sIn this video, you'll learn how to differentiate between load shedding and graceful degradation. The idea behind load shedding is that some less important requests will be dropped as the server approaches an overload condition. You'll learn that the goal of load shedding is to prevent the server from having to execute less important requests, helping prevent the system from running out of CPU, memory, and failing health checks. FREE ACCESS
-
2m 34sIn this video, you'll learn more about when systems are being overloaded with requests. You'll discover that retries can make the problem worse. However, what is a retry? A retry is performed after a previous failure. Regardless of the reason for failure, a retry is performed if the initial failure was due to a condition no longer relevant. FREE ACCESS
-
3m 56s
-
1m 33s
-
2m 20s
-
3m 49s
-
4m 50s
-
4m 30s
-
5m 30s
-
1m 8s
EARN A DIGITAL BADGE WHEN YOU COMPLETE THIS COURSE
Skillsoft is providing you the opportunity to earn a digital badge upon successful completion on some of our courses, which can be shared on any social network or business platform.
Digital badges are yours to keep, forever.