Site Reliability Engineer: Managing Cascading Failures

SRE | Intermediate

21 videos | 1h 11m 9s
Includes Assessment
Earns a Badge

(231)

Cascading failures are a concern for site reliability engineers (SREs) because they often stem from positive feedback and grow over time. In this course, you'll examine the various cascading failure triggers, such as overloads, CPU, and memory issues. You'll also explore the resource exhaustion issues resulting from cascading failures and the adverse effects on overall performance and stability. You'll outline steps to prevent server overloads, ensure efficient queue management, deal with latency, and manage slow startups. You'll explore terms such as ""load shedding"" and ""code retries."" You'll also identify the benefits of setting deadlines and how propagating cancellations can reduce or eliminate unneeded work and preserve resources for other needs. Finally, you'll outline the steps involved in testing cascading failures and in addressing them immediately.

WHAT YOU WILL LEARN

Discover the key concepts covered in this course

Define what is meant by cascading failures and identify situations in which this term is used

Describe how server overloads can lead to cascading failures

Define what is meant by resource exhaustion and describe its consequences

List cpu considerations as they relate to failures and overutilization

List factors that can contribute to memory exhaustion

Recognize how file descriptors and threads can directly lead to failures

Recognize how resource exhaustion can travel from one resource to another

Recognize how resource exhaustion can lead to service unavailability

Outline how to prevent server overloads

Outline steps to ensure efficient queue management
Differentiate between load shedding and graceful degradation

Define what is meant by code retries and recognize why it is relevant to the topic of cascading failures

Recognize the benefits of setting deadlines

Recognize how propagating cancellations can reduce unneeded work

Define what is meant by latency considerations, including bimodal latency, and describe how to address this class of problems

Outline the steps involved in managing slow startups and working with cold caching

Differentiate between the various cascading failure triggers

Outline how to test cascading failures

List steps to immediately address cascading failures

Summarize the key concepts covered in this course

IN THIS COURSE

1m 30s

FREE ACCESS
1m 55s

FREE ACCESS
3. Server Overloads

3m 34s

FREE ACCESS
4. Resource Exhaustion

3m 4s

In this video, you'll learn that one of the potential causes for a cascading failure is resource exhaustion. You'll learn what this means and how it relates to system-level resources like memory, CPU, disk space, and so on. For example, imagine having 64 gigabytes of memory and running at or near full capacity. Or imagine running at 100% CPU, or even running completely out of disk space. FREE ACCESS
5. CPU Resources

4m 52s

In this video, you'll learn more about resource exhaustion or system overloads in the CPU of an individual server. You'll learn that if it's starved, processes start to run much more slowly. The host will outline the symptoms of starved CPU and how they cascade into other areas like memory, total number of active threads, and on-screen text. FREE ACCESS
6. Memory Resources

2m 50s

In this video, you'll learn more about the system resource exhaustion can happen in all sorts of different areas. One common area is memory consumption. It can happen from processes taking lots of memory, but it can also happen as the CPU request queue starts to back up. A common symptom of memory exhaustion is dying tasks. As memory consumption needs exceed memory availability, a task might be killed or evicted by the system. FREE ACCESS
7. File Descriptors and Threads

2m 14s

FREE ACCESS
8. Resource Dependencies

1m 58s

FREE ACCESS
9. Unavailable Services

4m 45s

FREE ACCESS
10. Preventing Overloads

4m 47s

FREE ACCESS
11. Queueing Requests

2m 59s

In this video, you'll outline steps to ensure efficient queue management when systems are overloaded. You'll learn that one of the most common resources to get exhausted is the CPU. When the CPU gets exhausted, it no longer can handle all the requests coming at it. This means the queue of requests start to get larger and larger until the system eventually runs out of memory. FREE ACCESS
12. Load Shedding

6m 32s

In this video, you'll learn how to differentiate between load shedding and graceful degradation. The idea behind load shedding is that some less important requests will be dropped as the server approaches an overload condition. You'll learn that the goal of load shedding is to prevent the server from having to execute less important requests, helping prevent the system from running out of CPU, memory, and failing health checks. FREE ACCESS
13. Code Retries

2m 34s

In this video, you'll learn more about when systems are being overloaded with requests. You'll discover that retries can make the problem worse. However, what is a retry? A retry is performed after a previous failure. Regardless of the reason for failure, a retry is performed if the initial failure was due to a condition no longer relevant. FREE ACCESS
14. Implementing Deadlines

3m 56s

FREE ACCESS
15. Propagating Cancellations

1m 33s

FREE ACCESS
16. Dealing with Latency

2m 20s

FREE ACCESS
17. Working with Slow Startups

3m 49s

FREE ACCESS
18. Cascading Failure Triggers

4m 50s

FREE ACCESS
19. Testing Cascading Failures

4m 30s

FREE ACCESS
20. Addressing Cascading Failures

5m 30s

FREE ACCESS
21. Course Summary

1m 8s

FREE ACCESS

EARN A DIGITAL BADGE WHEN YOU COMPLETE THIS COURSE

Skillsoft is providing you the opportunity to earn a digital badge upon successful completion on some of our courses, which can be shared on any social network or business platform.

Digital badges are yours to keep, forever.

Course Technical Program Management: Solving Complex Problems in Technical Programs

(1)

Journey Performance Engineering Journey

(3)

Course Performance Engineering: Potential Performance Issues in Software Development

(25)

PEOPLE WHO VIEWED THIS ALSO VIEWED THESE

Course Build & Release Engineering Best Practices: Release Management

(249)

Course Best Practices for the SRE: Automation

(251)

Course Site Reliability Engineer: Managing Overloads

(247)

Get Started

Sharpen your skills. Upgrade your career. Find the right learning path for you, based on your role and skills. Take part in hands-on practice, study for a certification, and much more - all personalized for you.

*Not included: Compliance, Leadership Development Program content, and Engineering books

Your content + our content + our platform = a path to learning success

Using our learning experience platform, Percipio, your learners can engage in custom learning paths that can feature curated content from all sources.

Learn More

Aspire to something bigger

Aspire Journeys are guided learning paths that set you in motion for career success.

Browse Aspire Journeys

Explore a world of live learning with Global Knowledge

Choose from convenient delivery formats to get the training you and your team need - where, when and how you want it.

Browse Live Learning

IT Skills and Salary Report

ESG Impact Report

Site Reliability Engineer: Managing Cascading Failures

WHAT YOU WILL LEARN

IN THIS COURSE

EARN A DIGITAL BADGE WHEN YOU COMPLETE THIS COURSE

YOU MIGHT ALSO LIKE

PEOPLE WHO VIEWED THIS ALSO VIEWED THESE