SRE Troubleshooting Processes
SRE
| Intermediate
- 18 videos | 1h 2m 34s
- Includes Assessment
- Earns a Badge
Troubleshooting is a critical skill for site reliability engineers (SREs). Using past experiences, a proper mindset, and a stable troubleshooting process, SREs can effectively report, triage, examine, diagnose, test, and cure system issues. In this course, you'll explore troubleshooting approaches and best practices, while also learning how to avoid common pitfalls. You'll explore issue reporting, triaging, examination, diagnosis, and testing. You'll recognize how to simplify and reduce troubleshooting, use the ""what, why, and where"" technique, and examine negative results. You'll also investigate how to observe and interpret recent changes to identify what went wrong with a system. Lastly, you'll locate probable cause factors and outline the steps used to make troubleshooting more effective.
WHAT YOU WILL LEARN
-
Discover the key concepts covered in this courseDescribe how engineers think differently to "novices" when it comes to troubleshootingOutline best practices and approaches to troubleshooting and how to keep those skills sharpOutline an idealized troubleshooting model (e.g., report, triage, examine, diagnose, test/treat, and cure.)List potential pitfalls to avoid, such as looking for symptoms that are not relevantOutline how to manage operational loadsRecognize the importance of an adequate initial problem reportRecognize the importance of triaging problems from the onsetRecognize the importance of examining each component of a system to understand whether it is functioning properly
-
Identify the steps and approaches used to diagnose issuesDescribe methods for testing and treating possible causes to identify actual problemsRecognize how to simplify and reduce troubleshooting using techniques such as dividing and conqueringDescribe the "what, why, where" technique and how it can be used to diagnose a malfunctioning systemInterpret how determining who last touched a system can be helpful when identifying what is going on with a systemDefine what is meant by "negative results"Recognize that systems are complex and that often you can only identify probable cause factors to document what went wrong with a systemOutline steps to make troubleshooting easierSummarize the key concepts covered in this course
IN THIS COURSE
-
1m 34s
-
1m 23s
-
2m 15s
-
3m 16s
-
4m 31s
-
5m 1s
-
4m 11s
-
2m 21s
-
5m 14sIn this video, you'll learn more about the importance of examining each component of a system to understand whether it is functioning properly. You'll learn that when you're troubleshooting a problem, there are some system metrics you might want to consult. These can be anything from memory and CPU consumption to system logs or even audit and change logs. Having system metrics in your tool belt helps you find correlations in the behavior you're seeing. FREE ACCESS
-
2m 15s
-
6m 5sIn this video, you'll learn how to troubleshoot problems. In order to do this, you'll need to come up with a list of possible causes. You'll learn that using experimentation, you can rule out various causes. This means you'll run tests and treat the system, and then see if it resolves the problem or at least affects it in some way. FREE ACCESS
-
2m 43sIn this video, you'll learn how to recognize how to simplify and reduce troubleshooting using techniques such as dividing and conquering. You'll discover how to simplify the problem and identify the connections between components. This allows you to divide and conquer, which is a very useful general-purpose technique for troubleshooting and finding the solution to a problem. FREE ACCESS
-
3m 9sIn this video, you'll learn more about the six essential questions to ask when doing any investigation. These include who, what, when, where, why, and how. You'll discover that these questions can lead you to the solution as well as future prevention. FREE ACCESS
-
2m 9s
-
6m 26s
-
4m 21s
-
4m 27sIn this video, you'll learn more about how to make troubleshooting easier. You'll discover there are many ways to do this, such as building observability into the system. Logging, insights, status pages, and other outputs can help someone who's performing active troubleshooting gauge the health of any specific component. You'll also learn about consistency and information availability. All components should have well-designed interfaces that are observable. FREE ACCESS
-
1m 13sIn this video, you'll summarize what you've learned in the course. You've discovered that a site reliability engineer must be able to perform effective and efficient troubleshooting of malfunctioning systems. FREE ACCESS
EARN A DIGITAL BADGE WHEN YOU COMPLETE THIS COURSE
Skillsoft is providing you the opportunity to earn a digital badge upon successful completion on some of our courses, which can be shared on any social network or business platform.
Digital badges are yours to keep, forever.