SKILL BENCHMARK
SRE Competency (Intermediate Level)
- 25m
- 42 questions
The SRE Competency benchmark measures whether a learner has project-level exposure in SRE technologies, practices, and principles across multiple platforms. A learner who scores high on this benchmark demonstrates professional competency in all of the major areas of SRE operations, across a variety of different platforms and deployments.
Topics covered
- define what is meant by a process-induced emergency, describe the effects of them, and outline how to respond to them
- describe common tools used for packaging and releasing services and releases
- describe how automation processes can vary
- describe the characteristics and purpose of blackbox monitoring
- describe the characteristics and purpose of whitebox monitoring
- describe the path that the evolution of automation follows
- describe the value of automation including consistency, platform, repairs, and time savings
- describe what is meant by each one of the 'three Cs' of incident management (coordinate, communicate, and control)
- describe why it is vital to keep a history of outages and mistakes and outline best practices when doing so
- describe why SREs might carry out reliability testing
- determine which factors are the root cause of a problem
- differentiate between different tools used to automate functions
- differentiate between SRE and DevOps
- differentiate between tools used for creation such as GitHub and Subversion
- list common Google SRE use cases for automation
- list standard factors that can influence software reliability
- list the core tenets of SRE
- name and describe some common SRE metrics
- name the causes and outcomes of change-induced emergencies and outline how to respond to these emergencies
- outline the fundamental emergency response principles SREs need to be familiar with and recognize the critical steps to take when a system breaks
- outline the process and purpose of logging and name the benefits of text logs
- outline what comprises a private cloud, recognize which cloud service models can be delivered in them, describe ways to use them, and distinguish the advantages and disadvantages of their use
- outline what's involved in reliability testing and describe testing techniques, such as unit, integration, system, production, stress, and rollouts entangle tests
- provide an overview of automation classes and describe the path the evolution of automation follows
- provide an overview of common pitfalls associated with troubleshooting systems
- provide an overview of planning tools such as JIRA and Pivotal Tracker
- provide an overview of Service Level Agreements
- provide an overview of Service Level Objectives
- provide an overview of Site Reliability Engineering
- provide an overview of the primary goals of a post-mortem philosophy
- provide an overview of tools used to monitor applications and infrastructure
- provide an overview of uses cases for automation
- provide an overview Service Level Indicators
- recognize how to embrace and manage risk in an environment
- recognize how to measure service risk using metrics such as time-based availability and aggregate availability
- recognize how to use PowerShell for automation tasks in Windows
- recognize the advantages and considerations when automating all the things
- recognize the benefits of performing test-induced emergencies and outline what this involves
- recognize the importance of incident response planning and the characteristics of incidence response plans
- recognize the nine principles of Site Reliability Engineering
- restate the duties of the prominent job roles involved in incident response (Incident Commander, Communications Lead, and Operations Lead) as well as those of other, supporting roles
- summarize the requirements, goals, best practices, job roles, and tools involved in managing and responding to incidents