Site Reliability Engineering: SRE intermediate

https://www.skillsoft.com/channel/site-reliability-engineering-d7da949f-03da-41f4-8238-1dbb3e24896c?technologyandversion=3338066&expertiselevel=3338067 https://www.skillsoft.com/channel/site-reliability-engineering-d7da949f-03da-41f4-8238-1dbb3e24896c?technologyandversion=3338068&expertiselevel=3338067

2 Courses | 2h 12m

51 Courses | 56h 6m 36s
3 Books | 14h 9m

(1)

Explore Site Reliability Engineering (SRE), where software engineering aspects are applied to infrastructure and operation tasks.

GETTING STARTED

Build & Release Engineering Best Practices: Release Engineering

1m 35s
5m 53s

+13 MORE VIDEOS | FREE ACCESS

GETTING STARTED

Site Reliability: Engineering

1m 17s
4m 31s

+11 MORE VIDEOS | FREE ACCESS

COURSES INCLUDED

Build & Release Engineering Best Practices: Release Engineering

It's important to know why the roles, philosophy, and principles behind release engineering - a relatively new discipline of software engineering - are used for building and delivering software. In this course, you'll learn about the automated release system called Rapid, and how it can be used to provide a framework for delivering reliable software builds and releases. You'll also learn about configuration management and the importance of collaboration between release engineers and site reliability engineers.

15 videos | 59m Assessment Badge

Build & Release Engineering Best Practices: Release Management

Release management can guide your software development efforts from planning to deployment, resulting in better customer satisfaction with the end product. In this course, you'll learn about the benefits of using a release management process to manage and improve the development of a software build. You'll then move on to explore key concepts and principles that apply to release management, as well as common considerations and potential challenges to be aware of. Lastly, you'll learn about common toolsets used by release engineers and best practices related to continuous integration and release deployment.

15 videos | 1h 12m Assessment Badge

FREE ACCESS

COURSES INCLUDED

Site Reliability: Engineering

Site Reliability Engineers are often considered the link between software development and operations. In this course, you'll explore the principles of site reliability engineering as well as common concerns such as measuring and managing risk, and risk tolerance. You'll also learn how to ensure a satisfactory level of service by implementing Service Level Objectives, Service Level Agreements, and Service Level Indicators.

13 videos | 1h 5m Assessment Badge

Site Reliability: Tools & Automation

There are numerous tools available to Site Reliability Engineers to help with planning, managing, deploying, automating, and monitoring services and infrastructure. In this course, you'll explore these tools as well some the benefits of automation and the automation process. You'll also discover common pitfalls and failures, as well as how to manage of post-mortem incidents.

14 videos | 52m Assessment Badge

Best Practices for the SRE: Automation

It has been proven that the automation of processes and systems commonly results in higher production rates and increased productivity. In this course, you'll learn the basics of automation, including benefits such as consistency, efficiency, problem-solving, and cost-savings. You'll examine the potential challenges of automation, including integration, complexity, and security. Lastly, you'll learn the value of automation for a Site Reliability Engineer and how SREs are using automation to improve daily operations and overcome obstacles.

15 videos | 1h Assessment Badge

Best Practices for the SRE: Use Cases for Automation

Site Reliability Engineers often use automation and orchestration capabilities to scale security and performance, ensuring sites are reliable and efficient. In this course, you'll learn about common use cases for automating systems and processes. You'll examine PowerShell capabilities that can be used to automate a variety of Windows administrative tasks including user creation, patching and updating, bulk enrollment, and software installations. Lastly, you'll learn about cluster turnup automation, reliability, and enabling failure at scale.

16 videos | 1h 9m Assessment Badge

Backup & Recovery: Business Continuity & Disaster Recovery

Disasters can occur at any time and to any sized organization, so administrators should invest the time and resources to properly plan for business continuity and disaster recovery. In this course, you'll learn how to plan for business continuity, assess risk, and perform business impact assessments. You'll also learn about system resilience, sensitive data types, and data classifications. Lastly, you'll see a comparison of Recovery Time Objective and Recovery Point Objective, and examine what to include when preparing a disaster recovery training plan.

15 videos | 1h 21m Assessment Badge

Backup & Recovery: Enterprise Backup Strategies

Critical information must be backed up and protected for a company's survival. In this course, you'll learn about onsite and offsite backup and the recovery solution. You'll examine the three main cloud providers - Amazon Web Services, Microsoft Azure, and Google. You'll then learn about considerations for local backup and bring your own device backups. Finally, you'll explore the cultural impact involved in moving to the cloud and how employee communication and inclusion could be vital to a successful migration.

11 videos | 45m Assessment Badge

Backup & Recovery: Windows Client Backup and Recovery Tools

For the vitality of any company, data protection solutions are essential. There are numerous types of built-in backup and recovery tools available in the Windows 10 operating system. In this course, you'll learn about features such as File History, System Image Backups, and OneDrive and how they can be used to keep data safe and secure. Next, you'll examine how to repair a Windows 10 PC using the Advanced Startup options, enable volume shadow copies, and create a recovery drive for access to the advanced start-up options. Finally, you'll learn about the various restore features such as System Restore, that can be used to restore a system to a previously known working version.

14 videos | 1h 11m Assessment Badge

Describing Distributed Systems

Distributed systems involves numerous computers that work together but appear as only a single computer to the operator. In this course, you'll learn about distributed systems can provide numerous benefits including performance, availability, and autonomy. You'll also explore distributed systems in greater detail, and learn strategies and best practices for monitoring them.

13 videos | 42m Assessment Badge

Monitoring Distributed Systems

Principles and techniques are key in building a successful monitoring and alerting system. In this course, you'll explore the 'four golden signals' of monitoring while learning how to differentiate between symptoms and causes. You'll also learn about the guidelines for designing a monitoring system, questions to ask when creating rules for monitoring, and how to monitor for the long term.

14 videos | 30m Assessment Badge

Site Reliability Engineering: Scenario Planning

Scenario planning helps site reliability engineers strategically prepare for uncertainties that may disrupt or negatively affect services. In this course, you'll explore scenario planning use cases and the strategies utilized to prepare for disasters. You'll examine the functions of Disaster Recovery Testing (DiRT) and Customer Reliability Engineering teams, which help manage the impact of a disaster or disruption. Next, you'll identify disaster recovery testing events and recognize how to plan and design tests for DiRT. You'll move on to describe the production incident lifecycle and how to minimize production incidents. You'll identify unmanaged responses, how to rectify untrained responses, and the activities used to train response teams. Finally, you'll examine how to test people and how they self-organize and interact using various role-playing and test scenarios.

21 videos | 1h 11m Assessment Badge

SRE Simplicity: Software System Complexity

Simple systems and software are proven to be easier to develop, understand, maintain, and test. For site reliability engineers, simplicity should be an end-to-end goal and cover all aspects of the software life cycle. In this course, you'll explore the importance of simple systems and software code. You'll identify the different types of software complexity, such as structural complexity, organizational complexity, complexity of use, and theoretical complexity, and learn how to differentiate between complex and complicated code. You'll move on to recognize how to measure complexity using various metrics, such as cyclomatic complexity, the Halstead metric, and the maintainability index. Lastly, you'll examine class coupling, using NPATH to measure the complexity of a piece of code, and prioritizing the simplification of projects and resources.

18 videos | 1h 16m Assessment Badge

SRE Simplicity: Simple Software Systems

When creating a simple software system, it is essential to identify and remove any unwanted complexity, whether accidental or essential. By eliminating complexity, site reliability engineers can ensure the final software product is more stable and reliable. In this course, you'll learn to differentiate between agility and stability and explore the importance of stability testing. You'll learn about key metrics and methods, such as production analysis and agile process metrics, which can be used by software development teams to ensure business goals are met. Lastly, you'll learn how to avoid introducing potential defects and bugs by limiting the number of negative lines of code in a project.

15 videos | 1h 8m Assessment Badge

SRE Postmortums: Blameless Postmortem Culture Creation

There are various, frequently-used premortem and postmortem techniques adopted by site reliability engineers (SRE) to diagnose issues and come up with problem resolution ideas and alternative approaches. To do this effectively, SREs need to account for several factors at play, including the workplace culture and work collaboration. In this course, you'll learn how to promote a blameless culture - one without finger-pointing and animated language. You'll explore the key characteristics of good and bad postmortems, and discover the benefits of reviewing postmortems, sharing knowledge, giving feedback, and rewarding positive behavior. You'll then learn how to respond to postmortem culture implementation failure. Lastly, you'll discover how using the right postmortem templates and postmortem management tools can improve how you write postmortems and manage their associated data.

22 videos | 1h 11m Assessment Badge

Cloud and Containers for the SRE: Containers

Containers in cloud computing are a form of operating system virtualization that allows users or administrators to deploy and run applications without the need for virtual machines. Containers can be deployed and run virtually anywhere, and support Linux, Windows, and Mac operating systems. In this course, you'll explore the various types of container solutions, including Kubernetes, Docker, and AWS. You'll outline how containers enable a more efficient continuous integration and delivery system and why they're needed for SRE. You'll also examine container storage, security, and migration. You'll list the high-availability solutions available for containers and investigate the Containers as a Service concept. Lastly, you'll recognize how the container ecosystem is revolutionizing software delivery, and identify the role of Docker and Kubernetes in container orchestration.

20 videos | 1h 21m Assessment Badge

Cloud and Containers for the SRE: Implementing Container Solutions

Although containerization technologies such as Docker and Kubernetes can function independently, they can also benefit significantly from one another. Furthermore, open source automation tools such as Jenkins can be used to increase resource utilization and efficiency through pipelines. In this course, you'll explore the many benefits of pipelines, and learn how to use them to build code. You'll outline the benefits of Git and GitHub for revision control and identify the distributed version control tools that can be used to manage source code history. You'll then work with Jenkinsfiles to write pipeline-as-a-code and code to use at the build stage, after the build and test stages, and for recording failures. Next, you'll use the Jenkins Pipeline to set the environment variables and outline the key steps and factors needed in your code review. Lastly, you'll learn how to use Kubernetes to deploy applications with high availability, scalability, and resilience.

14 videos | 1h 7m Assessment Badge

SRE Troubleshooting Processes

Troubleshooting is a critical skill for site reliability engineers (SREs). Using past experiences, a proper mindset, and a stable troubleshooting process, SREs can effectively report, triage, examine, diagnose, test, and cure system issues. In this course, you'll explore troubleshooting approaches and best practices, while also learning how to avoid common pitfalls. You'll explore issue reporting, triaging, examination, diagnosis, and testing. You'll recognize how to simplify and reduce troubleshooting, use the ""what, why, and where"" technique, and examine negative results. You'll also investigate how to observe and interpret recent changes to identify what went wrong with a system. Lastly, you'll locate probable cause factors and outline the steps used to make troubleshooting more effective.

18 videos | 1h 2m Assessment Badge

SRE Troubleshooting: Tools

Site reliability engineers (SREs) are typically good problem solvers. They need to think logically to identify problems, correct them, and prevent them from happening again. In this course, you'll explore several built-in and open-source troubleshooting tools SREs can use for resolving system issues. You'll start by examining the techniques of logging and whitebox and blackbox monitoring used to monitor system events. You'll then work with the various built-in Windows troubleshooting tools, namely the Event Viewer, Resource Monitor, and System Information tools. Next, you'll use Google Cloud Dataflow to process logs, before outlining the purpose and benefits of the StatsD standard and the /api/search endpoint. Lastly, you'll identify how Google's Dapper is used for troubleshooting distributed systems, and the open standards tool, Prometheus, for instrumenting software and exposing metrics.

13 videos | 41m Assessment Badge

Site Reliability Engineer: Managing Overloads

Site reliability engineers (SREs) are typically responsible for preventing and managing overloads. A common misconception is that overloads only affect computer systems. However, overloads also comprise types of occupational stress, which invariably negatively affect an organization. In this course, you'll explore the fundamental concepts and methods involved in managing overloads. You'll start by identifying operational load types and how they relate to performance. You'll then outline how to mitigate workloads and prioritize work before recognizing the specific consequences of overloads. You'll then describe how to manage client-side traffic using per customer limitations and client-side throttling. You'll examine tools such as criticality values and utilization signals. Finally, you'll explore approaches used for handling overload errors and learn how to identify issues caused by loads associated with connections.

20 videos | 1h 10m Assessment Badge

Site Reliability Engineer: Managing Cascading Failures

Cascading failures are a concern for site reliability engineers (SREs) because they often stem from positive feedback and grow over time. In this course, you'll examine the various cascading failure triggers, such as overloads, CPU, and memory issues. You'll also explore the resource exhaustion issues resulting from cascading failures and the adverse effects on overall performance and stability. You'll outline steps to prevent server overloads, ensure efficient queue management, deal with latency, and manage slow startups. You'll explore terms such as ""load shedding"" and ""code retries."" You'll also identify the benefits of setting deadlines and how propagating cancellations can reduce or eliminate unneeded work and preserve resources for other needs. Finally, you'll outline the steps involved in testing cascading failures and in addressing them immediately.

21 videos | 1h 11m Assessment Badge

SRE Emergency & Incident Response: Responding to Emergencies

Site Reliability Engineers (SREs) are responsible for assigning the appropriate resources and responsibilities to effectively deal with unexpected emergencies. To do this, SREs should ensure the proper processes and teams are in place before an emergency occurs. In this course, you'll explore the different emergency types and outline how to plan for them. You'll examine the causes of and how to respond to test-induced, change-induced, and process-induced emergencies and what's involved in proactive approaches to emergency testing and planning. You'll then outline the critical steps to correctly documenting emergencies, including the history of outages and mistakes. You'll then differentiate between business continuity and disaster recovery planning and outline how to create both types of plans and conduct a business impact analysis. Lastly, you'll explore some IT recovery strategies.

18 videos | 1h 12m Assessment Badge

SRE Emergency & Incident Response: Incident Response

A well-prepared and organized approach is key to addressing and managing the aftermath of a system failure, security breach, or cyberattack. In this course, you'll explore the fundamental principles an SRE needs to be familiar with when responding to and managing incidents. You'll identify the goals, requirements, best practices, and key players involved in incident management. You'll learn how to deal with managed and unmanaged incidents and what's involved in an incident response plan. You'll identify incident response roles and responsibilities, and how to use incident metrics to manage incidents at scale. You'll outline what's involved in establishing a computer security incident response team (CSIRT), including each key team member's roles and responsibilities. Lastly, you'll examine what goes into an incident response policy.

17 videos | 1h 24m Assessment Badge

Distributed Reliability: SRE Critical State Management

Anticipating failures that will affect your company's systems is a crucial site reliability engineer duty. These failures are especially significant when they affect distributed systems, which is why efficient algorithms and strategies are essential in minimizing the likelihood of failures. In this course, you'll explore both critical state management and the CAP theorem, identifying how both concepts relate to distributed systems. Next, you'll examine several distributed system management algorithms and strategies, including deterministic and nondeterministic algorithms, distributed system models, and Byzantine faults. You'll then outline how each of these benefits distributed system management. Finally, you'll investigate the Multi-Paxos message flow protocol and how it works with distributed systems. Finally, you'll describe what's involved in deploying and monitoring a consensus-based system to increase distributed system performance.

14 videos | 1h 13m Assessment Badge

Distributed Reliability: SRE Distributed Periodic Scheduling

Maintaining a distributed system requires constant maintenance to ensure failures don't interfere with that system's reliability and availability. Using periodic scheduling and replication, site reliability engineers can minimize the effect failures may have on a system's performance. One way to automate this process is to utilize the system daemon, cron. In this course, you'll explore how to use cron for task scheduling, the purpose, components, and operators involved in cron jobs, and the format and characters of cron syntax. You'll outline how cron works with distributed periodic scheduling and idempotency, and in largescale deployments. Next, you'll review the PAXOS distributed consensus algorithm, best practices for its use, and how it applies to distributed replication. Lastly, you'll practice scheduling a cron job and using cron syntax generators.

14 videos | 57m Assessment Badge

SRE Load Balancing Techniques: Front-end Load Balancing

Today's distributed systems can consist of hundreds or even thousands of servers, and getting them to work together efficiently is a challenge. Load balancing is a multifaceted concept whose many techniques can help SREs face this challenge. In this course, you'll explore how front-end load balancing works and its associated techniques, concepts, and capabilities. You'll examine the characteristics of load balancers, their use in application delivery and security, and the use of DNS load balancers. You'll outline strategies for virtual IP load balancing, cloud load balancing, and handling overload. Finally, you'll learn how the Google Front End Service, Andromeda virtualization stack, Maglev network load balancing service, and the Envoy edge and service proxy are used for load balancing-related tasks.

14 videos | 1h 5m Assessment Badge

SRE Load Balancing Techniques: Data Center Load Balancing

A Site Reliability Engineer (SRE) must know how to perform load balancing within the data center, both internally and externally. In this course, you'll learn about load balancing, including various methods for balancing loads in the data center. You'll begin by examining what data center load balancing is and its importance to performance, as well as load balancing policies. You'll then learn how to deal with unhealthy tasks using flow control, and tips and tricks for optimizing load balancing. Next, you'll examine methods for limiting connection pools with subsetting, and the various load balancing components. Lastly, you'll learn how to balance loads internally and externally using HTTPS and TCP/UDP, and how to balance loads using SSL and TCP proxy load balancing.

14 videos | 1h 3m Assessment Badge

SRE Products at Scale: Product Launches

Site Reliability Engineers (SREs) often contribute to the launch of new products and features. These launches can occur in rapid iterations and at scale, so SREs need to be prepared to help them succeed. In this course, you'll examine launch coordination engineering to build and release reliable and fast products. You'll identify the criteria for a successful product launch and how to develop and use launch checklists to reduce failure and ensure consistency and completeness. Next, you'll outline the techniques used for reliable launches and how launch coordination engineers can help mitigate the repetition of launch mistakes. You'll investigate the production readiness review model used to identify a service's reliability needs. Lastly, you'll outline the characteristics of SRE engagement and early engagement models, as well as SRE engagement frameworks.

16 videos | 1h 17m Assessment Badge

Cloud and Containers for the SRE: Cloud Architectures & Solutions

When deploying a medium to a large-sized cloud solution, there are many factors to consider, such as the numerous cloud environments to choose from and the different levels of management and security they each require. In this course, you'll explore these environments in detail, with a specific focus on their application in SRE. You'll examine the features, purpose, benefits, and potential drawbacks of services such as Software as a Service (SaaS), Infrastructure as a Service (IaaS), Platform as a Service (PaaS), and Anything as a Service (XaaS). You'll then investigate private, public, hybrid, and community clouds and on and off-premises software. Moving on, you'll delve into cloud architecture-related topics, such as orchestration, automation, elasticity, and cloud bursting. Lastly, you'll study cloud payment models, resource allocation, and on-demand self-service.

24 videos | 1h Assessment Badge

SRE Testing Tasks: Software Reliability & Testing

Site reliability engineers (SREs) can use various testing techniques to ensure software operations are as failure-free as possible for a specified time in a specified environment. In this course, you'll explore multiple testing techniques, their purposes, and the tasks involved in their execution. You'll start by examining traditional software testing approaches, such as unit tests, integration tests, and system tests. Next, you'll investigate the components and use cases of various reliability metrics applied to SRE testing, including mean time to failure (MTTF), mean time to recover (MTTR), and mean time between failures (MTBF). Lastly, you'll outline several software testing approaches, such as stress, configuration, integration, acceptance, production, and canary testing, among others. You'll identify when, how, and by whom each of these testing types is carried out.

18 videos | 1h 22m Assessment Badge

SRE Testing Tasks: Testing Considerations

Site reliability engineers (SREs) need to create a healthy test and build environment to ensure that products being distributed integrate and function as expected. In this course, you'll explore the fundamentals of creating a robust SRE test and build environment, looking at the standard tools and techniques available for testing at scale. You'll examine disaster and statistical testing, and learn about working with deadlines and production configurations. You'll investigate the topic of test failures, identifying why an SRE should expect specific tests to fail and how results for test failures can help maximize knowledge about operations and end-users. Lastly, you'll look at the why and how of incorporating break glass procedures, integration testing configuration files, and fake back-end versions into your testing procedures.

14 videos | 1h 2m Assessment Badge

SRE Team Management: Scaling the Team

When adding a new site reliability engineer (SRE) to your team, it's important that the new member not only has the required skills but also receives the proper training. This allows the new SRE to fit into the team and get up to speed as quickly as possible. In this course, you'll learn about the best practices for onboarding a new SRE team member, including methods and tools that can be used during the onboarding process. Next, you'll explore the technical skills that an SRE requires, including the ability to reverse engineer an application to determine the root cause of a problem. Finally, you'll examine the skills and knowledge an SRE requires when on-call, including those needed to provide support and manage support issues.

14 videos | 1h 3m Assessment Badge

SRE Data Pipelines & Integrity: Data Pipelines

Site reliability engineers often find data processing complex as demands for faster, more reliable, and extra cost-effective results continue to evolve. In this course, you'll explore techniques and best practices for managing a data pipeline. You'll start by examining the various pipeline application models and their recommended uses. You'll then learn how to define and measure service level objectives, plan for dependency failures, and create and maintain pipeline documentation. Next, you'll outline the phases of a pipeline development lifecycle's typical release flow before investigating more challenging topics such as managing data processing pipelines, using big data with simple data pipelines, and using periodic pipeline patterns. Lastly, you'll delve into the components of Google Workflow and recognize how to work with this system.

21 videos | 1h 11m Assessment Badge

SRE Data Pipelines & Integrity: Pipeline Design

Site reliability engineers (SREs) encounter numerous and varied pipeline technologies and frameworks in their work. When building a pipeline, SREs need to invest considerable time during the design phase to ensure the results work best for the specific case. In this course, you'll explore the numerous features of a pipeline, such as latency, high availability, development, and operations. You'll also examine the two different pipeline mutations: idempotent and two-phase, as well as the checkpointing technique and various code patterns. You'll then investigate the five core characteristics of the pipeline maturity matrix and outline how they should be used to design the pipeline technology. You'll then identify potential failure modes, outage causes, and different prevention and response techniques. Finally, you'll outline event delivery system design and operations and how to plan for customer integration and support.

17 videos | 1h 2m Assessment Badge

SRE Data Pipelines & Integrity: Data Integrity

Data integrity is vital as it ensures end-user data accuracy and consistency in conjunction with an adequate level of service and availability. In this course, you'll learn how to choose a strategy for data integrity, including how to account for any potential upsides and tradeoffs. You'll explore various types of failures that lead to data loss and the existence of the many data failure modes. You'll also identify data integrity challenges. Next, you'll examine in detail the soft deletion, back up and recovery, and early detection layers of defense-in-depth, before investigating the data integrity challenges a cloud developer may encounter in high-velocity environments. Finally, you'll outline considerations for implementing out-of-band data validation and successful data recovery and identify how the primary SRE principles apply to data integrity.

16 videos | 1h 6m Assessment Badge

SRE Team Management: Managing Operational Loads

To ensure and maintain a system's functional state, site reliability engineers (SRE) must learn how to identify, calculate, and manage a system's operational load, which generally falls into three categories: ongoing operation activities, tickets, and pages. In this course, you'll explore these categories in detail. You'll start by outlining methods for managing operational loads at the team level and using support ticketing systems and service level objectives. Next, you'll investigate 'toil,' a term used to describe the operational work associated with running and maintaining a production service. You'll outline steps for identifying, calculating, and eliminating toil and examine the adverse effects toil can have on a team. Additionally, you'll outline how to work with interrupts and distinguish between crucial metrics used for managing them. Lastly, you'll identify the human element factors to consider when dealing with interrupts, including efficiency, distractibility, and respect.

17 videos | 54m Assessment Badge

SRE Team Management: Operational Overload

Site reliability engineers (SREs) are responsible for many administrative tasks, often splitting their time between reactive ops work and special projects. To ensure teams do not become overloaded, SREs may be transferred to a team in order to prevent or help mitigate overload. In this course, you will learn how to deal with operational overload. You'll start by examining ops mode, which is an approach used to ensure services are properly maintained and optimized. You'll discover factors that contribute to team morale and stress. In addition, you will outline emergency planning strategies and best practices, as well as learn how to categorize emergencies and prepare detailed emergency plans. Next, you'll explore how knowledge sharing relates to emergency preparedness, the key to writing successful postmortems, the importance of service level objectives, and how an appropriate level of detail is required to properly explain your findings. Lastly, you'll discover the key factors and attributes of successful teams. You'll examine a team-first approach and differentiate between questioning techniques such as open/closed, funnel, probing, and leading.

14 videos | 55m Assessment Badge

SRE Metric Management: Software Reliability Metrics

To improve the chances of creating, monitoring, and maintaining a successful software development project, site reliability engineers and all team members must be aware of which metrics to measure. They also need a working knowledge of both automated and manual testing methods. In this course, you'll learn how to manage and select SRE metrics and how various testing methods work. You'll begin by learning what metrics need to be measured for project management, software development, and APIs - examining in detail CI/CD, cloud API, and software project metrics, to name a few. Next, you'll compare both manual and automated testing methods and the goals of each. Lastly, you'll investigate automated testing frameworks and platforms, test cases and types, and best practices and pitfalls to consider.

17 videos | 1h 23m Assessment Badge

SRE Metric Management: Software Reliability Monitoring and Reporting

Once SRE metrics have been identified, site reliability engineers (SREs) must know how to perform fault analysis on a system, classify defects, and monitor and report data. In this course, you'll explore the tools and best practices for carrying out these procedures. You'll begin by identifying various fault analysis methods and tools. You'll then classify software defects and bugs with a focus on severity and priority. Next, you'll investigate strategies for monitoring APIs and explore some tools used for this task. You'll then examine in detail several tools for collecting, analyzing, and reporting metric data using a customizable dashboard, including those that comprise the ELK Stack - Elasticsearch, Logstash, and Kibana. Furthermore, you'll explore the data collection tool Beats and the beneficial use cases for Elasticsearch notifications.

17 videos | 1h 17m Assessment Badge

Core Skills for Site Reliability Engineers: SRE Collaboration & Communication

Collaboration is key to getting the most out of your team and ensuring your clients receive their desired service. In this course, you'll learn to collaborate and communicate as an SRE effectively. You'll learn how to run traditional and virtual meetings to ensure maximum effectiveness and productivity, whether it's with customers, internal or external team members, or distributed teams. You'll examine how to plan, carry out, and post-analyze meetings using best practices and sufficient preparation, tailoring these methods to suit the participants and the end-goal. You'll delve into the unique characteristics of different meeting types, such as those for problem-solving or innovation. You'll explore the advantages and challenges of SRE pair programming. You'll then end the course by investigating some helpful collaboration and communication tools.

14 videos | 1h 5m Assessment Badge

SRE Engagement: Production Readiness Review

Production Readiness Review (PRR), the standard first step of SRE engagement, and its phases are used to identify a service's reliability needs. The concept of ""early engagement"" is then used to evolve the Simple PRR model. In this course, you'll investigate SRE engagement, early engagement, and Production Readiness Review. You'll start by delving into each phase of the SRE Production Readiness Review (PRR) model, namely, engagement, analysis, refactoring, training, onboarding, and continuous improvement. Next, you'll learn how early engagement can be used to evolve the Simple PRR model. You'll then examine how SRE platforms and frameworks can provide structural solutions. Finally, you'll learn how to use the SRE engagement model to manage software projects, comparing it to the traditional System Development Life Cycle (SDLC) model.

14 videos | 1h Assessment Badge

SRE Engagement: The SRE Engagement Model

The SRE engagement model and SRE service lifecycle have note-worthy similarities and differences to the traditional software development life cycle. In this course, you'll explore these differences and investigate the SRE engagement model's components and how to work with it in various circumstances. You'll learn the steps for setting up and building SRE service relationships and establishing a roadmap for sprints and communication. You'll examine how to measure the impact of SRE engagement, set ground rules for SRE teams, and sustain effective relationships with other SREs and developers. Next, you'll study the steps to take for scaling SRE to larger environments and for ending an engagement. Lastly, you'll review case studies to see the results of how others have used the SRE engagement model used in real-life.

14 videos | 1h 3m Assessment Badge

Introduction to SRE and Essential Tools

Site reliability engineering (SRE) is based on a set of principles and practices used to monitor and observe software reliability in a production environment. In this course, you will dive into the fundamentals of SRE and the evolution of SRE over the years. Next, you will examine the site reliability engineering role and find out how to suitably find, place, bootstrap, and distribute site reliability engineers. You will discover the SRE principles that organizations should strive for, key SRE metrics, the importance of error budgeting, and the essential tools used in SRE. Then you will compare and contrast SRE to traditional IT operations, explore the SRE lifecycle from planning to operation, and investigate the process of incident response and postmortem analysis. Finally, you will focus on the cultural impacts of SRE within an organization, set up and configure a basic monitoring tool, and create a simple dashboard using Grafana.

16 videos | 1h 34m Assessment Badge

Implementing SRE Best Practices with Tools

Site Reliability Engineering (SRE) tools can help engineers monitor critical systems, automate incident response, collaborate on issues, and detect abnormal behaviors in the software. In this course, you'll learn best practices for effective monitoring and alerting, as well as different types of automation tools used in SRE. You will explore the process of establishing and revising service-level objectives (SLOs) and service-level indicators (SLIs) and discover methods for integrating SRE practices into existing workflows. Next, you will look at approaches for capacity planning and resource allocation and the process for creating effective SLIs. You will also explore the use of feedback loops for continuous improvement and discover the benefits of using simulations for incident response exercises. Lastly, you will see how to automate a routine maintenance task using a common SRE tool.

12 videos | 1h 7m Assessment Badge

Site Reliability Engineering Network Optimization

Network performance optimization is an important component of site reliability engineering (SRE). It allows IT systems to deliver a better user experience by improving the speed, responsiveness, and efficiency of IT systems. In this course, examine network optimization principles for SRE, network bottlenecks and solutions, and how to measure network performance and latency. Next, discover techniques for optimizing bandwidth and reducing latency, the impact of network design on service reliability, and implementation strategies for redundant network pathways. Finally, learn about network troubleshooting and diagnostics, network monitoring and management tools, load balancing and traffic management approaches, and methods for securing network communications in SRE practices. At the end of this course, you'll be able to identify key elements of network optimization in site reliability engineering.

14 videos | 1h 26m Assessment Badge

Site Reliability Engineering Observability

Observability plays an important role in systems engineering because it enables real-time detection and diagnosis of potential issues, allowing for proactive problem-solving and enhanced performance. In this course, you will take a deep dive into site reliability engineering (SRE) observability, including the three pillars of observability: logs, metrics, and traces. Then you will explore the tools and technologies used for achieving observability and the methods for performing observability in distributed systems. Next, you will discover strategies for log management and analysis, methods for collecting and analyzing metrics, and effective trace analysis methods. You will examine observability tool use cases and methods for setting up observability-related alerts and for performing root cause analysis using observability data. Finally, you will learn how to set up a logging framework for a small application, create and configure alerts, and perform a network trace analysis using Microsoft Network Analyzer.

15 videos | 1h 31m Assessment Badge

Final Exam: Chaos Engineer

Final Exam: Chaos Engineer will test your knowledge and application of the topics presented throughout the Chaos Engineer track of the Skillsoft Aspire Network Admin to Site Reliability Engineer Journey.

1 video | 32s Assessment Badge

Final Exam: Network Admin

Final Exam: Network Admin will test your knowledge and application of the topics presented throughout the Network Admin track of the Skillsoft Aspire Network Admin to Site Reliability Engineer Journey.

1 video | 32s Assessment Badge

SRE Incident Management: Fundamentals & Best Practices

Site reliability engineering (SRE) incident management focuses on managing and responding to incidents effectively, including best practices for incident response, postmortems, and continuous improvement processes. In this course, explore the basics of incident management and its importance in IT operations. Next, examine the key roles and responsibilities of an incident management team and the steps for detecting, responding to, and resolving incidents. Finally, discover the key techniques used for effective communication and documentation during an incident and strategies for post-incident review and continuous improvement. After completing this course, you will be able to outline the procedures of SRE incident management and implement incident response methods.

13 videos | 1h 20m Assessment Badge

SRE Incident Management: Deep Dives, Postmortems, & Continuous Improvement

Site reliability engineering (SRE) incident management focuses on managing and responding to incidents effectively, including implementing best practices for incident response, postmortems, and continuous improvement processes. In this course, explore advanced techniques for incident analysis and root cause identification, including best practices for conducting effective and blameless postmortems. Next, discover methods for translating postmortem findings into actionable improvements and how to implement strategies for fostering a culture of transparency and continuous learning. Finally, learn about approaches for measuring and tracking the effectiveness of improvements. After completing this course, you will be able to implement advanced incident analysis and root cause identification methods.

1h 42m Assessment Badge

Comprehensive Monitoring with Prometheus

Prometheus is a widely used open-source monitoring and alerting toolkit, essential for site reliability engineers. In this course, explore Prometheus' features, data models, characteristics and components, basic configuration, and metric types. Next, learn about Prometheus jobs and instances, the internal label structure, key considerations for consoles and dashboards, and long-term storage options. Finally, examine storage and performance optimization, scaling monitoring in large deployments, strategies for scaling Prometheus effectively, and how to download, install, and configure Prometheus to always run in the background at boot time. After completing this course, you will be able to install Prometheus, leverage advanced monitoring features, and create dynamic and interactive dashboards in Grafana.

15 videos | 1h 22m Assessment Badge

Comprehensive Monitoring with Grafana

Grafana is a multi-platform tool for visualizing and analyzing data. Grafana dashboards combine data from various sources into visual panels. In this course, explore how to install and configure Grafana, connect Prometheus as a data source, and use the query editor to design dashboards. Next, discover how to create interactive dashboards, configure alerts, and examine best practices for dashboard design and enhancing user experience. Finally, learn how to leverage the Grafana API, implement annotations to highlight data events, and configure and test Grafana alerts and notifications. After completing this course, you will be able to install Grafana, utilize advanced monitoring features, and create dynamic and interactive dashboards.

12 videos | 1h 10m Badge

SRE: Capacity Planning & Load Testing Essentials

Capacity planning and load testing are vital in site reliability engineering (SRE) to maintain system stability and performance. In this course, explore key processes for forecasting resource needs and evaluating system performance, system usage, and performance and capacity metrics. Next, learn how to set up and execute load tests using popular tools, simulate user behavior, and analyze different load scenarios. Finally, discover how to identify performance bottlenecks, implement load testing strategies, and evaluate scaling strategies to optimize system performance. After completing this course, you will be able to plan for resource capacity, execute load tests effectively, and ensure systems remain resilient under stress.

11 videos | 1h 1m Assessment Badge

FREE ACCESS

EARN A DIGITAL BADGE WHEN YOU COMPLETE THESE COURSES

Skillsoft is providing you the opportunity to earn a digital badge upon successful completion on some of our courses, which can be shared on any social network or business platform.

Digital badges are yours to keep, forever.

BOOKS INCLUDED

Book

Chaos Engineering

By reading this book you will learn how Chaos Engineering enables your organization to navigate complexity.

7h 17m By Casey Rosenthal

Book

The Art of Site Reliability Engineering (SRE) with Azure: Building and Deploying Applications That Endure

After reading this book, you will understand the underlying concepts of SRE and its implementation using Azure public cloud.

3h 10m By Unai Huete Beloki

Book

High Performance SRE: Automation, error budgeting, RPAs, SLOs, and SLAs with site reliability engineering

This book caters to students, application developers, software engineers, system administrators, and anyone who wishes to understand how to have a rewarding career in the field of SRE.

3h 42m By Anchal Arora Mishra

FREE ACCESS

SKILL BENCHMARKS INCLUDED

IT Operations Awareness (Entry Level)

The IT Operations Awareness benchmark will measure your ability to recognize key terms and concepts related to key IT Operations concepts. You will be evaluated on different cloud services, backing up of data, 5g concept, and key networking concepts. A learner who scores high on this benchmark demonstrates that they have the skills related understanding key IT Operations terminology and concepts.

20m | 15 questions

SRE Awareness (Entry Level)

The SRE Awareness benchmark measures whether a learner has some understanding in SRE technologies, practices, and principles. A learner who scores high on this benchmark demonstrates a general awareness in SRE operations.

20m | 11 questions

SRE Competency (Intermediate Level)

The SRE Competency benchmark measures whether a learner has project-level exposure in SRE technologies, practices, and principles across multiple platforms. A learner who scores high on this benchmark demonstrates professional competency in all of the major areas of SRE operations, across a variety of different platforms and deployments.

25m | 42 questions

SRE Literacy (Beginner Level)

The SRE Literacy benchmark measures whether a learner has had some working exposure in SRE technologies, practices, and principles across multiple platforms. A learner who scores high on this benchmark demonstrates professional literacy in most areas of SRE operations, across a variety of different platforms and deployments.

21m | 21 questions

SRE Proficiency (Advanced Level)

The SRE Proficiency benchmark measures whether a learner has had extensive exposure to SRE technologies, practices, and principles across multiple platforms. A learner who scores high on this benchmark demonstrates professional proficiency in all of the major areas of SRE operations, across a variety of different platforms and deployments.

32m | 32 questions

FREE ACCESS

Channel REST

(1)

Channel Systems Engineering

(1)

Channel Software Engineering

Get Started

Sharpen your skills. Upgrade your career. Find the right learning path for you, based on your role and skills. Take part in hands-on practice, study for a certification, and much more - all personalized for you.

*Not included: Compliance, Leadership Development Program content, and Engineering books

Your content + our content + our platform = a path to learning success

Using our learning experience platform, Percipio, your learners can engage in custom learning paths that can feature curated content from all sources.

Learn More