Distributed Reliability: SRE Critical State Management

SRE | Intermediate

14 videos | 1h 13m 47s
Includes Assessment
Earns a Badge

(181)

Anticipating failures that will affect your company's systems is a crucial site reliability engineer duty. These failures are especially significant when they affect distributed systems, which is why efficient algorithms and strategies are essential in minimizing the likelihood of failures. In this course, you'll explore both critical state management and the CAP theorem, identifying how both concepts relate to distributed systems. Next, you'll examine several distributed system management algorithms and strategies, including deterministic and nondeterministic algorithms, distributed system models, and Byzantine faults. You'll then outline how each of these benefits distributed system management. Finally, you'll investigate the Multi-Paxos message flow protocol and how it works with distributed systems. Finally, you'll describe what's involved in deploying and monitoring a consensus-based system to increase distributed system performance.

WHAT YOU WILL LEARN

Discover the key concepts covered in this course

Describe critical state management and how it applies to distributed systems and affects reliability

Define the cap theorem and describe how it relates to distributed systems

Outline how to coordinate system failures on distributed systems

Differentiate deterministic and nondeterministic algorithms and how they relate to distributed systems

Describe the system models that can be used with distributed systems

Define the concept of distributed consensus and list the stages of validation
Define the concept of byzantine fault and describe how it applies to distributed systems

Describe the distributed consensus architecture patterns used in distributed systems

Describe best practice and tricks for increasing performance for distributed systems

Define the multi-paxos protocol and describe how it relates to distributed systems

Outline how to deploy distributed consensus-based systems and name some key considerations

Name and describe the key considerations when monitoring distributed consensus systems

Summarize the key concepts covered in this course

IN THIS COURSE

1m 20s

FREE ACCESS
5m 41s

Site reliability engineering (SRE) is the practice of allowing software developers to run, manage, and maintain ongoing daily operations of their applications and services so that they are available for users to consume. Critical state management is a key part of SRE, as it allows for anticipating and planning for system failures. A distributed consensus is needed for building highly available and robust systems, which leads to the use of distributed locking. FREE ACCESS
3. CAP Theorem

5m 19s

In this video, you'll learn more about the acid features of typical DBMS system transactions. These are known as ACID, which stands for atomic, consistent, isolated, and durable. The idea is that every transaction performed against a DBMS system abides by these characteristics. You'll learn how to define these terms and how they relate to distributed systems. FREE ACCESS
4. Distributed Systems Coordination Failure

7m 22s

In this video, you will outline the primary job of an SRE. You will learn that, as with all systems, distributed systems can sometimes fail too. The objective is always to restore a system to full operation. This means that when a failure happens, we need to figure out what is going on and then resolve the issue. FREE ACCESS
5. Deterministic vs. Nondeterministic

7m 21s

In this video, you'll learn the difference between deterministic and nondeterministic algorithms. The objective of algorithms is to get an answer. However, not every algorithm can give you a specific answer. This leads to the discussion of deterministic versus nondeterministic algorithms. Deterministic algorithms work through the same states every time to produce an answer. Meanwhile, non-deterministic algorithms might go through completely different states every time they execute. FREE ACCESS
6. Distributed System Models

5m 51s

In this video, you will learn about different kinds of distributed systems. You will discover that there are several different categories of distribution, including synchronous and asynchronous models. You will also learn about architectural models and the fundamental models. FREE ACCESS
7. Distributed Consensus

5m 9s

In this video, you will learn more about distributed systems and how to achieve reliability in a system when dealing with faulty processes. You will learn that solving this problem requires that these distributed processes effectively agree on which data values will be committed to a database. You will learn there are many ways to achieve distributed consensus, including a two phase commit process and a three phase commit process. FREE ACCESS
8. Byzantine Fault

4m 51s

FREE ACCESS
9. Distributed Consensus Architecture Patterns

6m 50s

In this video, you'll learn about distributed consensus algorithms. You'll learn that these algorithms allow nodes to agree on information. They're low level and primitive, but distributed consensus algorithms provide a good place for practical functionality. You'll also learn that higher-level components such as datastores, configuration stores, queues, locking, and leader election services can help with consensus algorithms. FREE ACCESS
10. Distributed Consensus Performance

5m 40s

Distributed consensus can be quite slow and costly, but if it is implemented correctly, it can still function effectively. To improve performance, throughput, latency, and data replication, a number of strategies can be employed. FREE ACCESS
11. Multi-Paxos Detailed Message Flow

4m 50s

In this video, you'll learn how to define the Multi-Paxos protocol and describe how it relates to distributed systems. You'll learn that Paxos operates as a sequence of proposals. These proposals are accepted or denied by a majority of the processes in the system. If accepted, the proposals are executed. This means that Paxos makes sure all the operations are performed in a strict order. FREE ACCESS
12. Distributed Consensus-based System Deployment

7m 33s

When deploying a distributed consensus-based system, you need to take a number of factors into account, such as the number of replicas needed, the load each system can handle, and the quorum composition. The best way to determine how many replicas you need is to ask questions such as how important reliability is, how often you'll perform planned maintenance, and what level of risk you're willing to accept. FREE ACCESS
13. Distributed Consensus System Monitoring

4m 54s

In this video, you will learn more about distributed consensus-based systems. These are great to have because they solve a lot of problems and reduce risk while reliably serving customers. But as an SRE, your main responsibility is keeping that system up and running. Therefore, one thing you want to have is the ability to monitor these systems. You will learn that to monitor a distributed system, you need system metrics and log data collected and stored in a searchable format. FREE ACCESS
14. Course Summary

1m 7s

FREE ACCESS

EARN A DIGITAL BADGE WHEN YOU COMPLETE THIS COURSE

Skillsoft is providing you the opportunity to earn a digital badge upon successful completion on some of our courses, which can be shared on any social network or business platform.

Digital badges are yours to keep, forever.

Course InfoSec Crisis Management & Incident Response

(7)

Course Site Reliability Engineering Network Optimization

(9)

Course SRE Incident Management: Fundamentals & Best Practices

(6)

PEOPLE WHO VIEWED THIS ALSO VIEWED THESE

Course SRE Simplicity: Software System Complexity

(179)

Course Telling a Business Story

(3716)

Book 25 Reproducible Activities for Customer Service Excellence

Get Started

Sharpen your skills. Upgrade your career. Find the right learning path for you, based on your role and skills. Take part in hands-on practice, study for a certification, and much more - all personalized for you.

*Not included: Compliance, Leadership Development Program content, and Engineering books

Your content + our content + our platform = a path to learning success

Using our learning experience platform, Percipio, your learners can engage in custom learning paths that can feature curated content from all sources.

Learn More

Aspire to something bigger

Aspire Journeys are guided learning paths that set you in motion for career success.

Browse Aspire Journeys

Explore a world of live learning with Global Knowledge

Choose from convenient delivery formats to get the training you and your team need - where, when and how you want it.

Browse Live Learning

IT Skills and Salary Report

ESG Impact Report

Distributed Reliability: SRE Critical State Management

WHAT YOU WILL LEARN

IN THIS COURSE

EARN A DIGITAL BADGE WHEN YOU COMPLETE THIS COURSE

YOU MIGHT ALSO LIKE

PEOPLE WHO VIEWED THIS ALSO VIEWED THESE