Distributed Reliability: SRE Critical State Management
SRE
| Intermediate
- 14 videos | 1h 13m 47s
- Includes Assessment
- Earns a Badge
Anticipating failures that will affect your company's systems is a crucial site reliability engineer duty. These failures are especially significant when they affect distributed systems, which is why efficient algorithms and strategies are essential in minimizing the likelihood of failures. In this course, you'll explore both critical state management and the CAP theorem, identifying how both concepts relate to distributed systems. Next, you'll examine several distributed system management algorithms and strategies, including deterministic and nondeterministic algorithms, distributed system models, and Byzantine faults. You'll then outline how each of these benefits distributed system management. Finally, you'll investigate the Multi-Paxos message flow protocol and how it works with distributed systems. Finally, you'll describe what's involved in deploying and monitoring a consensus-based system to increase distributed system performance.
WHAT YOU WILL LEARN
-
Discover the key concepts covered in this courseDescribe critical state management and how it applies to distributed systems and affects reliabilityDefine the cap theorem and describe how it relates to distributed systemsOutline how to coordinate system failures on distributed systemsDifferentiate deterministic and nondeterministic algorithms and how they relate to distributed systemsDescribe the system models that can be used with distributed systemsDefine the concept of distributed consensus and list the stages of validation
-
Define the concept of byzantine fault and describe how it applies to distributed systemsDescribe the distributed consensus architecture patterns used in distributed systemsDescribe best practice and tricks for increasing performance for distributed systemsDefine the multi-paxos protocol and describe how it relates to distributed systemsOutline how to deploy distributed consensus-based systems and name some key considerationsName and describe the key considerations when monitoring distributed consensus systemsSummarize the key concepts covered in this course
IN THIS COURSE
-
1m 20s
-
5m 41sSite reliability engineering (SRE) is the practice of allowing software developers to run, manage, and maintain ongoing daily operations of their applications and services so that they are available for users to consume. Critical state management is a key part of SRE, as it allows for anticipating and planning for system failures. A distributed consensus is needed for building highly available and robust systems, which leads to the use of distributed locking. FREE ACCESS
-
5m 19sIn this video, you'll learn more about the acid features of typical DBMS system transactions. These are known as ACID, which stands for atomic, consistent, isolated, and durable. The idea is that every transaction performed against a DBMS system abides by these characteristics. You'll learn how to define these terms and how they relate to distributed systems. FREE ACCESS
-
7m 22sIn this video, you will outline the primary job of an SRE. You will learn that, as with all systems, distributed systems can sometimes fail too. The objective is always to restore a system to full operation. This means that when a failure happens, we need to figure out what is going on and then resolve the issue. FREE ACCESS
-
7m 21sIn this video, you'll learn the difference between deterministic and nondeterministic algorithms. The objective of algorithms is to get an answer. However, not every algorithm can give you a specific answer. This leads to the discussion of deterministic versus nondeterministic algorithms. Deterministic algorithms work through the same states every time to produce an answer. Meanwhile, non-deterministic algorithms might go through completely different states every time they execute. FREE ACCESS
-
5m 51sIn this video, you will learn about different kinds of distributed systems. You will discover that there are several different categories of distribution, including synchronous and asynchronous models. You will also learn about architectural models and the fundamental models. FREE ACCESS
-
5m 9sIn this video, you will learn more about distributed systems and how to achieve reliability in a system when dealing with faulty processes. You will learn that solving this problem requires that these distributed processes effectively agree on which data values will be committed to a database. You will learn there are many ways to achieve distributed consensus, including a two phase commit process and a three phase commit process. FREE ACCESS
-
4m 51s
-
6m 50sIn this video, you'll learn about distributed consensus algorithms. You'll learn that these algorithms allow nodes to agree on information. They're low level and primitive, but distributed consensus algorithms provide a good place for practical functionality. You'll also learn that higher-level components such as datastores, configuration stores, queues, locking, and leader election services can help with consensus algorithms. FREE ACCESS
-
5m 40sDistributed consensus can be quite slow and costly, but if it is implemented correctly, it can still function effectively. To improve performance, throughput, latency, and data replication, a number of strategies can be employed. FREE ACCESS
-
4m 50sIn this video, you'll learn how to define the Multi-Paxos protocol and describe how it relates to distributed systems. You'll learn that Paxos operates as a sequence of proposals. These proposals are accepted or denied by a majority of the processes in the system. If accepted, the proposals are executed. This means that Paxos makes sure all the operations are performed in a strict order. FREE ACCESS
-
7m 33sWhen deploying a distributed consensus-based system, you need to take a number of factors into account, such as the number of replicas needed, the load each system can handle, and the quorum composition. The best way to determine how many replicas you need is to ask questions such as how important reliability is, how often you'll perform planned maintenance, and what level of risk you're willing to accept. FREE ACCESS
-
4m 54sIn this video, you will learn more about distributed consensus-based systems. These are great to have because they solve a lot of problems and reduce risk while reliably serving customers. But as an SRE, your main responsibility is keeping that system up and running. Therefore, one thing you want to have is the ability to monitor these systems. You will learn that to monitor a distributed system, you need system metrics and log data collected and stored in a searchable format. FREE ACCESS
-
1m 7s
EARN A DIGITAL BADGE WHEN YOU COMPLETE THIS COURSE
Skillsoft is providing you the opportunity to earn a digital badge upon successful completion on some of our courses, which can be shared on any social network or business platform.
Digital badges are yours to keep, forever.