educative.io

Educative

False definition of reliability

“By definition, reliability is the probability a system will fail in a given period.”

This is wrong. It’s 1 minus that number.


Course: Grokking the System Design Interview - Learn Interactively
Lesson: Key Characteristics of Distributed Systems - Grokking the System Design Interview

Hi @Sam, Thank you for contacting us.

Reliability is the ability of the system to function without failure. In simple words, a reliable system will perform its intended tasks without any failure or collapse. If there is any probability that a system will fail at any given time or at any condition, then that system is less reliable, and if there is a higher probability that the system will perform all of its intended tasks successfully, then it is more reliable. The statement in the lesson is not mathematical that is why the author has tried to explain it in simple words.

I believe that’s what OP was saying. In the lesson it states “By definition, reliability is the probability a system will fail in a given period.” In other words we are saying if something has high reliability then it has a high probability of failing in a given time period. This is incorrect. I would expect the definition to be something like “ By definition, reliability is the probability a system will not fail in a given period.” In other words, we are saying that the likelihood of failure is low if something is highly reliable. High reliability == high probability of not failing ==> low probability of failing.

1 Like

Hi @unybble
In simple terms, a distributed system is considered reliable if it keeps delivering its services even when one or several of its software or hardware components fail. Thus, reliability represents one of the main characteristics of any distributed system. Any failing machine can always be replaced by another healthy one in such systems, ensuring the completion of the requested task.
Reliability is the ability for a system to remain available over a period of time. Reliable systems are those that can continuously perform their core functions without service disruptions, errors, or significant reductions in performance. However, there are many different ways a system can fail, especially as a system becomes larger, more dynamic, and more complex. Our systems—and the people operating those systems—must be able to recover from these failures. This recoverability is called resilience. In order to maximize availability, systems must be both reliable and resilient.

Hope it will clear your confusion, Happy Learning :slight_smile:

1 Like

Hi, that is exactly what we are saying. We do not have confusion over any definitions but it appears that you have a typo. That is all.

@unybble
Thanks for pointing this out. We’ll look into it.

1 Like