Promote membership in the ring to detect failures

Ravish_Aanand_4-Yr_B · August 30, 2023, 12:23pm

How gossip-based protocol helps in detecting fault, didn’t understand this clearly

Course: Grokking Modern System Design Interview for Engineers & Managers - Learn Interactively
Lesson: Enable Fault Tolerance and Failure Detection - Grokking Modern System Design Interview for Engineers & Managers

Umer_Draz · October 10, 2023, 7:02am

Hi Ravish.

Gossip protocols are sometimes called “epidemic protocols” because they operate like the spread of an infectious disease. Nodes exchange information with the nodes in their token set, who then pass it on to their token-set nodes, and so on until the information spreads throughout the network.

When some node cannot reach some of the other nodes (even after repeated tries), it is suspected that the node has probably failed. This node can even propagate such info to other live nodes. If such failures persist longer, they are reported to the system admin.

I hope that clears up the concern. If you have any further questions, feel free to reach out to us.

Thank you.

Nxa · December 16, 2023, 3:20am

From the text - Now, node A handles a request that results in a change, so it communicates this to B and E. Another node, D, has C and E in its token set. It makes a change and tells C and E.

“handles a request” what does it mean? A write operation? So apart from replicating the data maybe not to the same nodes but different nodes the coordinator node will also inform other nodes from token set (apart from preference list)? And what exactly is communicated that a write operation was performed?

Course: Grokking Modern System Design Interview for Engineers & Managers - Learn Interactively
Lesson: Enable Fault Tolerance and Failure Detection

Ali_Hassan · February 26, 2024, 9:17am

Hi Nxa,

“handles a request” typically refers to processing a read or write operation on a specific key. When a node, such as node A, handles a request that results in a change (e.g., a write operation), it not only updates its local data but also communicates this change to other nodes in its token set or replica set.

In the context, after a write operation is performed on a key by a coordinator node, the coordinator node is responsible for propagating this update to other nodes that either store replicas of the data or are part of the token set for that key. This communication ensures that the latest version of the data is replicated across multiple nodes for fault tolerance and high availability.

Therefore, in the scenario described, when node A handles a request that results in a change (e.g., a write operation), it communicates this update to nodes B and E. Similarly, node D communicates its changes to nodes C and E. This communication mechanism helps in ensuring data consistency and availability across multiple nodes in the system.

We hope that clears up the confusion. Feel free to reach out to us if you have any further questions.
Thank you.