Range handler redundancy

Alexander_Ryabets · November 1, 2023, 3:52pm

I have a question about range handler microservice. You have stated that:

We use a microservice called range handler that keeps a record of all the taken and available ranges. The status of each range can determine if a range is available or not. The state—that is, which server has what range assigned to it—can be saved on a replicated storage.

This microservice can become a single point of failure, but a failover server acts as the savior in that case. The failover server hands out ranges when the main server is down. We can recover the state of available and unavailable ranges from the latest checkpoint of the replicated store.

My question is: is this microservice deployed always as a single instance? What about redundancy? Also, if we have a geo-distributed system, with let’s say one client is in EU and another is in US, there could be potentially a very big latency to the range handler is it is deployed as a single instance.

If there could be many instances of this service, how could it be sure that the range is not taken, since there is no way to obtain a lock?

Alexander_Ryabets · November 1, 2023, 3:54pm

And if the service is geo distributed and replicated store is as well, then how can transactional behavior be achieved? In other words, how can we be sure inside the range handler when we write a range to the store, that the same range wasn’t just been written by another replica in another location, but just had not been propagated to all nodes yet?

Muhammad_Usman_Malik · January 1, 2024, 5:59am

Hi Alexander

No. We should see range handler as a managed service that provides a simple API for its clients to geet ranges, but can tolerate different kinds of failures. It we are using this service on geo-distributed scale, we need extra care (for example due to the CAP theorem that what we will do if network portion happens etc.). One possible design point is as follows:

We can have a hierarchy of range handlers, where someone (say in one region) can ask for a relatively long range from a upper tier range handler (possibly using a distributed transaction). And one that range is achieved, the second tier range handler can hand out sequences locally. We can have another level of range handlers to avoid single point of failure in a region as well.

So the important point in above design is that getting range is not on the critical path. Meaning we don’t do that when a client ask for a unique number. We try to get sufficient range ahead of time.

Thank you