educative.io

Performing regular checkpointing and storing FIFO queues to disks

All our crawling servers will be performing regular checkpointing and storing their FIFO queues to disks. If a server goes down, we can replace it. Meanwhile, consistent hashing should shift the load to other servers.

how is the checkpointing different from storing the fifo queues to disk? how do the fifo queues work? are we saying that all urls from a single domain will only go into a single queue? will this lead to unbalanced queues, if there are some domains that have many urls? what is the process that populates the queues? is that done in the url frontier?