educative.io

Politeness constraint

To implement this politeness constraint our crawler can have a collection of distinct FIFO sub-queues on each server. Each worker thread will have its separate sub-queue, from which it removes URLs for crawling. When a new URL needs to be added, the FIFO sub-queue in which it is placed will be determined by the URL’s canonical hostname. Our hash function can map each hostname to a thread number. Together, these two points imply that, at most, one worker thread will download documents from a given Web server, and also, by using the FIFO queue, it’ll not overload a Web server.
how will the above make sure that a web server will not get over loaded? is it forcing urls with the same hostname to go to the same queue?

Hi @Dewey_Munoz!

As per the explanation, there is a separate worker thread as per the hostname. So, if a URL arrives, the hash function maps it to its respective worker thread, meaning that instead of possibly going to multiple threads, it goes a single thread, saving a lot of overhead. Moreover, the FIFO queue ensures that only one URL is being processed by a thread at a time, meaning that only one document request is being made to the server per thread, saving on server overload.

I hope this helps.