URL Frontier - Load Distribution

ma.bikram · July 27, 2020, 2:28pm

Let’s also assume that our hash function maps each URL to a server which will be responsible for crawling it.

With above statement, does it mean there will be a separate dispatcher component which will dispatch the url to be crawled by using the hash function?? What will that hash function looks like?

Once the url (for eg: URL A) is dispatched to a particular server (for eg ServerA) then to particluar thread (for eg; Thread A) which is handeling crawl for a particular Host (for eg: HostA).

What will happen if the link inside the document of URL A has a url to another Host (for Eg; Host B)?
=> In which queue will Thread A puts that URL into

ma.bikram · May 24, 2019, 4:58pm

@Design_Gurus could you please help me understanding this

Mohammed_Junaid_Ahme · August 4, 2019, 6:45am

Thread A’s job is to crawl the web, not put the URL into any queue. From what I understand this is what will happen. When you crawl URL A via Thread A, the content of it including URL B will be extracted by the extractor and after the URL dedupe test and filtering, placed back in the URL frontier for further crawling. At that point, the dispatcher, will then hash the URL B and can place it on any of the servers based on the hash generated.

Hope this makes sense!

Peter_W · June 15, 2020, 7:45pm

[URL frontier]

keep queue of URL to be visited, there’s also an hash map stores hostname to crawler server

[Server A (or B or C)]

each server deals with a list of specific hostnames (eg, all URL from walmart is in Server B according with URL frontier and its hashmap)
each server has a FIFO that stores what needs to be processed (assume its assigned from URL frontier)
each server will then work from its FIFO queue and assign task to each crawler threads

seantech1999 · February 14, 2023, 9:18am

Would you mind further clarifying the following two questions:

To fill the FIFO on each server, does the server pull the URL frontier or the URL frontier push items to each server?
Does the concept “crawling worker” in the tutorial map to a server or a thread in a server (which may have multiple threads) here?