educative.io

Worker passes that URL to the relevant protocol module, which initializes the DIS from a network connection to contain the document’s contents

Each worker thread has an associated DIS, which it reuses from document to document. After extracting a URL from the frontier, the worker passes that URL to the relevant protocol module, which initializes the DIS from a network connection to contain the document’s contents. The worker then passes the DIS to all relevant processing modules.

Is the DIS just an object that encapsulates the HTML from fetching?

Hi @Shaheryaar_Kamal / @Design_Gurus, this is my question as well.

for example, if you have 1000 workers[machines] whose sole job is to ‘fetch’ and 150 other machines whose sole job is to parse… how is this DIS going to work? what shape does it take?

[fetcher] ->> [DIS?] —> [parser]