How to estimate number of web crawlers required? considering number of connections a node is able to maintain, bandwidth and other factors?
As usual, you should provide some assumptions for this, so, what I would do is:
- We are using a Gigabit connection, so do download 1GB would takes about 9 sec;
- One single page has 100KB, so for 15 Billions we have 1396TB of data to download;
Let’s make the math for a single node, to download all this data:
(1396 TB * 9 sec) / 3600 = 32400 hours or 1350 days
A single node, with a single connection would take 1350 days to download all this data. Remember that we are considering only HTML text.
The problem states that he wants to download in two weeks, or 14 days.
1350 / 14 = 97 Nodes with a Gigabit connection and 1 single connection
Hope this helps.