educative.io

Separate the indexing and search

I don’t undestand how the search node will know which index file to pick up? I mean if we have multiple indexer nodes, and each of them uploads small index file to distributed storage (that file was created by that particular indexer), then how we will avoid scenario where two or more search nodes are analyzing same index file? (Since we will have N search nodes, and idea is that they work in parallel, so each of those search nodes should somehow know which file is already been pickedup by some of other node. I don’t see how can we implement that)

The search nodes are not going to pick up the index files themselves. A cluster manager distributes the index files to all the search nodes in the cluster, and it uses a distributed and parallel programming model like MapReduce to perform this task.

1 Like

Thanks for reply! I still don’t clearly understand how the cluster manager distributes files to search nodes… For example, if we have N different index files (since each index node will create different index file) does cluster manager distributes all N index files to each of search nodes? I mean, does one search node gets ALL N index files, and some other search node gets ALL N files too, etc… If not, then how cluster manager knows which index file will send to particular serch node?

If there are N index files, and S number of search nodes, the cluster manager can split the N index files into S partitions; one partition per search node.

Yes, I understand that. But lets assume that we have S number of search nodes(numerated 1 to S) and S partitions of index file (numerated from 1 to S) . How does cluster manager knows that he needs to distribute partiotion 1 to search node 1, partition 2 to search node 2, etc… What in case if we are add new indexer node, so we have S+1 indexer nodes and S search nodes. In that case, cluster manager should distribute S+1st partition of indexer file to some of the S nodes search nodes, right? So which algorithm is used to achieve this?

The cluster manager uses the MapReduce programming model to distribute the data (index files) to the available search nodes in the cluster. The allocation of index files’ partitions to the search nodes can be done randomly or based on each search node’s computation and storage resources.