Mapreduce part to process log files

Terry_Chen · July 27, 2020, 2:29pm

So why not just update (asynchronous for each search query) db (PK: search phase, col2: frequency) on frequency count ? DB can be simply distributed to multiple machines by some hash function. This seems like the easiest approach to me

himani_agarwal · January 31, 2021, 9:08am

I think the reason could be the size of the data we are handling …
In the estimation we have considered that only 20% of the requests are resultng in a unique search term . that gives use 10 K unique search terms per seconds.

if you are teking 30 bytes to store each request - it requires to store around - 300 KB/s.
In an hour it would be 30 GB/hour. No we can’t keep this much data in a single file and update the count of that string simply … we will have to store it in distributed manner which make it very difficult to keep the counts up to date. So I think that why they have introduces hdfs and so mapreduce to do the count processing afterwards.

But I think we can also replace mapR/ HDFS with Redis/memcache distributed cache … and keep key -value pair there. This way we can instantaneously update the counts of the key. And after one hour we can flush all the key-values to the trie structture.

THe question in the first approach is that how will it work with partitioned data.

Would every single key-value pair be sent thorugh the LB to identify server and then update its secondary server ?
OR would the map reduce sort and group the results based on the partitioning strategy and flush the results in batch to every server. If so can map reduce do this goruping by itself ? and how would mapreduce get to know the latest partitioning stategy in case of maximam capacity based partitioning?

@Design_Gurus Please throw some light on this.