Aggregating the phrases count

Amr1 · February 27, 2024, 9:30pm

I’m a bit confused here when it comes to the overall structure so we will write the pharses log in a distributed storage which will be used later with map reduction to generate the last view of the frequency of the word and store it in Cassandra then build tries out of this mapping and store it in mongo.

So here are my questions:

what kind of format we are using to write the pharses log?
Do we aggregate older counts in the map-reduce stage or only newer phrases?
Why we are using Cassendara and mongo why not use only 1 of them?
How we will handle cases of failure in the map reduce stage, when we already process a term and re process it again?
How we will store the trie in mongo?

Ali_Hassan · March 27, 2024, 10:48am

Hi Amr,

Here the answers to your questions:

The format for writing phrase log would be JSON or CSV for easier processing by MapReduce jobs.
The MapReduce stage aggregates the frequency of all new and existing phrases over a specific time interval, not just new ones.
The system utilizes both Cassandra (for stroing aggregated phrase frequencies from MapReduce jobs) and MongoDB (to store actual trie data structure) due to their distinct features.
In case of failures, MapReduce frameworks typically offer features for retries and job restarts. Additionally, you can implement mechanisms to track processed data to avoid re-processing the same phrases if a job fails and restarts.
You can represent the trie nodes as documents with fields for the character stored in the node, child nodes (references to other documents), and potentially the frequency count associated with that node (if storing frequencies within the trie itself).

Thank you,