I’m a bit confused here when it comes to the overall structure so we will write the pharses log in a distributed storage which will be used later with map reduction to generate the last view of the frequency of the word and store it in Cassandra then build tries out of this mapping and store it in mongo.
So here are my questions:
- what kind of format we are using to write the pharses log?
- Do we aggregate older counts in the map-reduce stage or only newer phrases?
- Why we are using Cassendara and mongo why not use only 1 of them?
- How we will handle cases of failure in the map reduce stage, when we already process a term and re process it again?
- How we will store the trie in mongo?