Tries in memory why do we need cache?

Franklin · July 27, 2020, 2:29pm

I assume usually cache is in memory and is much faster than data in the disk. However, tries are already in memory then why do we still need cache which is the same hardware memory? fetching data speed from memory (cache or tries data) should be the same.

Design_Gurus · June 13, 2019, 6:43pm

You are right that the Tries are already in memory, the cache we are suggesting is without the trie. We can simply store the query and their top (say 10) terms in cache so that we don’t have to ask trie servers. Fetching from Trie servers could be slow, as we might have to query from multiple servers (based on the partitioning scheme we use), combine results, etc.

Noe_Brito · August 10, 2019, 8:04pm

Thanks for the reply. This was my assumption, that our cache would store “searchTerm” -> [list of top 10 suggestions]. We would not have to query the Trie and aggregate the top 10 suggestions.

rahul9 · February 14, 2021, 12:01am

whatever we are storing in trie can be easily store in hashtable also @Design_Gurus. For each trie node it will be acting as key to table and values would be list of k strings acting as top k items.Since out memory footprint esimated is 25 gb we can have atmax 25*k which is still in GB, it would be better than trie. Memory is cheap and it can be pushed to NO-sql with top words in cache.I guess this is more good design as i saw someone suggesting it somewhere.

Design_Gurus · February 14, 2021, 2:41pm

Did you mean, in addition to trie, we can store all “words->Top K string” in memory? This can be done given enough memory. The only thing we have to worry about would be the cache eviction policy whenever there is an update in the trie.

rahul9 · February 14, 2021, 2:59pm

I might have confused you.Let me explain in a bit detail :
I think it should be better if we directly use prefix hash table,It can be build directly,like if i have “Hello”, we can hash to H,HE,HEl,HELL,HELLO. I dont see usage of trie. Since we are storing top10-20 suggestions with each word.It should be fine.
[H] => { List of k suggestions}
[HE] => { List of k suggestions}
[HEL] => { List of k suggestions}
[HELL] => { List of k suggestions}
…
…
I am not in favour of trie because:

Difficult to distribute
Difficult to update ,what if we add new keys, we have to update all the nodes, same can be done in hash.

We can store top words of prefix hash table in redis and set expire to say 10 mins,and also back data in No-SQL or my sql at back end as persistent store.Each node score will be updated at back end and if new data is coming it will reflect in next 10 mins in redis cache or we can set it to 1 hour.Its too easy to update latest results that using trie and making it complex.
Searching in trie is too complex thing to do given the scale of search queries we are getting and the SLA we have on latency.The reason of expiry is that if some new words become popular at backend they will be picked up. Or alternatively we can update hash in similar way periodically the way you do trie .

Let me know your thoughts.

Show less

REPLY

Design_Gurus · February 15, 2021, 4:00am

This approach could work too. There are pros and cons; which need detailed discussion. For example, for this trie example given in the chapter, can you estimate how much space will we be using with this approach:

The trie approach will need space for 15 characters. Plus whatever we decide to cache.

Your approach will require space for every prefix and space for the top 10 terms. The prefixes are:

CA
CAT
CAP
CAPITAL
CAPT
CAPTION
CAPTAIN

Prefix takes more than 2x space. Now if we add the top 10 terms, this will be more than 10-20x. With the trie approach, we can decide how much data we should cache, with your approach we can’t, we’ve to store the top 10 terms with every prefix.

Practically, the trie can easily fit on one server (even with the 20% cache). We don’t need to distribute data. Though, for better read performance, we can have replica servers. The replica servers can follow updates from the primary and serve the read requests.

Your approach would require a lot more memory but will be quite fast to query compared to trie. And the data distribution seems straight forward too. As mentioned previously, the cache eviction policy needs to be taken care of. If the service is OK with a little bit of stale data (the way you have mentioned), it should work. In contrast, Trie can easily be updated on the fly.

rahul9 · February 17, 2021, 6:34pm

Thank you for detailed reply.Its a space time trade off we can discuss with interviewer.

Also, updates are not problem, we can update all prefixes on the fly.Like in trie also we need to update from child to parent, here we can update all the prefixes on the fly