Storage calculation doesn't make sense

Dmitrii_Murzin · July 27, 2020, 2:29pm

Hi,

This calculation doesn’t make sense:

Storage Estimation: If on the average each query consists of 3 words and if the average length of a word is 5 characters, this will give us 15 characters of average query size. Assuming we need 2 bytes to store a character, we will need 30 bytes to store an average query.

Why do you calculate the plain size to store the queries, whereas, actually, you need to store the trie with nodes data, references to parents and references to termination nodes?

Could you please explain how this calculation relates to the proposed design?

Vladislav · April 27, 2020, 9:57pm

How to store trie in a file so that we can rebuild our trie easily - this will be needed when a machine restarts? We can take a snapshot of our trie periodically and store it in a file. This will enable us to rebuild a trie if the server goes down. To store, we can start with the root node and save the trie level-by-level. With each node, we can store what character it contains and how many children it has. Right after each node, we should put all of its children.
If we store this trie in a file with the above-mentioned scheme, we will have: “C2,A2,R1,T,P,O1,D”. From this, we can easily rebuild our trie.

Serialization includes only plain the text (a character or a part of the word) of the node and the number of children. All references are recomputed.

If you’ve noticed, we are not storing top suggestions and their counts with each node. It is hard to store this information; as our trie is being stored top down, we don’t have child nodes created before the parent, so there is no easy way to store their references. For this, we have to recalculate all the top terms with counts. This can be done while we are building the trie. Each node will calculate its top suggestions and pass it to its parent. Each parent node will merge results from all of its children to figure out its top suggestions.