I don’t follow what’s being done in the “Memory estimates” section for caching in the URL shortening service.
It states:
Memory estimates: If we want to cache some of the hot URLs that are frequently accessed, how much memory will we need to store them? If we follow the 80-20 rule, meaning 20% of URLs generate 80% of traffic, we would like to cache these 20% hot URLs.
Previously we estimated 300 million total URLs will be stored. So I understand this to mean 60 million URLs (20%) are the cause of 80% of the requests. There are ~1.7 billion requests per day, so ~1.35 billion (80%) of them should be cache hits, leaving ~34 million (20%) as cache misses.
However, the next paragraph goes on to say “To cache 20% of these requests, we will need 170GB of memory.” I think this is an error, but maybe not. Why would we cache 20% of requests… where does this come from? I thought we are caching the top 20% of URLs, which result in cache hits on 80% of requests.
To cache 20% of these requests, we will need 170GB of memory.
0.2 * 1.7 billion * 500 bytes = ~170GB
I am further lost by this calculation. Isn’t this ~170GB the amount of bandwidth spent on cache misses? Caching 20% of the total 300 million URLs would require only ~27 GB (300 million * 20% * 500 bytes).
Can someone tell me if/where I’ve gone wrong?