Server and clients can calculate a hash (e.g., SHA-256) to see whether to update the local copy of a chunk or not. On the server, if we already have a chunk with a similar hash (even from another user), we don’t need to create another copy; we can use the same chunk. This is discussed in detail later under Data Deduplication.
4MB being larger than the size of the hash, our hash function will not be collision-free. Many 4MB blocks will have the same hash. Reusing a chunk with the same hash in the block storage looks like a quick path to data corruption, especially with 100 billion files and even more chunks.
A quick WAR would be to use different hash functions but that only lowers the probability, it doesn’t fix the underlying problem.
Is that acceptable for a “high reliability” system?
Type your question above this line.
Course: https://www.educative.io/collection/5668639101419520/5649050225344512
Lesson: https://www.educative.io/collection/page/5668639101419520/5649050225344512/5693417237512192