How is hashing chunks a reliable approach?

Neody · February 17, 2022, 6:19am

Server and clients can calculate a hash (e.g., SHA-256) to see whether to update the local copy of a chunk or not. On the server, if we already have a chunk with a similar hash (even from another user), we don’t need to create another copy; we can use the same chunk. This is discussed in detail later under Data Deduplication.

4MB being larger than the size of the hash, our hash function will not be collision-free. Many 4MB blocks will have the same hash. Reusing a chunk with the same hash in the block storage looks like a quick path to data corruption, especially with 100 billion files and even more chunks.

A quick WAR would be to use different hash functions but that only lowers the probability, it doesn’t fix the underlying problem.
Is that acceptable for a “high reliability” system?

Type your question above this line.

Course: https://www.educative.io/collection/5668639101419520/5649050225344512
Lesson: https://www.educative.io/collection/page/5668639101419520/5649050225344512/5693417237512192

Allan1 · March 11, 2022, 12:37am

2^256 is pretty large. So far no known SHA-256 collision has ever been found yet. It’s a common practice to just assume sha-256 never collides within any meaningful time range.