Each host will perform checkpointing periodically and dump a snapshot of all the data it is holding onto a remote server. This will ensure that if a server dies down another server can replace it by taking its data from the last snapshot.
what is being snapshot? is it mostly the state of the queues? won’t a durable queue solve the problem? one issue could be items that have been dequeued but are in the middle of processing. but can’t this be solved with a dead letter queue?