I’d say that the snapshot space overhead is minimal. Keep in mind we’re talking about storing a value per referenced chunk.
My latest snapshot references 26831 chunks.
With all of the upload lengths being stored in a single chunk of 167,372 bytes.
Which works out to be ~6 bytes per referenced chunk.
Lengths are currently stored using a JSON encoding methodology in the same way as the other snapshot values. If space is really a concern, these could be stored in a more compression friendly binary format. Assuming chunk lengths are roughly similar, there should be some good compression opportunities that we’re missing out on by using JSON.
I’ve also done a little bit more searching on how these zero-byte files came to be. From what I understand ext4 introduced a delayed allocation feature. This delays the allocation of file system blocks after a file is closed and can result in an empty file in the event of a power surge. So while a non-zero check would catch this particular case, I’m concerned that there are other cases where incomplete (non-zero) files could exist.
Anyway, I see this length comparison as a stepping stone to including the uploaded chunk hash for comparison with the backend. Ideally I’d like to be able to regularly verify remote storage integrity without having to transfer chunk contents. Leveraging the remote hashing facilities available on the majority of backend storage services.