MP stores data in 4MiB blocks. (4MiB was chosen since that was the average size of a smartphone picture at the time.) Files that are larger are split, and smaller files are tail-packed until it reaches 4MiB. This is done in a staging area where the more costly operations are not performed. Blocks are written to multiple spindles across machines to improve durability.
These blobs are compressed and md5 and SHA1 hashed, and the hashes are used for deduplication. Then the blocks are put into larger container and written to disk. The filesystem overlay and metadata are stored separately. The ingestion pipeline has a few other steps to maintain durability without bottlenecking.
I don’t remember if the tail-packed blocks are compressed and hashed as part of the 4MiB block, or if each fragment gets its own.
Since the data is deduped at the block level there’s not much cost (to DBX) if users upload lots of copies of the same thing. The downside of this is that after a block has gone through the ingestion pipeline it is only stored on one physical spindle in two different geographic regions. There is (or was) no facility for migrating busy blocks from erasure coding to multiple copies to improve read performance. There is some caching in the access pipeline from the user-facing API to the backend, but its hit rate is limited by its small size relative to the volume of data in the system.
There is a migration to cold storage that happens after a period of time where it’s split and stored in two locations with an XOR in a third. How we optimized Magic Pocket for cold storage - Dropbox
This is all from memory, but there’s a good write up at Inside the Magic Pocket - Dropbox, although details are glossed over. As of March 2020 they were using LRC Erasure Coding rather than Reed-Solomon. I think it was 12-2-2 (12 data disks in two groups of 6. Each group has a single local parity disk, with two global parity disks covering the 12 data disks) There was some investigation looking at a migration to 14-2-2 but I don’t know if that ever took place.
If you look at the documentation for the /upload_session/append
you’ll recognize this - Concurrent upload sessions must use a 4MiB offset, except for the last piece.
the dropbox API that is exposed to the user is not a block storage api
It is an object store under the covers and if you take the implementation into account you can use it as such, you just need to keep your object size <= 4MiB. I was trying to build an object store interface for internal use but didn’t get much traction. Due to the deduplication and lack of spindle diversity it wouldn’t have had acceptable performance.
Or maybe this only applies to business/team account
I had nothing to do with the business side of things. Looking at the site, the Standard and Advanced business accounts get 1 billion api calls/mo, but you can pay for more too.
Another useful usecase is speeding up files enumeration
The existing client already uses /list_folders
and /list_folders/continue
for reading directories.
/get_file_metadata/batch
might speed up fetching file info, but I think the implementation of GetFileInfo
will need to be updated to allow for batching. Fetching metadata is probably another significant cause of 429 errors.
The benefit of go is that it’s very learnable
It’s not about learning it. I actually landed diffs in the internal version of /x/sync/singlefilght
while at Google. Specifically adding the dups
member to call
, since I needed to know how many of our calls were being deduplicated. It’s small, but it’s something! That was in 2015 or 2016 before the library was made public.
I’ve been using a lot of python asyncio lately which isn’t terribly different in concept, but the syntax differences trip me up. I find it easier to switch between python and c++.