Using Google Drive File Stream with :d:

What is Google Drive File Stream and how to use?

:bulb: I will use the abbreviation gdfs from now on when referring to Google Drive File Stream.

:bulb: Google Drive File Stream is different than the personal Google Drive Backup and Sync.

It’s a program which mounts the Google Drive contents as a local drive.

What it does is cache the metadata of the files you have online, to your local drive, and you can see and access those files using windows explorer as if they were locally on your disk.


When you open a file, gdfs downloads that file to your computer, and then stores it for future uses. A file is downloaded only once, then cached.

The storage space used by gdfs is generally less than if using the personal Google Drive Backup and Sync, because with Google Drive Backup and Sync one has to download everything offline, while gdfs downloads only the files you open.

:bulb: of course you can select whole folders in gdfs for offline access. These folders are stored on disk and fully managed by gdfs


The relation with :d: is that gdfs is a way faster alternative to using the Google Drive web-api, which is used when you init storages with gcd://some/path.

How to use Google Drive File Stream with :d:

Since gdfs is a normal system drive, you init the storage as a Local Disk

duplicacy init -e my-repository-gdfs "G:\My Drive\backups\duplicacy"

and then all the commands are run normally.

Can I also backup a repository with web-api to the same storage?

Of course!

duplicacy init -e my-repository-web-api "gcd://backups/duplicacy"

the only different thing is the storage path.

Here’s the gcd web repository .duplicacy/preferences file:

[
    {
        "name": "default",
        "id": "tbp-pc",
        "storage": "gcd://backups/duplicacy",
        "encrypted": true,
        "no_backup": false,
        "no_restore": false,
        "no_save_password": false,
        "keys": null
    }
]

And the gdfs repository .duplicacy/preferences file:

[
    {
        "name": "default",
        "id": "tbp-pc",
        "storage": "G:/My Drive/backups/duplicacy",
        "encrypted": true,
        "no_backup": false,
        "no_restore": false,
        "no_save_password": false,
        "keys": null
    }
]

How fast is gdfs?

That is for you to decide.

Running check command on all repositories

First check run (i believe most chunks are already cached
PS C:\duplicacy repositories\tbp-v> date; .\.duplicacy\z.exe check -all -tabular; date

Saturday, 13 July, 2019 14:57:40

Storage set to G:/My Drive/backups/duplicacy
Listing all chunks
6 snapshots and 213 revisions
Total chunk size is 2325G in 503308 chunks
All chunks referenced by snapshot macpiu-pro at revision 1 exist
All chunks referenced by snapshot macpiu-pro at revision 4 exist
All chunks referenced by snapshot tbp-bulk at revision 1 exist
[...]
All chunks referenced by snapshot tbp-bulk at revision 56 exist
All chunks referenced by snapshot tbp-nuc at revision 1 exist
[...]
All chunks referenced by snapshot tbp-nuc at revision 879 exist
All chunks referenced by snapshot tbp-pc at revision 3231 exist
All chunks referenced by snapshot tbp-pc at revision 3247 exist
All chunks referenced by snapshot tbp-v at revision 1 exist
[...]
All chunks referenced by snapshot tbp-v at revision 862 exist
All chunks referenced by snapshot nope at revision 155 exist
[...
All chunks referenced by snapshot nope at revision 602 exist

       snap | rev |                          |  files |    bytes | chunks |    bytes | uniq |    bytes |   new |    bytes |
 macpiu-pro |   4 | @ 2019-06-20 16:43       | 169818 | 152,869M |  29379 | 136,111M |  951 |   3,961M |   953 |   3,972M |
 macpiu-pro | all |                          |        |          |  29532 | 136,538M | 9898 |  46,574M |       |          |

     snap | rev |                          |  files |   bytes | chunks |   bytes |   uniq |    bytes |    new |    bytes |
 tbp-bulk |  56 | @ 2019-07-10 01:02       | 112933 |   1602G | 331293 |   1590G |   8301 |  40,304M |   8301 |  40,304M |
 tbp-bulk | all |                          |        |         | 417670 |   1970G | 399697 |    1890G |        |          |

    snap | rev |                          |  files |   bytes | chunks |   bytes |  uniq |    bytes |  new |     bytes |
 tbp-nuc | 879 | @ 2019-07-13 10:00       |  95159 | 14,035M |   4623 | 15,529M |     5 |   3,533K |    5 |    3,533K |
 tbp-nuc | all |                          |        |         |  23937 | 90,876M | 17240 |  62,703M |      |           |

   snap |  rev |                          |  files |   bytes | chunks |  bytes | uniq |    bytes |  new |    bytes |
 tbp-pc | 3247 | @ 2018-11-30 13:57       | 123147 | 15,904M |   2154 | 6,129M |  143 | 288,246K |  145 | 290,560K |
 tbp-pc |  all |                          |        |         |   2278 | 6,371M | 1870 |   5,315M |      |          |

  snap | rev |                          | files |    bytes | chunks |    bytes |  uniq |     bytes |   new |     bytes |
 tbp-v | 862 | @ 2019-07-13 13:01       | 60505 |  11,196M |   3083 |  10,858M |    17 |   33,447K |    17 |   33,447K |
 tbp-v | all |                          |       |          |  50852 | 222,548M | 46168 |  203,218M |       |           |

 snap | rev |                          | files |    bytes | chunks |    bytes | uniq |    bytes | new |    bytes |
 nope | 602 | @ 2019-07-13 10:16       |   245 | 350,321K |    112 | 392,649K |    0 |        0 |   0 |        0 |
 nope | all |                          |       |          |   1982 |   8,769M | 1977 |   8,757M |     |          |

Saturday, 13 July, 2019 14:59:41

So that is 2 min 1 sec to check 2.325 GB in 503308 chunks.

And here's the second check run (to show that caching works)
PS C:\duplicacy repositories\tbp-v> date; .\.duplicacy\z.exe check -all -tabular; date

Saturday, 13 July, 2019 15:00:13

Storage set to G:/My Drive/backups/duplicacy
Listing all chunks
6 snapshots and 213 revisions
Total chunk size is 2325G in 503308 chunks
All chunks referenced by snapshot nope at revision 155 exist
[...]
All chunks referenced by snapshot nope at revision 602 exist
All chunks referenced by snapshot macpiu-pro at revision 1 exist
All chunks referenced by snapshot macpiu-pro at revision 4 exist
All chunks referenced by snapshot tbp-bulk at revision 1 exist
[...]
All chunks referenced by snapshot tbp-bulk at revision 56 exist
All chunks referenced by snapshot tbp-nuc at revision 1 exist
[...]
All chunks referenced by snapshot tbp-nuc at revision 879 exist
All chunks referenced by snapshot tbp-pc at revision 3247 exist
All chunks referenced by snapshot tbp-v at revision 1 exist
[...]
All chunks referenced by snapshot tbp-v at revision 862 exist

 snap | rev |                          | files |    bytes | chunks |    bytes | uniq |    bytes | new |    bytes |
 nope | 602 | @ 2019-07-13 10:16       |   245 | 350,321K |    112 | 392,649K |    0 |        0 |   0 |        0 |
 nope | all |                          |       |          |   1982 |   8,769M | 1977 |   8,757M |     |          |

       snap | rev |                          |  files |    bytes | chunks |    bytes | uniq |    bytes |   new |    bytes |
 macpiu-pro |   4 | @ 2019-06-20 16:43       | 169818 | 152,869M |  29379 | 136,111M |  951 |   3,961M |   953 |   3,972M |
 macpiu-pro | all |                          |        |          |  29532 | 136,538M | 9898 |  46,574M |       |          |

     snap | rev |                          |  files |   bytes | chunks |   bytes |   uniq |    bytes |    new |    bytes |
 tbp-bulk |  56 | @ 2019-07-10 01:02       | 112933 |   1602G | 331293 |   1590G |   8301 |  40,304M |   8301 |  40,304M |
 tbp-bulk | all |                          |        |         | 417670 |   1970G | 399697 |    1890G |        |          |

    snap | rev |                          |  files |   bytes | chunks |   bytes |  uniq |    bytes |  new |     bytes |
 tbp-nuc | 879 | @ 2019-07-13 10:00       |  95159 | 14,035M |   4623 | 15,529M |     5 |   3,533K |    5 |    3,533K |
 tbp-nuc | all |                          |        |         |  23937 | 90,876M | 17240 |  62,703M |      |           |

   snap |  rev |                          |  files |   bytes | chunks |  bytes | uniq |    bytes |  new |    bytes |
 tbp-pc | 3247 | @ 2018-11-30 13:57       | 123147 | 15,904M |   2154 | 6,129M |  143 | 288,246K |  145 | 290,560K |
 tbp-pc |  all |                          |        |         |   2278 | 6,371M | 1870 |   5,315M |      |          |

  snap | rev |                          | files |    bytes | chunks |    bytes |  uniq |     bytes |   new |     bytes |
 tbp-v | 862 | @ 2019-07-13 13:01       | 60505 |  11,196M |   3083 |  10,858M |    17 |   33,447K |    17 |   33,447K |
 tbp-v | all |                          |       |          |  50852 | 222,548M | 46168 |  203,218M |       |           |

Saturday, 13 July, 2019 15:02:18

So that is 2 min 5 sec to check 2.325 GB in 503308 chunks.

Running backup and prune all

Backup (one ~ 1TB repo) + prune (4 repositories ~ 2.3 TB)
== Starting Duplicacy Backup @ C:\duplicacy repositories\tbp-bulk

 = Start time is: 2019-07-10 01:02:28


 = Now executting .duplicacy/z.exe  -log  backup -stats  -limit-rate 100000  -threads 1

SUCCESS! Last lines:
 => 2019-07-10 01:53:23.269 INFO BACKUP_STATS Files: 112933 total, 1602G bytes; 27 new, 40,168M bytes
 => 2019-07-10 01:53:23.270 INFO BACKUP_STATS File chunks: 338729 total, 1608G bytes; 8297 new, 40,168M bytes, 40,287M bytes uploaded
 => 2019-07-10 01:53:23.270 INFO BACKUP_STATS Metadata chunks: 15 total, 68,403K bytes; 4 new, 35,484K bytes, 17,378K bytes uploaded
 => 2019-07-10 01:53:23.270 INFO BACKUP_STATS All chunks: 338744 total, 1608G bytes; 8301 new, 40,202M bytes, 40,304M bytes uploaded
 => 2019-07-10 01:53:23.271 INFO BACKUP_STATS Total running time: 00:50:51
 => 2019-07-10 01:53:23.271 WARN BACKUP_SKIPPED 1 file was not included due to access errors

 = Now executting .duplicacy/z.exe  -log   prune   -keep 0:1825 -keep 30:180 -keep 7:30 -keep 1:7  -threads 4  -all   

SUCCESS! Last lines:
 => 2019-07-10 01:57:34.551 INFO SNAPSHOT_DELETE The snapshot nnnnope at revision 534 has been removed
 => 2019-07-10 01:57:34.557 INFO SNAPSHOT_DELETE The snapshot nnnnope at revision 577 has been removed
 => 2019-07-10 01:57:34.575 INFO SNAPSHOT_DELETE The snapshot nnnnope at revision 579 has been removed
 => 2019-07-10 01:57:34.582 INFO SNAPSHOT_DELETE The snapshot nnnnope at revision 581 has been removed
 => 2019-07-10 01:57:34.587 INFO SNAPSHOT_DELETE The snapshot nnnnope at revision 583 has been removed
 => 2019-07-10 01:57:34.593 INFO SNAPSHOT_DELETE The snapshot tbp-bulk at revision 10 has been removed



 == Finished(SUCCESS) Duplicacy Backup @ C:\duplicacy repositories\tbp-bulk

 = Start time is: 2019-07-10 01:02:28
 = End   time is: 2019-07-10 01:57:37

 = Total runtime: 55 Minutes 9 Seconds

 = logFile is: C:\duplicacy repositories\tbp-bulk\.duplicacy\tbp-logs\2019-07-10 Wednesday\backup-log 2019-07-10 01-02-28_636983173482004646.log

Here’s the full log file: backup-log 2019-07-10 01-02-28_636983173482004646.zip - Google Drive

Notes, Gotchas

GDFS can fill your HDD in corner cases

Since the chunks :d: creates are copied to the gdfs drive, they will be stored on your real drive in the gdfs cache until they are uploaded, after which they are cleaned up.

This can lead to the case where if your initial backup is big enough (eg. repo size = 600GB, free space = 100GB), :d: will happily backup the data but your computer will run out of space, since syncing to the internet is much slower.

Of course you can just restart the backup, and :d: will resume the incomplete backup, but in case you don’t want to have this issue at all, you can filter some folders out of the initial backups, and add them 1 by 1. By doing this, since :d: will not upload the already backed up files again – but only the new ones, you won’t run out of disk space.

After you’ve uploaded the whole repository and completed the “initial backups” (by repeating the filtering step above as many times as it takes), you can just relax since now your backups (and especially prunes and checks) will be hundred times faster!

totally offtopic to OP, sorry

For Google Business accounts, i always use Google Drive File Stream instead of web-api or normal Google Drive.

The speed it provides is almost like backing up to an external-disk connected via USB, so pretty darn fast.

Currently using it with 3 computers (2 win, 1 mac) and it works ok.
The mac implementation feels a bit slower than the windows one though.

See here for another discussion: Duplicacy 'check' command can be very slow · Issue #397 · gilbertchen/duplicacy · GitHub.

Btw: do you think i should make a #how-to on how to use GDFS with :d:?

1 Like

Awesome, I remember you mentioned this some time ago but I could not find that post, like, at all, nor could I stumble upon this on my own — and that given I knew what I was looking for. Than you for bringing this up again.

Is not this just because it’s a local cache you are interfacing with, that is then slowly gets synced up to the cloud? How fast does it work if you have to enumerate thousands of files in the datastore the first time, until local cache had been warmed up?

Next question, when I tried that a year back on a Mac it was buggy to the point of being unusable — it was hung for no reason almost all the time. I uninstalled it shortly after. Do you use it on windows? Did you have any issues?

Yes, I think this would be useful.

1 Like

Even now (after more than a year) MacOS GDFS feels just a little bit slower that the windows counterpart, however I find it very usable nonetheless (for more than just backup – i also store in there about 4TB of personal files on my own account).

Just tag and ask me, or open a support topic :stuck_out_tongue:.

FAST! All the files are already cached locally the first time you install it. But only their metadata (created date, edited date, properties, size, etc.). The files are not downloaded to the disk unless you explicitly ask it to, so it consumes very little space.

In terms of speed, i think i can compare GDFS to using a laptop mechanical HDD. It’s definitely not slow, but not SSD-like speed, even though i have it cache its stuff on my main ssd.


Al in all i think i can recommend it, just like i did a few months ago in that github thread.

1 Like

I have updated the OP with a guide on how to use gdfs and :d: together.
Have the questions and feedback comming!

3 Likes

First, thanks for putting this together.

I also found that GDFS (Google Drive File Stream) is faster than the API access to Google Drive, so I switched to using it.

However, I’ve hit a mild gotcha with it. I had limited GDFS to uploading at 16MB/s to avoid completely saturating my upstream. :d: will happily create the backup files in excess of 100MB/s, which eventually filled up the cache, causing the initial backup to fail.

My workaround thus far has been to let GDFS upload more of the chunks, therefore clearing the on-disk cache, before restarting the backup. A more permanent workaround may be to limit :d: to a similar 16MB/s to avoid filling the cache faster than GDFS can upload.

I’m not sure how viable this will be for completely trouble free backups, but as long as a backup isn’t creating more chunks than the available space on the cache drive, it shouldn’t fail. GDFS also seems to have a limit on the size of the “virtual” disk it can create to 1TiB, so this could potentially be a problem for big backup sets.

Added note about this.

I dont think this is true. I have 5+ TB in my gdrive and 1TB in the virtual disk, but i think the only limits for the virtual drive is your own drive space and maybe the FAT32 maximum disk size (which afaik is bigger than 1TB)

1 Like

Do you have a test case where GDFS is residing on a disk/volume larger than 1TiB total on a cache drive? Because all the disk management programs are showing the virtual disk as being 1TiB in size.

Again, I’m not sure if this is actually important. The main thing is to keep the cache from getting full, but I could see a case where there’s a >1TiB backup taking place, the cache doesn’t get emptied as quickly as :d: can push to it and it goes over some 1TiB size and breaks down even though the cache drive isn’t full.

1 Like

I’ve been using Drive File Stream with the cache on a pooled drive. Seems to be working well enough so far. My upload speed is only 30-40mbps so it has had upload queues of 300-500GB when I’ve backed up bigger repositories. Loads into cache quickly then gradually uploads over the course of a few days. I suppose the downside is that the new chunks aren’t reflected in the Drive storage until the upload is complete, but since I’m only uploading from a single server, it doesn’t affect me from what I can tell.

I did have a weird issue with File Stream where it popped up with a message saying one chunk couldn’t be uploaded and was moved to “Lost & Found”. See screenshot:

Not sure why, and Google doesn’t come up with much when I search for it.

Is there any way to figure out where this chunk is supposed to go? I did a check -all on the storage which didn’t come up with any missing chunks but I’m not sure if it’s causing problems I’m not aware of.

I have never heard nor had any lost and found issues with GDFS.

However what you can try, to be 100% sure all your chunks exist is to init a new repository and use a gdc:// path (ie. A Google drive Api one), then check -all on that storage.

By doing this (which will take sooo much time, better run duplicacy.exe -d -log check -all to have the debug output and see what :d: is doing), you will see exactly what is uploaded to gdrive and eliminate the risk that your chunk may only be cached locally due to some GDFS bug.

1 Like

Check -all on Drive API didn’t show any missing chunks, although I got a few instances of “chunk _______ is confirmed to exist” which I haven’t seen before.

I did a search on the lost & found chunk filename and it does appear to be available on gdrive in File Stream and the web interface. However, on the web interface, the file says that it’s 0 bytes, whereas on File Stream in Windows Explorer it shows 4.04MB.

What’s the best way to verify the integrity of this chunk? Should I overwrite it with the file in the lost & found?

What I would do is copy the 4MB chunk directly off of the File Stream with Windows Explorer to a temporary location on a local drive and binary compare or compare its hash with the chunk in your ‘Lost & Found’. If they’re the same, delete the chunk on File Stream and copy back your temporary copy (or the one in Lost & Found).

To be honest, though, the only way to be sure of the integrity of the backup is to do a check -files although I advise only to check the latest revision as otherwise it’ll check all the revisions one by one.

If your backup size is quite large, you might want to test a restore instead coz then you can resume an aborted restore.

To save even more bandwidth - because of deduplication - and only if you have the disk space, doing a copy to a new local storage would be more efficient, and then you could check -files or restore directly off of the local copy.

2 Likes

Seems like they’re different; the lost & found chunk shows as 0.2MB larger size on disk.

The chunk in question doesn’t exist on my local backup storage, so I assume it’s from one of the repositories I backup directly to Google Drive. I’ll do as you suggested and copy them to a new local storage and do some checks from there.

1 Like

As expected, the chunk in question wasn’t uploaded properly and caused the restore operation to fail. After I replaced it with the one from lost & found, I was able to resume the restore operation, which completed successfully.

Hope that helps anyone else who has the same issue in the future.

1 Like

I’m curious why this happened? could it be something that :d: did which GDFS didn’t like, or is this a real bug in GDFS?

I have had some weird bugs with GDFS and talked to google support folks but I believe they were all fixed after I managed to repro them consistently.

No idea, it only happened to that one chunk. Hasn’t happened again in the few days since then for any of my backups. I don’t think it would have been anything on the part of Duplicacy, since it was just writing to the local cache. I had a lot of data stored in the cache queued for upload, probably a couple hundred gigabytes. Evidently, there was nothing wrong with the chunk itself since I was able to restore the backup after I manually replaced the empty chunk.

I did change the GDFS cache directory to a Drivepool of SSDs to expand the amount of data I could upload without filling the cache drive completely. Perhaps that had something to do with it? It only happened once in >1TB of uploads, though. I even rebalanced the pool a few times mid-upload to see if it would throw it off, but didn’t have any issues.

I would have no idea how to reproduce the error since I don’t recall doing anything in particular at the time that it occurred. I can’t even say for sure whether Duplicacy was running at the time, or if GDFS was just working through the upload queue (iirc it reached around 300,000 files at some point).

1 Like

I don’t think this huge size was an issue, as with my last computer, i have had more than 500GB and a few million chunks to upload via GDFS in a single revision at one point, and they were all stored in the cache and uploaded slowly.

After some time they were all uploaded successfully ¯\_(ツ)_/¯ afaict. I will however run a restore for all my storages, just to make really sure that shit didn’t hit the fan.

@gchen wdyt, is there a way to make either check or backup to local drive even smarter than they currently are so that the find chunks which are 0 sized? (even though i’m asking this, i’m not sure implementing such a check will help in this particular case :-?)

1 Like

I only noticed today that there was a ‘notifications’ tab on the GDFS interface, which contained the following message:

image

M: being the mounted GDFS drive and P:\DriveFS… being the GDFS cache directory.

At least with this particular issue, it was easy enough to resolve since:
a) There was a clear error message detailing which file wasn’t uploaded
b) It uploaded an empty chunk with the correct filename which made it possible to locate the correct directory to place the chunk from lost & found

I suppose it would be much worse if it happened for thousands of chunks. I did restore a few other snapshots that didn’t use the missing chunk and haven’t run into errors yet, so fingers crossed it’s just that one isolated event.

For now, I’ve added a Windows Defender exclusion to the GDFS drive and cache as recommended by the error message. I noticed the Defender process uses up quite a bit of resources when GDFS is running so maybe it will help in that regard as well.

I think this would be a smart idea actually…

I’ve encountered a handful of situations where I’ve seen missing chunks, 0-byte chunks, and even one chunk that had totally the wrong content*.

Then again, in these instances I am using Duplicacy in a fairly heavy-duty way - copying large Vertical Backup storages, with lots of fixed sized chunks, to another sftp storage. I’ve had no issues with my own personal Duplicacy backups, though.

But yep, I’ve witnessed a few instances where I’ve had 0-byte chunks on the destination storage and had to fix it by removing the chunk and removing the revisions in \snapshots and re-copy that particular revision. It should be a very simple additional check on its file size.

I think before I did suggest a possible improvement whereby the chunk file names could be appended with extra metadata - such as the hash of the chunk. (Since this is the method by which Duplicacy has a database-less architecture.) If not a hash, then maybe chunk size in bytes? A check for 0-byte chunks is certainly useful, but an exact match maybe more important given that @Akirus’s experience has shown a 0.2MB discrepancy.

*Edit: So yea… I actually had a chunk where the binary contents was completely different to the original storage where I copied it from. Yet the header had the ‘duplicacy<NUL>’ string at the beginning, so it wasn’t totally corrupted. Had to re-copy the chunk manually. Very puzzled by that!

2 Likes