Backblaze B2 optimum number of threads and chunk size?

Richard · 18 September 2017 18:41

Any thoughts on Backblaze B2 optimum number of threads and chunk size?

gchen · 19 September 2017 01:01

I normally get 1-2MB/s from single thread uploading, and since each thread will get a different B2 upload server, I would suggest using a number of threads that can max out your upload bandwidth.

The default 4 MB chunk size should work. You can also try reducing that to, say, 1MB, in order to get better deduplication, then use more threads to overcome the increased overhead associated with smaller chunks.

JarnoP · 22 July 2018 11:49

I was struggling with dismal backup speeds from Northern Europe to B2. The server has 100Mbps Fibre and I can get easily 2-3MByte / sec from it over SSH to Asia, but I got only 20KB/sec to B2. Then I found the -threads option and tried with 5 and suddenly I am getting 3-5MByte/sec to B2. I don’t know why the one-thread backup was so slow - it could have been due to some ISP throttling or what. Quite mysterious.

Anyway, @gchen, how about having e.g. 4 threads as a default for B2? I think many users would appreciate faster default setting?

saspus · 22 July 2018 15:45

I’m not him but I disagree. Default behavior shall be consistent across backends.

Have you found the root cause of the issue? 20*5 does not add up to 5000

JarnoP · 23 July 2018 02:36

Hi,

No, I was unable to find out the root cause (and happy with increased performance of the multiple threads).

Your argument rests on the consistency of behavior in technical sense, my argument is based on optimal default user experience. Since from black-box-perspective threads will just increase the performance, does it really make sense to drag down all the platforms by the weakest denominator or to provide good defaults to all the storage types depending on their capabilities? I think this is a question where the devs need to think how to get the balance right.

TheBestPessimist · 23 July 2018 05:33

I have a problem with this, in the sense that if you are running this (backup) on a nas, where you have limited memory but more time to waste, you may want to use only 1 thread.

There is also the factor of latency: those in Europa/Australia will have much more latency and therefore will need more uploading threads comparing with those in the US.

And you could also see it in this way: if you may have a slow DSL connection (again, think US), increasing the # of threads will only waste resources since you’re also killing the bandwidth for any other devices in your household.

All in all, i think that doing a few trials when you start your backups to test how many threads would be best for your upload speed and all the other network conditions is not a very difficult task.

JarnoP · 23 July 2018 06:09

Thanks, these are fair points.

gchen · 23 July 2018 17:17

The performance of individual B2 upload servers isn’t reliable. If you’re unlucky you can get an overloaded server that causes the upload to be very slow. Fortunately B2 always forces you to frequently change the upload server(s) by sending back http errors (or even closing the connections) so you’ll never get stuck in a slow server for long.

With regard to the default number of threads for B2, you’re not the first one to request this change, but as @saspus and @TheBestPessimist pointed out there are downsides of doing this so I would rather not change it now.

Christoph · 23 July 2018 20:32

Perhaps this “optimal default user experience” could be something for the GUI version? Though perhaps not entirely hidden but as a setting that is on by default?

Also, I guess it would be helpful to provide recommended thread settings for various backends (where applicable) here on the forum. If anyone has something to contribute regarding this, please start a topic in #how-to!

Droolio · 23 July 2018 21:07

As a wild suggestion - for the CLI version, could not each storage’s thread count (and, come to think of it, limit-rate) be established with the set command, so it doesn’t have to be specified every time?

saspus · 24 July 2018 01:36

This seems like a great idea at first. I even wanted to suggest to do that for any command line modifier - to avoid specifying the same command line parameter over and over; in a way creating your own default behaviors.

But then thinking about it a bit more it starts to seem less and less appealing. We already have a way to accomplish that via shell scripts. It would make sense to only implement the basic minimum required set of features and let the users go wild in the shell or GUI, the Unix way.

JarnoP · 24 July 2018 07:19

saspus, your argument can be used to reject almost any feature since you can always point a user to python libraries she can use to “script” around the problem. But in reality that scripting around is not very practical and therefore following your guideline does not make it a right decision as such. If stopping feature creep would be the primary goal of duplicacy development, why is there even the set command if we can “let users to go wild in the shell”? Feature decisions are always about the balance of pros and cons.

@Droolio’s proposal is very sensible to me as end-user looking for practical solution rather than dogmatic implementation guidelines.

saspus · 24 July 2018 14:21

When you put it this way, yes, that proposal does make some sense if we think of it not as a “another way to specify command line arguments to save time typing” but instead “treating number of threads as part of storage backend configuration”. Whether it is right thing to do I’m still not sure. You might want to alter number of threads depending on your connection (LAN vs LTE), and therefore it should not be part of storage configuration, it shall be function of the environment.
The current implementation of it as a command line argument is therefore the most appropriate.

However I still disagree that we should be adding redundant features just because it seems to simplify one particular use case: this is what front ends are for. We have GUI for that. Command line utility is a backend, and it’s interface shall be clean, logical, and unambiguous, without 20 different ways to accomplish the same things. This adds unnecessary complexity and ambiguity and bloat and increased possibilitity to misconfigure things.

JarnoP · 24 July 2018 14:58

Well, CLI is is the UI for many servers that do not have X server even installed. There are no “20 different ways to accomplish the same things”, but two (2) command line option or settings in preferences files.

The implementation and code base management aspect is for gchen to assess. For an end-user, adding support for threads settings in preferences file would be a nice addition. Or at least has two votes so far.

saspus · 24 July 2018 15:22

It’s one too much.

Well, CLI is is the UI for many servers that do not have X server even installed

You know what I meant. GUI is an example of a frontend. Frontend does not have to be graphical.

Or at least has two votes so far.

Those hearts are not votes for the feature. They mean that the reader thinks that the comment contributed to the discussion. One of these two “votes” is mine by the way. If you tap on a number you can see who voted

JarnoP · 24 July 2018 15:29

I tried to politely to tell you that there are two different opinions here. Let’s agree to disagree.

I don’t think you meant that gchen should remove the preferences file support completely from duplicacy, or?

Droolio · 24 July 2018 19:22

To clarify, the set command doesn’t save these options on the storage backend - they’re for each repository id on the client side. A single repository can have multiple storages so you might want to have a way to remember each environment internally, just as -no-backup is remembered on our fallible behalf.

With you on the Unix way, I don’t think it precludes the possibility to allow defaults to be overridden with a command line switch, especially in exceptional situations as above. And if you continually switch connections, you might simply choose not to ‘set’ anything at all.

Since the set command exists already for similar purposes, feature creep is minimal, and my proposal was strictly for thread counts and bandwidth limits.

But one problem with the idea I can think of already is deciding which thread amount and bw limit to use for the copy command. We have source and destination threads, upstream and downstream values for each. The lesser of the two applicable values? What if one isn’t set? Could it make a more intelligent choice? Might seem straightforward, yet not necessarily so. Would Duplicacy benefit from having a different thread count for source and destination? If not, why does copy already have both -download-limit-rate and -upload-limit-rate options?

BTW I’m not overly attached to this idea… I just threw it out there as a suitable enough improvement - a compromise to having hard-coded defaults for storage backends. I definitely don’t suggest the init command pre-populate these settings based on recommended values. That’s definitely up to the user to define.