Borgmatic to manage parallel support? #227

Closed
opened 2019-10-20 11:05:40 +00:00 by decentral1se · 6 comments
Contributor

What I'm trying to do and why

Backups take a while but if you have a powerful machine, you could run them in parallel! This is possible to manage externally from borgmatic but it would also be nice to have a -n N flag to signal some cores and have borgmatic run what it can in parallel. The user would have to make sure that what is being backed up can be done in parallel.

This is probably a can of worms but would be nice to think through.

#### What I'm trying to do and why Backups take a while but if you have a powerful machine, you could run them in parallel! This is possible to manage externally from borgmatic but it would also be nice to have a `-n N` flag to signal some cores and have borgmatic run what it can in parallel. The user would have to make sure that what is being backed up can be done in parallel. This is probably a can of worms but would be nice to think through.
Owner

Interesting. What might you want to be performed in parallel, exactly? Separate Borg invocations, say if you're backing up to two separate repositories on different remote hosts? Is the goal here to improve performance to reduce total runtime? If so, why? (I can imagine why, but I'm interested in your thinking here.)

Do you have reason to believe that running Borg invocations in parallel would actually improve performance? Put another way, what's the bottleneck? Is it disk I/O, or network I/O, or CPU (you mentioned cores)? And would layering on a parallel run help?

You don't need answers to all of these now, but this is the sort of exploration that would help define/scope this feature.

Interesting. What might you want to be performed in parallel, exactly? Separate Borg invocations, say if you're backing up to two separate repositories on different remote hosts? Is the goal here to improve performance to reduce total runtime? If so, why? (I can imagine why, but I'm interested in your thinking here.) Do you have reason to believe that running Borg invocations in parallel would actually improve performance? Put another way, what's the bottleneck? Is it disk I/O, or network I/O, or CPU (you mentioned cores)? And would layering on a parallel run help? You don't need answers to all of these now, but this is the sort of exploration that would help define/scope this feature.
witten added the
waiting for response
label 2019-11-18 01:26:36 +00:00
Owner

Related ticket: witten/borgmatic#291

Related ticket: https://projects.torsion.org/witten/borgmatic/issues/291
Owner

Closing due to inactivity. However, please feel free to reopen if you're still interested in this!

Closing due to inactivity. However, please feel free to reopen if you're still interested in this!
witten removed the
waiting for response
label 2021-08-31 23:42:49 +00:00

I think one of possible scenarios is:
multiple slow remote repositories (either the repository network or its storage).

One may use blazing fast SSDs for production server and use several last decade servers with 5400rpm laptop HDD found on the bottom of the rack to use as borg repositories. The production server will still be far from any kind of bottleneck while the repository server will be choking trying to write all data received.

I think one of possible scenarios is: multiple slow remote repositories (either the repository network or its storage). One may use blazing fast SSDs for production server and use several last decade servers with 5400rpm laptop HDD found on the bottom of the rack to use as borg repositories. The production server will still be far from any kind of bottleneck while the repository server will be choking trying to write all data received.
Owner

Those are great points. If anyone has scenarios like those—or others—feel free to chime in here so we can gauge interest.

Those are great points. If anyone has scenarios like those—or others—feel free to chime in here so we can gauge interest.

Multiple remote repositories, where the bottleneck is their bandwidth, is the most obvious use case. For example, borgbase has datacenters in both the US and EU. If, like me, you understand the value of both having redundant backups, and of spreading your backups across multiple geolocations, you'll create two repos for every one you actually intend to use. Sure you could just use something like rsync to copy a repo from one place to another, but the official borg docs have some reasoning why they don't recommend you do that and I will take them at their word for it. Nonetheless, I'm writing the same exact data to two different repos, and doing them in sequence (as borgmatic wants to do) means more than doubling the amount of time spent on the backup.

Now, I think (but am not sure, as I've not tried yet) that it's possible to run borg create targeting the same target directory in concurrent processes; if it is, then it seems to me that borgmatic would benefit greatly from having this be part of its declarative config rather than have that be something you have to orchestrate yourself (which has the extra pain of forcing you to specify YAML overrides.)

Another thing that having borgmatic take care of, in a parallel world, would go a long way to help is the extract integrity check: I reckon borg extract isn't safe to run concurrently when targeting the same directory, so having borgmatic take care of this logic internally by allowing all the integrity checks except for the extract dry-run to run in parallel, and then only running that extract dry-run for each repository in series, would be a huge boon to speed while also protecting users from potentially shooting themselves in the foot. I'm pretty sure you just need to make sure that another borg create isn't still going when you get to the point where you want to run the extract dry-run, and again, having borgmatic take care of this instead of orchestrating it yourself would be a huge help.

There's also the matter of using hooks correctly to prevent running backups on live data. Right now I have my before hooks touch a file and my after hooks rm it. If I were to run separate borgmatic processes at once, I'd have race conditions unless I have YAML overrides for those hook settings too (and have my app check globs before it starts.) But if borgmatic were to take care of the parallelism itself, we could have things like before_any_backup (run this command before any backup happens, and only once, e.g. touch a file) and after_all_backup (run this command after all backups are done, e.g. rm a file) without users having to do external orchestration or a ton of YAML overrides/config duplication.

Of course, I do understand this will complicate logging. Depending on how much effort you want to put in, you could either go all-in and write code to properly handle concurrent logs nicely, something like in this blog post, or you could just let chaos reign and tell the user that if they engage the parallelism then their logs will be interspersed and potentially garbled. Caveat emptor. I personally would accept log garbling as a tradeoff for speed.

Multiple remote repositories, where the bottleneck is *their* bandwidth, is the most obvious use case. For example, borgbase has datacenters in both the US and EU. If, like me, you understand the value of both having redundant backups, and of spreading your backups across multiple geolocations, you'll create two repos for every one you actually intend to use. Sure you could just use something like rsync to copy a repo from one place to another, but [the official borg docs have some reasoning why they don't recommend you do that](https://borgbackup.readthedocs.io/en/stable/faq.html#can-i-copy-or-synchronize-my-repo-to-another-location) and I will take them at their word for it. Nonetheless, I'm writing the same exact data to two different repos, and doing them in sequence (as borgmatic wants to do) means more than doubling the amount of time spent on the backup. Now, I think (but am not sure, as I've not tried yet) that it's _possible_ to run `borg create` targeting the same target directory in concurrent processes; if it is, then it seems to me that borgmatic would benefit greatly from having this be part of its declarative config rather than have that be something you have to orchestrate yourself (which has the extra pain of forcing you to specify YAML overrides.) Another thing that having borgmatic take care of, in a parallel world, would go a long way to help is the `extract` integrity check: I reckon `borg extract` isn't safe to run concurrently when targeting the same directory, so having borgmatic take care of this logic internally by allowing all the integrity checks _except_ for the `extract` dry-run to run in parallel, and then only running that `extract` dry-run for each repository in series, would be a huge boon to speed while also protecting users from potentially shooting themselves in the foot. I'm pretty sure you just need to make sure that another `borg create` isn't still going when you get to the point where you want to run the `extract` dry-run, and again, having borgmatic take care of this instead of orchestrating it yourself would be a huge help. There's also the matter of using hooks correctly to prevent running backups on live data. Right now I have my `before` hooks touch a file and my `after` hooks `rm` it. If I were to run separate `borgmatic` processes at once, I'd have race conditions unless I have YAML overrides for those hook settings too (and have my app check globs before it starts.) But if borgmatic were to take care of the parallelism itself, we could have things like `before_any_backup` (run this command before any backup happens, and only once, e.g. `touch` a file) and `after_all_backup` (run this command after all backups are done, e.g. `rm` a file) without users having to do external orchestration or a ton of YAML overrides/config duplication. Of course, I do understand this will complicate logging. Depending on how much effort you want to put in, you could either go all-in and write code to properly handle concurrent logs nicely, something like in [this blog post](https://scribe.nixnet.services/python3-logging-with-multiprocessing-f51f460b8778), or you could just let chaos reign and tell the user that if they engage the parallelism then their logs will be interspersed and potentially garbled. Caveat emptor. I personally would accept log garbling as a tradeoff for speed.
Sign in to join this conversation.
No Milestone
No Assignees
4 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: borgmatic-collective/borgmatic#227
No description provided.