Backupping a large disk #898

Closed
opened 2024-07-11 11:56:23 +00:00 by kimheino · 8 comments
Contributor

What I'd like to do and why

I have large glusterfs disk (6TB) with lot of small file (56M files). Backupping it takes three days. These files are mostly stable. I even know which directories are modified between backup runs.

Any ideas how to speedup backup?

I'm already diving this data set to 40 yaml files and backupping them separately. One solution I came up is to backup subdirs 1-5 on Monday, 6-10 on Tuesday, and so on. This would change "daily" (3 daily) backup to weekly backup, which is fine.

This would require cron-like schedule support on yaml/borgmatic. Is there such a feature? At least I didn't find any.

Or wildcard support for "borgmatic create --repository REPOSITORY" repository name? I could name them Mon_something ,Tue_something, etc and then run borgmatic create --repository $(date +%a)_*

I'm open to all ideas to easily backup this huge set of files.

Other notes / implementation ideas

No response

### What I'd like to do and why I have large glusterfs disk (6TB) with lot of small file (56M files). Backupping it takes three days. These files are mostly stable. I even know which directories are modified between backup runs. Any ideas how to speedup backup? I'm already diving this data set to 40 yaml files and backupping them separately. One solution I came up is to backup subdirs 1-5 on Monday, 6-10 on Tuesday, and so on. This would change "daily" (3 daily) backup to weekly backup, which is fine. This would require cron-like schedule support on yaml/borgmatic. Is there such a feature? At least I didn't find any. Or wildcard support for "borgmatic create --repository REPOSITORY" repository name? I could name them Mon_something ,Tue_something, etc and then run `borgmatic create --repository $(date +%a)_*` I'm open to all ideas to easily backup this huge set of files. ### Other notes / implementation ideas _No response_
Owner

Wow, that's a lot of files! Do you know which step of the backups take so long? Running borgmatic with --verbosity 2 should give you more information as each step is run. If the main slow step is specifically the borg create command, then this is likely a Borg performance question rather than something in borgmatic, and you might consider filing a Borg issue or searching for an existing one. Also see these comments for some ideas, like tweaking the compression or checking disk IO.

If on the other hand the slow step is borg check, there are a number of things you can do in borgmatic to speed that up.

And if the slow step is (in full or in part) something borgmatic itself is doing, then I'd be happy to take a look at it. borgmatic itself doesn't put timestamps in its logs, but if you log to something like systemd, that will include timestamps on each log entry so you can see how long things take.

This would require cron-like schedule support on yaml/borgmatic. Is there such a feature? At least I didn't find any.

borgmatic has some various basic "scheduling" for checks, but not for making backups themselves. For that, I recommend using a real scheduler like cron or systemd. The main downside is then you wouldn't be able to just run borgmatic once and rely on it to consume forty configuration files in sequence. Instead, you'd have to individually schedule borgmatic invocations (e.g. with --config) to run individual configuration files or run them in batches.

But before going down that road, you may want to see if the performance issues can be resolved.

Or wildcard support for "borgmatic create --repository REPOSITORY" repository name? I could name them Mon_something ,Tue_something, etc and then run borgmatic create --repository $(date +%a)_*

Interesting idea.. I think you could achieve that today though by grouping your configuration files. E.g., put all of your Monday files into a monday/ directory and then point borgmatic at it on Monday: borgmatic --config monday/ and so on.

Wow, that's a lot of files! Do you know which step of the backups take so long? Running borgmatic with `--verbosity 2` should give you more information as each step is run. If the main slow step is specifically the `borg create` command, then this is likely a Borg performance question rather than something in borgmatic, and you might consider filing a [Borg issue](https://github.com/borgbackup/borg/issues) or searching for an existing one. Also see [these comments](https://www.reddit.com/r/selfhosted/comments/dzmbt2/what_does_your_borg_backup_performance_look_like/) for some ideas, like tweaking the compression or checking disk IO. If on the other hand the slow step is `borg check`, there are a number of [things you can do in borgmatic](https://torsion.org/borgmatic/docs/how-to/deal-with-very-large-backups/) to speed that up. And if the slow step is (in full or in part) something borgmatic itself is doing, then I'd be happy to take a look at it. borgmatic itself doesn't put timestamps in its logs, but if you log to something like [systemd](https://torsion.org/borgmatic/docs/how-to/inspect-your-backups/#logging), that will include timestamps on each log entry so you can see how long things take. > This would require cron-like schedule support on yaml/borgmatic. Is there such a feature? At least I didn't find any. borgmatic has some various basic ["scheduling"](https://torsion.org/borgmatic/docs/how-to/deal-with-very-large-backups/#check-frequency) for checks, but not for making backups themselves. For that, I recommend using a real scheduler like cron or systemd. The main downside is then you wouldn't be able to just run borgmatic once and rely on it to consume forty configuration files in sequence. Instead, you'd have to individually schedule borgmatic invocations (e.g. with `--config`) to run individual configuration files or run them in batches. But before going down that road, you may want to see if the performance issues can be resolved. > Or wildcard support for "borgmatic create --repository REPOSITORY" repository name? I could name them Mon_something ,Tue_something, etc and then run borgmatic create --repository $(date +%a)_* Interesting idea.. I think you could achieve that today though by grouping your configuration files. E.g., put all of your Monday files into a `monday/` directory and then point borgmatic at it on Monday: `borgmatic --config monday/` and so on.
witten added the
question / support
label 2024-07-12 05:07:09 +00:00
Author
Contributor

Do you know which step of the backups take so long?

It's scanning the files on glusterfs disk. So it's not really borgmatic or borg related slowness. I'm already testing different glusterfs-tuning options and I've already got if faster. By faster I mean "3 days", not "5 days".

When comparing glusterfs to XFS:

opendir() / readdir() / stat() is slow
read() is fast

Also see these comments for some ideas, like tweaking the compression or checking disk IO.

Thanks. I'll have to test --noflags next, although it probably only affects when adding more files to backup, not when scanning for changed files.

If on the other hand the slow step is borg check, there are a number of things you can do in borgmatic to speed that up.

I've already disabled checks as they took too long. Here's a snippet from my yaml:

compression: zstd,4

files_cache: ctime,size

keep_daily: 7
keep_weekly: 8

skip_actions:
    - check

Side note: Originally I had all the files in single repository but mounting it took so long that I had to split it up to multiple repostories.

This would require cron-like schedule support on yaml/borgmatic. Is there such a feature? At least I didn't find any.

borgmatic has some various basic "scheduling" for checks, but not for making backups themselves.

I'm already using those on my other systems. For this I would need something similar (absolute Monday, not relative frequency: 1 week) to creating backups too.

Or wildcard support for "borgmatic create --repository REPOSITORY" repository name? I could name them Mon_something ,Tue_something, etc and then run borgmatic create --repository $(date +%a)_*

Interesting idea.. I think you could achieve that today though by grouping your configuration files. E.g., put all of your Monday files into a monday/ directory and then point borgmatic at it on Monday: borgmatic --config monday/ and so on.

Yes, I got the same idea after opening this issue. There is one downside on this: For borgmatic mount --repository something --mount-point /foo I need to find correct --config directory.

I can do all this with systemd timers/cron and subdirs, but it would be handy to have this in borgmatic too. Two options:

  1. --repository would allow wildcards and I can use something_Mon or Mon_something for my Monday needs
  2. New keyword, for example tags or matchers. Then in yaml I could use tags: Monday, foo, bar (multiple tags) and command borgmatic create --tag Monday would match only yamls with Monday tag.

I would prefer option 2.

> Do you know which step of the backups take so long? It's scanning the files on glusterfs disk. So it's not really borgmatic or borg related slowness. I'm already testing different glusterfs-tuning options and I've already got if faster. By faster I mean "3 days", not "5 days". When comparing glusterfs to XFS: opendir() / readdir() / stat() is slow read() is fast > Also see [these comments](https://www.reddit.com/r/selfhosted/comments/dzmbt2/what_does_your_borg_backup_performance_look_like/) for some ideas, like tweaking the compression or checking disk IO. Thanks. I'll have to test `--noflags` next, although it probably only affects when adding more files to backup, not when scanning for changed files. > If on the other hand the slow step is `borg check`, there are a number of [things you can do in borgmatic](https://torsion.org/borgmatic/docs/how-to/deal-with-very-large-backups/) to speed that up. I've already disabled checks as they took too long. Here's a snippet from my yaml: ``` compression: zstd,4 files_cache: ctime,size keep_daily: 7 keep_weekly: 8 skip_actions: - check ``` Side note: Originally I had all the files in single repository but mounting it took so long that I had to split it up to multiple repostories. > > This would require cron-like schedule support on yaml/borgmatic. Is there such a feature? At least I didn't find any. > > borgmatic has some various basic ["scheduling"](https://torsion.org/borgmatic/docs/how-to/deal-with-very-large-backups/#check-frequency) for checks, but not for making backups themselves. I'm already using those on my other systems. For this I would need something similar (absolute `Monday`, not relative `frequency: 1 week`) to creating backups too. > > Or wildcard support for "borgmatic create --repository REPOSITORY" repository name? I could name them Mon_something ,Tue_something, etc and then run borgmatic create --repository $(date +%a)_* > > Interesting idea.. I think you could achieve that today though by grouping your configuration files. E.g., put all of your Monday files into a `monday/` directory and then point borgmatic at it on Monday: `borgmatic --config monday/` and so on. Yes, I got the same idea after opening this issue. There is one downside on this: For `borgmatic mount --repository something --mount-point /foo` I need to find correct `--config` directory. I can do all this with systemd timers/cron and subdirs, but it would be handy to have this in borgmatic too. Two options: 1) `--repository` would allow wildcards and I can use `something_Mon` or `Mon_something` for my Monday needs 2) New keyword, for example `tags` or `matchers`. Then in yaml I could use `tags: Monday, foo, bar` (multiple tags) and command `borgmatic create --tag Monday` would match only yamls with `Monday` tag. I would prefer option 2.
Author
Contributor

I did XFS vs glusterfs comparison with small (400GB, 4M files) file set.

Backup from glusterfs took 7.5 hours.
Backup from XFS took 4 hours.

This is VM on Hetzner,. Backup is stored to Hetzner's storagebox.

I did XFS vs glusterfs comparison with small (400GB, 4M files) file set. Backup from glusterfs took 7.5 hours. Backup from XFS took 4 hours. This is VM on Hetzner,. Backup is stored to Hetzner's storagebox.
Owner

I apologize for the delay in getting back to this. That's interesting about the relative filesystem performance. XFS sounds like a big win there. You might also consult the Borg project for filesystem performance tuning tips; that's well outside my area of expertise. And see these Borg FAQ entries on the topic.

In terms of borgmatic features to support limiting the set of configuration files "run" upon a create action, there is a comparable feature already for repositories: Each repository can have a label and you can specify that label when using the --repository flag. But it sounds like you'd rather have a tag/label at the whole configuration file level.

Could your repositories and configuration files be 1 to 1, such that there's one repository per configuration file? If so, you could maybe use the existing repository label feature for your use case.

I apologize for the delay in getting back to this. That's interesting about the relative filesystem performance. XFS sounds like a big win there. You might also consult the Borg project for filesystem performance tuning tips; that's well outside my area of expertise. And see these [Borg FAQ entries](https://borgbackup.readthedocs.io/en/stable/faq.html#what-s-the-expected-backup-performance) on the topic. In terms of borgmatic features to support limiting the set of configuration files "run" upon a `create` action, there is a comparable feature already for repositories: Each repository can have a `label` and you can specify that label when using the `--repository` flag. But it sounds like you'd rather have a tag/label at the whole configuration file level. Could your repositories and configuration files be 1 to 1, such that there's one repository per configuration file? If so, you could maybe use the existing repository label feature for your use case.
Owner

borgmatic has some various basic "scheduling" for checks, but not for making backups themselves.

I'm already using those on my other systems. For this I would need something similar (absolute Monday, not relative frequency: 1 week) to creating backups too.

borgmatic has that too, but only for check right now; see check days. But note that unlike a real scheduler, this is only "best effort" scheduling. In theory it could be generalized to create as well.

> > borgmatic has some various basic "scheduling" for checks, but not for making backups themselves. > I'm already using those on my other systems. For this I would need something similar (absolute Monday, not relative frequency: 1 week) to creating backups too. borgmatic has that too, but only for `check` right now; see [check days](https://torsion.org/borgmatic/docs/how-to/deal-with-very-large-backups/#check-days). But note that unlike a real scheduler, this is only "best effort" scheduling. In theory it could be generalized to `create` as well.
Author
Contributor

In terms of borgmatic features to support limiting the set of configuration files "run" upon a create action, there is a comparable feature already for repositories: Each repository can have a label and you can specify that label when using the --repository flag.

Yes, I'm using unique label for each repository. My missing feature for it is wildcard support:

(12:20) root@risa:~$ borgmatic list --repository baz-jemma
baz-jemma: Listing archives
...
(12:20) root@risa:~$ borgmatic list --repository \*jemma\*
Repository "*jemma*" not found in configuration files

With such wildcard support I could do my own scheduling very easily. Currently I'm doing it not-so-easily, as discussed above. I

Could your repositories and configuration files be 1 to 1, such that there's one repository per configuration file? If so, you could maybe use the existing repository label feature for your use case.

Yes, I have one subdirectory per repository, one configuration file per repository. Problem is matching multiple repositories easily with wildcard, not splitting them to multiple configuration directories. Having them all in one configuration directory makes other borg(matic) tasks easier.

Anyway, I have a workaround for this now and this issue can be closed. If you decide to add wildcard support it would make my life easier.

> In terms of borgmatic features to support limiting the set of configuration files "run" upon a `create` action, there is a comparable feature already for repositories: Each repository can have a `label` and you can specify that label when using the `--repository` flag. Yes, I'm using unique label for each repository. My missing feature for it is wildcard support: ``` (12:20) root@risa:~$ borgmatic list --repository baz-jemma baz-jemma: Listing archives ... (12:20) root@risa:~$ borgmatic list --repository \*jemma\* Repository "*jemma*" not found in configuration files ``` With such wildcard support I could do my own scheduling very easily. Currently I'm doing it not-so-easily, as discussed above. I > Could your repositories and configuration files be 1 to 1, such that there's one repository per configuration file? If so, you could maybe use the existing repository label feature for your use case. Yes, I have one subdirectory per repository, one configuration file per repository. Problem is matching multiple repositories easily with wildcard, not splitting them to multiple configuration directories. Having them all in one configuration directory makes other borg(matic) tasks easier. Anyway, I have a workaround for this now and this issue can be closed. If you decide to add wildcard support it would make my life easier.
Owner

Okay, the --repository flag now has glob support in main. It'll be part of the next release. Thanks for the suggestion!

Okay, the `--repository` flag now has glob support in main. It'll be part of the next release. Thanks for the suggestion!
Owner

Released in borgmatic 1.8.14!

Released in borgmatic 1.8.14!
Sign in to join this conversation.
2 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: borgmatic-collective/borgmatic#898
No description provided.