Support different prefixes for database hooks to allow for multiple stdin streams #296

New Issue

AriosThePhoenix · 2020-02-28T23:50:52Z

AriosThePhoenix commented

2020-02-28 23:50:52 +00:00

(I originally wrote this post as a comment for this thread, but I feel like it makes more sense as its own separate issue)

Problem Description

Borgmatic does not have proper support for reading from stdin right now (well okay, it kinda does), which is an issue when dumping large databases. Reading a DB from disk, then writing the dump to potentially the same disk, then reading that dump again and sending it to the repo creates a lot more I/O-overhead than a simple pipe between pg_dump/mysqldump and borg. However, piping a dump into borgmatic like described in the issue above limits one to just one dump. It also feels a little hacky when compared with borgmatics otherwise very explicit and clean configuration-based approach.

Proposal

One option for getting around this "one stream per archive" issue would be to allow the user to specify one prefix for each database, with borgmatic then creating several acrhive "chains", with one chain for every db backup + one chain for the regular filesystem backups.

Right now, it seems as if borgmatic uses one naming scheme (defined in the storage section) for all archives that it creates. If the user could specify a different prefix for each database hook, then multiple db backups via stdin could be run very elegantly simply by calling borg once for every db, with a different prefix every time.

Essentially, I'm imagining something like this:

location:
    source_directories:
        - /some/dir
    repositories:
        - some_repo
    prefix: "filesystem" # could also just be a constant/hardocded value
    
hooks:
    postgresql_databases:
        - name: foo
          prefix: "db_foo"
          ...
        - name: bar
          prefix: "db_bar"
          ...
    ...
          
    
retention:
    prefixes:
        - "filesystem"
        - "db_foo"
        - "db_bar"
    ...

consistency:
    prefixes:
        - "filesystem"
        - "db_foo"
        - "db_bar"

(Alternatively, there could also be a "use_prefixes" option in the hooks section and borgmatic could automatically generate the prefixes from the db names and use those for pruning and checks)

The resulting repository would then contain several archives with different prefixes, eg:

- filesystem-<today> -> Contains filesystem
- db_foo-<today> -> contains only dump of foo
- db_bar-<today> -> contains only dump of bar
- filesystem-<yesterday> -> ...
- db_foo-<yesterday> -> ...
- db_bar-<yesterday> -> ...

This approach would allow for a lot of flexibility in regards to db backups, especially when combined with stdin stream input. It also doesn't depend on what could be considered "undefined behavior" right now (witten/borgmatic#42), instead everything is explicitly defined in the configuration file.

One consideration to make here is backwards compatability. If prefixes + streams are introcued, then they would have to be opt-in, as to not disrupt current setups packing multiple db dumps in a single archive. A simple flag in the config would probably be ideal here.

This is less of an issue and more of a proposal/brainstorm I guess. I'm looking forward to hearing everyone's opinion on this :)

(I originally wrote this post as a comment for [this thread](https://projects.torsion.org/witten/borgmatic/issues/258), but I feel like it makes more sense as its own separate issue) ### Problem Description Borgmatic does not have proper support for reading from stdin right now (well okay, it [kinda does](https://projects.torsion.org/witten/borgmatic/issues/42)), which is an issue when dumping large databases. Reading a DB from disk, then writing the dump to potentially *the same* disk, then reading that dump again and sending it to the repo creates a lot more I/O-overhead than a simple pipe between pg_dump/mysqldump and borg. However, piping a dump into borgmatic like described in the issue above limits one to just one dump. It also feels a little hacky when compared with borgmatics otherwise very explicit and clean configuration-based approach. ### Proposal One option for getting around this "one stream per archive" issue would be to allow the user to specify one prefix for each database, with borgmatic then creating several acrhive "chains", with one chain for every db backup + one chain for the regular filesystem backups. Right now, it seems as if borgmatic uses one naming scheme (defined in the storage section) for all archives that it creates. If the user could specify a different prefix for each database hook, then multiple db backups via stdin could be run very elegantly simply by calling borg once for every db, with a different prefix every time. Essentially, I'm imagining something like this: ``` location: source_directories: - /some/dir repositories: - some_repo prefix: "filesystem" # could also just be a constant/hardocded value hooks: postgresql_databases: - name: foo prefix: "db_foo" ... - name: bar prefix: "db_bar" ... ... retention: prefixes: - "filesystem" - "db_foo" - "db_bar" ... consistency: prefixes: - "filesystem" - "db_foo" - "db_bar" ``` (Alternatively, there could also be a "use_prefixes" option in the hooks section and borgmatic could automatically generate the prefixes from the db names and use those for pruning and checks) The resulting repository would then contain several archives with different prefixes, eg: ``` - filesystem-<today> -> Contains filesystem - db_foo-<today> -> contains only dump of foo - db_bar-<today> -> contains only dump of bar - filesystem-<yesterday> -> ... - db_foo-<yesterday> -> ... - db_bar-<yesterday> -> ... ``` This approach would allow for a lot of flexibility in regards to db backups, especially when combined with stdin stream input. It also doesn't depend on what could be considered "undefined behavior" right now (https://projects.torsion.org/witten/borgmatic/issues/42), instead everything is explicitly defined in the configuration file. One consideration to make here is backwards compatability. If prefixes + streams are introcued, then they would have to be opt-in, as to not disrupt current setups packing multiple db dumps in a single archive. A simple flag in the config would probably be ideal here. This is less of an issue and more of a proposal/brainstorm I guess. I'm looking forward to hearing everyone's opinion on this :)

witten commented

2020-04-29 17:44:44 +00:00

Interesting. It certainly sounds like something that could work. But what's the restore picture look like with this in place? For instance, today, you can point borgmatic restore at a single archive, and it'll somewhat atomically restore all databases contained therein. Let's say you've got a total host failure and need to do a restore of multiple databases. How do you go about getting all of the databases back?

Crazy thought: I wonder if there's a reasonable way to stream multiple different databases (e.g. the output of multiple different invocations of pg_dump) into a single borg invocation consuming from stdin. Maybe with some sort of logical file separator to aide in borgmatic restoration. Downside: It would mean you'd always have to restore with borgmatic.. You couldn't manually get at the contents of the Borg archive and restore an archive by hand with pg_restore, because there would be multiple dumps therein.

Interesting. It certainly sounds like something that could work. But what's the restore picture look like with this in place? For instance, today, you can point `borgmatic restore` at a single archive, and it'll somewhat atomically restore all databases contained therein. Let's say you've got a total host failure and need to do a restore of multiple databases. How do you go about getting all of the databases back? Crazy thought: I wonder if there's a reasonable way to stream multiple different databases (e.g. the output of multiple different invocations of `pg_dump`) into a *single* `borg` invocation consuming from stdin. Maybe with some sort of logical file separator to aide in borgmatic restoration. Downside: It would mean you'd always have to restore with borgmatic.. You couldn't manually get at the contents of the Borg archive and restore an archive by hand with `pg_restore`, because there would be multiple dumps therein.

AriosThePhoenix commented

2020-04-29 18:41:09 +00:00

Well, when restoring, running borgmatic list would return a list of all archives, so the user would see something similar to what I described above:

- filesystem-<today> -> Contains filesystem
- db_foo-<today> -> contains only dump of foo
- db_bar-<today> -> contains only dump of bar
- filesystem-<yesterday> -> ...
- db_foo-<yesterday> -> ...
- db_bar-<yesterday> -> ...

Then, a user could choose one of the archives to restore and borgmatic could look at the archive prefix, compare it with the config file and find the appropiate database to overwrite. That looks like a reasonable solution from my perspective, but I agree that it could get a little messy if one has dozens of databases to restore.

In that case, a --prefix switch for the restore command might be a good idea. Then the user could just select the prefixes (i.e the databases or fs) that they want restored. Perhaps there could also be a * prefix to select all avaailable prefixes, causing borgmatic to restore every prefix. This would also work in tandem with the current --archive switch - a user could run something like borgmatic restore --archive <yesterday> --prefix db1,db2 to only restore the first two database hooks from yesterdays backup instead of the entire system. This would add some more separation between the different hooks and would make partial restores much easier. Another example: borgmatic restore --archive latest --prefix * would then restore the entire system from the most recent archive.

As for piping multiple dumps into a single file - I guess that could work too, but I'm not sure if I would prefer that option. As you mentioned, borgmatic would somehow need to mark the different dumps, potentially in a different way for every type of database hook it supports. And partial restores would be made impossible by this method too, as now all the dumps are within one large file.

Well, when restoring, running `borgmatic list` would return a list of all archives, so the user would see something similar to what I described above: ``` - filesystem-<today> -> Contains filesystem - db_foo-<today> -> contains only dump of foo - db_bar-<today> -> contains only dump of bar - filesystem-<yesterday> -> ... - db_foo-<yesterday> -> ... - db_bar-<yesterday> -> ... ``` Then, a user could choose one of the archives to restore and borgmatic could look at the archive prefix, compare it with the config file and find the appropiate database to overwrite. That looks like a reasonable solution from my perspective, but I agree that it could get a little messy if one has dozens of databases to restore. In that case, a `--prefix` switch for the `restore` command might be a good idea. Then the user could just select the prefixes (i.e the databases or fs) that they want restored. Perhaps there could also be a `*` prefix to select all avaailable prefixes, causing borgmatic to restore every prefix. This would also work in tandem with the current `--archive` switch - a user could run something like `borgmatic restore --archive <yesterday> --prefix db1,db2` to only restore the first two database hooks from yesterdays backup instead of the entire system. This would add some more separation between the different hooks and would make partial restores much easier. Another example: `borgmatic restore --archive latest --prefix * ` would then restore the entire system from the most recent archive. As for piping multiple dumps into a single file - I guess that could work too, but I'm not sure if I would prefer that option. As you mentioned, borgmatic would somehow need to mark the different dumps, potentially in a different way for every type of database hook it supports. And partial restores would be made impossible by this method too, as now all the dumps are within one large file.

witten commented

2020-04-29 20:12:42 +00:00

Okay, so to make sure I understand.. You're suggesting that instead of (or in addition to) taking an archive name, borgmatic restore could accept a partial archive name.. Basically, a common timestamp shared by all archives made during the same borgmatic run. And then --prefix could allow specifying one or more prefixes (database names, really) to select the particular database archives to restore.

One modification I might suggest to that is to replace --prefix with the existing borgmatic restore --database flag. Makes it less about the underlying implementation detail (prefixes) and more about what the user wants: Restore these databases.

One other detail we'd have to account for: Some folks use archive prefixes to distinguish archives from different client machines, or even different applications on a single client machine. So we'd have to not break that, and maybe even allow that feature to interact with multiple per-archive database dumps.

Tangentially related: https://github.com/borgbackup/borg/issues/2300 and https://github.com/borgbackup/borg/issues/3945

Okay, so to make sure I understand.. You're suggesting that instead of (or in addition to) taking an archive name, `borgmatic restore` could accept a partial archive name.. Basically, a common timestamp shared by all archives made during the same borgmatic run. And then `--prefix` could allow specifying one or more prefixes (database names, really) to select the particular database archives to restore. One modification I might suggest to that is to replace `--prefix` with the existing `borgmatic restore --database` flag. Makes it less about the underlying implementation detail (prefixes) and more about what the user wants: Restore these databases. One other detail we'd have to account for: Some folks use archive prefixes to distinguish archives from different client machines, or even different applications on a single client machine. So we'd have to not break that, and maybe even allow that feature to interact with multiple per-archive database dumps. Tangentially related: https://github.com/borgbackup/borg/issues/2300 and https://github.com/borgbackup/borg/issues/3945

AriosThePhoenix commented

2020-04-29 20:40:38 +00:00

Ah, good point about the --database flag, i missed that in the docs. Yea, that makes more sense than adding a new prefix flag then.

I also just realized that I made a wrong assumption about the archive names in my previous post; the individual archives may have slightly different timestamps, so we would indeed need some common timestamp. I guess wildcards would be an option for that (so, --archive "2020-04-20*" --database db1,db2, but yea, that's a lot less pretty than what I originally had in mind. Sorry about that

As for the use case of having multiple borgmatic instances backup to the same repository, one simple option would be to make this behavior opt-in - then users could setup appropiate prefixes (or database archive names really) themselves if they want to utilize this new behavior.

I guess borgmatic could alternatively include some form of machine/instance uuid as sort of a "super-prefix" in front of every archive it creates. That would allow for separation, but it would mean adding yet more complexity on top of archive names. in that case, a full archive name would be something like <instance_prefix>-<database_prefix>-<archive_name>. I'm not sure if that's a great way going forward to be honest. So yea, I think I prefer the former - make it an opt-in option and thus allow existing users to setup appropiate prefixes themselves.

Oh, and apologies if my suggestions are a bit vague in some parts, it's been a while since I opened this issue and it was really more of a brain-storm idea - though one that I would love to see implemented. To elaborate a little, I have several MariaDB and Postgres servers hosting databases that are > 100G in size. With the current borgmatic, my only options are to only pipe one database per run, or dump 100s of Gigabytes to a temporary file every night, which is not practical due to space and performance contstraints. That's why I originally suggested this feature, it would fit my use case perfectly

Ah, good point about the `--database` flag, i missed that in the docs. Yea, that makes more sense than adding a new prefix flag then. I also just realized that I made a wrong assumption about the archive names in my previous post; the individual archives may have slightly different timestamps, so we would indeed need some common timestamp. I *guess* wildcards would be an option for that (so, `--archive "2020-04-20*" --database db1,db2`, but yea, that's a lot less pretty than what I originally had in mind. Sorry about that As for the use case of having multiple borgmatic instances backup to the same repository, one simple option would be to make this behavior opt-in - then users could setup appropiate prefixes (or database archive names really) themselves if they want to utilize this new behavior. I guess borgmatic could alternatively include some form of machine/instance uuid as sort of a "super-prefix" in front of every archive it creates. That would allow for separation, but it would mean adding yet more complexity on top of archive names. in that case, a full archive name would be something like `<instance_prefix>-<database_prefix>-<archive_name>`. I'm not sure if that's a great way going forward to be honest. So yea, I think I prefer the former - make it an opt-in option and thus allow existing users to setup appropiate prefixes themselves. Oh, and apologies if my suggestions are a bit vague in some parts, it's been a while since I opened this issue and it was really more of a brain-storm idea - though one that I would love to see implemented. To elaborate a little, I have several MariaDB and Postgres servers hosting databases that are > 100G in size. With the current borgmatic, my only options are to only pipe one database per run, or dump 100s of Gigabytes to a temporary file every night, which is not practical due to space and performance contstraints. That's why I originally suggested this feature, it would fit my use case perfectly

witten commented

2020-04-29 21:03:37 +00:00

I also just realized that I made a wrong assumption about the archive names in my previous post; the individual archives may have slightly different timestamps, so we would indeed need some common timestamp. I guess wildcards would be an option for that (so, --archive "2020-04-20*" --database db1,db2, but yea, that's a lot less pretty than what I originally had in mind. Sorry about that

No worries.. One potential solution would be for borgmatic to use a common timestamp for all archives created in a single borgmatic run. That however would mean departing from the current {hostname}-{now:%Y-%m-%dT%H:%M:%S.%f} format, which relies on Borg filling out the timestamp.. and is therefore different each time Borg is invoked for a database.

In regards to the topic of prefixes, I'm not sure this makes a material difference, but how about suffixes! Hear me out..

Most users will just have archive names that look like:

<archive_timestamp_or_whatever>

Then, users that want to effectively tag archives with hostname:

<instance_prefix>-<archive_timestamp_or_whatever>`

Finally, some users will want to dump to per-database archives as well:

<instance_prefix>-<archive_timestamp_or_whatever>-<database_name>`

The main reason I suggest a suffix is because, logically, the per-database dumps seem like all logically tied to the same archive.. They just happen to be spread across multiple archives because.. implementation details.

To elaborate a little, I have several MariaDB and Postgres servers hosting databases that are > 100G in size. With the current borgmatic, my only options are to only pipe one database per run, or dump 100s of Gigabytes to a temporary file every night, which is not practical due to space and performance contstraints. That's why I originally suggested this feature, it would fit my use case perfectly

Yup, that use case makes sense to me, and I think probably resonates with many other users as well.

> I also just realized that I made a wrong assumption about the archive names in my previous post; the individual archives may have slightly different timestamps, so we would indeed need some common timestamp. I guess wildcards would be an option for that (so, --archive "2020-04-20*" --database db1,db2, but yea, that's a lot less pretty than what I originally had in mind. Sorry about that No worries.. One potential solution would be for borgmatic to use a common timestamp for all archives created in a single borgmatic run. That however would mean departing from the current `{hostname}-{now:%Y-%m-%dT%H:%M:%S.%f}` format, which relies on Borg filling out the timestamp.. and is therefore different each time Borg is invoked for a database. In regards to the topic of prefixes, I'm not sure this makes a material difference, but how about suffixes! Hear me out.. Most users will just have archive names that look like: ```plain <archive_timestamp_or_whatever> ``` Then, users that want to effectively tag archives with hostname: ```plain <instance_prefix>-<archive_timestamp_or_whatever>` ``` Finally, some users will want to dump to per-database archives as well: ```plain <instance_prefix>-<archive_timestamp_or_whatever>-<database_name>` ``` The main reason I suggest a suffix is because, logically, the per-database dumps seem like all logically tied to the same archive.. They just happen to be spread across multiple archives because.. implementation details. > To elaborate a little, I have several MariaDB and Postgres servers hosting databases that are > 100G in size. With the current borgmatic, my only options are to only pipe one database per run, or dump 100s of Gigabytes to a temporary file every night, which is not practical due to space and performance contstraints. That's why I originally suggested this feature, it would fit my use case perfectly Yup, that use case makes sense to me, and I think probably resonates with many other users as well.

AriosThePhoenix commented

2020-04-29 22:07:10 +00:00

You're right, that sounds like a good idea to me as well! It makes more sense structurally and should make selecting archives easier too:

Upon creating a new backup, borgmatic could generate one fixed timestamp that will then be used for all archvives. Then the archive names would look something like this:

myhost-2020-04-22T04:05:06-db1
myhost-2020-04-22T04:05:06-db2

From there borgmatic should be able to use borgs --prefix command to select the archives a user may request with --archive or --database. This seems like a pretty clean and straightforward implementation to me

You're right, that sounds like a good idea to me as well! It makes more sense structurally and should make selecting archives easier too: Upon creating a new backup, borgmatic could generate one fixed timestamp that will then be used for all archvives. Then the archive names would look something like this: ``` myhost-2020-04-22T04:05:06-db1 myhost-2020-04-22T04:05:06-db2 ``` From there borgmatic should be able to use borgs `--prefix` command to select the archives a user may request with `--archive` or `--database`. This seems like a pretty clean and straightforward implementation to me

witten commented

2020-04-29 23:14:43 +00:00

Sounds good! Thanks for filing and brainstorming.

👍 1

witten added the

design finalized

label 2020-04-29 23:14:54 +00:00

witten commented

2020-05-05 21:43:52 +00:00

An update.. I've done a fair amount of prototyping, and I've run into some challenges on the separate-archive-per-database-dump approach. The biggest hurdle is simply in making sure that all archive dumps made at the same time have the same timestamp without breaking the existing archive_name_format feature. Some of the options for that:

Before invoking Borg for each archive, effectively "freeze time" with something like the libfaketime module so that all archives get the same timestamp. However, this not only influences the archive name.. it also means that every other call to "get current time" in Borg will return a time that never advances. Seems ill-advised.
Or, make borgmatic responsible for interpolating {now} and {utcnow} in archive names instead of Borg, so we can give all archives made at the same time the same timestamp. However, this seems like a lot of duplicated code that we'd have to maintain.
Or, use the borg create --timestamp flag. However, it turns out that this doesn't actually impact placeholders like {now} and {utcnow} as one might expect. I could file a ticket on Borg to implement that, but then older versions of Borg will still not have that feature.
Or, abuse borg create --comment to give all archives created at the same time the same comment, and somehow correlate that way. Seems hacky though, and won't show up in borgmatic list unless that output format is also changed. Same problems with using borg create --timestamp and then just correlating on archive start time.

However, there's good news! I found a way to stuff all of the database dumps into a single archive.. while doing full streaming and avoiding any extra disk storage. Since I have a working prototype of that, I'll elaborate on the approach on #258 and close this particular ticket.

I do want to thank you for filing it however, as playing around with the approaches mentioned above is what led me to this new approach!

An update.. I've done a fair amount of prototyping, and I've run into some challenges on the separate-archive-per-database-dump approach. The biggest hurdle is simply in making sure that all archive dumps made at the same time have the same timestamp without breaking the existing `archive_name_format` feature. Some of the options for that: 1. Before invoking Borg for each archive, effectively "freeze time" with something like the libfaketime module so that all archives get the same timestamp. However, this not only influences the archive name.. it also means that every other call to "get current time" in Borg will return a time that never advances. Seems ill-advised. 2. Or, make borgmatic responsible for interpolating `{now}` and `{utcnow}` in archive names instead of Borg, so we can give all archives made at the same time the same timestamp. However, this seems like a lot of duplicated code that we'd have to maintain. 3. Or, use the `borg create --timestamp` flag. However, it turns out that this doesn't actually impact placeholders like `{now}` and `{utcnow}` as one might expect. I could file a ticket on Borg to implement that, but then older versions of Borg will still not have that feature. 4. Or, abuse `borg create --comment` to give all archives created at the same time the same comment, and somehow correlate that way. Seems hacky though, and won't show up in `borgmatic list` unless that output format is also changed. Same problems with using `borg create --timestamp` and then just correlating on archive start time. However, there's good news! I found a way to stuff all of the database dumps into a single archive.. while doing full streaming and avoiding any extra disk storage. Since I have a working prototype of that, I'll elaborate on the approach on #258 and close this particular ticket. I do want to thank you for filing it however, as playing around with the approaches mentioned above is what led me to this new approach!

witten closed this issue

2020-05-05 21:44:11 +00:00

AriosThePhoenix commented

2020-05-05 21:54:27 +00:00

Good to hear! Looking forward to the finished feature, however it ends up being implemented :)

That said, if we ever go with the approach discussed above, I would personally choose the second option, given that it seems to have the least side effects, although it will increase borgmatics footprint by a fair bit.

And thank you for taking the time to discuss this - I really enjoyed brainstorming this idea to its logical conclusion!

Good to hear! Looking forward to the finished feature, however it ends up being implemented :) That said, *if* we ever go with the approach discussed above, I would personally choose the second option, given that it seems to have the least side effects, although it will increase borgmatics footprint by a fair bit. And thank you for taking the time to discuss this - I really enjoyed brainstorming this idea to its logical conclusion!

witten referenced this issue

2020-05-05 21:54:48 +00:00

Streaming database dumps/restores without using extra disk space #258

witten commented

2020-05-05 21:57:33 +00:00

Agreed on option two.. I'm just really lazy and apparently would rather write a bunch of multiprocess code than write a bunch of date format parsing code. 😄

Agreed on option two.. I'm just really lazy and apparently would rather write a bunch of multiprocess code than write a bunch of date format parsing code. :smile:

Sign in to join this conversation.