Support different prefixes for database hooks to allow for multiple stdin streams #296
Labels
No Label
bug
data loss
design finalized
good first issue
new feature area
question / support
security
waiting for response
No Milestone
No Assignees
2 Participants
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: borgmatic-collective/borgmatic#296
Loading…
Reference in New Issue
No description provided.
Delete Branch "%!s(<nil>)"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
(I originally wrote this post as a comment for this thread, but I feel like it makes more sense as its own separate issue)
Problem Description
Borgmatic does not have proper support for reading from stdin right now (well okay, it kinda does), which is an issue when dumping large databases. Reading a DB from disk, then writing the dump to potentially the same disk, then reading that dump again and sending it to the repo creates a lot more I/O-overhead than a simple pipe between pg_dump/mysqldump and borg. However, piping a dump into borgmatic like described in the issue above limits one to just one dump. It also feels a little hacky when compared with borgmatics otherwise very explicit and clean configuration-based approach.
Proposal
One option for getting around this "one stream per archive" issue would be to allow the user to specify one prefix for each database, with borgmatic then creating several acrhive "chains", with one chain for every db backup + one chain for the regular filesystem backups.
Right now, it seems as if borgmatic uses one naming scheme (defined in the storage section) for all archives that it creates. If the user could specify a different prefix for each database hook, then multiple db backups via stdin could be run very elegantly simply by calling borg once for every db, with a different prefix every time.
Essentially, I'm imagining something like this:
(Alternatively, there could also be a "use_prefixes" option in the hooks section and borgmatic could automatically generate the prefixes from the db names and use those for pruning and checks)
The resulting repository would then contain several archives with different prefixes, eg:
This approach would allow for a lot of flexibility in regards to db backups, especially when combined with stdin stream input. It also doesn't depend on what could be considered "undefined behavior" right now (witten/borgmatic#42), instead everything is explicitly defined in the configuration file.
One consideration to make here is backwards compatability. If prefixes + streams are introcued, then they would have to be opt-in, as to not disrupt current setups packing multiple db dumps in a single archive. A simple flag in the config would probably be ideal here.
This is less of an issue and more of a proposal/brainstorm I guess. I'm looking forward to hearing everyone's opinion on this :)
Interesting. It certainly sounds like something that could work. But what's the restore picture look like with this in place? For instance, today, you can point
borgmatic restore
at a single archive, and it'll somewhat atomically restore all databases contained therein. Let's say you've got a total host failure and need to do a restore of multiple databases. How do you go about getting all of the databases back?Crazy thought: I wonder if there's a reasonable way to stream multiple different databases (e.g. the output of multiple different invocations of
pg_dump
) into a singleborg
invocation consuming from stdin. Maybe with some sort of logical file separator to aide in borgmatic restoration. Downside: It would mean you'd always have to restore with borgmatic.. You couldn't manually get at the contents of the Borg archive and restore an archive by hand withpg_restore
, because there would be multiple dumps therein.Well, when restoring, running
borgmatic list
would return a list of all archives, so the user would see something similar to what I described above:Then, a user could choose one of the archives to restore and borgmatic could look at the archive prefix, compare it with the config file and find the appropiate database to overwrite. That looks like a reasonable solution from my perspective, but I agree that it could get a little messy if one has dozens of databases to restore.
In that case, a
--prefix
switch for therestore
command might be a good idea. Then the user could just select the prefixes (i.e the databases or fs) that they want restored. Perhaps there could also be a*
prefix to select all avaailable prefixes, causing borgmatic to restore every prefix. This would also work in tandem with the current--archive
switch - a user could run something likeborgmatic restore --archive <yesterday> --prefix db1,db2
to only restore the first two database hooks from yesterdays backup instead of the entire system. This would add some more separation between the different hooks and would make partial restores much easier. Another example:borgmatic restore --archive latest --prefix *
would then restore the entire system from the most recent archive.As for piping multiple dumps into a single file - I guess that could work too, but I'm not sure if I would prefer that option. As you mentioned, borgmatic would somehow need to mark the different dumps, potentially in a different way for every type of database hook it supports. And partial restores would be made impossible by this method too, as now all the dumps are within one large file.
Okay, so to make sure I understand.. You're suggesting that instead of (or in addition to) taking an archive name,
borgmatic restore
could accept a partial archive name.. Basically, a common timestamp shared by all archives made during the same borgmatic run. And then--prefix
could allow specifying one or more prefixes (database names, really) to select the particular database archives to restore.One modification I might suggest to that is to replace
--prefix
with the existingborgmatic restore --database
flag. Makes it less about the underlying implementation detail (prefixes) and more about what the user wants: Restore these databases.One other detail we'd have to account for: Some folks use archive prefixes to distinguish archives from different client machines, or even different applications on a single client machine. So we'd have to not break that, and maybe even allow that feature to interact with multiple per-archive database dumps.
Tangentially related: https://github.com/borgbackup/borg/issues/2300 and https://github.com/borgbackup/borg/issues/3945
Ah, good point about the
--database
flag, i missed that in the docs. Yea, that makes more sense than adding a new prefix flag then.I also just realized that I made a wrong assumption about the archive names in my previous post; the individual archives may have slightly different timestamps, so we would indeed need some common timestamp. I guess wildcards would be an option for that (so,
--archive "2020-04-20*" --database db1,db2
, but yea, that's a lot less pretty than what I originally had in mind. Sorry about thatAs for the use case of having multiple borgmatic instances backup to the same repository, one simple option would be to make this behavior opt-in - then users could setup appropiate prefixes (or database archive names really) themselves if they want to utilize this new behavior.
I guess borgmatic could alternatively include some form of machine/instance uuid as sort of a "super-prefix" in front of every archive it creates. That would allow for separation, but it would mean adding yet more complexity on top of archive names. in that case, a full archive name would be something like
<instance_prefix>-<database_prefix>-<archive_name>
. I'm not sure if that's a great way going forward to be honest. So yea, I think I prefer the former - make it an opt-in option and thus allow existing users to setup appropiate prefixes themselves.Oh, and apologies if my suggestions are a bit vague in some parts, it's been a while since I opened this issue and it was really more of a brain-storm idea - though one that I would love to see implemented. To elaborate a little, I have several MariaDB and Postgres servers hosting databases that are > 100G in size. With the current borgmatic, my only options are to only pipe one database per run, or dump 100s of Gigabytes to a temporary file every night, which is not practical due to space and performance contstraints. That's why I originally suggested this feature, it would fit my use case perfectly
No worries.. One potential solution would be for borgmatic to use a common timestamp for all archives created in a single borgmatic run. That however would mean departing from the current
{hostname}-{now:%Y-%m-%dT%H:%M:%S.%f}
format, which relies on Borg filling out the timestamp.. and is therefore different each time Borg is invoked for a database.In regards to the topic of prefixes, I'm not sure this makes a material difference, but how about suffixes! Hear me out..
Most users will just have archive names that look like:
Then, users that want to effectively tag archives with hostname:
Finally, some users will want to dump to per-database archives as well:
The main reason I suggest a suffix is because, logically, the per-database dumps seem like all logically tied to the same archive.. They just happen to be spread across multiple archives because.. implementation details.
Yup, that use case makes sense to me, and I think probably resonates with many other users as well.
You're right, that sounds like a good idea to me as well! It makes more sense structurally and should make selecting archives easier too:
Upon creating a new backup, borgmatic could generate one fixed timestamp that will then be used for all archvives. Then the archive names would look something like this:
From there borgmatic should be able to use borgs
--prefix
command to select the archives a user may request with--archive
or--database
. This seems like a pretty clean and straightforward implementation to meSounds good! Thanks for filing and brainstorming.
An update.. I've done a fair amount of prototyping, and I've run into some challenges on the separate-archive-per-database-dump approach. The biggest hurdle is simply in making sure that all archive dumps made at the same time have the same timestamp without breaking the existing
archive_name_format
feature. Some of the options for that:{now}
and{utcnow}
in archive names instead of Borg, so we can give all archives made at the same time the same timestamp. However, this seems like a lot of duplicated code that we'd have to maintain.borg create --timestamp
flag. However, it turns out that this doesn't actually impact placeholders like{now}
and{utcnow}
as one might expect. I could file a ticket on Borg to implement that, but then older versions of Borg will still not have that feature.borg create --comment
to give all archives created at the same time the same comment, and somehow correlate that way. Seems hacky though, and won't show up inborgmatic list
unless that output format is also changed. Same problems with usingborg create --timestamp
and then just correlating on archive start time.However, there's good news! I found a way to stuff all of the database dumps into a single archive.. while doing full streaming and avoiding any extra disk storage. Since I have a working prototype of that, I'll elaborate on the approach on #258 and close this particular ticket.
I do want to thank you for filing it however, as playing around with the approaches mentioned above is what led me to this new approach!
Good to hear! Looking forward to the finished feature, however it ends up being implemented :)
That said, if we ever go with the approach discussed above, I would personally choose the second option, given that it seems to have the least side effects, although it will increase borgmatics footprint by a fair bit.
And thank you for taking the time to discuss this - I really enjoyed brainstorming this idea to its logical conclusion!
Agreed on option two.. I'm just really lazy and apparently would rather write a bunch of multiprocess code than write a bunch of date format parsing code. 😄