Multi-Repository Database Backups should re-use the same DB Dump #846
Labels
No Label
bug
data loss
design finalized
good first issue
new feature area
question / support
security
waiting for response
No Milestone
No Assignees
2 Participants
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: borgmatic-collective/borgmatic#846
Loading…
Reference in New Issue
No description provided.
Delete Branch "%!s(<nil>)"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
What I'd like to do and why
Borgmatic version:
1.8.9
Hey Witten, I'm still using this great project all these years after contributing better
pg_dumpall
support!Context: Borgmatic is configured to backup to two repositories (redundant) and there is a
postgresql_databases
configuration defined.Under this multi-repo configuration, when
borgmatic create
runs, these actions happen: (I'm omitting some irrelevant details)dump_data_sources
) the postgres database usingpg_dump
to a fileborg create ...
remove_data_source_dumps
) the postgres dump created earlierdump_data_sources
) the postgres database usingpg_dump
to a fileborg create ...
remove_data_source_dumps
) the postgres dump created earlierborgmatic create
finishedBorgmatic calls
pg_dump
twice for one call ofborgmatic create
. One call ofpg_dump
per repository. What this means is that it is possible for the results ofpg_dump
to differ from the first call to the second, if the database has been changed while backup up to two repositories.What I think should happen:
dump_data_sources
borg create ...
borg create ...
remove_data_source_dumps
This way guarantees that the contents of the database dump (e.g
pg_dump
) are the same for both repositories/invocations ofborg create
.I've contributed to borgmatic in the past, I'm happy to make this modification and do the PR too. Since this is more than just a bug fix, I wanted to run this idea past you first to see if its something borgmatic is interested in.
I know this might be a significant change in default behaviour, if that's the case then implementing this behaviour could be behind an optional flag in the data source configurations.
Thanks!
Other notes / implementation ideas
No response
Thanks for taking the time to file this, write up the detailed description, and offering to tackle a PR! The current borgmatic multiple-dump behavior you've pinpointed is actually a side effect of how dumps work right now: Each dump is actually streamed directly from
pg_dump
to Borg without hitting disk. This means it wouldn't be possible to do the "dump once" approach without forgoing that optimization. I realize that dumping twice might seem wasteful, but this approach makes the intentional trade-off of saving disk space, which can be quite important especially for larger databases where the disk can fill up if an intermediate dump file is used.Could you maybe talk about why you want/need identical database dump contents in each repository's backup? Even when backing up plain old files on disk, backups to two separate repositories won't be identical because files can change between the intervening backups.
My pleasure! thanks for the great project. I'll start with answering your questions:
Consistency, I'd prefer if the database dump was identical in both backups. I'd hate to be in a (albeit very unlikely) scenario where restoring from a backup results in different data than a seemingly identical backup from a different location.
I have thought of this too, I didn't know how to mention it in the initial description without getting too deep into possibly irrelevant details haha.
You are right and what you described is an unavoidable reality when it comes to backing up plain files on most filesystems. I workaround this by taking a
zfs snapshot
in a pre-backup hook and pointing borgmatic at that snapshot. I know many don't have that choice.Though they are very related behaviours, I think there is room to improve the database dumping to make it atomic for both/all repositories that are configured.
Interesting, I hadn't realized that, makes sense.
Those are all very good points. There are multiple approaches I see here that each provide a solution, and caveats, to my issue.
1: Easier: Optional config
pg_dump
to actual file2: More complex: make backups to multiple repositories parallel
borg create
pg_dump
, synchronize the processespg_dump
(no intermediate dump file), but to a multiplexer processpg_dump
data to all processes runningborg create
pg_dump
right now, though I assume that if thefd
its writing to slows down, it too slows down.Happy to discuss this further.
I apologize for the delay in getting back to this. Working through some thoughts on each of the approaches:
Option 1: reusing non-streaming file dumps
There actually already is a code path for non-streaming PostgreSQL database dumps in borgmatic, and that's to support directory format dumps. Those don't stream directly to Borg because they can't; they are by nature composed of multiple files. So in theory that code path could be generalized to support non-streaming dumps for other use cases such as this one. I could see having a configuration option under
postgresql_databases:
that disables streaming for a particular database.As for reusing the same database dump file across repositories, that's (perhaps surprisingly) a little more complicated! That's because borgmatic is currently very aggressive about cleaning up the named pipes used for streaming, as Borg has a habit of hanging when trying to read stale ones that get left behind. So the place where database dumps and pipes are removed (right after each repository uses them) would have to be changed to accommodate your use case. It's certainly possible, but perhaps a little annoying to implement.
Option 2: parallelization and multiplexing
You're correct that this would be a pretty significant change to borgmatic! Supporting parallelization is a topic that has come up in the past (#227, #291), and I'm not 100% convinced the complexity trade-off is worth it. But I'm also not against investigating the trade-offs a little further.
The multiplexer idea sounds interesting, and yeah, it would take some experimentation to see whether it's even possible given
pg_dump
's behavior. An alternate to throttling though would be internal buffering in the multiplexer, such that it can support multiple Borg readers consuming data at different rates by keeping a sliding cache of thepg_dump
data internally. That has the downside though of potentially running out of memory if the multiple readers have wildly differing consumption rates. (Perhaps the cache could be file-backed to dodge that particular issue?)So maybe throttling is the preferred approach, although that might actually eliminate some of the benefits of parallelization because you're giving up some of your parallel performance gains!
Anyway, just based on what I know now, option 1 seems more tractable—and certainly doesn't preclude something like option 2 down the road.