Streaming database dumps/restores without using extra disk space #258

Closed
opened 2019-11-28 12:01:45 +00:00 by rabievdm · 16 comments

What I'm trying to do and why

Some database backups can be rather large, dumping the backup to a compressed archive would save space and disk IO.
It would be convenient if we could add an option to dump MySQL/MariaDB/PGSQL to a compressed image and then mount that image (with something like fuse) for borg to backup and de-duplicate.
Backing up a compressed db archive directly may result in less effective to no de-duplication taking place.

Other notes / implementation ideas

Environment

borgmatic version: 1.4.15

borgmatic installation method: PIP install

Borg version: 1.1.10

Python version: 3.6.8

Database version (if applicable): mysql Ver 15.1 Distrib 5.5.64-MariaDB, for Linux (x86_64) using readline 5.1

operating system and version: CentOS 7

#### What I'm trying to do and why Some database backups can be rather large, dumping the backup to a compressed archive would save space and disk IO. It would be convenient if we could add an option to dump MySQL/MariaDB/PGSQL to a compressed image and then mount that image (with something like fuse) for borg to backup and de-duplicate. Backing up a compressed db archive directly may result in less effective to no de-duplication taking place. #### Other notes / implementation ideas #### Environment **borgmatic version:** 1.4.15 **borgmatic installation method:** PIP install **Borg version:** 1.1.10 **Python version:** 3.6.8 **Database version (if applicable):** mysql Ver 15.1 Distrib 5.5.64-MariaDB, for Linux (x86_64) using readline 5.1 **operating system and version:** CentOS 7
Owner

For MySQL/MariaDB, have you tried adding this to your borgmatic database configuration?

options: "--compress"

It sounds like you'd like a top-level borgmatic option for compression though?

For MySQL/MariaDB, have you tried adding this to your borgmatic database configuration? ```yaml options: "--compress" ``` It sounds like you'd like a top-level borgmatic option for compression though?
Author

Hi,

Thanks for that, but no, that would just compress the data between client and server (over the network if the client was remote from the server)

Here's an example/usecase:
Lets assume you have a dataserver with a database of 100GB of data.
To back that up normally and to make it space efficient you would do something like 'mysqldump mydatabase some_options | gzip > mysql_backup.sql.gz' and then you would get some space and IO saving (less data to write, less time waiting)

Now to doing this in borgmatic almost the same happens however it's a straight dump to .borgmatic/mysql/etc/etc/ which is then sucked up by borg, deduplicated and deleted.

What we could do is to also pass the mysqlddump through gzip (would have to work around the issue of multiple DB specified separately) and dump a compress db back and then pass that to borg and in theory I could do this via the pre/post hooks ... but from experience if you make changes to data (ie changes to the database) and re-compress it then you are less likely to get a good de-duplication ration, in some cases it work de-duplicate at all.

So what I thought might be a good option was to leverage the fuse capability and potentially write the mysqldump to a compressed file and then remount that image for borg to backup and deduplicate the dump as if it wasn't compressed. That would mean we save space, save io and score on deduplication.

Hi, Thanks for that, but no, that would just compress the data between client and server (over the network if the client was remote from the server) Here's an example/usecase: Lets assume you have a dataserver with a database of 100GB of data. To back that up normally and to make it space efficient you would do something like 'mysqldump mydatabase some_options | gzip > mysql_backup.sql.gz' and then you would get some space and IO saving (less data to write, less time waiting) Now to doing this in borgmatic almost the same happens however it's a straight dump to .borgmatic/mysql/etc/etc/ which is then sucked up by borg, deduplicated and deleted. What we could do is to also pass the mysqlddump through gzip (would have to work around the issue of multiple DB specified separately) and dump a compress db back and then pass that to borg and in theory I could do this via the pre/post hooks ... but from experience if you make changes to data (ie changes to the database) and re-compress it then you are less likely to get a good de-duplication ration, in some cases it work de-duplicate at all. So what I thought might be a good option was to leverage the fuse capability and potentially write the mysqldump to a compressed file and then remount that image for borg to backup and deduplicate the dump as if it wasn't compressed. That would mean we save space, save io and score on deduplication.
Owner

Ahh, gotcha. Thanks for clarifying.. I misunderstood. And I totally get what you're saying about de-duplication on compressed dumps. However, while it may save on space and IO, it trades that for a fair amount of CPU to compress and then immediately decompress. And I'm not sure how much IO would actually be saved if it's both compressing and decompressing. So it may only save on space.

Getting back to the core problem we're trying to solve here: Are you finding that database dumps with borgmatic are consuming too much disk space (or IO) on your system? Are you typically dumping and backing up either a single database per archive or "all" databases per archive? Also, I wonder if this could be prototyped to see if the performance gains are actually worth it.

In terms of implementation, I wonder if it could be as "simple" as putting ~/.borgmatic on a compressed ZFS volume. Then everything written to it would get transparently compressed and decompressed without further intervention.

Ahh, gotcha. Thanks for clarifying.. I misunderstood. And I totally get what you're saying about de-duplication on compressed dumps. However, while it may save on space and IO, it trades that for a fair amount of CPU to compress and then immediately decompress. And I'm not sure how much IO would actually be saved if it's both compressing and decompressing. So it may only save on space. Getting back to the core problem we're trying to solve here: Are you finding that database dumps with borgmatic are consuming too much disk space (or IO) on your system? Are you typically dumping and backing up either a single database per archive or "all" databases per archive? Also, I wonder if this could be prototyped to see if the performance gains are actually worth it. In terms of implementation, I wonder if it could be as "simple" as putting `~/.borgmatic` on a [compressed ZFS volume](https://serverfault.com/questions/617648/transparent-compression-filesystem-in-conjunction-with-ext4). Then everything written to it would get transparently compressed and decompressed without further intervention.

You could pipe the database backup directly to borg via stdin. It would definitely add some complexity but it would avoid the temporary file altogether.

You could pipe the database backup directly to borg via stdin. It would definitely add some complexity but it would avoid the temporary file altogether.
Author

@witten: The primary driver is to avoid doubling up on space just so we can do temporary dump for backup, the disk IO saving is a bonus. To save on CPU you could just do a gzip -1 or similar.
@drewkett: That would be even better and would replicate what most enterprise tools do and completely negates the space requirement. Agree it would complicate the process a little and you would also have to consider how to do the restore, but that would be 1st prize.

@witten: The primary driver is to avoid doubling up on space *just* so we can do temporary dump for backup, the disk IO saving is a bonus. To save on CPU you could just do a gzip -1 or similar. @drewkett: That would be even better and would replicate what most enterprise tools do and completely negates the space requirement. Agree it would complicate the process a little and you would also have to consider how to do the restore, but that would be 1st prize.
Owner

You could pipe the database backup directly to borg via stdin. It would definitely add some complexity but it would avoid the temporary file altogether.

This works already.. Although without the handy-dandy database hooks. Plus it's undocumented for borgmatic.

However, the caveat with this approach is that you can only pipe in a single database dump per Borg invocation, meaning a single database dump per archive. You'd have to give up the ability to make several separate database dumps and put them all in the same Borg archive.

If that compromise is okay for your respective use cases, then I could see polishing up and documenting that feature as a solution to this ask.

> You could pipe the database backup directly to borg via stdin. It would definitely add some complexity but it would avoid the temporary file altogether. This [works already](https://projects.torsion.org/witten/borgmatic/issues/42).. Although without the handy-dandy database hooks. Plus it's undocumented for borgmatic. However, the caveat with this approach is that you can only pipe in a single database dump per Borg invocation, meaning a single database dump per archive. You'd have to give up the ability to make several separate database dumps and put them all in the same Borg archive. If that compromise is okay for your respective use cases, then I could see polishing up and documenting that feature as a solution to this ask.
Owner

In any case, please let me know if witten/borgmatic#42 would serve as a solution for your use case.. Taking into account the caveat about the single dump per archive. Thanks!

In any case, please let me know if https://projects.torsion.org/witten/borgmatic/issues/42 would serve as a solution for your use case.. Taking into account the caveat about the single dump per archive. Thanks!
witten added the
waiting for response
label 2019-12-10 20:38:39 +00:00

Hello, this maybe useful option for pg_dumpall tool, because it dump all databases in plain-text sql format. In our case we have ~90Gb databases dir, it just exhaust disk space. Maybe it could be option like "compress: bz2":

- name: all
  compress: bz2
Hello, this maybe useful option for pg_dumpall tool, because it dump all databases in plain-text sql format. In our case we have ~90Gb databases dir, it just exhaust disk space. Maybe it could be option like "compress: bz2": ``` - name: all compress: bz2 ```
Owner

Thanks for the suggestion! If your main goal is just reducing temporary disk usage for the dumps, have you looked at using a dump format that takes advantage of PostgreSQL's built-in compression by default? For instance: Using borgmatic's default PostgreSQL dump format of "custom" should result in PostgreSQL applying compression. And you can up the compression level with options: "--compress=9" for instance. According to the docs, even the plain text format supports compression like this.. It's just off by default.

Let me know your thoughts.

Thanks for the suggestion! If your main goal is just reducing temporary disk usage for the dumps, have you looked at using a dump format that takes advantage of PostgreSQL's built-in compression by default? For instance: Using borgmatic's default PostgreSQL dump format of "custom" should result in PostgreSQL applying compression. And you can up the compression level with `options: "--compress=9"` for instance. According to the [docs](https://www.postgresql.org/docs/12/app-pgdump.html), even the plain text format supports compression like this.. It's just off by default. Let me know your thoughts.

But pg_dumpall doesn't have --compress option :(

But pg_dumpall doesn't have --compress option :(
Owner

Ah, right you are! Do you think the aforementioned dump-to-Borg-stdin approach would work for your use case, rather than compressing the dump? Note that Borg itself can do compression of its own, so it could still be stored compressed inside the Borg repository.

Ah, right you are! Do you think the aforementioned dump-to-Borg-stdin approach would work for your use case, rather than compressing the dump? Note that Borg itself can do compression of its own, so it could still be stored compressed inside the Borg repository.

Now I'm using a pre_hook with just pg_dumpall | pbzip2, because i need not only db backups, but many files too. Or I'm misunderstood knowledge of stdin usage in borgmatic?

Now I'm using a pre_hook with just pg_dumpall | pbzip2, because i need not only db backups, but many files too. Or I'm misunderstood knowledge of stdin usage in borgmatic?
Owner

That hook certainly sounds like it would work for now. What I'm suggesting is some sort of option that would tell borgmatic to stream the output of pg_dumpall directly Borg's stdin. So under the hood, it would be in effect be running pg_dumpall | borg create --compression auto ... or similar. The benefit would be that the dump wouldn't ever have to be written to a temporary file, so it wouldn't need to get bzip2'd.

That hook certainly sounds like it would work for now. What I'm suggesting is some sort of option that would tell borgmatic to stream the output of pg_dumpall directly Borg's stdin. So under the hood, it would be in effect be running `pg_dumpall | borg create --compression auto ...` or similar. The benefit would be that the dump wouldn't ever have to be written to a temporary file, so it wouldn't need to get bzip2'd.

imho, it should be as simple as specifying name: all in the configuration :)

imho, it should be as simple as specifying ```name: all``` in the configuration :)
Owner

Some progress on this thanks to the discussion with @AriosThePhoenix on #296 and also some tips from @textshell on #borgbackup IRC. The general idea is:

For database dumps: For each database to dump, create one Unix named pipe on the filesystem in a directory like ~/.borgmatic/postgresql_databases/myhost/. Then, stream each database dump into its named pipe. Simultaneously, point borg create at ~/.borgmatic/postgresql_databases/myhost/ (among other source directories), so it consumes the dumps from those named pipes as they're created.

What this means is there's zero additional disk use when dumping the databases, and the dumps stream directly to Borg. And they all end up in the same archive.

For database restores: For each database to restore, run a separate instance of borg extract for just that database's dump file, writing the dump to stdout and piping it directly to the database restore command. Again, this means that there's zero additional disk used for restores.

I have a prototype of this working for both PostgreSQL and MySQL, but it'll take more testing (both manual and automated) to see whether it'll work well in practice and get it ready for use.

If anyone's interested in beta-testing this when it's further along, let me know!

Some progress on this thanks to the discussion with @AriosThePhoenix on #296 and also some tips from @textshell on `#borgbackup` IRC. The general idea is: **For database dumps:** For each database to dump, create one Unix named pipe on the filesystem in a directory like `~/.borgmatic/postgresql_databases/myhost/`. Then, stream each database dump into its named pipe. Simultaneously, point `borg create` at `~/.borgmatic/postgresql_databases/myhost/` (among other source directories), so it consumes the dumps from those named pipes as they're created. What this means is there's zero additional disk use when dumping the databases, and the dumps stream directly to Borg. *And* they all end up in the same archive. **For database restores:** For each database to restore, run a separate instance of `borg extract` for just that database's dump file, writing the dump to stdout and piping it directly to the database restore command. Again, this means that there's zero additional disk used for restores. I have a prototype of this working for both PostgreSQL and MySQL, but it'll take more testing (both manual and automated) to see whether it'll work well in practice and get it ready for use. If anyone's interested in beta-testing this when it's further along, let me know!
witten removed the
waiting for response
label 2020-05-05 21:58:16 +00:00
witten changed title from Feature Request: Dump DB backups to compressed image to Streaming database dumps/restores without using extra disk space 2020-05-13 04:08:07 +00:00
Owner

Okay, streaming database dumps are implemented and released in borgmatic 1.5.3. Here are the relevant release notes:

  • Stream database dumps and restores directly to/from Borg without using any additional filesystem space. This feature is automatic, and works even on restores from archives made with previous versions of borgmatic.

Please give this a shot and let me know how it works out for you. These are some fairly major internal changes to the database hooks, and borgmatic in general, so I'd like to find out about and fix any issues that may arise.

Okay, streaming database dumps are implemented and released in borgmatic 1.5.3. Here are the relevant release notes: > * Stream database dumps and restores directly to/from Borg without using any additional filesystem space. This feature is automatic, and works even on restores from archives made with previous versions of borgmatic. Please give this a shot and let me know how it works out for you. These are some fairly major internal changes to the database hooks, and borgmatic in general, so I'd like to find out about and fix any issues that may arise.
Sign in to join this conversation.
No Milestone
No Assignees
4 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: borgmatic-collective/borgmatic#258
No description provided.