How to manage the created pipes for backing up Postgres dumps? #435

Closed
opened 2021-07-23 10:51:44 +00:00 by ams_tschoening · 3 comments

What I'm trying to do and why

I have multiple VMs hosting PostgreSQL and most of those use pg_dump currently. Some of those create pretty small dumps and store those locally within the VMs, while some of those create pretty larges ones with hundreds of GiB and store them to some external NAS. For privacy reasons, the large dumps are encrypted within the VM and then dumped to the NAS using NFS. The NAS implements retention policy using BTRFS-snapshots.

This process has multiple downsides: Storing dumps locally within VMs increase their I/O, bloats the image files etc. In case of the NAS, because it only sees encrypted files, BTRFS-snapshots can't be used as efficient as otherwise possible, because each dump is entirely new data. Even if many databases and tables don't change that often at all, new rows are only added mostly.

In theory, the combination of Borg and BorgMatic would address all those issues: Borg takes care of encrypted, compressed and especially de-duplicated storage of backups, which is the most efficient way I could store at all most likely. BorgMatic OTOH is able to stream the dumps using named pipes, so that I really have dump files in Borg, but without the need to write them temporarily locally within the VM.

BUT: I would like to setup my VM server ONLY to take care of backups. That's easily possible for backing up the VM-images and somewhat easily for VM-contents as well. In the latter case I simply mount file systems using SSHFS, which is a bit slow, but seems to work. The benefit is that I need to setup Borg+BorgMatic on one server only, that takes care of Borg-cache, keys to access repos etc. Though, that approach is incompatible with the named pipes created by BorgMatic for database dumps. SSHFS simply doesn't know of those concepts.

The good news is that I don't need a consistent backup of the whole system including database dumps. Instead, I considered using different repos for both right from the start anyway. The reason simply is that both data is totally different, so that in my opinion it doesn't make too much sense putting it into the same repos. Instead, I guess having always the same kind of data per repo increases chances that things can be de-duplicated etc. Additionally, that makes the setup easier, because I can keep SSHFS for file based backups and use some different solution for the database dumps.

Of course I would like to keep the streaming approach of BorgMatic. That's why I thought if it's possible to make BorgMatic use different named pipes instead of those it seems to maintain on it's own. That might allow one to create a named pipe in the VM-host being available in the VM-guest which is then used to forward the database dumps through that pipe and backed up like BorgMatic/Borg would do if things were running entirely within the VM.

Im using QEMU and that seems to allow to create a named pipe from host to guest:

[host]# mkfifo /tmp/serial0
[host]# qemu-kvm -enable-kvm -m 1024 -nographic \
          -device virtio-serial-pci -chardev pipe,id=ch0,path=/tmp/serial0 
          -device virtserialport,chardev=ch0,name=serial0 \
          -drive file=disk.img,if=virtio,boot=on
[guest]# while :;do sleep 1; date; done &
[guest]# cat /proc/kallsyms > /dev/virtio-ports/serial0
[host]# cat /tmp/serial0

Would be great if you could provide some thoughts about how your named pipe-support works right now and if something like this would be possible at all, under which circumstances etc.

Thanks for your help!

Environment

borgmatic version: 1.5.15
borgmatic installation method: PIP, system wide
Borg version: 1.1.16
Python version: 3.8.10
operating system and version: Ubuntu 20.04

#### What I'm trying to do and why I have multiple VMs hosting PostgreSQL and most of those use `pg_dump` currently. Some of those create pretty small dumps and store those locally within the VMs, while some of those create pretty larges ones with hundreds of GiB and store them to some external NAS. For privacy reasons, the large dumps are encrypted within the VM and then dumped to the NAS using NFS. The NAS implements retention policy using BTRFS-snapshots. This process has multiple downsides: Storing dumps locally within VMs increase their I/O, bloats the image files etc. In case of the NAS, because it only sees encrypted files, BTRFS-snapshots can't be used as efficient as otherwise possible, because each dump is entirely new data. Even if many databases and tables don't change that often at all, new rows are only added mostly. In theory, the combination of Borg and BorgMatic would address all those issues: Borg takes care of encrypted, compressed and especially de-duplicated storage of backups, which is the most efficient way I could store at all most likely. BorgMatic OTOH is able to stream the dumps using named pipes, so that I really have dump files in Borg, but without the need to write them temporarily locally within the VM. BUT: I would like to setup my VM server ONLY to take care of backups. That's easily possible for backing up the VM-images and somewhat easily for VM-contents as well. In the latter case I simply mount file systems using SSHFS, which is a bit slow, but seems to work. The benefit is that I need to setup Borg+BorgMatic on one server only, that takes care of Borg-cache, keys to access repos etc. Though, that approach is incompatible with the named pipes created by BorgMatic for database dumps. SSHFS simply doesn't know of those concepts. The good news is that I don't need a consistent backup of the whole system including database dumps. Instead, I considered using different repos for both right from the start anyway. The reason simply is that both data is totally different, so that in my opinion it doesn't make too much sense putting it into the same repos. Instead, I guess having always the same kind of data per repo increases chances that things can be de-duplicated etc. Additionally, that makes the setup easier, because I can keep SSHFS for file based backups and use some different solution for the database dumps. Of course I would like to keep the streaming approach of BorgMatic. That's why I thought if it's possible to make BorgMatic use different named pipes instead of those it seems to maintain on it's own. That might allow one to create a named pipe in the VM-host being available in the VM-guest which is then used to forward the database dumps through that pipe and backed up like BorgMatic/Borg would do if things were running entirely within the VM. Im using QEMU and that seems to allow to [create a named pipe from host to guest](https://wiki.qemu.org/Features/ChardevFlowControl#Named_pipe_backend): ```bash [host]# mkfifo /tmp/serial0 [host]# qemu-kvm -enable-kvm -m 1024 -nographic \ -device virtio-serial-pci -chardev pipe,id=ch0,path=/tmp/serial0 -device virtserialport,chardev=ch0,name=serial0 \ -drive file=disk.img,if=virtio,boot=on [guest]# while :;do sleep 1; date; done & [guest]# cat /proc/kallsyms > /dev/virtio-ports/serial0 [host]# cat /tmp/serial0 ``` Would be great if you could provide some thoughts about how your named pipe-support works right now and if something like this would be possible at all, under which circumstances etc. **Thanks for your help!** #### Environment **borgmatic version:** 1.5.15 **borgmatic installation method:** PIP, system wide **Borg version:** 1.1.16 **Python version:** 3.8.10 **operating system and version:** Ubuntu 20.04
Owner

Sorry for the delay in getting to this one.

The benefit is that I need to setup Borg+BorgMatic on one server only, that takes care of Borg-cache, keys to access repos etc. Though, that approach is incompatible with the named pipes created by BorgMatic for database dumps. SSHFS simply doesn't know of those concepts.

Hmm.. I'm not following why that's the case. borgmatic's database dumping support can be used with either a local or remote database server. So in theory, you could expose the VM database on a port accessible by borgmatic, and then have it dump the database that way, creating any named pipes as necesary on the machine running borgmatic. Granted, this wouldn't have the property of pre-encrypting the database within the VM. But that detail aside, would this work?

Of course I would like to keep the streaming approach of BorgMatic. That's why I thought if it's possible to make BorgMatic use different named pipes instead of those it seems to maintain on it's own. That might allow one to create a named pipe in the VM-host being available in the VM-guest which is then used to forward the database dumps through that pipe and backed up like BorgMatic/Borg would do if things were running entirely within the VM.

Interesting approach, but (maybe due to my lack of imagination) I'm not seeing why this is necessary! If you can connect to the VM database from borgmatic, then the existing built-in named pipe streaming should still work.. unless of course I'm missing something about your setup.

Would be great if you could provide some thoughts about how your named pipe-support works right now

The short of it is that borgmatic invokes PostgreSQL's pg_dump locally, which can connect to a local or remote server and (if necessary) stream the dump over the network via an encrypted, authentication connection. Then borgmatic sets up a local named pipe that's written to on one end by pg_dump (spewing database contents) and read on the other end by Borg (consuming the database as part of creating an archive).

Let me know your thoughts!

Sorry for the delay in getting to this one. > The benefit is that I need to setup Borg+BorgMatic on one server only, that takes care of Borg-cache, keys to access repos etc. Though, that approach is incompatible with the named pipes created by BorgMatic for database dumps. SSHFS simply doesn't know of those concepts. Hmm.. I'm not following why that's the case. borgmatic's database dumping support can be used with either a local or remote database server. So in theory, you could expose the VM database on a port accessible by borgmatic, and then have it dump the database that way, creating any named pipes as necesary on the machine running borgmatic. Granted, this wouldn't have the property of pre-encrypting the database within the VM. But that detail aside, would this work? > Of course I would like to keep the streaming approach of BorgMatic. That's why I thought if it's possible to make BorgMatic use different named pipes instead of those it seems to maintain on it's own. That might allow one to create a named pipe in the VM-host being available in the VM-guest which is then used to forward the database dumps through that pipe and backed up like BorgMatic/Borg would do if things were running entirely within the VM. Interesting approach, but (maybe due to my lack of imagination) I'm not seeing why this is necessary! If you can connect to the VM database from borgmatic, then the existing built-in named pipe streaming should still work.. unless of course I'm missing something about your setup. > Would be great if you could provide some thoughts about how your named pipe-support works right now The short of it is that borgmatic invokes PostgreSQL's `pg_dump` locally, which can connect to a local or remote server and (if necessary) stream the dump over the network via an encrypted, authentication connection. Then borgmatic sets up a local named pipe that's written to on one end by `pg_dump` (spewing database contents) and read on the other end by Borg (consuming the database as part of creating an archive). Let me know your thoughts!
Author

Hmm.. I'm not following why that's the case. borgmatic's database dumping support can be used with either a local or remote database server. So in theory, you could expose the VM database on a port accessible by borgmatic[...]

Yes, that would work, but requires special network setup between backing up host and VMs and all the database dumpers installed on the backing up host. I hoped I could avoid both using a more generic approach liked that one described using QEMU. Though, I abandoned that approach in favor of something else already:

As SSH is available already, I simply use that to execute mysqldump, pg_dump etc. remotely manually. Woudl have been necessary with QEMU-stuff anyway as well. Those simply stream their dumps to STDOUT and SSH forwards that locally. If SSHFS is fast enough, dumping through SSH directly is of course as well. Afterwards one has two choices how to deal with that data: Either dump it directly into some file which gets backed up by BGM like any other file. The benefit of this approach simply is that no additional I/O within the VMs happen and the host most likely has much more storage available. The downside of this approach is that dumps need to be written first and additionally read(+deleted) afterwards, which increases backup times.

Though, there's one optimization possible again: Instead of dumping the output of SSH into a plain file, one can manually create a named pipe using mkfifo in the file system and dump into that. This should be pretty much what BGM is doing with what you implemented already. The only important thing is that SSH needs to keep running in the background while BGM actually backup those named pipes using the option --read-special for BORG. Creating the pipes and dumping into them can be implemented using a BEFORE-hook and e.g. nohup ssh -f -n [...] &. When BGM starts reading from the pipe, the actual dump starts within the VM hosts and when that finishes, EOF is read by BGM and the backup for that pipe finished. One can create many of those pipes beforehand and start SSH-dumpers in the background, all of them blocking until a reader like BGM is connected. BGM simply gets some directory configured, which includes lots of named pipes at backup time.

This is exactly what I had in mind, doesn't require too many changes in BGM and might even be easier to setup than using QEMU. With SSH, one even has more influence about used encryption, compression and stuff. mysqldump for example outputs plain text by default, possibly bloating the transfer and which might easily be compressed using SSH on the transport channel while keeping easy to read plain text in the backups.

There's only one problem left, according my tests BGM doesn't return when the main BEFORE-hook script ends. So if you have some time left, I would really appreciate looking into that problem instead this one! :-) Am going to close here as well, as my described "workaround" and your explicit additional mention of accessing databases using the network seem to make good docs for anyone of interest.

#522

> Hmm.. I'm not following why that's the case. borgmatic's database dumping support can be used with either a local or remote database server. So in theory, you could expose the VM database on a port accessible by borgmatic[...] Yes, that would work, but requires special network setup between backing up host and VMs and all the database dumpers installed on the backing up host. I hoped I could avoid both using a more generic approach liked that one described using QEMU. Though, I abandoned that approach in favor of something else already: As SSH is available already, I simply use that to execute `mysqldump`, `pg_dump` etc. remotely manually. Woudl have been necessary with QEMU-stuff anyway as well. Those simply stream their dumps to STDOUT and SSH forwards that locally. If SSHFS is fast enough, dumping through SSH directly is of course as well. Afterwards one has two choices how to deal with that data: Either dump it directly into some file which gets backed up by BGM like any other file. The benefit of this approach simply is that no additional I/O within the VMs happen and the host most likely has much more storage available. The downside of this approach is that dumps need to be written first and additionally read(+deleted) afterwards, which increases backup times. Though, there's one optimization possible again: Instead of dumping the output of SSH into a plain file, one can manually create a named pipe using `mkfifo` in the file system and dump into that. This should be pretty much what BGM is doing with what you implemented already. The only important thing is that SSH needs to keep running in the background while BGM actually backup those named pipes using the option `--read-special` for BORG. Creating the pipes and dumping into them can be implemented using a BEFORE-hook and e.g. `nohup ssh -f -n [...] &`. When BGM starts reading from the pipe, the actual dump starts within the VM hosts and when that finishes, EOF is read by BGM and the backup for that pipe finished. One can create many of those pipes beforehand and start SSH-dumpers in the background, all of them blocking until a reader like BGM is connected. BGM simply gets some directory configured, which includes lots of named pipes at backup time. This is exactly what I had in mind, doesn't require too many changes in BGM and might even be easier to setup than using QEMU. With SSH, one even has more influence about used encryption, compression and stuff. `mysqldump` for example outputs plain text by default, possibly bloating the transfer and which might easily be compressed using SSH on the transport channel while keeping easy to read plain text in the backups. There's only one problem left, according my tests BGM doesn't return when the main BEFORE-hook script ends. So if you have some time left, I would really appreciate looking into that problem instead this one! :-) Am going to close here as well, as my described "workaround" and your explicit additional mention of accessing databases using the network seem to make good docs for anyone of interest. https://projects.torsion.org/borgmatic-collective/borgmatic/issues/522
Owner

Thanks for the detailed explanation! I'll take a look at #522.

Thanks for the detailed explanation! I'll take a look at #522.
Sign in to join this conversation.
No Milestone
No Assignees
2 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: borgmatic-collective/borgmatic#435
No description provided.