#232 database snapshots

Open
opened 1 month ago by anarcat · 5 comments
anarcat commented 1 month ago

What I’m trying to do and why

A corollary to #231 is database-level snapshots. It would be great to be able to do binary backups of database instead of the current “SQL dump” approach. The downside of “SQL dumps” is they can take a very long time to load for large backups, and take a lot of extra space. By using filesystem snapshots or just locking the database for the backup’s duration, you get the binary files directly and restoring from backups is just like restoring the database from a crash…

Other notes / implementation ideas

Many database support issuing a command to flush everything to disk and prepare for catastrophe. You have FLUSH TABLES WITH READ LOCK in MySQL/MariaDB, which needs to be followed by a UNLOCK TABLES when finished with the backup. There’s no equivalent that I could find on the other side, but there is CHECKPOINT in PostgreSQL which flushes all WAL files and allow for a shorter recovery. PostgreSQL would require filesystem snapshots for the backups to be consistent, however.

I know that both of those could be implemented with simple shell scripts, but I’m wondering if you’d be interested in merging this directly in the Python code, which might be easier (although I haven’t looked).

Worst case, this issue can track the shell script implementation too. :)

Thanks again!

#### What I'm trying to do and why A corollary to #231 is *database*-level snapshots. It would be great to be able to do *binary* backups of database instead of the current "SQL dump" approach. The downside of "SQL dumps" is they can take a very long time to load for large backups, and take a lot of extra space. By using filesystem snapshots or just locking the database for the backup's duration, you get the binary files directly and restoring from backups is just like restoring the database from a crash... #### Other notes / implementation ideas Many database support issuing a command to flush everything to disk and prepare for catastrophe. You have [`FLUSH TABLES WITH READ LOCK` in MySQL/MariaDB](https://dev.mysql.com/doc/refman/8.0/en/flush.html#flush-tables-with-read-lock), which needs to be followed by a `UNLOCK TABLES` when finished with the backup. There's no equivalent that I could find on the other side, but there *is* [`CHECKPOINT` in PostgreSQL](https://www.postgresql.org/docs/9.1/backup-file.html) which flushes all WAL files and allow for a shorter recovery. PostgreSQL would require filesystem snapshots for the backups to be consistent, however. I know that both of those could be implemented with simple shell scripts, but I'm wondering if you'd be interested in merging this directly in the Python code, which might be easier (although I haven't looked). Worst case, this issue can track the shell script implementation too. :) Thanks again!
witten commented 1 month ago
Owner

I’d be interested in merging this directly in the Python code, although be aware that borgmatic is in many ways itself a glorified shell script.. written in Python! By that, I mean that its Borg integration is “just” shelling out to Borg. Its existing PostgreSQL integration is “just” shelling out to pg_dump. So I could certainly see adding more shelling out to implement what you’re talking about here.. Unless you’re thinking of a more direct library-level integration.

In terms of this ask, can you run through the series of events you foresee happening once borgmatic supports this and is properly configured? Is it something like this?

  1. Run flush and/or checkpoint on the database.
  2. Create filesystem snapshot of internal database storage.
  3. Run backups with Borg.
  4. Remove filesystem snapshot.

Let me know if I’m missing something.

I'd be interested in merging this directly in the Python code, although be aware that borgmatic is in many ways itself a glorified shell script.. written in Python! By that, I mean that its Borg integration is "just" shelling out to Borg. Its existing PostgreSQL integration is "just" shelling out to `pg_dump`. So I could certainly see adding more shelling out to implement what you're talking about here.. Unless you're thinking of a more direct library-level integration. In terms of this ask, can you run through the series of events you foresee happening once borgmatic supports this and is properly configured? Is it something like this? 1. Run flush and/or checkpoint on the database. 2. Create filesystem snapshot of internal database storage. 3. Run backups with Borg. 4. Remove filesystem snapshot. Let me know if I'm missing something.
anarcat commented 1 month ago
Poster

So I could certainly see adding more shelling out to implement what you’re talking about here.. Unless you’re thinking of a more direct library-level integration.

There are probably database libs that could help us do things here, but I’m not sure it would be worth the extra dependencies.

In terms of this ask, can you run through the series of events you foresee happening once borgmatic supports this and is properly configured? Is it something like this?

  1. Run flush and/or checkpoint on the database.
  2. Create filesystem snapshot of internal database storage.
  3. Run backups with Borg.
  4. Remove filesystem snapshot.

Let me know if I’m missing something.

I think that’s pretty much it. As the Snapshot class, I would suggest drafting an abstract interfce for this, because, for example, MySQL can work without filesystem level snapshots, because FLUSH WITH READ LOCK provides a consistent, unchanging view of the database as long as the lock is held. It does means the database is basically readonly the entire time, so we usually use snapshots there as well in production environments, but it’s not absolutely mandatory.

You will need to hold that database lock, so it means borgmatic will need to keep a mysql client connection alive for the duration of the backup. That might be annoying to do with the commandline client, as you’d need to open some sort of “dialog” with a mysql pipe, which could easily mess up. Using a client library might make sense in this case, if we really don’t want to depend on filesystem-level snapshots for this database type.

PostgreSQL, on the contrary, assumes the backup tool has atomic snapshot of the filesystem, and there’s no way to “lock” the database in the same way, which is probably a good design decision anyways, now that I think of it. So you need a frozen filesystem for this, or you need to shutdown the database, run the backups, and start it again, which is probably unacceptable for most production scenarios.

So yeah:

  1. flush/checkpoint the database, just to make sure disk has some sort of state copy, keeping the lock in some cases (e.g. MySQL)
  2. create filesystem snapshot, if available
  3. release lock, if snapshot is available, or if not, shutdown database in some cases (e.g. PostgreSQL without FS snapshots)
  4. run backups with Borg
  5. release lock or start database if snapshot is unavailable
  6. remove filesystem snapshot, if available

Also note that PostgreSQL has this concept of “clusters” which is a bit alien to the MySQL world. You can basically have multiple PostgreSQL “servers” on the same machine, each with their own database, user, process, locking and so on namespace. Those can (and should) be independently backed up, and that can be done in parallel.

So maybe the abstract class would look something like this:

def _notimplemented():
  raise NotImplementedError()

class DatabaseBackup:
   # flush the database data to ensure latest on disk
   flush = _notimplemented
   # lock the database so no more changes occur, would just be a noop
   # for pgqsql
   lock = _notimplemented
   # release the lock, if implemented, silent otherwise
   unlock = _notimplemented
   # snapshot the filesystem, if available - maybe mix in Snapshot here?
   snapshot = _notimplemented

(Kind of embarrassing, that “on the fly class design”… There’s probably a better way to do this, but it’s late and I just want to brain dump that stuff. ;)

I hope that helps!

> So I could certainly see adding more shelling out to implement what you’re talking about here.. Unless you’re thinking of a more direct library-level integration. There are probably database libs that could help us do things here, but I'm not sure it would be worth the extra dependencies. > In terms of this ask, can you run through the series of events you foresee happening once borgmatic supports this and is properly configured? Is it something like this? > > 1. Run flush and/or checkpoint on the database. > 2. Create filesystem snapshot of internal database storage. > 3. Run backups with Borg. > 4. Remove filesystem snapshot. > > Let me know if I’m missing something. I think that's pretty much it. As the `Snapshot` class, I would suggest drafting an abstract interfce for this, because, for example, MySQL can work without filesystem level snapshots, because `FLUSH WITH READ LOCK` provides a consistent, unchanging view of the database as long as the lock is held. It does means the database is basically readonly the entire time, so we *usually* use snapshots there as well in production environments, but it's not absolutely mandatory. You *will* need to *hold* that database lock, so it means borgmatic will need to keep a mysql client connection alive for the duration of the backup. That might be annoying to do with the commandline client, as you'd need to open some sort of "dialog" with a mysql pipe, which could easily mess up. Using a client library might make sense in this case, if we really don't want to depend on filesystem-level snapshots for this database type. PostgreSQL, on the contrary, assumes the backup tool has atomic snapshot of the filesystem, and there's no way to "lock" the database in the same way, which is probably a good design decision anyways, now that I think of it. So you *need* a frozen filesystem for this, *or* you need to *shutdown* the database, run the backups, and start it again, which is probably unacceptable for most production scenarios. So yeah: 1. flush/checkpoint the database, just to make sure disk has some sort of state copy, keeping the lock in some cases (e.g. MySQL) 2. create filesystem snapshot, if available 3. release lock, if snapshot is available, or if not, shutdown database in some cases (e.g. PostgreSQL without FS snapshots) 4. run backups with Borg 5. release lock or start database if snapshot is unavailable 6. remove filesystem snapshot, if available Also note that PostgreSQL has this concept of "clusters" which is a bit alien to the MySQL world. You can basically have multiple PostgreSQL "servers" on the same machine, each with their own database, user, process, locking and so on namespace. Those can (and should) be independently backed up, and that can be done in parallel. So maybe the abstract class would look something like this: ``` def _notimplemented(): raise NotImplementedError() class DatabaseBackup: # flush the database data to ensure latest on disk flush = _notimplemented # lock the database so no more changes occur, would just be a noop # for pgqsql lock = _notimplemented # release the lock, if implemented, silent otherwise unlock = _notimplemented # snapshot the filesystem, if available - maybe mix in Snapshot here? snapshot = _notimplemented ``` (Kind of embarrassing, that "on the fly class design"... There's probably a better way to do this, but it's late and I just want to brain dump that stuff. ;) I hope that helps!
witten commented 1 month ago
Owner

Thank you for the detailed answer. The series of steps you laid out makes sense to me, although I don’t know that supporting offline backups (taking the database down) would be useful to that many users! Online backups via snapshots, certainly.

On the topic of PostgreSQL binary backups, have you seen pg_basebackup? (Also, a less reference-docs-y overview.) That appears to use internal Postgres checkpointing (”backup mode”) to get a consistent snapshot of the data directory. I can only imagine that that’s preferable to using external filesystem checkpointing to accomplish the same thing, especially since it can be executed remotely. But I don’t know what folks typically do in the wild. And what the differences are in I/O impact for each method.

Also, I wonder if there’s a comparable approach in MariaDB land, or whether most folks use the filesystem snapshot approach.

Thank you for the detailed answer. The series of steps you laid out makes sense to me, although I don't know that supporting *offline* backups (taking the database down) would be useful to that many users! Online backups via snapshots, certainly. On the topic of PostgreSQL binary backups, have you seen [`pg_basebackup`](https://www.postgresql.org/docs/12/app-pgbasebackup.html)? (Also, a less reference-docs-y [overview](https://www.opsdash.com/blog/postgresql-backup-restore.html#backup-using-pg_basebackup).) That appears to use internal Postgres checkpointing ("[backup mode](https://serverfault.com/questions/543674/what-is-backup-mode-in-postgresql)") to get a consistent snapshot of the data directory. I can only imagine that that's preferable to using external filesystem checkpointing to accomplish the same thing, especially since it can be executed remotely. But I don't know what folks typically do in the wild. And what the differences are in I/O impact for each method. Also, I wonder if there's a [comparable approach](https://mariadb.com/kb/en/library/full-backup-and-restore-with-mariabackup/) in MariaDB land, or whether most folks use the [filesystem snapshot approach](https://mariadb.com/kb/en/library/backup-and-restore-overview/#filesystem-snapshots).
anarcat commented 1 month ago
Poster

Thank you for the detailed answer. The series of steps you laid out makes sense to me, although I don’t know that supporting offline backups (taking the database down) would be useful to that many users! Online backups via snapshots, certainly.

True, it might be worth ignoring that use case completely.

On the topic of PostgreSQL binary backups, have you seen pg_basebackup ( https://www.postgresql.org/docs/12/app-pgbasebackup.html ) ? (Also, a less reference-docs-y overview ( https://www.opsdash.com/blog/postgresql-backup-restore.html#backup-using-pg_basebackup ) .) That appears to use internal Postgres checkpointing (” backup mode ( https://serverfault.com/questions/543674/what-is-backup-mode-in-postgresql ) ”) to get a consistent snapshot of the data directory. I can only imagine that that’s preferable to using external filesystem checkpointing to accomplish the same thing, especially since it can be executed remotely. But I don’t know what folks typically do in the wild. And what the differences are in I/O impact for each method.

We actually use this as part of the custom backup system here at tor, so it’s for real, and could be used here. I’m just not very familiar with it so I didn’t think of bringing it up.

Main problem with that, IMHO, is that it creates local files, so duplicates storage space requirements. Maybe there’s a way that it could be fed directly into borg as an object instead?

Also, I wonder if there’s a comparable approach ( https://mariadb.com/kb/en/library/full-backup-and-restore-with-mariabackup/ ) in MariaDB land, or whether most folks use the filesystem snapshot approach ( https://mariadb.com/kb/en/library/backup-and-restore-overview/#filesystem-snapshots ) .

mariabackup looks like something that was called mysql_hotcopy, if my memory is correct, and yes, it serves a similar purpose.

> Thank you for the detailed answer. The series of steps you laid out makes sense to me, although I don’t know that supporting offline backups (taking the database down) would be useful to that many users! Online backups via snapshots, certainly. True, it might be worth ignoring that use case completely. > On the topic of PostgreSQL binary backups, have you seen pg_basebackup ( https://www.postgresql.org/docs/12/app-pgbasebackup.html ) ? (Also, a less reference-docs-y overview ( https://www.opsdash.com/blog/postgresql-backup-restore.html#backup-using-pg_basebackup ) .) That appears to use internal Postgres checkpointing (” backup mode ( https://serverfault.com/questions/543674/what-is-backup-mode-in-postgresql ) ”) to get a consistent snapshot of the data directory. I can only imagine that that’s preferable to using external filesystem checkpointing to accomplish the same thing, especially since it can be executed remotely. But I don’t know what folks typically do in the wild. And what the differences are in I/O impact for each method. We actually use this as part of the custom backup system here at tor, so it's for real, and *could* be used here. I'm just not very familiar with it so I didn't think of bringing it up. Main problem with that, IMHO, is that it creates local files, so duplicates storage space requirements. Maybe there's a way that it could be fed directly into borg as an object instead? > Also, I wonder if there’s a comparable approach ( https://mariadb.com/kb/en/library/full-backup-and-restore-with-mariabackup/ ) in MariaDB land, or whether most folks use the filesystem snapshot approach ( https://mariadb.com/kb/en/library/backup-and-restore-overview/#filesystem-snapshots ) . mariabackup looks like something that was called mysql_hotcopy, if my memory is correct, and yes, it serves a similar purpose.
witten commented 1 month ago
Owner

Main problem with that, IMHO, is that it creates local files, so duplicates storage space requirements. Maybe there’s a way that it could be fed directly into borg as an object instead?

Yup, there is! But that’d be an optimization. (Maybe a necessary one for certain use cases..)

> Main problem with that, IMHO, is that it creates local files, so duplicates storage space requirements. Maybe there’s a way that it could be fed directly into borg as an object instead? Yup, [there is](https://projects.torsion.org/witten/borgmatic/issues/42)! But that'd be an optimization. (Maybe a necessary one for certain use cases..)
Sign in to join this conversation.
No Milestone
No Assignees
2 Participants
Due Date

No due date set.

Dependencies

This issue currently doesn't have any dependencies.

Loading…
Cancel
Save
There is no content yet.