"Spot check" of source directories consistency check #656

Closed
opened 2023-03-21 22:16:56 +00:00 by witten · 5 comments
Owner

What I'm trying to do and why

Today, borgmatic supports a number of consistency check types, most of them implemented by Borg. They all have various trade-offs around speed and thoroughness, but one thing none of them do is check the contents of the backed up archives against the original source files—which is arguably one important way to ensure your archives contain the files you'll ultimately want to restore in the case of catastrophe (or just an accidentally deleted file). Because if you happen to misconfigure borgmatic such that you're not backing up the files you think you're backing up, every existing consistency check will still pass with flying colors—and you won't discover this problem until you go to restore.

However, automatically and exhaustively checking an archive's contents against the contents of the source directories on disk has two main problems:

  • It'd be slow: Depending on the source directories, this sort of exhaustive check would be too speed prohibitive to run on a regular basis.
  • It'd yield false negatives: Files change over time. (If they didn't, you probably wouldn't need to use Borg for backups at all.) And with changing files, that means an exhaustive consistency check of source directory contents would fail whenever a source directory changes versus a backup archive—rendering this check pretty useless.

So the proposed solution is to run a "spot check" of source directories. The code implementing such a feature would go something like this:

  1. Pick a random file from within the configured source directories.
  2. Check that the file exists within the selected backup archive and has the same contents in both places.
  3. Repeat the whole process for some percentage of total source directory files.

This approach has the benefit of being fast and hopefully not yielding too many false negatives. The main downside is it's probabilistic; it won't catch 100% of source vs. archive consistency problems on any given run. But that might be good enough given the value that it provides over time.

Additionally, to make the tradeoff between false negatives and thoroughness of the check tunable for different source data, there could be borgmatic configuration options for the check:

  • sample_percentage: The percentage of total files in the source directories to randomly sample and compare to their corresponding files in the backup archive.
  • tolerance_percentage: The percentage of total files in the source directories that can fail a sample comparison without failing the entire consistency check:

Example:

consistency:
    checks:
        - name: spot
          frequency: 2 weeks
          sample_percentage: 5
          tolerance_percentage: 2

Open questions

  • Would this proposed consistency check provide enough value over the existing Borg/borgmatic consistency checks?
  • How feasible would this proposed consistency check be to implement with Borg? It would require effectively diffing random files within archives against their corresponding files on disk. (And no, borg diff won't do that.) Is it easy to do that in a performant-enough way?
  • Is this feature just too complex, either to use as an end-user or to implement/maintain as a developer?
  • Should this proposed consistency check only check against the most recent archive in a repository, like the existing extract check does?
  • Rather than checking file contents, would it be better to simply check for existence of source directory files within the backup archive? That'd be faster, simpler, and presumably wouldn't have as many false negatives. It'd also still handle the use case of catching a borgmatic include/exclude misconfiguration.
#### What I'm trying to do and why Today, borgmatic supports a number of [consistency check types](https://torsion.org/borgmatic/docs/how-to/deal-with-very-large-backups/#consistency-check-configuration), most of them implemented by Borg. They all have various trade-offs around speed and thoroughness, but one thing none of them do is check the contents of the backed up archives against the original source files—which is arguably one important way to ensure your archives contain the files you'll ultimately want to restore in the case of catastrophe (or just an accidentally deleted file). Because if you happen to misconfigure borgmatic such that you're not backing up the files you *think* you're backing up, every existing consistency check will still pass with flying colors—and you won't discover this problem until you go to restore. However, automatically and exhaustively checking an archive's contents against the contents of the source directories on disk has two main problems: * It'd be slow: Depending on the source directories, this sort of exhaustive check would be too speed prohibitive to run on a regular basis. * It'd yield false negatives: Files change over time. (If they didn't, you probably wouldn't need to use Borg for backups at all.) And with changing files, that means an exhaustive consistency check of source directory contents would fail whenever a source directory changes versus a backup archive—rendering this check pretty useless. So the proposed solution is to run a "spot check" of source directories. The code implementing such a feature would go something like this: 1. Pick a random file from within the configured source directories. 2. Check that the file exists within the selected backup archive and has the same contents in both places. 3. Repeat the whole process for some percentage of total source directory files. This approach has the benefit of being fast and hopefully not yielding too many false negatives. The main downside is it's probabilistic; it won't catch 100% of source vs. archive consistency problems on any given run. But that might be good enough given the value that it provides over time. Additionally, to make the tradeoff between false negatives and thoroughness of the check tunable for different source data, there could be borgmatic configuration options for the check: * `sample_percentage`: The percentage of total files in the source directories to randomly sample and compare to their corresponding files in the backup archive. * `tolerance_percentage`: The percentage of total files in the source directories that can fail a sample comparison without failing the entire consistency check: Example: ``` consistency: checks: - name: spot frequency: 2 weeks sample_percentage: 5 tolerance_percentage: 2 ``` #### Open questions * Would this proposed consistency check provide enough value over the existing Borg/borgmatic consistency checks? * How feasible would this proposed consistency check be to implement with Borg? It would require effectively diffing random files within archives against their corresponding files on disk. (And no, `borg diff` won't do that.) Is it easy to do that in a performant-enough way? * Is this feature just too complex, either to use as an end-user or to implement/maintain as a developer? * Should this proposed consistency check only check against the most recent archive in a repository, like the existing `extract` check does? * Rather than checking file contents, would it be better to simply check for *existence* of source directory files within the backup archive? That'd be faster, simpler, and presumably wouldn't have as many false negatives. It'd also still handle the use case of catching a borgmatic include/exclude misconfiguration.
witten added the
new feature area
label 2023-06-28 18:39:03 +00:00
Author
Owner

Implemented in main and will be part of the next release! See the documentation for actual configuration options: https://torsion.org/borgmatic/docs/how-to/deal-with-very-large-backups/#spot-check

Implemented in main and will be part of the next release! See the documentation for actual configuration options: https://torsion.org/borgmatic/docs/how-to/deal-with-very-large-backups/#spot-check
Author
Owner

Released in 1.8.10!

Released in 1.8.10!
witten referenced this issue from a commit 2024-04-16 17:50:51 +00:00
witten referenced this issue from a commit 2024-04-16 17:53:06 +00:00

Great news this is progressing.

I'm currently using #760 and also added a script to check/verify a random set of files, reading up to a set amount of data (2GB in my usage) and a minimum set of files (I've set it to 5 for now).

I figured that if I read a large file, like a video file, it would trip the 2GB threshold and check only 1 file.

I plan to check this new spotcheck feature out today.

Great news this is progressing. I'm currently using https://projects.torsion.org/borgmatic-collective/borgmatic/issues/760 and also added a script to check/verify a random set of files, reading up to a set amount of data (2GB in my usage) and a minimum set of files (I've set it to 5 for now). I figured that if I read a large file, like a video file, it would trip the 2GB threshold and check only 1 file. I plan to check this new spotcheck feature out today.
Author
Owner

Ah, yeah, the spot check feature doesn't currently have a threshold for file sizes, but that would be an interesting thing to add. And I'd love to hear about any feedback you have as you try out the feature!

Ah, yeah, the spot check feature doesn't currently have a threshold for file sizes, but that would be an interesting thing to add. And I'd love to hear about any feedback you have as you try out the feature!

Hi, I wanted to give it a few tries before giving feedback.

My usage for spot check is for verifying immediately after backup. I ran some backups with the following settings :
count_tolerance_percentage: 0
data_sample_percentage: 1
data_tolerance_percentage: 0

I have the verbose/logging set to 1 (-v 1)

For me, this made sure that all of the files are backed up (file count) and any of the files compared don't differ (file compare). Its pretty much what my current verify scripts do, compare file count, compare a random set of files (mine is set to compare 1.5GB of files).

  1. It caught some borg files that change during the backup which caused the spot check to fail. For me, this was a success as I hadn't seen this before (my other verify scripts caught this too). This change may have occurred after an update with borg. I excluded these and ran the backup again, this time the spot check passed.

  2. When a spot check fails, it would be helpful to know how the number of the file count (backup and repo) also which files fail the compare check. I had a small amount of files fail the check during a backup, it reported 0.00% and couldn't tell where the problem was or how many files were an issue.

  3. The spot check failed one time then retried 5 more times. Is this the same retry setting? Does it need to retry, wouldn't it get the same result each time? All the other checks passed, they were reran each time. Do i need to change settings?

  4. It works on backup and check, if I run check then sometimes it fails on first try (only on one of my repositories) then passes on the second try. I couldn't find what made it fail (like number 2).

  5. I think it would be helpful to know how much data has been verified as well as number of files e.g. Verified : 1% (2000 files, 1.5GB)

  6. If the spot check fails during backup, does the backup fail and gets deleted or something else?

This is working really well.

Hi, I wanted to give it a few tries before giving feedback. My usage for spot check is for verifying immediately after backup. I ran some backups with the following settings : count_tolerance_percentage: 0 data_sample_percentage: 1 data_tolerance_percentage: 0 I have the verbose/logging set to 1 (-v 1) For me, this made sure that all of the files are backed up (file count) and any of the files compared don't differ (file compare). Its pretty much what my current verify scripts do, compare file count, compare a random set of files (mine is set to compare 1.5GB of files). 1) It caught some borg files that change during the backup which caused the spot check to fail. For me, this was a success as I hadn't seen this before (my other verify scripts caught this too). This change may have occurred after an update with borg. I excluded these and ran the backup again, this time the spot check passed. 2) When a spot check fails, it would be helpful to know how the number of the file count (backup and repo) also which files fail the compare check. I had a small amount of files fail the check during a backup, it reported 0.00% and couldn't tell where the problem was or how many files were an issue. 3) The spot check failed one time then retried 5 more times. Is this the same retry setting? Does it need to retry, wouldn't it get the same result each time? All the other checks passed, they were reran each time. Do i need to change settings? 4) It works on backup and check, if I run check then sometimes it fails on first try (only on one of my repositories) then passes on the second try. I couldn't find what made it fail (like number 2). 5) I think it would be helpful to know how much data has been verified as well as number of files e.g. Verified : 1% (2000 files, 1.5GB) 6) If the spot check fails during backup, does the backup fail and gets deleted or something else? This is working really well.
Sign in to join this conversation.
No Milestone
No Assignees
2 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: borgmatic-collective/borgmatic#656
No description provided.