"Spot check" of source directories consistency check #656

Closed
opened 2023-03-21 22:16:56 +00:00 by witten · 8 comments
Owner

What I'm trying to do and why

Today, borgmatic supports a number of consistency check types, most of them implemented by Borg. They all have various trade-offs around speed and thoroughness, but one thing none of them do is check the contents of the backed up archives against the original source files—which is arguably one important way to ensure your archives contain the files you'll ultimately want to restore in the case of catastrophe (or just an accidentally deleted file). Because if you happen to misconfigure borgmatic such that you're not backing up the files you think you're backing up, every existing consistency check will still pass with flying colors—and you won't discover this problem until you go to restore.

However, automatically and exhaustively checking an archive's contents against the contents of the source directories on disk has two main problems:

  • It'd be slow: Depending on the source directories, this sort of exhaustive check would be too speed prohibitive to run on a regular basis.
  • It'd yield false negatives: Files change over time. (If they didn't, you probably wouldn't need to use Borg for backups at all.) And with changing files, that means an exhaustive consistency check of source directory contents would fail whenever a source directory changes versus a backup archive—rendering this check pretty useless.

So the proposed solution is to run a "spot check" of source directories. The code implementing such a feature would go something like this:

  1. Pick a random file from within the configured source directories.
  2. Check that the file exists within the selected backup archive and has the same contents in both places.
  3. Repeat the whole process for some percentage of total source directory files.

This approach has the benefit of being fast and hopefully not yielding too many false negatives. The main downside is it's probabilistic; it won't catch 100% of source vs. archive consistency problems on any given run. But that might be good enough given the value that it provides over time.

Additionally, to make the tradeoff between false negatives and thoroughness of the check tunable for different source data, there could be borgmatic configuration options for the check:

  • sample_percentage: The percentage of total files in the source directories to randomly sample and compare to their corresponding files in the backup archive.
  • tolerance_percentage: The percentage of total files in the source directories that can fail a sample comparison without failing the entire consistency check:

Example:

consistency:
    checks:
        - name: spot
          frequency: 2 weeks
          sample_percentage: 5
          tolerance_percentage: 2

Open questions

  • Would this proposed consistency check provide enough value over the existing Borg/borgmatic consistency checks?
  • How feasible would this proposed consistency check be to implement with Borg? It would require effectively diffing random files within archives against their corresponding files on disk. (And no, borg diff won't do that.) Is it easy to do that in a performant-enough way?
  • Is this feature just too complex, either to use as an end-user or to implement/maintain as a developer?
  • Should this proposed consistency check only check against the most recent archive in a repository, like the existing extract check does?
  • Rather than checking file contents, would it be better to simply check for existence of source directory files within the backup archive? That'd be faster, simpler, and presumably wouldn't have as many false negatives. It'd also still handle the use case of catching a borgmatic include/exclude misconfiguration.
#### What I'm trying to do and why Today, borgmatic supports a number of [consistency check types](https://torsion.org/borgmatic/docs/how-to/deal-with-very-large-backups/#consistency-check-configuration), most of them implemented by Borg. They all have various trade-offs around speed and thoroughness, but one thing none of them do is check the contents of the backed up archives against the original source files—which is arguably one important way to ensure your archives contain the files you'll ultimately want to restore in the case of catastrophe (or just an accidentally deleted file). Because if you happen to misconfigure borgmatic such that you're not backing up the files you *think* you're backing up, every existing consistency check will still pass with flying colors—and you won't discover this problem until you go to restore. However, automatically and exhaustively checking an archive's contents against the contents of the source directories on disk has two main problems: * It'd be slow: Depending on the source directories, this sort of exhaustive check would be too speed prohibitive to run on a regular basis. * It'd yield false negatives: Files change over time. (If they didn't, you probably wouldn't need to use Borg for backups at all.) And with changing files, that means an exhaustive consistency check of source directory contents would fail whenever a source directory changes versus a backup archive—rendering this check pretty useless. So the proposed solution is to run a "spot check" of source directories. The code implementing such a feature would go something like this: 1. Pick a random file from within the configured source directories. 2. Check that the file exists within the selected backup archive and has the same contents in both places. 3. Repeat the whole process for some percentage of total source directory files. This approach has the benefit of being fast and hopefully not yielding too many false negatives. The main downside is it's probabilistic; it won't catch 100% of source vs. archive consistency problems on any given run. But that might be good enough given the value that it provides over time. Additionally, to make the tradeoff between false negatives and thoroughness of the check tunable for different source data, there could be borgmatic configuration options for the check: * `sample_percentage`: The percentage of total files in the source directories to randomly sample and compare to their corresponding files in the backup archive. * `tolerance_percentage`: The percentage of total files in the source directories that can fail a sample comparison without failing the entire consistency check: Example: ``` consistency: checks: - name: spot frequency: 2 weeks sample_percentage: 5 tolerance_percentage: 2 ``` #### Open questions * Would this proposed consistency check provide enough value over the existing Borg/borgmatic consistency checks? * How feasible would this proposed consistency check be to implement with Borg? It would require effectively diffing random files within archives against their corresponding files on disk. (And no, `borg diff` won't do that.) Is it easy to do that in a performant-enough way? * Is this feature just too complex, either to use as an end-user or to implement/maintain as a developer? * Should this proposed consistency check only check against the most recent archive in a repository, like the existing `extract` check does? * Rather than checking file contents, would it be better to simply check for *existence* of source directory files within the backup archive? That'd be faster, simpler, and presumably wouldn't have as many false negatives. It'd also still handle the use case of catching a borgmatic include/exclude misconfiguration.
witten added the
new feature area
label 2023-06-28 18:39:03 +00:00
Author
Owner

Implemented in main and will be part of the next release! See the documentation for actual configuration options: https://torsion.org/borgmatic/docs/how-to/deal-with-very-large-backups/#spot-check

Implemented in main and will be part of the next release! See the documentation for actual configuration options: https://torsion.org/borgmatic/docs/how-to/deal-with-very-large-backups/#spot-check
Author
Owner

Released in 1.8.10!

Released in 1.8.10!
witten referenced this issue from a commit 2024-04-16 17:50:51 +00:00
witten referenced this issue from a commit 2024-04-16 17:53:06 +00:00

Great news this is progressing.

I'm currently using #760 and also added a script to check/verify a random set of files, reading up to a set amount of data (2GB in my usage) and a minimum set of files (I've set it to 5 for now).

I figured that if I read a large file, like a video file, it would trip the 2GB threshold and check only 1 file.

I plan to check this new spotcheck feature out today.

Great news this is progressing. I'm currently using https://projects.torsion.org/borgmatic-collective/borgmatic/issues/760 and also added a script to check/verify a random set of files, reading up to a set amount of data (2GB in my usage) and a minimum set of files (I've set it to 5 for now). I figured that if I read a large file, like a video file, it would trip the 2GB threshold and check only 1 file. I plan to check this new spotcheck feature out today.
Author
Owner

Ah, yeah, the spot check feature doesn't currently have a threshold for file sizes, but that would be an interesting thing to add. And I'd love to hear about any feedback you have as you try out the feature!

Ah, yeah, the spot check feature doesn't currently have a threshold for file sizes, but that would be an interesting thing to add. And I'd love to hear about any feedback you have as you try out the feature!

Hi, I wanted to give it a few tries before giving feedback.

My usage for spot check is for verifying immediately after backup. I ran some backups with the following settings :
count_tolerance_percentage: 0
data_sample_percentage: 1
data_tolerance_percentage: 0

I have the verbose/logging set to 1 (-v 1)

For me, this made sure that all of the files are backed up (file count) and any of the files compared don't differ (file compare). Its pretty much what my current verify scripts do, compare file count, compare a random set of files (mine is set to compare 1.5GB of files).

  1. It caught some borg files that change during the backup which caused the spot check to fail. For me, this was a success as I hadn't seen this before (my other verify scripts caught this too). This change may have occurred after an update with borg. I excluded these and ran the backup again, this time the spot check passed.

  2. When a spot check fails, it would be helpful to know how the number of the file count (backup and repo) also which files fail the compare check. I had a small amount of files fail the check during a backup, it reported 0.00% and couldn't tell where the problem was or how many files were an issue.

  3. The spot check failed one time then retried 5 more times. Is this the same retry setting? Does it need to retry, wouldn't it get the same result each time? All the other checks passed, they were reran each time. Do i need to change settings?

  4. It works on backup and check, if I run check then sometimes it fails on first try (only on one of my repositories) then passes on the second try. I couldn't find what made it fail (like number 2).

  5. I think it would be helpful to know how much data has been verified as well as number of files e.g. Verified : 1% (2000 files, 1.5GB)

  6. If the spot check fails during backup, does the backup fail and gets deleted or something else?

This is working really well.

Hi, I wanted to give it a few tries before giving feedback. My usage for spot check is for verifying immediately after backup. I ran some backups with the following settings : count_tolerance_percentage: 0 data_sample_percentage: 1 data_tolerance_percentage: 0 I have the verbose/logging set to 1 (-v 1) For me, this made sure that all of the files are backed up (file count) and any of the files compared don't differ (file compare). Its pretty much what my current verify scripts do, compare file count, compare a random set of files (mine is set to compare 1.5GB of files). 1) It caught some borg files that change during the backup which caused the spot check to fail. For me, this was a success as I hadn't seen this before (my other verify scripts caught this too). This change may have occurred after an update with borg. I excluded these and ran the backup again, this time the spot check passed. 2) When a spot check fails, it would be helpful to know how the number of the file count (backup and repo) also which files fail the compare check. I had a small amount of files fail the check during a backup, it reported 0.00% and couldn't tell where the problem was or how many files were an issue. 3) The spot check failed one time then retried 5 more times. Is this the same retry setting? Does it need to retry, wouldn't it get the same result each time? All the other checks passed, they were reran each time. Do i need to change settings? 4) It works on backup and check, if I run check then sometimes it fails on first try (only on one of my repositories) then passes on the second try. I couldn't find what made it fail (like number 2). 5) I think it would be helpful to know how much data has been verified as well as number of files e.g. Verified : 1% (2000 files, 1.5GB) 6) If the spot check fails during backup, does the backup fail and gets deleted or something else? This is working really well.

Hi @witten

Just wondering, did you see my last feedback and questions above?

Would it be better to put them into a new ticket next time rather than add to this one?

Hi @witten Just wondering, did you see my last feedback and questions above? Would it be better to put them into a new ticket next time rather than add to this one?
Author
Owner

Thank you for the nudge, and my apologies for the delay in getting back to this! I do really appreciate all your testing and feedback on this feature.. It's invaluable on something new like this.

Some initial responses:

  • It's interesting that your use case involves running the spot check immediately after the backup. That's not how I'd originally imagined this being used, but I'm glad to to hear it's also useful for that use case, mostly working well, and even caught at least one legitimate problem.
  • Adding the file counts and the list of failing files to the output should be doable. Maybe as a series of debug level (--verbosity 2) log entries? Say, one per failing file and maybe another for the counts? The existing error message is already pretty crowded.
  • As for retries, they're currently run (if configured) for all borgmatic actions including checks. And it wouldn't necessarily get the same result each time as the spot check itself is probabilistic; the set of sampled files can change from run to run. I can see an argument though for not retrying checks even if other actions are retried. (Unfortunately the current code structure would not be conducive to that as all actions are run together.)
  • It would make make sense that running backup and check together would result in fewer check failures than if only running check by itself. That's because the spot check effectively detects drift in file contents/count since the last backup. So the longer since the last backup, the more likely that your other processes have been monkeying around with files on disk, thereby resulting in spot check hashes failing for those files.
  • Knowing how much data has been verified is probably possible, but file size isn't something that's currently gathered or tallied. (It could be.) Probably belongs in a debug log entry?
  • If the spot check fails after a backup run in the same borgmatic invocation, the backup / archive creation does not fail even if the overall borgmatic command does. create and check are run independently.
  • And yes, filing a new ticket for these work items would be helpful. (We can continue the discussion here if you want, but a separate ticket is easier for tracking to-do items.)
Thank you for the nudge, and my apologies for the delay in getting back to this! I do really appreciate all your testing and feedback on this feature.. It's invaluable on something new like this. Some initial responses: - It's interesting that your use case involves running the spot check immediately after the backup. That's not how I'd originally imagined this being used, but I'm glad to to hear it's also useful for that use case, mostly working well, and even caught at least one legitimate problem. - Adding the file counts and the list of failing files to the output should be doable. Maybe as a series of debug level (`--verbosity 2`) log entries? Say, one per failing file and maybe another for the counts? The existing error message is already pretty crowded. - As for retries, they're currently run (if configured) for _all_ borgmatic actions including checks. And it wouldn't necessarily get the same result each time as the spot check itself is probabilistic; the set of sampled files can change from run to run. I can see an argument though for not retrying checks even if other actions are retried. (Unfortunately the current code structure would not be conducive to that as all actions are run together.) - It would make make sense that running backup and check together would result in fewer check failures than if only running check by itself. That's because the spot check effectively detects _drift_ in file contents/count since the last backup. So the longer since the last backup, the more likely that your other processes have been monkeying around with files on disk, thereby resulting in spot check hashes failing for those files. - Knowing how much data has been verified is probably possible, but file size isn't something that's currently gathered or tallied. (It could be.) Probably belongs in a debug log entry? - If the spot check fails after a backup run in the same borgmatic invocation, the backup / archive creation does not fail even if the overall borgmatic command does. `create` and `check` are run independently. - And yes, filing a new ticket for these work items would be helpful. (We can continue the discussion here if you want, but a separate ticket is easier for tracking to-do items.)

Thanks for your reply. Sorry its taken me time, I wanted to do some more tests first and got side tracked.

Having thought about it, I think listing the failing files could be overload of output. For my usage, it maybe a couple of files. Of the usage you describe, it could be hundred or even thousands of files. That's a lot.

I'll go through my suggested output again and see where they could possibly fit (verbosity 1 or 2, etc). I tend to run in verbosity 1, I'll run in different levels and try it out.

I will add any further feedback to a new ticket.

Rob

Thanks for your reply. Sorry its taken me time, I wanted to do some more tests first and got side tracked. Having thought about it, I think listing the failing files could be overload of output. For my usage, it maybe a couple of files. Of the usage you describe, it could be hundred or even thousands of files. That's a lot. I'll go through my suggested output again and see where they could possibly fit (verbosity 1 or 2, etc). I tend to run in verbosity 1, I'll run in different levels and try it out. I will add any further feedback to a new ticket. Rob
Sign in to join this conversation.
2 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: borgmatic-collective/borgmatic#656
No description provided.