no activity during borgmatic check #920
Labels
No labels
blocked
breaking
bug
data loss
design finalized
good first issue
new feature area
question / support
security
waiting for response
No milestone
No project
No assignees
2 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
borgmatic-collective/borgmatic#920
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
What I'm trying to do and why
I have a regular borgmatic config run twice a day. I have the following in the config:
Generally this works out fine, but sometimes the check hangs and I'm not sure why (logs don't seem to help). They hang long enough to run concurrently with the next scheduled backup.
Steps to reproduce
Here's my command:
I don't specify any actions so it should be the default four.
I also backup locally and to a server over ssh.
Actual behavior
A bunch of hung processes (this is after two cycles, which I think is why there are two "sets"):
and on the server:
Expected behavior
For the checks to finish or exit
Other notes / implementation ideas
No response
borgmatic version
1.8.12
borgmatic installation method
pip install
Borg version
borg 1.4.0
Python version
Python 3.11.2
Database version (if applicable)
No response
Operating system and version
Debian GNU/Linux 12 (bookworm)
My guess is that it's the "spot" check that's hanging, probably trying to access a source file to hash. The check is probabilistic (meaning it randomly selects the files to read), so that could explain why it only sometimes happens.
To verify this, I'd recommend upping the verbosity level to
2. That should show which check is being run when it hangs. But it'll also produce more debugging output about each check. Specifically, for the spot check, it'll show you a (truncated) list of the source files it's sampling. That should give you an idea whether there's anything that maybe should be excluded, e.g. if a special file like/dev/randomis somehow getting erroneously included.Lastly, the spot check can just be really slow depending on how it's configured and the specifics of your data. I see you have a
data_sample_percentageof1. That means 1% of your source files get sampled in each check. But if borgmatic happens to select a really large file, that could take much longer than when it selects a small one.Feel free to post any verbose logs you get here, and I'd be happy to help decipher them.
If it was a large file, shouldn't we expect some CPU usage?
Presumably the logging would be for /future/ checks? If so how do I safely abort this run? Just kill the borg processes?
And then how do I force this spot check? Given that it's hung twice in a row now it seems repeatable?
Either that or disk I/O.
Yes, the logs for what checks are to be run occur before the checks themselves. However there is some intermediate debug logging during the spot check that happens while it's running.
Yes, and borgmatic as well if necessary. That would actually be an interesting data point to help debug this—does the Borg process command-line show anything useful before you kill it?
Also, you may have to run
borgmatic break-lockafterwards if a Borg lock file is left behind. (You'll know because you'll run Borg/borgmatic and it'll complain about a lock file.)You can always run
borgmatic check --forceto force checks to run. There's also the--only spotflag if you don't want to run other checks.Possibly, yeah. Spot is still probabilistic though, so it could vary from run to run. I guess you'll find out!
FWIW the third scheduled backup did "stack up". Killing them cleaned everything up (including the processes on the server).
I then ran the scheduled backup manually (no force, no spot). This quits in an error which I guess is better than the hanging? Although it seems to have passed the check against the first repo (I backup to two).
There are things in the syslog - forgive me but what the best way to share these?
Pasting like this into a ticket comment or attaching it as a file to a ticket is fine.
Based on those error logs though, it looks like Borg's own checks found some legitimate repository errors. You might try
check --repair, first making yourself aware of the warnings here. Also perhaps look into the hardware this is running on and whether anything there is failing (disk, RAM, etc.). Occasional Borg repository/archive check failures happen—and can usually be repaired—but if it's happening frequently, I would suspect the hardware.I went ahead with the repair and all is running smoothly now. This happens more than I would like, and I suspect I only noticed when the spot checks are performed. Which leaves me with two questions:
Not generally, no. But if for instance your disk is failing, that could cause hangs related to disk access.
The two places I would look to start are:
smartctlthat can look for bad blocks.If that doesn't turn up anything, then maybe your disks are fine—in which case we should look at verbose borgmatic logs when the spot check is failing.
Fwiw we do a smart scan periodically and disks have been healthy for a while. I'll look into the deeper options and will check the logs.
Okay, it's possible that the Borg error has nothing to do with hardware and it was just a Borg bug. Similar on the spot check hang being a potential borgmatic bug.
Sounds good.
So, smartctl extended tests came back OK, and I can't see anything obvious in the syslog.
I tried the following:
And although the second completed fine, the first error'd out. Unfortunately the console output overflowed with a huge list of file paths (presumably those being spot checked) but I spotted a python stacktrace - perhaps an "invalid character" error? I think I remember an issue about a strange filepath causing issues with borg. Either way I'm not sure if it's even related to my original issue!
FWIW I've been having a lot of these kind of check and issues since upgrading to the new-database-format version of borg (just as a data point!).
Okay, I'm glad to hear your disk isn't bad!
I'm happy to look at the traceback if you've got it, but it's sounding a whole lot like #899, which is fixed in a newer version of borgmatic than you're using. So it might be time to upgrade! But yeah, I'm not sure it would be related to your original hang issue.
You're speaking of Borg 1.4.0? If the check issues are with the spot check, then it's likely a borgmatic problem rather than anything to do with Borg. But if you're getting more integrity errors, that's almost certainly a Borg (or hardware) issue, and you might consider filing a ticket with the Borg project.
#899 was mine! I've upgraded now.
My suspicion is that something is happening in the daily backups (all of which end in a logged success), which only becomes an issue during the less frequent spot check.
Is there a way to keep every backup run as a timestamped file for future investigation?
You can try the
--log-filecommand-line flag, which does include timestamps in the log. But you'd have to either use a different file name each time or rotate the log files with an external tool (e.g.logrotate)—or else each borgmatic run will overwrite the log file.Another option is to just use the default system logging and run
journalctlto view logs from the desired time span.Got a log for a failed backup! Strangely it seems to occur while pruning after a successful backup:
Running it again won't even get past backup:
Side question (maybe) - when trying to run:
The command fails because borgmatic seems to execute a
borg infofirst, which fails because of the segment error. Is that something thatcheckalways does?Using
borgdirectly for the check:Which are different to the segments being complained about during the other actions. A further check lists different ones still:
I won't repair this in case there's something I can dig into while it's in this state - the other repo is backing up okay for now.
Yeah, this is another integrity error within the Borg repository itself. You can try another
check --repair, but the fact that this keeps happening makes me suspect the hardware again (or maybe even a Borg bug).Yes.
borgmatic checkfirst runsborg infounder the hood to get the Borg repository ID. This is used to record which checks have been run and when. That way checks can be run at the desired frequency instead of every time borgmatic runs.Yeah, that's not looking good. As per the Borg docs, you might consider making a copy of the repo (even maybe on another machine) and trying the repair there.
It may also be worth filing a ticket with the Borg project. They'll have much more insight into what to do with a corrupted Borg repository than I do!
Yes agree - I've raised it here:
https://github.com/borgbackup/borg/issues/8473
Okay, I'll close this ticket for now. Hopefully you get some resolution to this problem with that Borg ticket!
Seems like it was a pretty obscure hardware issue - only caught by the scheduled checks by borgmatic. So there's value there!
Cool! I'm glad to hear you've made some progress on the issue, even if you might be down a hard drive now.
Just to round this off, the issue was down to the RAM I was using, even though it passed memtest. So something I learned there too.
For anyone interested I managed to make the error repeatable - more info in the borg issue at: https://github.com/borgbackup/borg/issues/8473
I'm glad to hear you've got this resolved now, and thanks for circling back with the root cause. Great detective work!