'Not a directory' errors when running consistency check #713
Labels
No Label
bug
data loss
design finalized
good first issue
new feature area
question / support
security
waiting for response
No Milestone
No Assignees
2 Participants
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: borgmatic-collective/borgmatic#713
Loading…
Reference in New Issue
No description provided.
Delete Branch "%!s(<nil>)"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
What I'm trying to do and why
Last night, during my normal backup, I got an error saying something along these lines:
Looking into it, it seems that the
data
part of that path is a file, not a directory.I believe this is the data consistency check that's having this failure, which I have configured to run once a month, and this is the first run since after upgrading to the latest borgmatic release.
I tried downgrading to the previous version I was using (1.7.12), and it made this error go away, however it messed with the planned date for the check to run. Now it thinks it needs to run in 3 weeks from now. I have no idea why.
Steps to reproduce (if a bug)
To be honest, I doubt this can be easily reproduced, since it seems like something that would be caught by tests or that would have been reported by now since the release was over 2 weeks ago. But, here's a sanitized config:
Also, I usually invoke borgmatic in an Automator app that runs this script (using whatever
zsh
Automator provides, so, I guess the default macOS version):Of course, I also tried running this just in my terminal, and made sure to try verbosity level 2 both in Automator and in the terminal because I needed the verbose logs for this report, but FWIW none of these alternate run scenarios fixed the problem. The only thing that "fixes" it is reverting to an older version of borgmatic, but, like I said before, that messed with the planned date for the consistency check to run.
Actual behavior (if a bug)
It does appear that basically everything else works, however, the data consistency check fails, and so does the overall run.
Here's a portion of the log output from a recent run, obviously if I gave the whole log it would be overwhelming but hopefully this shows enough:
Expected behavior (if a bug)
Well, I expected the data checks to run like normal.
Other notes / implementation ideas
I upgraded borgmatic yesterday right before this data check was configured to run. This was totally by coincidence; I wasn't even thinking about when the next consistency check would run, and I didn't even bother looking at the changelog, because I was simply doing
brew update && brew upgrade
. I don't remember which version of borgmatic I had last month when the data check ran last, but it was no greater than 1.7.12, because that's the version I upgraded from last night.I don't believe the issue is related to which user borgmatic is running as, since I have it run via a repeating Calendar event and have shown that that makes it run as my user; furthermore, the file paths in question are owned by me just fine.
Environment
borgmatic version: 1.7.14
borgmatic installation method: homebrew
Borg version: 1.2.4
Python version: 3.11.1 or 3.11.4, not sure
Note: I'm actually not sure which version of python is being used. If you want to know what my default python version is, well, it's 3.11.1 right now due to my pyenv settings, but since I installed borgmatic via homebrew, I can't say for sure if it's using the same python version as my default, since the borgmatic formula has an explicit dependency on the
python@3.11
formula, and that formula's using 3.11.4. This information is apparently not logged even with verbose logging. (I think it might be a good idea to have verbose logs show that info!)Database version (if applicable): N/A
operating system and version: macOS Ventura 13.4, Intel
Thank you for the detailed bug report!
This is almost certainly fallout from #688, which was introduced in the lucky 1.7.13 release. As part of that change, some of the check data that's recorded on disk was moved to a new location, and specifically one of the files was replaced with a directory. There is code in place to upgrade the old file format transparently. It just sounds like it's not working as intended on your particular system.
So just to be clear: You had run the
data
check at some point before upgrading borgmatic?It's because your check files (the record of what checks had been run when) had been upgraded to the new 1.7.13+ format by running the newer version of borgmatic before downgrading. The older version of borgmatic can't read that format.
You'd be surprised. People don't always report stuff right away (or at all). It's also possible this is triggered by an edge case that most users aren't experiencing.
Looking at your borgmatic logs, I definitely see the failure occurring when reading the check time from disk. And based on your
ls
output above, the check format upgrade has not successfully run on the file in question—and I don't see evidence it's being attempted anywhere in these logs.For purposes of this ticket, it hopefully shouldn't matter.
Yup, that makes sense to me. PRs welcome. :D
Anyway, given that this is such a mysterious issue—it shouldn't be possible that it happens—how comfortable would you be running a version of borgmatic that's instrumented with extra debugging logs in the area of the code responsible for upgrading check files on disk? The way that would work is you'd upgrade back to 1.7.14 and then I'd give you a single modified
check.py
file to overwrite the existing 1.7.14 file wherever it exists on your system. Then you'd run borgmatic and collect the debug logs to paste here.Let me know your thoughts.
@witten
Is it because I went from 1.7.12 to 1.7.14 directly? Is it possible to just delete the
data
file to resolve this? I'm not saying I'll do that, since I know you'll want to devise a more holistic and transparent solution, but I am curious if that can be my plan B.Yup, I've been a user for over 8 months, and I've been doing data checks on at least a monthly cadence ever since (it used to be more frequently, but at some point I switched it to be monthly.) I can guarantee that they have run successfully for me in the past. In fact, I opened an issue here a while ago related to them and
check_last
, the logs from which are proof that the data checks ran, so it's definitely not something along the lines of me being mistaken and the data checks have just never worked for me.I can do that! Just give me the name of a git branch, and I'll install via pip with a git URL instead. I think that's easier for both of us than coordinating a file handoff and me having to manually overwrite stuff on my system.
I might do just that. For now, I'll put it in as a separate issue for tracking purposes, then set a reminder for myself to do it when I have some free time. If someone else tackles it before me, then all the better.
Nope, that should be fine.
Yeah, it should work if the problem is indeed a check time file upgrade not working properly. But if you do, please make a copy of your "broken" check times so we can continue working on this.
Sounds good. I've got a branch for you with tons of check time upgrade instrumentation: https://projects.torsion.org/borgmatic-collective/borgmatic/src/branch/713-check-time-upgrade-debugging
This branch is based off of main (so 1.7.15.dev0), but if that turns out to be a problem, I can make it based off of an actual release. When you run it, you'll see new debugging messages prefix with
#713:
. Hopefully that should be enough to figure out what's going wrong with the check time upgrade.Thanks!
Well, in playing around with this branch, I may have found the problem at the root of this ticket. It looks like there's an edge case where the
archives
check times are already upgraded but thedata
check times aren't. When that happens, thedata
check time upgrade can get skipped. I just pushed a potential fix for that to this branch, so it may actually make the problem go away on your machine.@witten that's great that you found the potential problem. I will switch to the branch, set my data check to run nightly to force it to happen soon, and let you know how it goes.
Any luck with that most recent nightly run? Thanks.
@witten Well, sort of. The actual nightly run failed because there's some regression in the arg parsing. Automator is really bad at telling you what went wrong, so I had to try it manually. That's why my answer is so late, I had to figure out what had gone wrong. Anyway, here's what I mean:
With that said, if I take out the
--verbosity
flag and run it then, the data checks seem to run properly. I usedalways
for the check frequency, though, so hopefully that doesn't invalidate anything. I think your fix works.Ugh, it looks like that error is a new one introduced in main but not present in the last release. I've managed to repro it and filed it separately here: #716
And thanks for validating the fix! I'll call this done and hopefully release soon, although I'm going to have a look at #716 first.
@witten thanks, also I forgot to mention that all the output text is purple when running it without the verbosity flag, which is weird and I'm not sure it's normal, may or may not help you out with figuring out what the other issue is. Regardless, I appreciate you looking into a fix here. Can I downgrade to 1.7.14 from here safely, or do I have to keep using this in-development 1.7.15?
Thanks. Purple output is just from a particular log level, as each log level gets its own color. I don't believe that has changed recently.
I've just rebased the branch on 1.7.14 and backported the fix, so you should be able to use that until 1.7.15 is released.
Just released in borgmatic 1.7.15.