1.9.2/1.9.3 spinning at exit #947
Loading…
x
Reference in New Issue
Block a user
No description provided.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
What I'm trying to do and why
Running borgmatic on OpenBSD -current with borg 1.4.0. With 1.9.2/1.9.3 I'm seeing processes spin at high cpu when it's just about to exit, after a backup has completed. (With ktrace enabled for the process it is making wait4 syscalls very frequently). I haven't seen this happen with 1.9.0/1.9.1. I suspect it may only happen if doing database backups (at least it didn't happen with one test after commenting them out).
Running under pdb and breaking while it spins I have the trace below.
Any suggestions please?
Steps to reproduce
No response
Actual behavior
No response
Expected behavior
No response
Other notes / implementation ideas
No response
borgmatic version
1.9.2
borgmatic installation method
via OpenBSD ports tree (tweaked to use different versions)
Borg version
borg 1.4.0
Python version
Python 3.11.10
Database version (if applicable)
psql (PostgreSQL) 17.2 / mysql from 11.4.4-MariaDB, client 15.2 for OpenBSD (amd64) using readline 4.3
Operating system and version
OpenBSD 7.6-current (GENERIC.MP) #445: Sun Nov 24 23:22:18 MST 2024 deraadt@amd64.openbsd.org:/usr/src/sys/arch/amd64/compile/GENERIC.MP
Thanks for the detailed ticket. The main item that changed in borgmatic 1.9.2 that could in theory impact database dump streaming (and therefore cause hangs) is this item from the changelog:
So looking at those docs and working through the documented runtime directory logic:
user_runtime_directory
configured.XDG_RUNTIME_DIR
environment variable isn't set on your system because you're on OpenBSD, not Linux.RUNTIME_DIRECTORY
isn't set, because you're probably not using systemd on OpenBSD.TMPDIR
is set, and probably pointing to something like/tmp
. However this would've also been the case with borgmatic 1.9.1, so I'm not sure how an upgrade 1.9.2 would impact things.When you look at your borgmatic logs, do you see a line that says something like the following early in the logs?
If so, what is the runtime directory that's listed there? If you run borgmatic again, does a different runtime directory get used? And if possible, can you downgrade to 1.9.1 and run borgmatic again to see its runtime directory as well? Does does the hang problem go away with 1.9.1?
Correct,
user_runtime_directory
wasn't configured,XDG_RUNTIME_DIR
andRUNTIME_DIRECTORY
not set, neither areTMPDIR
orTMP
. Apologies I forgot to mention I have a patch in 1.9.1 to use~/.cache
instead of/run/user/{os.getuid()}
(though /run/user would have failed anyway as it doesn't exist), so it was using~/.cache/borgmatic
as the runtime dir.With 1.9.2, it is correctly using a new directory in
/tmp
each time:Using runtime directory /tmp/borgmatic-chznt6pq/borgmatic
,/tmp/borgmatic-pjfqt01x/borgmatic
etc.I have done further tests with 1.9.2:
I've only been able to get it to hang with databases and a reasonably large file set. Tests with just one or two databases and /etc are not hanging. (Was hoping I could save some time with tests).
I've now tried specifying
user_runtime_directory: /root/.borgmatic
and that seems to be working reliably, no hangs (tested both 1.9.2 and 1.9.3).Given those results (and I know exactly what dir it was using) I haven't retested with 1.9.1 again now - I'm pretty confident that it's not hanging there.
So the problem is only occurring when the randomly-generated directory name under /tmp is used.
It's possible that it's hanging on one of the files in your reasonably large file set, because enabling databases implicitly enables
read_special
—which can cause hangs if Borg encounters a special file. One way to verify that theory would be to enable your large file set, disable databases, but then enable theread_special
option explicitly. If it hangs withread_special
but not without, then it's pretty clearly encountering a special file that's causing problems.Another thing to try would be to enable just databases but omit your large file set. If that hangs, then the problem is almost certainly related to the temporary database named pipe. If it doesn't hang, then that indicates the databases (on their own) are not the problem.
It would be really odd if one temporary directory caused hangs though and the other one didn't. Are
/tmp
and/root
on different filesystems or even different devices? If so, any ideas about the relative characteristics of them that might impact this?I have the same issue (i think) on arch linux
It just spins to 100% cpu and stays there.
I tried setting the user_runtime_directory and or the temporary_directory but for me it didn't do anything - might be because i used for both directories tmpfs
I tried disabling Database backup (postgres) and it worked like a charm again. Had the same issue with this since it seems to have started about the same time or version of borgmatic
Fun fact:
even if borgmatic spins up to 100% cpu with database backup enabled - the backup is complete, just everything afterwards like prune, compact and check are missing. In Log (-v2) you see it exits with exit code 0 when database backup is enabled but the the next prompt (Calling bootstrap hook function remove_data_source_dumps) never comes
Just another update - i have this issue only when borg itself exits with return code 0 when creating a backup
I just got an return code 100 (because a file changed during the backup) from borg and borgmatic completed gracefully afterwards
So whatever it is that is causing the issue, it's only occuring when return code is 0
@nachfuellbar Thanks for weighing in. What runtime directory is borgmatic actually using when you don't specify one? You can see it in the log entry containing
Using runtime directory
in borgmatic's verbose output. Does that directory still exist on the filesystem during the hang? And when you ctrl-C the hang, do you get a similar traceback to what @sthen is reporting above? (You might not needpdb
to get it.)The fact that this is hanging after Borg exits makes me think it's not the usual problem of Borg hanging on a special file (e.g. a database named pipe). But it is interesting that the hang only happens when databases are enabled. One difference when databases are enabled is that there are multiple processes involved: The database dump process, streaming via a named pipe to Borg's process. I wonder if there's somehow a new deadlock between them.
Also of note is that the file where the hang is ostensibly occurring,
execute.py
, hasn't changed since April!It's using /run/user/0 as a runtime directory
The runtime directory is still existing when it hangs.
After ctrl-c the runtime directory continues to exist (including the named pipe)
Error when I ctrl-c the process -don't know if the issue is the same because it seems similar but not the same.
Might be because I'm already on 1.9.4 right now
Thanks for those details. The fact that your runtime directory isn't in a temporary directory allows us to eliminate the possibility of the hang stemming from temporary directory removal issues, at least on your machine. So that's something. And I'd expect the named pipe to still be there, given that the ctrl-C prevents cleanup steps from running.
And yeah, that traceback isn't exactly the same, but it could indicate that on both machines borgmatic is stuck in a loop waiting for a condition that will never happen, since both of the final borgmatic lines in your respective tracebacks occur in the same loop. One way to test that theory would be to run your repro and then ctrl-C it a few more times. If it stops on different lines in that same loop, that would support the theory. If it always stops on the same line, that wouldn't necessarily eliminate the theory though.
Seems like it's really stuck in that while loop and doesn't see when the process is exited.
After some retries i could reproduce borgmatic running fine when borg returns code 100 because of a file which was changed throughout the backup
Which is kinda funny since it's kind of an warning and not "really" successful
It may actually be seeing when Borg exits just fine, but that's not the only process borgmatic is monitoring—there's also the database dump process that's feeding data to Borg via a named pipe. If that dump process isn't exiting cleanly, for instance if it's expecting a last bit of data to get consumed by Borg that's never consumed, then that could explain a hang. I'm just giving an example though; there could be any number of failure modes here.
Anyway, if you're willing, could you try temporarily downgrading versions of borgmatic one by one until the hang disappears? For instance, if you're on 1.9.4 now, try a repro on 1.9.3, then 1.9.2, and so on until there's no more hang. That would at least give me a direction to go in pinpointing the problem.
Additionally, if I gave you a heavily instrumented copy of execute.py (probably just with a bunch of print statements), is that something you could easily drop in to your copy of borgmatic and try out? Or would that be difficult to do on your machine?
Thanks for your patience here.
This is interesting. Presumably because it's only a warning from Borg's (and borgmatic's) perspective, Borg would run and exit in exactly the same way, but just with a different exit code. So I'm not sure why this wouldn't cause a hang. It is perhaps a clue though.
It does make me think: If you comment out your
source_directories:
entirely and leave database backups on, does the hang go away or not?Still hangs after trying it without source_directories
For trying with older versions etc i'd need some more time ;)
Had to go back more versions than i thought i needed to.
It happens since 1.9.0 and with 1.8.14 it worked without issues
So I'm not exactly sure if its the same issue as the issue creator has - but maybe it just didn't get noticed before that since the backups are still created
If you have any versions of execute.py with more prints, i could test it now
Interesting! And maybe a little unfortunate in terms of pinning this down, just because so much changed in 1.9.0.
Okay, I've instrumented
log_outputs()
ofexecute.py
to hell and back with print debugging. You can either clone the whole branch here or if it's easier just replace the singleexecute.py
file. This branch is based off of 1.9.4 (well, main, which is very similar right now), so if you replace the single file, probably good to switch back to 1.9.4 first.Thanks for being willing to do this!
hey, i'm more than happy to help if a developer is this nice - unfortunately that's not my usual experience ;)
So I'm pretty sure there is the issue.
The postgres backup gets no return code and the /run/user directories which don't exist in my source_directories, dont exist in my completed backup (
borgmatic/bootstrap is still found in the completed backup under /root/.borgmatic/bootstrap<- forget about this, the manifest timestamp is from when i tried 1.8.14)manually executing the pg_dumpall command from borgmatic works though
PGPASSWORD=XXX pg_dumpall --no-password --clean --if-exists --host XXXX --username XXX
Here the Output of the Loop after the backup which continues over 10 times a second
and here the backup - as you can see, there is no run directory - don't know if this is intentional but i don't think so
Yeah, it's at least the proximate cause if not the root cause. Because if the
pg_dump
command is never exiting, then it makes some sense that there's a hang.Database dumps should show up in a
borgmatic/...
directory within the backup archive (assuming Borg 1.4+ is used). But in any case, it does appear to be missing from from the backup, which might indicate the dump never completed. And that would be consistent with the dump process still running as indicated above.Anyway, here are my current thoughts on next steps:
/run/user/0/./borgmatic/postgresql_databases
. But I wonder if thepg_dump
command is actually writing to a named pipe there. So I tweakedexecute.py
to log the full command line that was run so you can see if it's actually dumping to, say,/run/user/0/borgmatic/postgresql_databases/all
. No need to send me the output and laboriously redact it again.. Just please let me know if the dump location lines up with theborg create
paths. If not, that could explain the hang behavior you're seeing.user_runtime_directory
somewhere else to see if the hang goes away?/run/user/0
. I wouldn't expect that to be the case if you're running as root, because IIRC, root doesn't have that directory created by Linux PAM. So do you happen to know how it's getting set? See my first comment above about the logic behind that in borgmatic.working_directory
configured in your borgmatic configuration?The named pipe is in the path you named (/run/user/0/./borgmatic/postgresql_databases/ip_address/all)
And this path is backuped like you saw in the previous log.
The named pipe itself seems to work, too. I tried cat on it in one of my test runs just now and i get the pg_dumpall
I set it to /root/borgmatic and backup worked just fine and the Postgres Backup is in this path (/root/borgmatic/borgmatic/postgresql_databases/ip_address/all)
I tried setting another tmpfs like /tmp and it looped again with a backup without postgres backup coming out of it.
I honestly don't know when it's created but there is at least gnupg, p11-kit and systemd using /run/user/0 so I think it's at least not unusal to exist in this way
I've never set this variable - the only "path variable" i set was user_state_directory because it filled up my rootfs but I've just seen i haven't set it at at least for the time i updated the config the last time (mid november)
Okay, then it sounds like at least pg_dump is probably writing to a location that Borg is reading from. So we can eliminate that.
Super odd! Does this happen consistently over the course of several runs? I know I've successfully run borgmatic database dumps on tmpfs locations, so I wonder if the fact that you're defaulting to a tmpfs path has anything to do with the hang or if there's another correlated (or even random) factor at play.
Another thing I wonder about: Are your tmpfs directories mounted with the
nodev
option? You can see this by runningmount | grep tmpfs
, looking for the tmpfs filesystem in question (e.g./tmp
and/run/user/0
in your case) and seeing ifnodev
is specified among the parenthesized mount options. My guess is thatnodev
is there, and in theory it could prevent named pipes from working.The weird thing is I have
nodev
on my tmpfs filesystems, and borgmatic database backups work fine here! So here's something to try to see if named pipes work on your tmpfs. In a root shell, try:And then in a separate root shell:
That second shell should print "hi" and the first shell should exit.
Okay.
What Linux distro are you using BTW?
my tmpfs has following options which come via fstab's default option
the named pipe is working and it's working for the dump too - even executing a cat on the pipe for the dump while doing the backup is working (which I tried on one of my test runs)
your test with mkfifo works, too
I can persistently create the issue when the pipe is on any tmpfs
And just as persistently can I recreate it working on a "real" disk
I thought I wrote it already but maybe not - it's Arch Linux so it's pretty barebones with no weird changes like Ubuntu
Okay, thanks for trying that out. I saw conflicting information on whether named pipes work on nodev filesystems, but it sound like they do on both your machine and mine.
Nope, you did. Looks like I just missed it.
So I'm a little at a loss on how to proceed from here. We could do the binary search thing on 1.9.0 changesets, but that's where the use of tmpfs directories began, and so it would probably just tell us what we already know—that on your machine borgmatic hangs when dumping on a tmpfs but not when dumping on a disk-backed filesystem.
Here's something that stands out to me though. You mentioned your non-tmpfs dump file listing above:
That strikes me as wrong, because
all
should be a named pipe (that line should start with "p"). But here it's just a regular file. However above when you're using/run/user/0
(presumably tmpfs), the dump location is an actual named pipe. That discrepancy is pretty weird, and I wonder if it could be related to the underlying problem here. You don't haveformat
set in yourpostgresql_databases:
entries, do you? If you try dumping a single database instead ofall
, does the hang behavior change?I think I'm stupid
Honestly I'm sorry I used that much of your time because I haven't thought about my exclude_patterns
Seems like they excluded the paths i tried with tmpfs - after removing the exclude everything works again...
The exclude patterns are there for years at least because i have a lot of things which utilize tmpfs like /tmp and /run temporarily
But yeah might be worth implementing a warning for people who either don't think about their exclude pattern anymore or just don't expect them to interfere with the database dumps
Or an "exclude from exclude" for the directories borgmatic needs itself to function
I'm so glad you figured it out! I think this one is on me, because I really should've asked for your config file up front, and then we probably would've caught this sooner.
Good idea about the warning, although I'm not sure how easy that would be to implement. I'll leave this ticket open to investigate it. At the very least I can document the potential conflict.
@sthen You might try checking your configured excludes as well to see if it's the same issue.
EDIT: FWIW I have a repro now of the hang by putting the user runtime directory path into
exclude_patterns
!Okay, I ended up erroring instead of just warning in the case where the configured excludes overlap with the runtime directory. That's because it's a misconfiguration and basically guarantees a hang. I figured a moderately user-friendly error is preferable to hanging mysteriously.
This change is in main and will be part of the next release. Calling the ticket done for now. Thanks again for all your patience (and assistance) here!
Released in borgmatic 1.9.5!
Hello, my backups just started failing after updating to 1.9.5, however I don't use
exclude_patterns
at all !However, I use
patterns
and of course the runtime directory was not present in my patterns before, because it didn't need to.So to make it work now I have to add a custom
user_runtime_directory
and add it under mypatterns
?@LelouBil It depends on exactly what's in your
patterns
. The point of the error is that the runtime directory can no longer be excluded via either excludes or patterns, because that can cause Borg hangs. But it might not have to be explicitly included as long as you aren't excluding it. For instance, the following patterns would trigger the new error message if the runtime directory exists at/run/user/1234
:That's because the parent directory of the runtime directory is getting excluded. To fix that, you could remove the exclude:
... or you could add an explicit include to override the exclude:
This is just one particular example, of course, and your runtime directory may be at a different location. See the
-v 2
borgmatic output for the path. And you shouldn't need to setuser_runtime_directory
as long as you're fine with the implicitly selected path (and can avoid excluding it).I'm using docker-borgmatic, and my patterns element is a bunch of lines like these (because it is auto generated):
All paths begin with /mnt/source, and no path contains "/tmp", however it fails even when borgmatic choses a directory under /tmp
Maybe docker-borgmatic adds some configuration that I'm not finding ?
Hmm. I don't know of any configuration that docker-borgmatic adds. Can I get a look at your (redacted) configuration file and the verbose output of your borgmatic run that results in the error? Thanks.
I tried reproducing it with a minimal configuration without luck, however after reading the logs more carefully I found out that it failed with this error during a spot check.
A normal backup creation has no issues, so this seems to be dependent on the paths of my last backups ?
Here's the relevant part of the logs, I added a print to show the paths_output variable here
does this seem normal to you ?
If I understand the logic in
collect_special_file_paths
correctly, the fact that no directory inpaths_output
is a parent of the runtime directory means it was excluded or else it would have been in the list ?Since you error out if there are no paths to skip.
Should the runtime directory be added to the paths by default ?
Thank you! That error is not actually from the spot check. It comes after, during the
create
action when special file paths are detected and excluded to prevent Borg hangs.I assume you've added a debugging print for
paths_output
, which is helpful here. So to answer your question, yes, this seems "normal" in the sense that the code is apparently working as expected. But it's catching an "incorrect" configuration.That's right!
It is, but it's also possible with patterns to inadvertently "undo" that addition and exclude the runtime directory. That's what this error is ostensibly catching. So can I get a look at your configured patterns (redacted as necessary)?
What makes you say that? What's the difference between a "normal" backup creation and one that results in this error?
DIsabling the spot check makes the backup work without any issues.
Adding other debug prints I found out the error happens during this function
During
collect_spot_check_source_paths
I mean, I forgot the specific line I tested yesterdayThat's really interesting that it only happens during the spot check. Collecting the spot check source paths does go through the code path that can ultimately lead to that error message, so it makes some amount of sense. So can I get a look at your configured patterns? That will help debug any potential interaction between those patterns and the borgmatic runtime directory.
I tested it with this single pattern and it errored.
With spot checks enabled
Thanks. I think I might have a repro now and an idea about what's going on! So what I did was:
Does that sound like it lines up with your configuration?
Anyway, I'll file a separate ticket to cover fixing this, and I'll link it here.
#965.