1.9.2/1.9.3 spinning at exit #947

Closed
opened 2024-11-29 14:16:53 +00:00 by sthen · 36 comments

What I'm trying to do and why

Running borgmatic on OpenBSD -current with borg 1.4.0. With 1.9.2/1.9.3 I'm seeing processes spin at high cpu when it's just about to exit, after a backup has completed. (With ktrace enabled for the process it is making wait4 syscalls very frequently). I haven't seen this happen with 1.9.0/1.9.1. I suspect it may only happen if doing database backups (at least it didn't happen with one test after commenting them out).

Running under pdb and breaking while it spins I have the trace below.

Any suggestions please?

Remote: check_free_space: required bytes 1231894848, free bytes 875741470720
security: saving state for 44c9b0cdcd6aa17e58b6976ad71d6761f77d340ff82e9170fe8277bc021c3258 to /root/.config/borg/security/44c9b0cdcd6aa17e58b6976ad71d6761f77d340ff82e9170fe8277bc021c3258
security: current location ssh://XXX@XXX/bak/XXX
security: key type 5
security: manifest timestamp 2024-11-29T13:06:01.031120
RemoteRepository: 13.79 MB bytes sent, 5.32 kB bytes received, 107 messages sent
terminating with success status, rc 0
^CTraceback (most recent call last):
File "/usr/local/lib/python3.11/pdb.py", line 1777, in main
pdb._run(target)
File "/usr/local/lib/python3.11/pdb.py", line 1643, in _run
self.run(target.code)
File "/usr/local/lib/python3.11/bdb.py", line 600, in run
exec(cmd, globals, locals)
File "", line 1, in
File "/usr/local/bin/borgmatic", line 8, in
sys.exit(main())
^^^^^^
File "/usr/local/lib/python3.11/site-packages/borgmatic/commands/borgmatic.py", line 942, in main
or list(collect_configuration_run_summary_logs(configs, config_paths, arguments))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/borgmatic/commands/borgmatic.py", line 824, in collect_configuration_run_summary_logs
results = list(run_configuration(config_filename, config, config_paths, arguments))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/borgmatic/commands/borgmatic.py", line 146, in run_configuration
yield from run_actions(
File "/usr/local/lib/python3.11/site-packages/borgmatic/commands/borgmatic.py", line 331, in run_actions
yield from borgmatic.actions.create.run_create(
File "/usr/local/lib/python3.11/site-packages/borgmatic/actions/create.py", line 107, in run_create
json_output = borgmatic.borg.create.create_archive(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/borgmatic/borg/create.py", line 573, in create_archive
return execute_command_with_processes(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/borgmatic/execute.py", line 424, in execute_command_with_processes
captured_outputs = log_outputs(
^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/borgmatic/execute.py", line 146, in log_outputs
other_process.poll() is None
^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/subprocess.py", line 1236, in poll
return self._internal_poll()
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/subprocess.py", line 1990, in _internal_poll
pid, sts = _waitpid(self.pid, _WNOHANG)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/borgmatic/signals.py", line 21, in handle_signal
os.killpg(os.getpgrp(), signal_number)
File "/usr/local/lib/python3.11/site-packages/borgmatic/signals.py", line 27, in handle_signal
raise KeyboardInterrupt()
KeyboardInterrupt
Uncaught exception. Entering post mortem debugging

Steps to reproduce

No response

Actual behavior

No response

Expected behavior

No response

Other notes / implementation ideas

No response

borgmatic version

1.9.2

borgmatic installation method

via OpenBSD ports tree (tweaked to use different versions)

Borg version

borg 1.4.0

Python version

Python 3.11.10

Database version (if applicable)

psql (PostgreSQL) 17.2 / mysql from 11.4.4-MariaDB, client 15.2 for OpenBSD (amd64) using readline 4.3

Operating system and version

OpenBSD 7.6-current (GENERIC.MP) #445: Sun Nov 24 23:22:18 MST 2024 deraadt@amd64.openbsd.org:/usr/src/sys/arch/amd64/compile/GENERIC.MP

### What I'm trying to do and why Running borgmatic on OpenBSD -current with borg 1.4.0. With 1.9.2/1.9.3 I'm seeing processes spin at high cpu when it's just about to exit, after a backup has completed. (With ktrace enabled for the process it is making wait4 syscalls very frequently). I haven't seen this happen with 1.9.0/1.9.1. I suspect it may only happen if doing database backups (at least it didn't happen with one test after commenting them out). Running under pdb and breaking while it spins I have the trace below. Any suggestions please? > Remote: check_free_space: required bytes 1231894848, free bytes 875741470720 > security: saving state for 44c9b0cdcd6aa17e58b6976ad71d6761f77d340ff82e9170fe8277bc021c3258 to /root/.config/borg/security/44c9b0cdcd6aa17e58b6976ad71d6761f77d340ff82e9170fe8277bc021c3258 > security: current location ssh://XXX@XXX/bak/XXX > security: key type 5 > security: manifest timestamp 2024-11-29T13:06:01.031120 > RemoteRepository: 13.79 MB bytes sent, 5.32 kB bytes received, 107 messages sent > terminating with success status, rc 0 > ^CTraceback (most recent call last): > File "/usr/local/lib/python3.11/pdb.py", line 1777, in main > pdb._run(target) > File "/usr/local/lib/python3.11/pdb.py", line 1643, in _run > self.run(target.code) > File "/usr/local/lib/python3.11/bdb.py", line 600, in run > exec(cmd, globals, locals) > File "<string>", line 1, in <module> > File "/usr/local/bin/borgmatic", line 8, in <module> > sys.exit(main()) > ^^^^^^ > File "/usr/local/lib/python3.11/site-packages/borgmatic/commands/borgmatic.py", line 942, in main > or list(collect_configuration_run_summary_logs(configs, config_paths, arguments)) > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > File "/usr/local/lib/python3.11/site-packages/borgmatic/commands/borgmatic.py", line 824, in collect_configuration_run_summary_logs > results = list(run_configuration(config_filename, config, config_paths, arguments)) > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > File "/usr/local/lib/python3.11/site-packages/borgmatic/commands/borgmatic.py", line 146, in run_configuration > yield from run_actions( > File "/usr/local/lib/python3.11/site-packages/borgmatic/commands/borgmatic.py", line 331, in run_actions > yield from borgmatic.actions.create.run_create( > File "/usr/local/lib/python3.11/site-packages/borgmatic/actions/create.py", line 107, in run_create > json_output = borgmatic.borg.create.create_archive( > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > File "/usr/local/lib/python3.11/site-packages/borgmatic/borg/create.py", line 573, in create_archive > return execute_command_with_processes( > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > File "/usr/local/lib/python3.11/site-packages/borgmatic/execute.py", line 424, in execute_command_with_processes > captured_outputs = log_outputs( > ^^^^^^^^^^^^ > File "/usr/local/lib/python3.11/site-packages/borgmatic/execute.py", line 146, in log_outputs > other_process.poll() is None > ^^^^^^^^^^^^^^^^^^^^ > File "/usr/local/lib/python3.11/subprocess.py", line 1236, in poll > return self._internal_poll() > ^^^^^^^^^^^^^^^^^^^^^ > File "/usr/local/lib/python3.11/subprocess.py", line 1990, in _internal_poll > pid, sts = _waitpid(self.pid, _WNOHANG) > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > File "/usr/local/lib/python3.11/site-packages/borgmatic/signals.py", line 21, in handle_signal > os.killpg(os.getpgrp(), signal_number) > File "/usr/local/lib/python3.11/site-packages/borgmatic/signals.py", line 27, in handle_signal > raise KeyboardInterrupt() > KeyboardInterrupt > Uncaught exception. Entering post mortem debugging ### Steps to reproduce _No response_ ### Actual behavior _No response_ ### Expected behavior _No response_ ### Other notes / implementation ideas _No response_ ### borgmatic version 1.9.2 ### borgmatic installation method via OpenBSD ports tree (tweaked to use different versions) ### Borg version borg 1.4.0 ### Python version Python 3.11.10 ### Database version (if applicable) psql (PostgreSQL) 17.2 / mysql from 11.4.4-MariaDB, client 15.2 for OpenBSD (amd64) using readline 4.3 ### Operating system and version OpenBSD 7.6-current (GENERIC.MP) #445: Sun Nov 24 23:22:18 MST 2024 deraadt@amd64.openbsd.org:/usr/src/sys/arch/amd64/compile/GENERIC.MP
Owner

Thanks for the detailed ticket. The main item that changed in borgmatic 1.9.2 that could in theory impact database dump streaming (and therefore cause hangs) is this item from the changelog:

So looking at those docs and working through the documented runtime directory logic:

  • I'm guessing you don't have user_runtime_directory configured.
  • I'm guessing the XDG_RUNTIME_DIR environment variable isn't set on your system because you're on OpenBSD, not Linux.
  • I'm guessing RUNTIME_DIRECTORY isn't set, because you're probably not using systemd on OpenBSD.
  • I'm guessing TMPDIR is set, and probably pointing to something like /tmp. However this would've also been the case with borgmatic 1.9.1, so I'm not sure how an upgrade 1.9.2 would impact things.

When you look at your borgmatic logs, do you see a line that says something like the following early in the logs?

test: Using runtime directory /tmp/borgmatic-i7vxd3xd/borgmatic

If so, what is the runtime directory that's listed there? If you run borgmatic again, does a different runtime directory get used? And if possible, can you downgrade to 1.9.1 and run borgmatic again to see its runtime directory as well? Does does the hang problem go away with 1.9.1?

Thanks for the detailed ticket. The main item that changed in borgmatic 1.9.2 that could *in theory* impact database dump streaming (and therefore [cause hangs](https://torsion.org/borgmatic/docs/how-to/backup-your-databases/#borgmatic-hangs-during-backup)) is this item from the changelog: > * #934: Update the logic that probes for the borgmatic streaming database dump, bootstrap metadata, and check state directories to support more platforms and use cases. See the documentation for more information: https://torsion.org/borgmatic/docs/how-to/backup-your-databases/#runtime-directory So looking at those docs and working through the documented runtime directory logic: - I'm guessing you don't have `user_runtime_directory` configured. - I'm guessing the `XDG_RUNTIME_DIR` environment variable isn't set on your system because you're on OpenBSD, not Linux. - I'm guessing `RUNTIME_DIRECTORY` isn't set, because you're probably not using systemd on OpenBSD. - I'm guessing `TMPDIR` *is* set, and probably pointing to something like `/tmp`. However this would've also been the case with borgmatic 1.9.1, so I'm not sure how an upgrade 1.9.2 would impact things. When you look at your borgmatic logs, do you see a line that says something like the following early in the logs? ``` test: Using runtime directory /tmp/borgmatic-i7vxd3xd/borgmatic ``` If so, what is the runtime directory that's listed there? If you run borgmatic again, does a different runtime directory get used? And if possible, can you downgrade to 1.9.1 and run borgmatic again to see its runtime directory as well? Does does the hang problem go away with 1.9.1?
witten added the
question / support
label 2024-11-29 16:51:51 +00:00
Author

Correct, user_runtime_directory wasn't configured, XDG_RUNTIME_DIR and RUNTIME_DIRECTORY not set, neither are TMPDIR or TMP. Apologies I forgot to mention I have a patch in 1.9.1 to use ~/.cache instead of /run/user/{os.getuid()} (though /run/user would have failed anyway as it doesn't exist), so it was using ~/.cache/borgmatic as the runtime dir.

With 1.9.2, it is correctly using a new directory in /tmp each time: Using runtime directory /tmp/borgmatic-chznt6pq/borgmatic, /tmp/borgmatic-pjfqt01x/borgmatic etc.

I have done further tests with 1.9.2:

  • I've only been able to get it to hang with databases and a reasonably large file set. Tests with just one or two databases and /etc are not hanging. (Was hoping I could save some time with tests).

  • I've now tried specifying user_runtime_directory: /root/.borgmatic and that seems to be working reliably, no hangs (tested both 1.9.2 and 1.9.3).

Given those results (and I know exactly what dir it was using) I haven't retested with 1.9.1 again now - I'm pretty confident that it's not hanging there.

So the problem is only occurring when the randomly-generated directory name under /tmp is used.

Correct, `user_runtime_directory` wasn't configured, `XDG_RUNTIME_DIR` and `RUNTIME_DIRECTORY` not set, neither are `TMPDIR` or `TMP`. Apologies I forgot to mention I have a patch in 1.9.1 to use `~/.cache` instead of `/run/user/{os.getuid()}` (though /run/user would have failed anyway as it doesn't exist), so it was using `~/.cache/borgmatic` as the runtime dir. With 1.9.2, it is correctly using a new directory in `/tmp` each time: `Using runtime directory /tmp/borgmatic-chznt6pq/borgmatic`, `/tmp/borgmatic-pjfqt01x/borgmatic` etc. I have done further tests with 1.9.2: - I've only been able to get it to hang with databases and a reasonably large file set. Tests with just one or two databases and /etc are not hanging. (Was hoping I could save some time with tests). - I've now tried specifying `user_runtime_directory: /root/.borgmatic` and that seems to be working reliably, no hangs (tested both 1.9.2 and 1.9.3). Given those results (and I know exactly what dir it was using) I haven't retested with 1.9.1 again now - I'm pretty confident that it's not hanging there. So the problem is only occurring when the randomly-generated directory name under /tmp is used.
Owner

It's possible that it's hanging on one of the files in your reasonably large file set, because enabling databases implicitly enables read_special—which can cause hangs if Borg encounters a special file. One way to verify that theory would be to enable your large file set, disable databases, but then enable the read_special option explicitly. If it hangs with read_special but not without, then it's pretty clearly encountering a special file that's causing problems.

Another thing to try would be to enable just databases but omit your large file set. If that hangs, then the problem is almost certainly related to the temporary database named pipe. If it doesn't hang, then that indicates the databases (on their own) are not the problem.

It would be really odd if one temporary directory caused hangs though and the other one didn't. Are /tmp and /root on different filesystems or even different devices? If so, any ideas about the relative characteristics of them that might impact this?

It's possible that it's hanging on one of the files in your reasonably large file set, because enabling databases implicitly enables `read_special`—which can cause hangs if Borg encounters a special file. One way to verify that theory would be to enable your large file set, disable databases, but then enable the `read_special` option explicitly. If it hangs with `read_special` but not without, then it's pretty clearly encountering a special file that's causing problems. Another thing to try would be to enable just databases but omit your large file set. If that hangs, then the problem is almost certainly related to the temporary database named pipe. If it doesn't hang, then that indicates the databases (on their own) are not the problem. It would be really odd if one temporary directory caused hangs though and the other one didn't. Are `/tmp` and `/root` on different filesystems or even different devices? If so, any ideas about the relative characteristics of them that might impact this?
SSH command line: ['ssh', 'XXX@XXXX', 'XXX/borg', 'serve', '--debug']
Remote: using builtin fallback logging configuration
Remote: 33 self tests completed in 0.51 seconds
Remote: using builtin fallback logging configuration
Remote: Initialized logging system for JSON-based protocol
Remote: Resolving repository path b'/./backups'
Remote: Resolved repository path to 'XXX/backups'
Remote: Verified integrity of XXX/backups/index.55128
TAM-verified manifest
security: read previous location 'ssh://XXX@XXXX/./backups'
security: read manifest timestamp '2024-12-17T20:11:19.427763'
security: determined newest manifest timestamp as 2024-12-17T20:11:19.427763
security: repository checks ok, allowing access
Creating archive at "ssh://XXX@XXXX/./backups::nuc-2024-12-17T21:13:44.908444"
Verified integrity of /tmp/borg/59eb74739ec68bdf2a2ff83ba634f22f727d36452371e0b37c899284530023ce/chunks
Reading files cache ...
Verified integrity of /tmp/borg/59eb74739ec68bdf2a2ff83ba634f22f727d36452371e0b37c899284530023ce/files
security: read previous location 'ssh://XXXX@XXXX/./backups'
security: read manifest timestamp '2024-12-17T20:11:19.427763'
security: determined newest manifest timestamp as 2024-12-17T20:11:19.427763
security: repository checks ok, allowing access
Processing files ...
Remote: Cleaned up 0 uncommitted segment files (== everything after segment 55128).
Remote: Verified integrity of XXXX/backups/hints.55128
Remote: check_free_space: required bytes 895591458, free bytes 5314181431296
security: saving state for 59eb74739ec68bdf2a2ff83ba634f22f727d36452371e0b37c899284530023ce to /root/.config/borg/security/59eb74739ec68bdf2a2ff83ba634f22f727d36452371e0b37c899284530023ce
security: current location   ssh://XXXX@XXXX/./backups
security: key type           3
security: manifest timestamp 2024-12-17T20:16:39.184113
RemoteRepository: 864.13 kB bytes sent, 10.73 kB bytes received, 33 messages sent
terminating with success status, rc 0

I have the same issue (i think) on arch linux
It just spins to 100% cpu and stays there.
I tried setting the user_runtime_directory and or the temporary_directory but for me it didn't do anything - might be because i used for both directories tmpfs

I tried disabling Database backup (postgres) and it worked like a charm again. Had the same issue with this since it seems to have started about the same time or version of borgmatic

Fun fact:
even if borgmatic spins up to 100% cpu with database backup enabled - the backup is complete, just everything afterwards like prune, compact and check are missing. In Log (-v2) you see it exits with exit code 0 when database backup is enabled but the the next prompt (Calling bootstrap hook function remove_data_source_dumps) never comes

``` SSH command line: ['ssh', 'XXX@XXXX', 'XXX/borg', 'serve', '--debug'] Remote: using builtin fallback logging configuration Remote: 33 self tests completed in 0.51 seconds Remote: using builtin fallback logging configuration Remote: Initialized logging system for JSON-based protocol Remote: Resolving repository path b'/./backups' Remote: Resolved repository path to 'XXX/backups' Remote: Verified integrity of XXX/backups/index.55128 TAM-verified manifest security: read previous location 'ssh://XXX@XXXX/./backups' security: read manifest timestamp '2024-12-17T20:11:19.427763' security: determined newest manifest timestamp as 2024-12-17T20:11:19.427763 security: repository checks ok, allowing access Creating archive at "ssh://XXX@XXXX/./backups::nuc-2024-12-17T21:13:44.908444" Verified integrity of /tmp/borg/59eb74739ec68bdf2a2ff83ba634f22f727d36452371e0b37c899284530023ce/chunks Reading files cache ... Verified integrity of /tmp/borg/59eb74739ec68bdf2a2ff83ba634f22f727d36452371e0b37c899284530023ce/files security: read previous location 'ssh://XXXX@XXXX/./backups' security: read manifest timestamp '2024-12-17T20:11:19.427763' security: determined newest manifest timestamp as 2024-12-17T20:11:19.427763 security: repository checks ok, allowing access Processing files ... Remote: Cleaned up 0 uncommitted segment files (== everything after segment 55128). Remote: Verified integrity of XXXX/backups/hints.55128 Remote: check_free_space: required bytes 895591458, free bytes 5314181431296 security: saving state for 59eb74739ec68bdf2a2ff83ba634f22f727d36452371e0b37c899284530023ce to /root/.config/borg/security/59eb74739ec68bdf2a2ff83ba634f22f727d36452371e0b37c899284530023ce security: current location ssh://XXXX@XXXX/./backups security: key type 3 security: manifest timestamp 2024-12-17T20:16:39.184113 RemoteRepository: 864.13 kB bytes sent, 10.73 kB bytes received, 33 messages sent terminating with success status, rc 0 ``` I have the same issue (i think) on arch linux It just spins to 100% cpu and stays there. I tried setting the user_runtime_directory and or the temporary_directory but for me it didn't do anything - might be because i used for both directories tmpfs I tried disabling Database backup (postgres) and it worked like a charm again. Had the same issue with this since it seems to have started about the same time or version of borgmatic Fun fact: even if borgmatic spins up to 100% cpu with database backup enabled - the backup is complete, just everything afterwards like prune, compact and check are missing. In Log (-v2) you see it exits with exit code 0 when database backup is enabled but the the next prompt (Calling bootstrap hook function remove_data_source_dumps) never comes

Just another update - i have this issue only when borg itself exits with return code 0 when creating a backup
I just got an return code 100 (because a file changed during the backup) from borg and borgmatic completed gracefully afterwards
So whatever it is that is causing the issue, it's only occuring when return code is 0

Just another update - i have this issue only when borg itself exits with return code 0 when creating a backup I just got an return code 100 (because a file changed during the backup) from borg and borgmatic completed gracefully afterwards So whatever it is that is causing the issue, it's only occuring when return code is 0
Owner

@nachfuellbar Thanks for weighing in. What runtime directory is borgmatic actually using when you don't specify one? You can see it in the log entry containing Using runtime directory in borgmatic's verbose output. Does that directory still exist on the filesystem during the hang? And when you ctrl-C the hang, do you get a similar traceback to what @sthen is reporting above? (You might not need pdb to get it.)

The fact that this is hanging after Borg exits makes me think it's not the usual problem of Borg hanging on a special file (e.g. a database named pipe). But it is interesting that the hang only happens when databases are enabled. One difference when databases are enabled is that there are multiple processes involved: The database dump process, streaming via a named pipe to Borg's process. I wonder if there's somehow a new deadlock between them.

Also of note is that the file where the hang is ostensibly occurring, execute.py, hasn't changed since April!

@nachfuellbar Thanks for weighing in. What runtime directory is borgmatic actually using when you don't specify one? You can see it in the log entry containing `Using runtime directory` in borgmatic's verbose output. Does that directory still exist on the filesystem during the hang? And when you ctrl-C the hang, do you get a similar traceback to what @sthen is reporting above? (You might not need `pdb` to get it.) The fact that this is hanging *after* Borg exits makes me think it's not the usual problem of Borg hanging on a special file (e.g. a database named pipe). But it is interesting that the hang only happens when databases are enabled. One difference when databases are enabled is that there are multiple processes involved: The database dump process, streaming via a named pipe to Borg's process. I wonder if there's somehow a new deadlock between them. Also of note is that the file where the hang is ostensibly occurring, `execute.py`, hasn't changed since April!

It's using /run/user/0 as a runtime directory

The runtime directory is still existing when it hangs.
After ctrl-c the runtime directory continues to exist (including the named pipe)

ll /run/user/0/borgmatic/*/*
-rw-r--r-- 1 root root 78 Dec 17 23:26 /run/user/0/borgmatic/bootstrap/manifest.json

/run/user/0/borgmatic/postgresql_databases/XXXX:
total 0
drwx------ 2 root root 60 Dec 17 23:26 .
drwxr-xr-x 3 root root 60 Dec 17 23:26 ..
prw------- 1 root root  0 Dec 17 23:26 all

Error when I ctrl-c the process -don't know if the issue is the same because it seems similar but not the same.
Might be because I'm already on 1.9.4 right now

Traceback (most recent call last):
  File "/usr/bin/borgmatic", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/usr/lib/python3.12/site-packages/borgmatic/commands/borgmatic.py", line 943, in main
    or list(collect_configuration_run_summary_logs(configs, config_paths, arguments))
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/site-packages/borgmatic/commands/borgmatic.py", line 825, in collect_configuration_run_summary_logs
    results = list(run_configuration(config_filename, config, config_paths, arguments))
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/site-packages/borgmatic/commands/borgmatic.py", line 147, in run_configuration
    yield from run_actions(
  File "/usr/lib/python3.12/site-packages/borgmatic/commands/borgmatic.py", line 332, in run_actions
    yield from borgmatic.actions.create.run_create(
  File "/usr/lib/python3.12/site-packages/borgmatic/actions/create.py", line 210, in run_create
    json_output = borgmatic.borg.create.create_archive(
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/site-packages/borgmatic/borg/create.py", line 432, in create_archive
    return execute_command_with_processes(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/site-packages/borgmatic/execute.py", line 424, in execute_command_with_processes
    captured_outputs = log_outputs(
                       ^^^^^^^^^^^^
  File "/usr/lib/python3.12/site-packages/borgmatic/execute.py", line 154, in log_outputs
    line = ready_buffer.readline().rstrip().decode()
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/site-packages/borgmatic/signals.py", line 21, in handle_signal
    os.killpg(os.getpgrp(), signal_number)
  File "/usr/lib/python3.12/site-packages/borgmatic/signals.py", line 27, in handle_signal
    raise KeyboardInterrupt()
KeyboardInterrupt
It's using /run/user/0 as a runtime directory The runtime directory is still existing when it hangs. After ctrl-c the runtime directory continues to exist (including the named pipe) ``` ll /run/user/0/borgmatic/*/* -rw-r--r-- 1 root root 78 Dec 17 23:26 /run/user/0/borgmatic/bootstrap/manifest.json /run/user/0/borgmatic/postgresql_databases/XXXX: total 0 drwx------ 2 root root 60 Dec 17 23:26 . drwxr-xr-x 3 root root 60 Dec 17 23:26 .. prw------- 1 root root 0 Dec 17 23:26 all ``` Error when I ctrl-c the process -don't know if the issue is the same because it seems similar but not the same. Might be because I'm already on 1.9.4 right now ``` Traceback (most recent call last): File "/usr/bin/borgmatic", line 8, in <module> sys.exit(main()) ^^^^^^ File "/usr/lib/python3.12/site-packages/borgmatic/commands/borgmatic.py", line 943, in main or list(collect_configuration_run_summary_logs(configs, config_paths, arguments)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.12/site-packages/borgmatic/commands/borgmatic.py", line 825, in collect_configuration_run_summary_logs results = list(run_configuration(config_filename, config, config_paths, arguments)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.12/site-packages/borgmatic/commands/borgmatic.py", line 147, in run_configuration yield from run_actions( File "/usr/lib/python3.12/site-packages/borgmatic/commands/borgmatic.py", line 332, in run_actions yield from borgmatic.actions.create.run_create( File "/usr/lib/python3.12/site-packages/borgmatic/actions/create.py", line 210, in run_create json_output = borgmatic.borg.create.create_archive( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.12/site-packages/borgmatic/borg/create.py", line 432, in create_archive return execute_command_with_processes( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.12/site-packages/borgmatic/execute.py", line 424, in execute_command_with_processes captured_outputs = log_outputs( ^^^^^^^^^^^^ File "/usr/lib/python3.12/site-packages/borgmatic/execute.py", line 154, in log_outputs line = ready_buffer.readline().rstrip().decode() ^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.12/site-packages/borgmatic/signals.py", line 21, in handle_signal os.killpg(os.getpgrp(), signal_number) File "/usr/lib/python3.12/site-packages/borgmatic/signals.py", line 27, in handle_signal raise KeyboardInterrupt() KeyboardInterrupt ```
Owner

Thanks for those details. The fact that your runtime directory isn't in a temporary directory allows us to eliminate the possibility of the hang stemming from temporary directory removal issues, at least on your machine. So that's something. And I'd expect the named pipe to still be there, given that the ctrl-C prevents cleanup steps from running.

And yeah, that traceback isn't exactly the same, but it could indicate that on both machines borgmatic is stuck in a loop waiting for a condition that will never happen, since both of the final borgmatic lines in your respective tracebacks occur in the same loop. One way to test that theory would be to run your repro and then ctrl-C it a few more times. If it stops on different lines in that same loop, that would support the theory. If it always stops on the same line, that wouldn't necessarily eliminate the theory though.

Thanks for those details. The fact that your runtime directory isn't in a temporary directory allows us to eliminate the possibility of the hang stemming from temporary directory removal issues, at least on your machine. So that's something. And I'd expect the named pipe to still be there, given that the ctrl-C prevents cleanup steps from running. And yeah, that traceback isn't exactly the same, but it *could* indicate that on both machines borgmatic is stuck in a loop waiting for a condition that will never happen, since both of the final borgmatic lines in your respective tracebacks occur in the same loop. One way to test that theory would be to run your repro and then ctrl-C it a few more times. If it stops on different lines in that same loop, that would support the theory. If it always stops on the same line, that wouldn't necessarily eliminate the theory though.
Traceback (most recent call last):
  File "/usr/bin/borgmatic", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/usr/lib/python3.12/site-packages/borgmatic/commands/borgmatic.py", line 943, in main
    or list(collect_configuration_run_summary_logs(configs, config_paths, arguments))
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/site-packages/borgmatic/commands/borgmatic.py", line 825, in collect_configuration_run_summary_logs
    results = list(run_configuration(config_filename, config, config_paths, arguments))
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/site-packages/borgmatic/commands/borgmatic.py", line 147, in run_configuration
    yield from run_actions(
  File "/usr/lib/python3.12/site-packages/borgmatic/commands/borgmatic.py", line 332, in run_actions
    yield from borgmatic.actions.create.run_create(
  File "/usr/lib/python3.12/site-packages/borgmatic/actions/create.py", line 210, in run_create
    json_output = borgmatic.borg.create.create_archive(
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/site-packages/borgmatic/borg/create.py", line 432, in create_archive
    return execute_command_with_processes(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/site-packages/borgmatic/execute.py", line 424, in execute_command_with_processes
    captured_outputs = log_outputs(
                       ^^^^^^^^^^^^
  File "/usr/lib/python3.12/site-packages/borgmatic/execute.py", line 173, in log_outputs
    exit_code = process.poll() if output_buffers else process.wait()
                ^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/subprocess.py", line 1236, in poll
    return self._internal_poll()
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/subprocess.py", line 1990, in _internal_poll
    pid, sts = _waitpid(self.pid, _WNOHANG)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/site-packages/borgmatic/signals.py", line 21, in handle_signal
    os.killpg(os.getpgrp(), signal_number)
  File "/usr/lib/python3.12/site-packages/borgmatic/signals.py", line 27, in handle_signal
    raise KeyboardInterrupt()
KeyboardInterrupt
Traceback (most recent call last):
  File "/usr/bin/borgmatic", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/usr/lib/python3.12/site-packages/borgmatic/commands/borgmatic.py", line 943, in main
    or list(collect_configuration_run_summary_logs(configs, config_paths, arguments))
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/site-packages/borgmatic/commands/borgmatic.py", line 825, in collect_configuration_run_summary_logs
    results = list(run_configuration(config_filename, config, config_paths, arguments))
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/site-packages/borgmatic/commands/borgmatic.py", line 147, in run_configuration
    yield from run_actions(
  File "/usr/lib/python3.12/site-packages/borgmatic/commands/borgmatic.py", line 332, in run_actions
    yield from borgmatic.actions.create.run_create(
  File "/usr/lib/python3.12/site-packages/borgmatic/actions/create.py", line 210, in run_create
    json_output = borgmatic.borg.create.create_archive(
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/site-packages/borgmatic/borg/create.py", line 432, in create_archive
    return execute_command_with_processes(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/site-packages/borgmatic/execute.py", line 424, in execute_command_with_processes
    captured_outputs = log_outputs(
                       ^^^^^^^^^^^^
  File "/usr/lib/python3.12/site-packages/borgmatic/execute.py", line 146, in log_outputs
    other_process.poll() is None
    ^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/subprocess.py", line 1236, in poll
    return self._internal_poll()
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/subprocess.py", line 1983, in _internal_poll
    if not self._waitpid_lock.acquire(False):
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/site-packages/borgmatic/signals.py", line 21, in handle_signal
    os.killpg(os.getpgrp(), signal_number)
  File "/usr/lib/python3.12/site-packages/borgmatic/signals.py", line 27, in handle_signal
    raise KeyboardInterrupt()
KeyboardInterrupt

Seems like it's really stuck in that while loop and doesn't see when the process is exited.
After some retries i could reproduce borgmatic running fine when borg returns code 100 because of a file which was changed throughout the backup
Which is kinda funny since it's kind of an warning and not "really" successful

``` Traceback (most recent call last): File "/usr/bin/borgmatic", line 8, in <module> sys.exit(main()) ^^^^^^ File "/usr/lib/python3.12/site-packages/borgmatic/commands/borgmatic.py", line 943, in main or list(collect_configuration_run_summary_logs(configs, config_paths, arguments)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.12/site-packages/borgmatic/commands/borgmatic.py", line 825, in collect_configuration_run_summary_logs results = list(run_configuration(config_filename, config, config_paths, arguments)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.12/site-packages/borgmatic/commands/borgmatic.py", line 147, in run_configuration yield from run_actions( File "/usr/lib/python3.12/site-packages/borgmatic/commands/borgmatic.py", line 332, in run_actions yield from borgmatic.actions.create.run_create( File "/usr/lib/python3.12/site-packages/borgmatic/actions/create.py", line 210, in run_create json_output = borgmatic.borg.create.create_archive( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.12/site-packages/borgmatic/borg/create.py", line 432, in create_archive return execute_command_with_processes( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.12/site-packages/borgmatic/execute.py", line 424, in execute_command_with_processes captured_outputs = log_outputs( ^^^^^^^^^^^^ File "/usr/lib/python3.12/site-packages/borgmatic/execute.py", line 173, in log_outputs exit_code = process.poll() if output_buffers else process.wait() ^^^^^^^^^^^^^^ File "/usr/lib/python3.12/subprocess.py", line 1236, in poll return self._internal_poll() ^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.12/subprocess.py", line 1990, in _internal_poll pid, sts = _waitpid(self.pid, _WNOHANG) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.12/site-packages/borgmatic/signals.py", line 21, in handle_signal os.killpg(os.getpgrp(), signal_number) File "/usr/lib/python3.12/site-packages/borgmatic/signals.py", line 27, in handle_signal raise KeyboardInterrupt() KeyboardInterrupt ``` ``` Traceback (most recent call last): File "/usr/bin/borgmatic", line 8, in <module> sys.exit(main()) ^^^^^^ File "/usr/lib/python3.12/site-packages/borgmatic/commands/borgmatic.py", line 943, in main or list(collect_configuration_run_summary_logs(configs, config_paths, arguments)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.12/site-packages/borgmatic/commands/borgmatic.py", line 825, in collect_configuration_run_summary_logs results = list(run_configuration(config_filename, config, config_paths, arguments)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.12/site-packages/borgmatic/commands/borgmatic.py", line 147, in run_configuration yield from run_actions( File "/usr/lib/python3.12/site-packages/borgmatic/commands/borgmatic.py", line 332, in run_actions yield from borgmatic.actions.create.run_create( File "/usr/lib/python3.12/site-packages/borgmatic/actions/create.py", line 210, in run_create json_output = borgmatic.borg.create.create_archive( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.12/site-packages/borgmatic/borg/create.py", line 432, in create_archive return execute_command_with_processes( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.12/site-packages/borgmatic/execute.py", line 424, in execute_command_with_processes captured_outputs = log_outputs( ^^^^^^^^^^^^ File "/usr/lib/python3.12/site-packages/borgmatic/execute.py", line 146, in log_outputs other_process.poll() is None ^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.12/subprocess.py", line 1236, in poll return self._internal_poll() ^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.12/subprocess.py", line 1983, in _internal_poll if not self._waitpid_lock.acquire(False): ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.12/site-packages/borgmatic/signals.py", line 21, in handle_signal os.killpg(os.getpgrp(), signal_number) File "/usr/lib/python3.12/site-packages/borgmatic/signals.py", line 27, in handle_signal raise KeyboardInterrupt() KeyboardInterrupt ``` Seems like it's really stuck in that while loop and doesn't see when the process is exited. After some retries i could reproduce borgmatic running fine when borg returns code 100 because of a file which was changed throughout the backup Which is kinda funny since it's kind of an warning and not "really" successful
Owner

Seems like it's really stuck in that while loop and doesn't see when the process is exited.

It may actually be seeing when Borg exits just fine, but that's not the only process borgmatic is monitoring—there's also the database dump process that's feeding data to Borg via a named pipe. If that dump process isn't exiting cleanly, for instance if it's expecting a last bit of data to get consumed by Borg that's never consumed, then that could explain a hang. I'm just giving an example though; there could be any number of failure modes here.

Anyway, if you're willing, could you try temporarily downgrading versions of borgmatic one by one until the hang disappears? For instance, if you're on 1.9.4 now, try a repro on 1.9.3, then 1.9.2, and so on until there's no more hang. That would at least give me a direction to go in pinpointing the problem.

Additionally, if I gave you a heavily instrumented copy of execute.py (probably just with a bunch of print statements), is that something you could easily drop in to your copy of borgmatic and try out? Or would that be difficult to do on your machine?

Thanks for your patience here.

> Seems like it's really stuck in that while loop and doesn't see when the process is exited. It may actually be seeing when Borg exits just fine, but that's not the only process borgmatic is monitoring—there's also the database dump process that's feeding data to Borg via a named pipe. If that dump process isn't exiting cleanly, for instance if it's expecting a last bit of data to get consumed by Borg that's never consumed, then that could explain a hang. I'm just giving an example though; there could be any number of failure modes here. Anyway, if you're willing, could you try temporarily downgrading versions of borgmatic one by one until the hang disappears? For instance, if you're on 1.9.4 now, try a repro on 1.9.3, then 1.9.2, and so on until there's no more hang. That would at least give me a direction to go in pinpointing the problem. Additionally, if I gave you a heavily instrumented copy of execute.py (probably just with a bunch of print statements), is that something you could easily drop in to your copy of borgmatic and try out? Or would that be difficult to do on your machine? Thanks for your patience here.
Owner

After some retries i could reproduce borgmatic running fine when borg returns code 100 because of a file which was changed throughout the backup
Which is kinda funny since it's kind of an warning and not "really" successful

This is interesting. Presumably because it's only a warning from Borg's (and borgmatic's) perspective, Borg would run and exit in exactly the same way, but just with a different exit code. So I'm not sure why this wouldn't cause a hang. It is perhaps a clue though.

It does make me think: If you comment out your source_directories: entirely and leave database backups on, does the hang go away or not?

> After some retries i could reproduce borgmatic running fine when borg returns code 100 because of a file which was changed throughout the backup > Which is kinda funny since it's kind of an warning and not "really" successful This is interesting. Presumably because it's only a warning from Borg's (and borgmatic's) perspective, Borg would run and exit in exactly the same way, but just with a different exit code. So I'm not sure why this *wouldn't* cause a hang. It is perhaps a clue though. It does make me think: If you comment out your `source_directories:` entirely and leave database backups on, does the hang go away or not?

After some retries i could reproduce borgmatic running fine when borg returns code 100 because of a file which was changed throughout the backup
Which is kinda funny since it's kind of an warning and not "really" successful

This is interesting. Presumably because it's only a warning from Borg's (and borgmatic's) perspective, Borg would run and exit in exactly the same way, but just with a different exit code. So I'm not sure why this wouldn't cause a hang. It is perhaps a clue though.

It does make me think: If you comment out your source_directories: entirely and leave database backups on, does the hang go away or not?

Still hangs after trying it without source_directories

For trying with older versions etc i'd need some more time ;)

> > After some retries i could reproduce borgmatic running fine when borg returns code 100 because of a file which was changed throughout the backup > > Which is kinda funny since it's kind of an warning and not "really" successful > > This is interesting. Presumably because it's only a warning from Borg's (and borgmatic's) perspective, Borg would run and exit in exactly the same way, but just with a different exit code. So I'm not sure why this *wouldn't* cause a hang. It is perhaps a clue though. > > It does make me think: If you comment out your `source_directories:` entirely and leave database backups on, does the hang go away or not? Still hangs after trying it without source_directories For trying with older versions etc i'd need some more time ;)

Had to go back more versions than i thought i needed to.
It happens since 1.9.0 and with 1.8.14 it worked without issues
So I'm not exactly sure if its the same issue as the issue creator has - but maybe it just didn't get noticed before that since the backups are still created

If you have any versions of execute.py with more prints, i could test it now

Had to go back more versions than i thought i needed to. It happens since 1.9.0 and with 1.8.14 it worked without issues So I'm not exactly sure if its the same issue as the issue creator has - but maybe it just didn't get noticed before that since the backups are still created If you have any versions of execute.py with more prints, i could test it now
Owner

It happens since 1.9.0 and with 1.8.14 it worked without issues

Interesting! And maybe a little unfortunate in terms of pinning this down, just because so much changed in 1.9.0.

Okay, I've instrumented log_outputs() of execute.py to hell and back with print debugging. You can either clone the whole branch here or if it's easier just replace the single execute.py file. This branch is based off of 1.9.4 (well, main, which is very similar right now), so if you replace the single file, probably good to switch back to 1.9.4 first.

Thanks for being willing to do this!

> It happens since 1.9.0 and with 1.8.14 it worked without issues Interesting! And maybe a little unfortunate in terms of pinning this down, just because so much changed in 1.9.0. Okay, I've instrumented `log_outputs()` of `execute.py` to hell and back with print debugging. You can either [clone the whole branch here](https://projects.torsion.org/borgmatic-collective/borgmatic/src/branch/execute-debugging) or if it's easier just replace [the single `execute.py` file](https://projects.torsion.org/borgmatic-collective/borgmatic/raw/branch/execute-debugging/borgmatic/execute.py). This branch is based off of 1.9.4 (well, main, which is very similar right now), so if you replace the single file, probably good to switch back to 1.9.4 first. Thanks for being willing to do this!

hey, i'm more than happy to help if a developer is this nice - unfortunately that's not my usual experience ;)

So I'm pretty sure there is the issue.
The postgres backup gets no return code and the /run/user directories which don't exist in my source_directories, dont exist in my completed backup (borgmatic/bootstrap is still found in the completed backup under /root/.borgmatic/bootstrap <- forget about this, the manifest timestamp is from when i tried 1.8.14)

manually executing the pg_dumpall command from borgmatic works though

PGPASSWORD=XXX pg_dumpall --no-password --clean --if-exists --host XXXX --username XXX

Here the Output of the Loop after the backup which continues over 10 times a second

*** command: ('borg', 'create', '--exclude-from', '/tmp/tmpyk6inpnz', '--compression', 'zstd', '--one-file-system', '--read-special', '--remote-path', 'XXX/borg', '--debug', '--show-rc', 'ssh://XXX@XXX/./backups::{hostname}-{now:%Y-%m-%dT%H:%M:%S.%f}', '/', '/boot', '/run/user/0/./borgmatic/bootstrap', '/run/user/0/./borgmatic/postgresql_databases', '/var/lib/containers/storage', '/var/lib/containers/storage/volumes', '/var/lib/machines/XXX')
*** exit_status: ('borg', 'create', '--exclude-from', '/tmp/tmpyk6inpnz', '--compression', 'zstd', '--one-file-system', '--read-special', '--remote-path', 'XXX/borg', '--debug', '--show-rc', 'ssh://XXX@XXX/./backups::{hostname}-{now:%Y-%m-%dT%H:%M:%S.%f}', '/', '/boot', '/run/user/0/./borgmatic/bootstrap', '/run/user/0/./borgmatic/postgresql_databases', '/var/lib/containers/storage', '/var/lib/containers/storage/volumes', '/var/lib/machines/XXX')
*** top of outermost loop
*** selecting on: [<_io.BufferedReader name=3>, <_io.BufferedReader name=5>]
*** ready_buffers: [<_io.BufferedReader name=5>]
*** top of ready_buffer outer loop: <_io.BufferedReader name=5>
*** ready_process: <Popen: returncode: 0 args: ('borg', 'create', '--exclude-from', '/tmp/tmpyk...>
*** ready_process has exited
*** top of ready_buffer inner loop
*** bailing
*** top of process loop: <Popen: returncode: None args: 'pg_dumpall --no-password --clean --if-exists...>
*** exit_code: None
*** continue
*** top of process loop: <Popen: returncode: 0 args: ('borg', 'create', '--exclude-from', '/tmp/tmpyk...>
*** exit_code: 0

and here the backup - as you can see, there is no run directory - don't know if this is intentional but i don't think so

ls -lah
total 2.0K
drwxr-xr-x 1 root root 0 Dec 18 20:32 .
drwxr-xr-x 1 root root 0 Dec 18 20:32 .
drwxr-xr-x 1 root root 0 Dec 18 20:39 ..
lrwxrwxrwx 1 root root 7 Nov 21 09:56 bin -> usr/bin
drwxr-xr-x 1 root root 0 Dec 12 19:04 boot
drwxr-xr-x 1 root root 0 Dec 18 18:29 etc
drwxr-xr-x 1 root root 0 Apr 19  2024 home
lrwxrwxrwx 1 root root 7 Nov 21 09:56 lib -> usr/lib
lrwxrwxrwx 1 root root 7 Nov 21 09:56 lib64 -> usr/lib
drwx------ 1 root root 0 Mar  7  2024 lost+found
drwxr-xr-x 1 root root 0 Mar  7  2024 media
drwxr-xr-x 1 root root 0 Jan 19  2024 mnt
-rw-r--r-- 1 root root 0 Nov  8 12:58 oom
drwxr-xr-x 1 root root 0 Jan 19  2024 opt
drwxr-x--- 1 root root 0 Dec 18 18:28 root
lrwxrwxrwx 1 root root 7 Nov 21 09:56 sbin -> usr/bin
drwxr-xr-x 1 root root 0 Mar  7  2024 srv
dr-xr-xr-x 1 root root 0 Dec 17 23:29 sys
drwxr-xr-x 1 root root 0 Dec 18 18:29 usr
drwxr-xr-x 1 root root 0 Dec 16 16:34 var
hey, i'm more than happy to help if a developer is this nice - unfortunately that's not my usual experience ;) So I'm pretty sure there is the issue. The postgres backup gets no return code and the /run/user directories which don't exist in my source_directories, dont exist in my completed backup (~~borgmatic/bootstrap is still found in the completed backup under /root/.borgmatic/bootstrap~~ <- forget about this, the manifest timestamp is from when i tried 1.8.14) manually executing the pg_dumpall command from borgmatic works though PGPASSWORD=XXX pg_dumpall --no-password --clean --if-exists --host XXXX --username XXX Here the Output of the Loop after the backup which continues over 10 times a second ``` *** command: ('borg', 'create', '--exclude-from', '/tmp/tmpyk6inpnz', '--compression', 'zstd', '--one-file-system', '--read-special', '--remote-path', 'XXX/borg', '--debug', '--show-rc', 'ssh://XXX@XXX/./backups::{hostname}-{now:%Y-%m-%dT%H:%M:%S.%f}', '/', '/boot', '/run/user/0/./borgmatic/bootstrap', '/run/user/0/./borgmatic/postgresql_databases', '/var/lib/containers/storage', '/var/lib/containers/storage/volumes', '/var/lib/machines/XXX') *** exit_status: ('borg', 'create', '--exclude-from', '/tmp/tmpyk6inpnz', '--compression', 'zstd', '--one-file-system', '--read-special', '--remote-path', 'XXX/borg', '--debug', '--show-rc', 'ssh://XXX@XXX/./backups::{hostname}-{now:%Y-%m-%dT%H:%M:%S.%f}', '/', '/boot', '/run/user/0/./borgmatic/bootstrap', '/run/user/0/./borgmatic/postgresql_databases', '/var/lib/containers/storage', '/var/lib/containers/storage/volumes', '/var/lib/machines/XXX') *** top of outermost loop *** selecting on: [<_io.BufferedReader name=3>, <_io.BufferedReader name=5>] *** ready_buffers: [<_io.BufferedReader name=5>] *** top of ready_buffer outer loop: <_io.BufferedReader name=5> *** ready_process: <Popen: returncode: 0 args: ('borg', 'create', '--exclude-from', '/tmp/tmpyk...> *** ready_process has exited *** top of ready_buffer inner loop *** bailing *** top of process loop: <Popen: returncode: None args: 'pg_dumpall --no-password --clean --if-exists...> *** exit_code: None *** continue *** top of process loop: <Popen: returncode: 0 args: ('borg', 'create', '--exclude-from', '/tmp/tmpyk...> *** exit_code: 0 ``` and here the backup - as you can see, there is no run directory - don't know if this is intentional but i don't think so ``` ls -lah total 2.0K drwxr-xr-x 1 root root 0 Dec 18 20:32 . drwxr-xr-x 1 root root 0 Dec 18 20:32 . drwxr-xr-x 1 root root 0 Dec 18 20:39 .. lrwxrwxrwx 1 root root 7 Nov 21 09:56 bin -> usr/bin drwxr-xr-x 1 root root 0 Dec 12 19:04 boot drwxr-xr-x 1 root root 0 Dec 18 18:29 etc drwxr-xr-x 1 root root 0 Apr 19 2024 home lrwxrwxrwx 1 root root 7 Nov 21 09:56 lib -> usr/lib lrwxrwxrwx 1 root root 7 Nov 21 09:56 lib64 -> usr/lib drwx------ 1 root root 0 Mar 7 2024 lost+found drwxr-xr-x 1 root root 0 Mar 7 2024 media drwxr-xr-x 1 root root 0 Jan 19 2024 mnt -rw-r--r-- 1 root root 0 Nov 8 12:58 oom drwxr-xr-x 1 root root 0 Jan 19 2024 opt drwxr-x--- 1 root root 0 Dec 18 18:28 root lrwxrwxrwx 1 root root 7 Nov 21 09:56 sbin -> usr/bin drwxr-xr-x 1 root root 0 Mar 7 2024 srv dr-xr-xr-x 1 root root 0 Dec 17 23:29 sys drwxr-xr-x 1 root root 0 Dec 18 18:29 usr drwxr-xr-x 1 root root 0 Dec 16 16:34 var ```
Owner

So I'm pretty sure there is the issue.
The postgres backup gets no return code and the /run/user directories which don't exist in my source_directories, dont exist in my completed backup

Yeah, it's at least the proximate cause if not the root cause. Because if the pg_dump command is never exiting, then it makes some sense that there's a hang.

and here the backup - as you can see, there is no run directory - don't know if this is intentional but i don't think so

Database dumps should show up in a borgmatic/... directory within the backup archive (assuming Borg 1.4+ is used). But in any case, it does appear to be missing from from the backup, which might indicate the dump never completed. And that would be consistent with the dump process still running as indicated above.

Anyway, here are my current thoughts on next steps:

  • It looks like Borg is reading files from, among other places, /run/user/0/./borgmatic/postgresql_databases. But I wonder if the pg_dump command is actually writing to a named pipe there. So I tweaked execute.py to log the full command line that was run so you can see if it's actually dumping to, say, /run/user/0/borgmatic/postgresql_databases/all. No need to send me the output and laboriously redact it again.. Just please let me know if the dump location lines up with the borg create paths. If not, that could explain the hang behavior you're seeing.
  • You mentioned tmpfs. There shouldn't be a problem with the runtime directory being there, but just in case there is, maybe try user_runtime_directory somewhere else to see if the hang goes away?
  • I see that your user runtime directory is currently /run/user/0. I wouldn't expect that to be the case if you're running as root, because IIRC, root doesn't have that directory created by Linux PAM. So do you happen to know how it's getting set? See my first comment above about the logic behind that in borgmatic.
  • Do you happen to have working_directory configured in your borgmatic configuration?
  • If none of the above pans out, then the next step—assuming you're game—might be a binary search through the changesets that make up version 1.9.0 to see where the hang started happening. I'd be happy to assist with that if it comes to it.
> So I'm pretty sure there is the issue. > The postgres backup gets no return code and the /run/user directories which don't exist in my source_directories, dont exist in my completed backup Yeah, it's at least the proximate cause if not the root cause. Because if the `pg_dump` command is never exiting, then it makes some sense that there's a hang. > and here the backup - as you can see, there is no run directory - don't know if this is intentional but i don't think so Database dumps should show up in a `borgmatic/...` directory within the backup archive (assuming Borg 1.4+ is used). But in any case, it does appear to be missing from from the backup, which might indicate the dump never completed. And that would be consistent with the dump process still running as indicated above. Anyway, here are my current thoughts on next steps: * It looks like Borg is reading files from, among other places, `/run/user/0/./borgmatic/postgresql_databases`. But I wonder if the `pg_dump` command is actually writing to a named pipe there. So I tweaked `execute.py` to log the full command line that was run so you can see if it's actually dumping to, say, `/run/user/0/borgmatic/postgresql_databases/all`. No need to send me the output and laboriously redact it again.. Just please let me know if the dump location lines up with the `borg create` paths. If not, that could explain the hang behavior you're seeing. * You mentioned tmpfs. There *shouldn't* be a problem with the runtime directory being there, but just in case there is, maybe try `user_runtime_directory` somewhere else to see if the hang goes away? * I see that your user runtime directory is currently `/run/user/0`. I wouldn't expect that to be the case if you're running as root, because IIRC, root doesn't have that directory created by Linux PAM. So do you happen to know how it's getting set? See my first comment above about the logic behind that in borgmatic. * Do you happen to have `working_directory` configured in your borgmatic configuration? * If none of the above pans out, then the next step—assuming you're game—might be a binary search through the changesets that make up version 1.9.0 to see where the hang started happening. I'd be happy to assist with that if it comes to it.
  • It looks like Borg is reading files from, among other places, /run/user/0/./borgmatic/postgresql_databases. But I wonder if the pg_dump command is actually writing to a named pipe there. So I tweaked execute.py to log the full command line that was run so you can see if it's actually dumping to, say, /run/user/0/borgmatic/postgresql_databases/all. No need to send me the output and laboriously redact it again.. Just please let me know if the dump location lines up with the borg create paths. If not, that could explain
    the hang behavior you're seeing.

The named pipe is in the path you named (/run/user/0/./borgmatic/postgresql_databases/ip_address/all)
And this path is backuped like you saw in the previous log.
The named pipe itself seems to work, too. I tried cat on it in one of my test runs just now and i get the pg_dumpall

  • You mentioned tmpfs. There shouldn't be a problem with the runtime directory being there, but just in case there is, maybe try user_runtime_directory somewhere else to see if the hang goes away?

I set it to /root/borgmatic and backup worked just fine and the Postgres Backup is in this path (/root/borgmatic/borgmatic/postgresql_databases/ip_address/all)

drwx------ 1 root root    0 Dec 19 17:07 .
drwxr-xr-x 1 root root    0 Dec 19 17:07 ..
-rw------- 1 root root 212M Dec 19 17:07 all

I tried setting another tmpfs like /tmp and it looped again with a backup without postgres backup coming out of it.

  • I see that your user runtime directory is currently /run/user/0. I wouldn't expect that to be the case if you're running as root, because IIRC, root doesn't have that directory created by Linux PAM. So do you happen to know how it's getting set? See my first comment above about the logic behind that in borgmatic.

I honestly don't know when it's created but there is at least gnupg, p11-kit and systemd using /run/user/0 so I think it's at least not unusal to exist in this way

  • Do you happen to have working_directory configured in your borgmatic configuration?

I've never set this variable - the only "path variable" i set was user_state_directory because it filled up my rootfs but I've just seen i haven't set it at at least for the time i updated the config the last time (mid november)

  • If none of the above pans out, then the next step—assuming you're game—might be a binary search through the changesets that make up version 1.9.0 to see where the hang started happening. I'd be happy to assist with that if it comes to it.
> * It looks like Borg is reading files from, among other places, `/run/user/0/./borgmatic/postgresql_databases`. But I wonder if the `pg_dump` command is actually writing to a named pipe there. So I tweaked `execute.py` to log the full command line that was run so you can see if it's actually dumping to, say, `/run/user/0/borgmatic/postgresql_databases/all`. No need to send me the output and laboriously redact it again.. Just please let me know if the dump location lines up with the `borg create` paths. If not, that could explain the hang behavior you're seeing. The named pipe is in the path you named (/run/user/0/./borgmatic/postgresql_databases/ip_address/all) And this path is backuped like you saw in the previous log. The named pipe itself seems to work, too. I tried cat on it in one of my test runs just now and i get the pg_dumpall > * You mentioned tmpfs. There *shouldn't* be a problem with the runtime directory being there, but just in case there is, maybe try `user_runtime_directory` somewhere else to see if the hang goes away? I set it to /root/borgmatic and backup worked just fine and the Postgres Backup is in this path (/root/borgmatic/borgmatic/postgresql_databases/ip_address/all) ``` drwx------ 1 root root 0 Dec 19 17:07 . drwxr-xr-x 1 root root 0 Dec 19 17:07 .. -rw------- 1 root root 212M Dec 19 17:07 all ``` I tried setting another tmpfs like /tmp and it looped again with a backup without postgres backup coming out of it. > * I see that your user runtime directory is currently `/run/user/0`. I wouldn't expect that to be the case if you're running as root, because IIRC, root doesn't have that directory created by Linux PAM. So do you happen to know how it's getting set? See my first comment above about the logic behind that in borgmatic. I honestly don't know when it's created but there is at least gnupg, p11-kit and systemd using /run/user/0 so I think it's at least not unusal to exist in this way > * Do you happen to have `working_directory` configured in your borgmatic configuration? I've never set this variable - the only "path variable" i set was user_state_directory because it filled up my rootfs but I've just seen i haven't set it at at least for the time i updated the config the last time (mid november) > * If none of the above pans out, then the next step—assuming you're game—might be a binary search through the changesets that make up version 1.9.0 to see where the hang started happening. I'd be happy to assist with that if it comes to it.
Owner

The named pipe is in the path you named (/run/user/0/./borgmatic/postgresql_databases/ip_address/all)

Okay, then it sounds like at least pg_dump is probably writing to a location that Borg is reading from. So we can eliminate that.

I set it to /root/borgmatic and backup worked just fine and the Postgres Backup is in this path (/root/borgmatic/borgmatic/postgresql_databases/ip_address/all)

I tried setting another tmpfs like /tmp and it looped again with a backup without postgres backup coming out of it.

Super odd! Does this happen consistently over the course of several runs? I know I've successfully run borgmatic database dumps on tmpfs locations, so I wonder if the fact that you're defaulting to a tmpfs path has anything to do with the hang or if there's another correlated (or even random) factor at play.

Another thing I wonder about: Are your tmpfs directories mounted with the nodev option? You can see this by running mount | grep tmpfs, looking for the tmpfs filesystem in question (e.g. /tmp and /run/user/0 in your case) and seeing if nodev is specified among the parenthesized mount options. My guess is that nodev is there, and in theory it could prevent named pipes from working.

The weird thing is I have nodev on my tmpfs filesystems, and borgmatic database backups work fine here! So here's something to try to see if named pipes work on your tmpfs. In a root shell, try:

mkfifo /tmp/947
ls -l /tmp/947  # Ensure it shows up with an initial "p" for pipe.
echo "hi" > /tmp/947

And then in a separate root shell:

cat /tmp/947

That second shell should print "hi" and the first shell should exit.

I honestly don't know when it's created but there is at least gnupg, p11-kit and systemd using /run/user/0 so I think it's at least not unusal to exist in this way

Okay.

What Linux distro are you using BTW?

> The named pipe is in the path you named (/run/user/0/./borgmatic/postgresql_databases/ip_address/all) Okay, then it sounds like at least pg_dump is probably writing to a location that Borg is reading from. So we can eliminate that. > I set it to /root/borgmatic and backup worked just fine and the Postgres Backup is in this path (/root/borgmatic/borgmatic/postgresql_databases/ip_address/all) > I tried setting another tmpfs like /tmp and it looped again with a backup without postgres backup coming out of it. Super odd! Does this happen consistently over the course of several runs? I know I've successfully run borgmatic database dumps on tmpfs locations, so I wonder if the fact that you're defaulting to a tmpfs path has anything to do with the hang or if there's another correlated (or even random) factor at play. Another thing I wonder about: Are your tmpfs directories mounted with the `nodev` option? You can see this by running `mount | grep tmpfs`, looking for the tmpfs filesystem in question (e.g. `/tmp` and `/run/user/0` in your case) and seeing if `nodev` is specified among the parenthesized mount options. My guess is that `nodev` *is* there, and in theory it could [prevent named pipes from working](https://unix.stackexchange.com/questions/533675/meaning-of-nodev-in-mount-options-on-linux). The weird thing is I have `nodev` on my tmpfs filesystems, and borgmatic database backups work fine here! So here's something to try to see if named pipes work on your tmpfs. In a root shell, try: ```bash mkfifo /tmp/947 ls -l /tmp/947 # Ensure it shows up with an initial "p" for pipe. echo "hi" > /tmp/947 ``` And then in a separate root shell: ```bash cat /tmp/947 ``` That second shell should print "hi" and the first shell should exit. > I honestly don't know when it's created but there is at least gnupg, p11-kit and systemd using /run/user/0 so I think it's at least not unusal to exist in this way Okay. What Linux distro are you using BTW?

The named pipe is in the path you named (/run/user/0/./borgmatic/postgresql_databases
I set it to /root/borgmatic and backup worked just fine and the Postgres Backup is in this path (/root/borgmatic/borgmatic/postgresql_databases/ip_address/all)

I tried setting another tmpfs like /tmp and it looped again with a backup without postgres backup coming out of it.

Super odd! Does this happen consistently over the course of several runs? I know I've successfully run borgmatic database dumps on tmpfs locations, so I wonder if the fact that you're defaulting to a tmpfs path has anything to do with the hang or if there's another correlated (or even random) factor at play.

Another thing I wonder about: Are your tmpfs directories mounted with the nodev option? You can see this by running mount | grep tmpfs, looking for the tmpfs filesystem in question (e.g. /tmp and /run/user/0 in your case) and seeing if nodev is specified among the parenthesized mount options. My guess is that nodev is there, and in theory it could prevent named pipes from working.

The weird thing is I have nodev on my tmpfs filesystems, and borgmatic database backups work fine here! So here's something to try to see if named pipes work on your tmpfs. In a root shell, try:

mkfifo /tmp/947
ls -l /tmp/947  # Ensure it shows up with an initial "p" for pipe.
echo "hi" > /tmp/947

And then in a separate root shell:

cat /tmp/947

That second shell should print "hi" and the first shell should exit.

my tmpfs has following options which come via fstab's default option

rw,nosuid,nodev,nr_inodes=1048576,inode64

the named pipe is working and it's working for the dump too - even executing a cat on the pipe for the dump while doing the backup is working (which I tried on one of my test runs)

your test with mkfifo works, too

I can persistently create the issue when the pipe is on any tmpfs
And just as persistently can I recreate it working on a "real" disk

I honestly don't know when it's created but there is at least gnupg, p11-kit and systemd using /run/user/0 so I think it's at least not unusal to exist in this way

Okay.

What Linux distro are you using BTW?

I thought I wrote it already but maybe not - it's Arch Linux so it's pretty barebones with no weird changes like Ubuntu

> > The named pipe is in the path you named (/run/user/0/./borgmatic/postgresql_databases > > I set it to /root/borgmatic and backup worked just fine and the Postgres Backup is in this path (/root/borgmatic/borgmatic/postgresql_databases/ip_address/all) > > > I tried setting another tmpfs like /tmp and it looped again with a backup without postgres backup coming out of it. > > Super odd! Does this happen consistently over the course of several runs? I know I've successfully run borgmatic database dumps on tmpfs locations, so I wonder if the fact that you're defaulting to a tmpfs path has anything to do with the hang or if there's another correlated (or even random) factor at play. > > Another thing I wonder about: Are your tmpfs directories mounted with the `nodev` option? You can see this by running `mount | grep tmpfs`, looking for the tmpfs filesystem in question (e.g. `/tmp` and `/run/user/0` in your case) and seeing if `nodev` is specified among the parenthesized mount options. My guess is that `nodev` *is* there, and in theory it could [prevent named pipes from working](https://unix.stackexchange.com/questions/533675/meaning-of-nodev-in-mount-options-on-linux). > > The weird thing is I have `nodev` on my tmpfs filesystems, and borgmatic database backups work fine here! So here's something to try to see if named pipes work on your tmpfs. In a root shell, try: > > ```bash > mkfifo /tmp/947 > ls -l /tmp/947 # Ensure it shows up with an initial "p" for pipe. > echo "hi" > /tmp/947 > ``` > > And then in a separate root shell: > > ```bash > cat /tmp/947 > ``` > > That second shell should print "hi" and the first shell should exit. > my tmpfs has following options which come via fstab's default option ``` rw,nosuid,nodev,nr_inodes=1048576,inode64 ``` the named pipe is working and it's working for the dump too - even executing a cat on the pipe for the dump while doing the backup is working (which I tried on one of my test runs) your test with mkfifo works, too I can persistently create the issue when the pipe is on any tmpfs And just as persistently can I recreate it working on a "real" disk > > I honestly don't know when it's created but there is at least gnupg, p11-kit and systemd using /run/user/0 so I think it's at least not unusal to exist in this way > > Okay. > > What Linux distro are you using BTW? I thought I wrote it already but maybe not - it's Arch Linux so it's pretty barebones with no weird changes like Ubuntu
Owner

Okay, thanks for trying that out. I saw conflicting information on whether named pipes work on nodev filesystems, but it sound like they do on both your machine and mine.

I thought I wrote it already but maybe not - it's Arch Linux so it's pretty barebones with no weird changes like Ubuntu

Nope, you did. Looks like I just missed it.

So I'm a little at a loss on how to proceed from here. We could do the binary search thing on 1.9.0 changesets, but that's where the use of tmpfs directories began, and so it would probably just tell us what we already know—that on your machine borgmatic hangs when dumping on a tmpfs but not when dumping on a disk-backed filesystem.

Here's something that stands out to me though. You mentioned your non-tmpfs dump file listing above:

drwx------ 1 root root    0 Dec 19 17:07 .
drwxr-xr-x 1 root root    0 Dec 19 17:07 ..
-rw------- 1 root root 212M Dec 19 17:07 all

That strikes me as wrong, because all should be a named pipe (that line should start with "p"). But here it's just a regular file. However above when you're using /run/user/0 (presumably tmpfs), the dump location is an actual named pipe. That discrepancy is pretty weird, and I wonder if it could be related to the underlying problem here. You don't have format set in your postgresql_databases: entries, do you? If you try dumping a single database instead of all, does the hang behavior change?

Okay, thanks for trying that out. I saw conflicting information on whether named pipes work on nodev filesystems, but it sound like they do on both your machine and mine. > I thought I wrote it already but maybe not - it's Arch Linux so it's pretty barebones with no weird changes like Ubuntu Nope, you did. Looks like I just missed it. So I'm a little at a loss on how to proceed from here. We could do the binary search thing on 1.9.0 changesets, but that's where the use of tmpfs directories began, and so it would probably just tell us what we already know—that on your machine borgmatic hangs when dumping on a tmpfs but not when dumping on a disk-backed filesystem. Here's something that stands out to me though. You mentioned your non-tmpfs dump file listing above: ``` drwx------ 1 root root 0 Dec 19 17:07 . drwxr-xr-x 1 root root 0 Dec 19 17:07 .. -rw------- 1 root root 212M Dec 19 17:07 all ``` That strikes me as wrong, because `all` should be a named pipe (that line should start with "p"). But here it's just a regular file. However above when you're using `/run/user/0` (presumably tmpfs), the dump location is an actual named pipe. That discrepancy is pretty weird, and I wonder if it could be related to the underlying problem here. You don't have `format` set in your `postgresql_databases:` entries, do you? If you try dumping a single database instead of `all`, does the hang behavior change?

I think I'm stupid
Honestly I'm sorry I used that much of your time because I haven't thought about my exclude_patterns
Seems like they excluded the paths i tried with tmpfs - after removing the exclude everything works again...

The exclude patterns are there for years at least because i have a lot of things which utilize tmpfs like /tmp and /run temporarily
But yeah might be worth implementing a warning for people who either don't think about their exclude pattern anymore or just don't expect them to interfere with the database dumps
Or an "exclude from exclude" for the directories borgmatic needs itself to function

I think I'm stupid Honestly I'm sorry I used that much of your time because I haven't thought about my exclude_patterns Seems like they excluded the paths i tried with tmpfs - after removing the exclude everything works again... The exclude patterns are there for years at least because i have a lot of things which utilize tmpfs like /tmp and /run temporarily But yeah might be worth implementing a warning for people who either don't think about their exclude pattern anymore or just don't expect them to interfere with the database dumps Or an "exclude from exclude" for the directories borgmatic needs itself to function
Owner

I'm so glad you figured it out! I think this one is on me, because I really should've asked for your config file up front, and then we probably would've caught this sooner.

Good idea about the warning, although I'm not sure how easy that would be to implement. I'll leave this ticket open to investigate it. At the very least I can document the potential conflict.

@sthen You might try checking your configured excludes as well to see if it's the same issue.

EDIT: FWIW I have a repro now of the hang by putting the user runtime directory path into exclude_patterns!

I'm so glad you figured it out! I think this one is on me, because I really should've asked for your config file up front, and then we probably would've caught this sooner. Good idea about the warning, although I'm not sure how easy that would be to implement. I'll leave this ticket open to investigate it. At the very least I can document the potential conflict. @sthen You might try checking your configured excludes as well to see if it's the same issue. EDIT: FWIW I have a repro now of the hang by putting the user runtime directory path into `exclude_patterns`!
Owner

Okay, I ended up erroring instead of just warning in the case where the configured excludes overlap with the runtime directory. That's because it's a misconfiguration and basically guarantees a hang. I figured a moderately user-friendly error is preferable to hanging mysteriously.

This change is in main and will be part of the next release. Calling the ticket done for now. Thanks again for all your patience (and assistance) here!

Okay, I ended up _erroring_ instead of just warning in the case where the configured excludes overlap with the runtime directory. That's because it's a misconfiguration and basically guarantees a hang. I figured a moderately user-friendly error is preferable to hanging mysteriously. This change is in main and will be part of the next release. Calling the ticket done for now. Thanks again for all your patience (and assistance) here!
Owner

Released in borgmatic 1.9.5!

Released in borgmatic 1.9.5!

Hello, my backups just started failing after updating to 1.9.5, however I don't use exclude_patterns at all !

However, I use patterns and of course the runtime directory was not present in my patterns before, because it didn't need to.

So to make it work now I have to add a custom user_runtime_directory and add it under my patterns ?

Hello, my backups just started failing after updating to 1.9.5, however I don't use `exclude_patterns` at all ! However, I use `patterns` and of course the runtime directory was not present in my patterns before, because it didn't need to. So to make it work now I have to add a custom `user_runtime_directory` and add it under my `patterns` ?
Owner

@LelouBil It depends on exactly what's in your patterns. The point of the error is that the runtime directory can no longer be excluded via either excludes or patterns, because that can cause Borg hangs. But it might not have to be explicitly included as long as you aren't excluding it. For instance, the following patterns would trigger the new error message if the runtime directory exists at /run/user/1234:

R /
- /dev
- /run

That's because the parent directory of the runtime directory is getting excluded. To fix that, you could remove the exclude:

R /
- /dev

... or you could add an explicit include to override the exclude:

R /
- /dev
- /run
+ /run/user

This is just one particular example, of course, and your runtime directory may be at a different location. See the -v 2 borgmatic output for the path. And you shouldn't need to set user_runtime_directory as long as you're fine with the implicitly selected path (and can avoid excluding it).

@LelouBil It depends on exactly what's in your `patterns`. The point of the error is that the runtime directory can no longer be excluded via either excludes or patterns, because that can cause Borg hangs. But it might not have to be explicitly *included* as long as you aren't excluding it. For instance, the following patterns would trigger the new error message if the runtime directory exists at `/run/user/1234`: ``` R / - /dev - /run ``` That's because the parent directory of the runtime directory is getting excluded. To fix that, you could remove the exclude: ``` R / - /dev ``` ... *or* you could add an explicit include to override the exclude: ``` R / - /dev - /run + /run/user ``` This is just one particular example, of course, and your runtime directory may be at a different location. See the `-v 2` borgmatic output for the path. And you shouldn't need to set `user_runtime_directory` as long as you're fine with the implicitly selected path (and can avoid excluding it).

I'm using docker-borgmatic, and my patterns element is a bunch of lines like these (because it is auto generated):

patterns:
 - R /mnt/source/some_service/
 - "+ /mnt/source/some_service//*"

All paths begin with /mnt/source, and no path contains "/tmp", however it fails even when borgmatic choses a directory under /tmp

The runtime directory /tmp/borgmatic-bm_l33ha/borgmatic overlaps with the configured excludes. Please remove it from excludes or change the runtime directory.

Maybe docker-borgmatic adds some configuration that I'm not finding ?

I'm using docker-borgmatic, and my patterns element is a bunch of lines like these (because it is auto generated): ``` patterns: - R /mnt/source/some_service/ - "+ /mnt/source/some_service//*" ``` All paths begin with /mnt/source, and no path contains "/tmp", however it fails even when borgmatic choses a directory under /tmp > The runtime directory /tmp/borgmatic-bm_l33ha/borgmatic overlaps with the configured excludes. Please remove it from excludes or change the runtime directory. Maybe docker-borgmatic adds some configuration that I'm not finding ?
Owner

Hmm. I don't know of any configuration that docker-borgmatic adds. Can I get a look at your (redacted) configuration file and the verbose output of your borgmatic run that results in the error? Thanks.

Hmm. I don't know of any configuration that docker-borgmatic adds. Can I get a look at your (redacted) configuration file and the verbose output of your borgmatic run that results in the error? Thanks.

I tried reproducing it with a minimal configuration without luck, however after reading the logs more carefully I found out that it failed with this error during a spot check.

A normal backup creation has no issues, so this seems to be dependent on the paths of my last backups ?

Here's the relevant part of the logs, I added a print to show the paths_output variable here

paths_output = execute_command_and_capture_output(

does this seem normal to you ?

Finished full repository check, no problems found.
terminating with success status, rc 0
Writing check time at /root/.local/state/borgmatic/checks/bdd12c6ff22e991be56c589b8b9ce00fe86a3fb554f900d232aac12b0d9a9eda/repository
local: Using runtime directory /tmp/borgmatic-pfkel7ih/borgmatic
local: Running spot check
/mnt/borg-repository: Calling bootstrap hook function use_streaming
/mnt/borg-repository: Calling mariadb hook function use_streaming
/mnt/borg-repository: Calling mysql hook function use_streaming
/mnt/borg-repository: Calling postgresql hook function use_streaming
/mnt/borg-repository: Calling sqlite hook function use_streaming
/mnt/borg-repository: Ignoring configured "read_special" value of false, as true is needed for database hooks.
/mnt/borg-repository: Collecting special file paths
BORG_RELOCATED_REPO_ACCESS_IS_OK=*** BORG_UNKNOWN_UNENCRYPTED_REPO_ACCESS_IS_OK=*** BORG_EXIT_CODES=*** borg create --patterns-from /tmp/tmp81qeb9qm --read-special --list --filter AMEx- --dry-run /mnt/borg-repository::{hostname}-{now:%Y-%m-%dT%H:%M:%S.%f} --dry-run --list
paths_output='- /mnt/source/paperless_data/log/.__paperless.lock\n- /mnt/source/paperless_data/log/paperless.log\n- /mnt/source/paperless_data/log/.__celery.lock\n- /mnt/source/paperless_data/log/celery.log.20\n- /mnt/source/paperless_data/log/celery.log.19\n- /mnt/source/paperless_data/log/celery.log.18\n- /mnt/source/paperless_data/log/celery.log.17\n- /mnt/source/paperless_data/log/celery.log.16\n- /mnt/source/paperless_data/log/celery.log.15\n- /mnt/source/paperless_data/log/celery.log.14\n- /mnt/source/paperless_data/log/celery.log.13\n- /mnt/source/paperless_data/log/celery.log.12\n- /mnt/source/paperless_data/log/celery.log.11\n- /mnt/source/paperless_data/log/celery.log.10\n- /mnt/source/paperless_data/log/celery.log.9\n- /mnt/source/paperless_data/log/celery.log.8\n- /mnt/source/paperless_data/log/celery.log.7\n- /mnt/source/paperless_data/log/celery.log.6\n- /mnt/source/paperless_data/log/celery.log.5\n- /mnt/source/paperless_data/log/celery.log.4\n- /mnt/source/paperless_data/log/celery.log.3\n- /mnt/source/paperless_data/log/celery.log.2\n- /mnt/source/paperless_data/log/celery.log.1\n- /mnt/source/paperless_data/log/celery.log\n- /mnt/source/paperless_data/log\n- /mnt/source/paperless_data/migration_lock\n- /mnt/source/paperless_data/index/MAIN_WRITELOCK\n- /mnt/source/paperless_data/index/_MAIN_65.toc\n- /mnt/source/paperless_data/index\n- /mnt/source/paperless_data/.index_version\n- /mnt/source/paperless_data/celerybeat-schedule.db\n- /mnt/source/paperless_data\n'
local: Error running actions for repository
The runtime directory /tmp/borgmatic-pfkel7ih/borgmatic overlaps with the configured excludes. Please remove it from excludes or change the runtime directory.

If I understand the logic in collect_special_file_paths correctly, the fact that no directory in paths_output is a parent of the runtime directory means it was excluded or else it would have been in the list ?

Since you error out if there are no paths to skip.

Should the runtime directory be added to the paths by default ?

I tried reproducing it with a minimal configuration without luck, however after reading the logs more carefully I found out that it failed with this error during a spot check. A normal backup creation has no issues, so this seems to be dependent on the paths of my last backups ? Here's the relevant part of the logs, I added a print to show the paths_output variable here https://projects.torsion.org/borgmatic-collective/borgmatic/src/commit/8573660ff05e66fe6995957797df82cbebcf91ca/borgmatic/borg/create.py#L185 does this seem normal to you ? ``` Finished full repository check, no problems found. terminating with success status, rc 0 Writing check time at /root/.local/state/borgmatic/checks/bdd12c6ff22e991be56c589b8b9ce00fe86a3fb554f900d232aac12b0d9a9eda/repository local: Using runtime directory /tmp/borgmatic-pfkel7ih/borgmatic local: Running spot check /mnt/borg-repository: Calling bootstrap hook function use_streaming /mnt/borg-repository: Calling mariadb hook function use_streaming /mnt/borg-repository: Calling mysql hook function use_streaming /mnt/borg-repository: Calling postgresql hook function use_streaming /mnt/borg-repository: Calling sqlite hook function use_streaming /mnt/borg-repository: Ignoring configured "read_special" value of false, as true is needed for database hooks. /mnt/borg-repository: Collecting special file paths BORG_RELOCATED_REPO_ACCESS_IS_OK=*** BORG_UNKNOWN_UNENCRYPTED_REPO_ACCESS_IS_OK=*** BORG_EXIT_CODES=*** borg create --patterns-from /tmp/tmp81qeb9qm --read-special --list --filter AMEx- --dry-run /mnt/borg-repository::{hostname}-{now:%Y-%m-%dT%H:%M:%S.%f} --dry-run --list paths_output='- /mnt/source/paperless_data/log/.__paperless.lock\n- /mnt/source/paperless_data/log/paperless.log\n- /mnt/source/paperless_data/log/.__celery.lock\n- /mnt/source/paperless_data/log/celery.log.20\n- /mnt/source/paperless_data/log/celery.log.19\n- /mnt/source/paperless_data/log/celery.log.18\n- /mnt/source/paperless_data/log/celery.log.17\n- /mnt/source/paperless_data/log/celery.log.16\n- /mnt/source/paperless_data/log/celery.log.15\n- /mnt/source/paperless_data/log/celery.log.14\n- /mnt/source/paperless_data/log/celery.log.13\n- /mnt/source/paperless_data/log/celery.log.12\n- /mnt/source/paperless_data/log/celery.log.11\n- /mnt/source/paperless_data/log/celery.log.10\n- /mnt/source/paperless_data/log/celery.log.9\n- /mnt/source/paperless_data/log/celery.log.8\n- /mnt/source/paperless_data/log/celery.log.7\n- /mnt/source/paperless_data/log/celery.log.6\n- /mnt/source/paperless_data/log/celery.log.5\n- /mnt/source/paperless_data/log/celery.log.4\n- /mnt/source/paperless_data/log/celery.log.3\n- /mnt/source/paperless_data/log/celery.log.2\n- /mnt/source/paperless_data/log/celery.log.1\n- /mnt/source/paperless_data/log/celery.log\n- /mnt/source/paperless_data/log\n- /mnt/source/paperless_data/migration_lock\n- /mnt/source/paperless_data/index/MAIN_WRITELOCK\n- /mnt/source/paperless_data/index/_MAIN_65.toc\n- /mnt/source/paperless_data/index\n- /mnt/source/paperless_data/.index_version\n- /mnt/source/paperless_data/celerybeat-schedule.db\n- /mnt/source/paperless_data\n' local: Error running actions for repository The runtime directory /tmp/borgmatic-pfkel7ih/borgmatic overlaps with the configured excludes. Please remove it from excludes or change the runtime directory. ``` If I understand the logic in `collect_special_file_paths` correctly, the fact that no directory in `paths_output` is a parent of the runtime directory means it was excluded or else it would have been in the list ? Since you error out if there are no paths to skip. Should the runtime directory be added to the paths by default ?
Owner

Thank you! That error is not actually from the spot check. It comes after, during the create action when special file paths are detected and excluded to prevent Borg hangs.

I assume you've added a debugging print for paths_output, which is helpful here. So to answer your question, yes, this seems "normal" in the sense that the code is apparently working as expected. But it's catching an "incorrect" configuration.

If I understand the logic in collect_special_file_paths correctly, the fact that no directory in paths_output is a parent of the runtime directory means it was excluded or else it would have been in the list ?

That's right!

Should the runtime directory be added to the paths by default ?

It is, but it's also possible with patterns to inadvertently "undo" that addition and exclude the runtime directory. That's what this error is ostensibly catching. So can I get a look at your configured patterns (redacted as necessary)?

A normal backup creation has no issues, so this seems to be dependent on the paths of my last backups ?

What makes you say that? What's the difference between a "normal" backup creation and one that results in this error?

Thank you! That error is not actually from the spot check. It comes after, during the `create` action when special file paths are detected and excluded to prevent Borg hangs. I assume you've added a debugging print for `paths_output`, which is helpful here. So to answer your question, yes, this seems "normal" in the sense that the code is apparently working as expected. But it's catching an "incorrect" configuration. > If I understand the logic in collect_special_file_paths correctly, the fact that no directory in paths_output is a parent of the runtime directory means it was excluded or else it would have been in the list ? That's right! > Should the runtime directory be added to the paths by default ? It is, but it's also possible with patterns to inadvertently "undo" that addition and exclude the runtime directory. That's what this error is ostensibly catching. So can I get a look at your configured patterns (redacted as necessary)? > A normal backup creation has no issues, so this seems to be dependent on the paths of my last backups ? What makes you say that? What's the difference between a "normal" backup creation and one that results in this error?

DIsabling the spot check makes the backup work without any issues.

Adding other debug prints I found out the error happens during this function

(create_flags, create_positional_arguments, pattern_file, exclude_file) = (

DIsabling the spot check makes the backup work without any issues. Adding other debug prints I found out the error happens during this function https://projects.torsion.org/borgmatic-collective/borgmatic/src/commit/8573660ff05e66fe6995957797df82cbebcf91ca/borgmatic/actions/check.py#L371

During collect_spot_check_source_paths I mean, I forgot the specific line I tested yesterday

During `collect_spot_check_source_paths` I mean, I forgot the specific line I tested yesterday
Owner

That's really interesting that it only happens during the spot check. Collecting the spot check source paths does go through the code path that can ultimately lead to that error message, so it makes some amount of sense. So can I get a look at your configured patterns? That will help debug any potential interaction between those patterns and the borgmatic runtime directory.

That's really interesting that it only happens during the spot check. Collecting the spot check source paths does go through the code path that can ultimately lead to that error message, so it makes some amount of sense. So can I get a look at your configured patterns? That will help debug any potential interaction between those patterns and the borgmatic runtime directory.

I tested it with this single pattern and it errored.

patterns:
  - R /mnt/source/paperless_data/
  - "+ /mnt/source/paperless_data//*"

With spot checks enabled

I tested it with this single pattern and it errored. ``` patterns: - R /mnt/source/paperless_data/ - "+ /mnt/source/paperless_data//*" ``` With spot checks enabled
Owner

Thanks. I think I might have a repro now and an idea about what's going on! So what I did was:

  • Enable a database hook. (This triggers the collect special files code path.)
  • Configure one root pattern. Although this also appears to repro with a source directory instead.
  • Configure spot checks.

Does that sound like it lines up with your configuration?

Anyway, I'll file a separate ticket to cover fixing this, and I'll link it here.

Thanks. I think I might have a repro now and an idea about what's going on! So what I did was: - Enable a database hook. (This triggers the collect special files code path.) - Configure one root pattern. Although this also appears to repro with a source directory instead. - Configure spot checks. Does that sound like it lines up with your configuration? Anyway, I'll file a separate ticket to cover fixing this, and I'll link it here.
Owner

#965.

#965.
Sign in to join this conversation.
No Milestone
4 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: borgmatic-collective/borgmatic#947
No description provided.