Borg OOM during backup does not result in borgmatic passing on the error #423

Closed
opened 2021-06-02 11:43:16 +00:00 by cultpony · 5 comments

What I'm trying to do and why

Attempting to perform a backup of considerate size (~2TB, millions of files) with limited memory (500MB)

Steps to reproduce (if a bug)

  1. Create large-ish dataset (2TB, roughly 2 million files)
  2. Run borgmatic in a container with limited memory (512MB)

Actual behavior (if a bug)

Borgmatic should exit with an error code and ping relevant healthcheck endpoints with failure. It should report that borg unexpectedly terminated.

Expected behavior (if a bug)

Borgmatic exits with error code 0 and pings a success to healthcheck endpoints.

Other notes / implementation ideas

I am attempting to run the backup again with more memory (2GB). I've not enabled any special flags for borgmatic beyond what borgmatic already sets (and a -v2 flag has been used in debugging the issue).

Borgmatic is running in an LXC container that functions as NFS storage daemon.

Environment

borgmatic version: [version here]

Version 1.5.13

borgmatic installation method: pip

Borg version: [version here]

1.1.15

Python version: [version here]

3.8.10

Database version (if applicable): [version here]

Not applicable

operating system and version: [OS here]

Alpine Linux 3.13, latest packages

#### What I'm trying to do and why Attempting to perform a backup of considerate size (~2TB, millions of files) with limited memory (500MB) #### Steps to reproduce (if a bug) 1. Create large-ish dataset (2TB, roughly 2 million files) 2. Run borgmatic in a container with limited memory (512MB) #### Actual behavior (if a bug) Borgmatic should exit with an error code and ping relevant healthcheck endpoints with failure. It should report that borg unexpectedly terminated. #### Expected behavior (if a bug) Borgmatic exits with error code 0 and pings a success to healthcheck endpoints. #### Other notes / implementation ideas I am attempting to run the backup again with more memory (2GB). I've not enabled any special flags for borgmatic beyond what borgmatic already sets (and a -v2 flag has been used in debugging the issue). Borgmatic is running in an LXC container that functions as NFS storage daemon. #### Environment **borgmatic version:** [version here] Version 1.5.13 **borgmatic installation method:** pip **Borg version:** [version here] 1.1.15 **Python version:** [version here] 3.8.10 **Database version (if applicable):** [version here] Not applicable **operating system and version:** [OS here] Alpine Linux 3.13, latest packages
Owner

Thank you for reporting this one! I have been having a hell of a time reproducing it.. although admittedly I don't have 2 TB of free space to play with! I'm running the end-to-end tests via Docker (so scripts/run-full-dev-tests) and I've hacked up execute.py to set a super low virtual memory limit for any subprocesses executed (including Borg). I can induce memory errors this way, but borgmatic does not interpret them as success.

I assume that you can reproduce the Borg out-of-memory conditions by running the Borg command directly without borgmatic? (You can get the command by running borgmatic with -v 2 and looking for the borg execution.) If it does reproduce with just Borg, do you know what the exit code is when that occurs? (In Bash, you can get that with: echo $?.) My guess that it's 1, which would be consistent with the behavior you're seeing from borgmatic.. 1 is a Borg warning rather than error, and so borgmatic does not consider it a failure. It's also possibly a negative exit code (like -11, indicating a segfault).

Thank you for reporting this one! I have been having a hell of a time reproducing it.. although admittedly I don't have 2 TB of free space to play with! I'm running the end-to-end tests via Docker (so `scripts/run-full-dev-tests`) and I've hacked up `execute.py` to set a super low virtual memory limit for any subprocesses executed (including Borg). I can induce memory errors this way, but borgmatic does not interpret them as success. I assume that you can reproduce the Borg out-of-memory conditions by running the Borg command directly without borgmatic? (You can get the command by running borgmatic with `-v 2` and looking for the `borg` execution.) If it does reproduce with just Borg, do you know what the exit code is when that occurs? (In Bash, you can get that with: `echo $?`.) My guess that it's `1`, which would be consistent with the behavior you're seeing from borgmatic.. `1` is a Borg warning rather than error, and so `borgmatic` does not consider it a failure. It's also possibly a negative exit code (like `-11`, indicating a segfault).
witten added the
bug
label 2021-06-07 17:35:52 +00:00
Author

This was rather interesting to try to replicate again, since at this point, borg has eventually copied enough data to be able to complete the backup anyways.

But, I was able to reproduce the behaviour by turning on all the statistics and progress switches I found in borgmatic.

host:~# borgmatic -v2 create --stats --progress --files
Ensuring legacy configuration is upgraded
/etc/borgmatic/config.yaml: No commands to run for pre-everything hook
/etc/borgmatic/config.yaml: Calling healthchecks hook function initialize_monitor
/etc/borgmatic/config.yaml: No commands to run for pre-backup hook
/etc/borgmatic/config.yaml: Calling healthchecks hook function ping_monitor
/etc/borgmatic/config.yaml: Pinging Healthchecks start
/etc/borgmatic/config.yaml: Using Healthchecks ping URL https://<domain>/ping/<uuid>/start
<host>:repo: Creating archive
/usr/bin/borg create --patterns-from /tmp/tmpa_mvi0gx --exclude-from /tmp/tmplq6lmhrv --exclude-caches --exclude-if-present .nobackup --exclude-if-present nobackup --checkpoint-interval 600 --compression auto,zstd --stats --debug --show-rc --progress <host>:repo::{hostname}-{now:%Y-%m-%dT%H:%M:%S.%f} /tank >
using builtin fallback logging configuration
35 self tests completed in 0.19 seconds
SSH command line: ['ssh', '-y', '-i', '/root/.ssh/id_rsa.db', '<host>', 'borg', 'serve', '--umask=077', '--debug']
Remote: ssh: Failed to connect to agent
Remote: using builtin fallback logging configuration
Remote: 35 self tests completed in 0.10 seconds
Remote: using builtin fallback logging configuration
Remote: Initialized logging system for JSON-based protocol
Remote: Resolving repository path b'repo'
Remote: Resolved repository path to '/srv/repos/<repo>/repo'
Remote: Verified integrity of /srv/repos/<repo>/repo/index.<number>
TAM-verified manifest
Killed stale lock <host>@<number>.
Removed stale exclusive roster lock for host <host>@<number> pid 20447 thread 0.
Removed stale exclusive roster lock for host <host>@<number> pid 20447 thread 0.
security: read previous location 'ssh://<host>/./repo'
security: read manifest timestamp '2021-06-08T02:18:08.965687'
security: determined newest manifest timestamp as 2021-06-08T02:18:08.965687
security: repository checks ok, allowing access
Creating archive at "<host>:repo::{hostname}-{now:%Y-%m-%dT%H:%M:%S.%f}"
Verified integrity of /root/.cache/borg/<hash>/chunks
Reading files cache ...
/etc/borgmatic/config.yaml: No commands to run for post-backup hook
/etc/borgmatic/config.yaml: Calling healthchecks hook function ping_monitor
/etc/borgmatic/config.yaml: Pinging Healthchecks finish
/etc/borgmatic/config.yaml: Using Healthchecks ping URL https://<domain>/ping/<uuid>
/etc/borgmatic/config.yaml: Calling healthchecks hook function destroy_monitor
/etc/borgmatic/config.yaml: No commands to run for post-everything hook

summary:
/etc/borgmatic/config.yaml: Successfully ran configuration file

Meanwhile on the host:

Jun 08 07:10:27 lxc-host kernel: oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=ns,mems_allowed=0,oom_memcg=/lxc/202004255,task_memcg=/lxc/202004255/ns,task=borg,pid=22892,uid
Jun 08 07:10:27 lxc-host kernel: Memory cgroup out of memory: Killed process 22892 (borg) total-vm:545712kB, anon-rss:478444kB, file-rss:4kB, shmem-rss:0kB, UID:0 pgtables:1000kB oom_score
Jun 08 07:10:27 lxc-host kernel: oom_reaper: reaped process 22892 (borg), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

The healthchecks endpoint reported an "OK" for the backup test-run here.

This is ran with 512MB memory on the 2TB dataset.

Running the command manually, removing the temporary files used, yields the exit code 137 but also prints the line "Killed" before exiting (though I am unsure if this is due to the shell or borg itself).

host:~# /usr/bin/borg create --exclude-caches --exclude-if-present .nobackup --exclude-if-present nobackup --checkpoint-interval 600 --compression auto,zstd --stats --debug --show-rc --progress <repo>:repo::{hostname}-{now:%Y-%m-%dT%H:%M:%S.%f} /tank
using builtin fallback logging configuration
35 self tests completed in 0.19 seconds
SSH command line: ['ssh', '-y', '-i', '/root/.ssh/id_rsa.db', '<repo>', 'borg', 'serve', '--umask=077', '--debug']
Remote: ssh: Failed to connect to agent
Remote: using builtin fallback logging configuration
Remote: 35 self tests completed in 0.08 seconds
Remote: using builtin fallback logging configuration
Remote: Initialized logging system for JSON-based protocol
Remote: Resolving repository path b'repo'
Remote: Resolved repository path to '/srv/repos/<repo>/repo'
Remote: Verified integrity of /srv/repos/<repo>/repo/index.<number>
TAM-verified manifest
Killed stale lock <host>.
Removed stale exclusive roster lock for host <host> pid 20462 thread 0.
Removed stale exclusive roster lock for host <host> pid 20462 thread 0.
security: read previous location 'ssh://<repo>/./repo'
security: read manifest timestamp '2021-06-08T02:18:08.965687'
security: determined newest manifest timestamp as 2021-06-08T02:18:08.965687
security: repository checks ok, allowing access
Creating archive at "<repo>:repo::{hostname}-{now:%Y-%m-%dT%H:%M:%S.%f}"
Verified integrity of /root/.cache/borg/<hash>/chunks
Reading files cache ...
Killed
host:~# echo $?
137
host:~#
This was rather interesting to try to replicate again, since at this point, borg has eventually copied enough data to be able to complete the backup anyways. But, I was able to reproduce the behaviour by turning on all the statistics and progress switches I found in borgmatic. ``` host:~# borgmatic -v2 create --stats --progress --files Ensuring legacy configuration is upgraded /etc/borgmatic/config.yaml: No commands to run for pre-everything hook /etc/borgmatic/config.yaml: Calling healthchecks hook function initialize_monitor /etc/borgmatic/config.yaml: No commands to run for pre-backup hook /etc/borgmatic/config.yaml: Calling healthchecks hook function ping_monitor /etc/borgmatic/config.yaml: Pinging Healthchecks start /etc/borgmatic/config.yaml: Using Healthchecks ping URL https://<domain>/ping/<uuid>/start <host>:repo: Creating archive /usr/bin/borg create --patterns-from /tmp/tmpa_mvi0gx --exclude-from /tmp/tmplq6lmhrv --exclude-caches --exclude-if-present .nobackup --exclude-if-present nobackup --checkpoint-interval 600 --compression auto,zstd --stats --debug --show-rc --progress <host>:repo::{hostname}-{now:%Y-%m-%dT%H:%M:%S.%f} /tank > using builtin fallback logging configuration 35 self tests completed in 0.19 seconds SSH command line: ['ssh', '-y', '-i', '/root/.ssh/id_rsa.db', '<host>', 'borg', 'serve', '--umask=077', '--debug'] Remote: ssh: Failed to connect to agent Remote: using builtin fallback logging configuration Remote: 35 self tests completed in 0.10 seconds Remote: using builtin fallback logging configuration Remote: Initialized logging system for JSON-based protocol Remote: Resolving repository path b'repo' Remote: Resolved repository path to '/srv/repos/<repo>/repo' Remote: Verified integrity of /srv/repos/<repo>/repo/index.<number> TAM-verified manifest Killed stale lock <host>@<number>. Removed stale exclusive roster lock for host <host>@<number> pid 20447 thread 0. Removed stale exclusive roster lock for host <host>@<number> pid 20447 thread 0. security: read previous location 'ssh://<host>/./repo' security: read manifest timestamp '2021-06-08T02:18:08.965687' security: determined newest manifest timestamp as 2021-06-08T02:18:08.965687 security: repository checks ok, allowing access Creating archive at "<host>:repo::{hostname}-{now:%Y-%m-%dT%H:%M:%S.%f}" Verified integrity of /root/.cache/borg/<hash>/chunks Reading files cache ... /etc/borgmatic/config.yaml: No commands to run for post-backup hook /etc/borgmatic/config.yaml: Calling healthchecks hook function ping_monitor /etc/borgmatic/config.yaml: Pinging Healthchecks finish /etc/borgmatic/config.yaml: Using Healthchecks ping URL https://<domain>/ping/<uuid> /etc/borgmatic/config.yaml: Calling healthchecks hook function destroy_monitor /etc/borgmatic/config.yaml: No commands to run for post-everything hook summary: /etc/borgmatic/config.yaml: Successfully ran configuration file ``` Meanwhile on the host: ``` Jun 08 07:10:27 lxc-host kernel: oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=ns,mems_allowed=0,oom_memcg=/lxc/202004255,task_memcg=/lxc/202004255/ns,task=borg,pid=22892,uid Jun 08 07:10:27 lxc-host kernel: Memory cgroup out of memory: Killed process 22892 (borg) total-vm:545712kB, anon-rss:478444kB, file-rss:4kB, shmem-rss:0kB, UID:0 pgtables:1000kB oom_score Jun 08 07:10:27 lxc-host kernel: oom_reaper: reaped process 22892 (borg), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB ``` The healthchecks endpoint reported an "OK" for the backup test-run here. This is ran with 512MB memory on the 2TB dataset. Running the command manually, removing the temporary files used, yields the exit code 137 but also prints the line "Killed" before exiting (though I am unsure if this is due to the shell or borg itself). ``` host:~# /usr/bin/borg create --exclude-caches --exclude-if-present .nobackup --exclude-if-present nobackup --checkpoint-interval 600 --compression auto,zstd --stats --debug --show-rc --progress <repo>:repo::{hostname}-{now:%Y-%m-%dT%H:%M:%S.%f} /tank using builtin fallback logging configuration 35 self tests completed in 0.19 seconds SSH command line: ['ssh', '-y', '-i', '/root/.ssh/id_rsa.db', '<repo>', 'borg', 'serve', '--umask=077', '--debug'] Remote: ssh: Failed to connect to agent Remote: using builtin fallback logging configuration Remote: 35 self tests completed in 0.08 seconds Remote: using builtin fallback logging configuration Remote: Initialized logging system for JSON-based protocol Remote: Resolving repository path b'repo' Remote: Resolved repository path to '/srv/repos/<repo>/repo' Remote: Verified integrity of /srv/repos/<repo>/repo/index.<number> TAM-verified manifest Killed stale lock <host>. Removed stale exclusive roster lock for host <host> pid 20462 thread 0. Removed stale exclusive roster lock for host <host> pid 20462 thread 0. security: read previous location 'ssh://<repo>/./repo' security: read manifest timestamp '2021-06-08T02:18:08.965687' security: determined newest manifest timestamp as 2021-06-08T02:18:08.965687 security: repository checks ok, allowing access Creating archive at "<repo>:repo::{hostname}-{now:%Y-%m-%dT%H:%M:%S.%f}" Verified integrity of /root/.cache/borg/<hash>/chunks Reading files cache ... Killed host:~# echo $? 137 host:~# ```
Owner

Thanks so much for this detailed repro info! I wasn't able to exactly reproduce it, but I was able to create a representative manual test and the corresponding fix. What I think was going on is that when the OS kills Borg due to an OOM situation, it sends it a SIGKILL (9) signal. In bash, this shows up as a 137 exit status (128 + 9). In Python, however, this shows up as -9, which the borgmatic execute.py code was not interpreting as an error. (Borg errors are all >= 2.)

However, now with a fix accounting for negative exit statuses, Borg's getting killed is handled properly. Which leads me to believe the OOM situation will be handled properly.

This fix is in borgmatic 1.5.14, just released! Let me know whether it solves this issue for you.

Thanks so much for this detailed repro info! I wasn't able to exactly reproduce it, but I was able to create a representative manual test and the corresponding fix. What I think was going on is that when the OS kills Borg due to an OOM situation, it sends it a SIGKILL (9) signal. In bash, this shows up as a 137 exit status (128 + 9). In Python, however, this shows up as -9, which the borgmatic `execute.py` code was not interpreting as an error. (Borg errors are all >= 2.) However, now with a fix accounting for negative exit statuses, Borg's getting killed is handled properly. Which leads me to believe the OOM situation will be handled properly. This fix is in borgmatic 1.5.14, just released! Let me know whether it solves this issue for you.
Author

Verified to be fixed, thanks a lot!

Verified to be fixed, thanks a lot!
Owner

Awesome, glad to hear it! The fix should also apply to any other situation in which Borg gets killed out from under borgmatic.

Awesome, glad to hear it! The fix should also apply to any other situation in which Borg gets killed out from under borgmatic.
Sign in to join this conversation.
No Milestone
No Assignees
2 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: borgmatic-collective/borgmatic#423
No description provided.