Borg OOM during backup does not result in borgmatic passing on the error #423
Labels
No Label
bug
data loss
design finalized
good first issue
new feature area
question / support
security
waiting for response
No Milestone
No Assignees
2 Participants
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: borgmatic-collective/borgmatic#423
Loading…
Reference in New Issue
No description provided.
Delete Branch "%!s(<nil>)"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
What I'm trying to do and why
Attempting to perform a backup of considerate size (~2TB, millions of files) with limited memory (500MB)
Steps to reproduce (if a bug)
Actual behavior (if a bug)
Borgmatic should exit with an error code and ping relevant healthcheck endpoints with failure. It should report that borg unexpectedly terminated.
Expected behavior (if a bug)
Borgmatic exits with error code 0 and pings a success to healthcheck endpoints.
Other notes / implementation ideas
I am attempting to run the backup again with more memory (2GB). I've not enabled any special flags for borgmatic beyond what borgmatic already sets (and a -v2 flag has been used in debugging the issue).
Borgmatic is running in an LXC container that functions as NFS storage daemon.
Environment
borgmatic version: [version here]
Version 1.5.13
borgmatic installation method: pip
Borg version: [version here]
1.1.15
Python version: [version here]
3.8.10
Database version (if applicable): [version here]
Not applicable
operating system and version: [OS here]
Alpine Linux 3.13, latest packages
Thank you for reporting this one! I have been having a hell of a time reproducing it.. although admittedly I don't have 2 TB of free space to play with! I'm running the end-to-end tests via Docker (so
scripts/run-full-dev-tests
) and I've hacked upexecute.py
to set a super low virtual memory limit for any subprocesses executed (including Borg). I can induce memory errors this way, but borgmatic does not interpret them as success.I assume that you can reproduce the Borg out-of-memory conditions by running the Borg command directly without borgmatic? (You can get the command by running borgmatic with
-v 2
and looking for theborg
execution.) If it does reproduce with just Borg, do you know what the exit code is when that occurs? (In Bash, you can get that with:echo $?
.) My guess that it's1
, which would be consistent with the behavior you're seeing from borgmatic..1
is a Borg warning rather than error, and soborgmatic
does not consider it a failure. It's also possibly a negative exit code (like-11
, indicating a segfault).This was rather interesting to try to replicate again, since at this point, borg has eventually copied enough data to be able to complete the backup anyways.
But, I was able to reproduce the behaviour by turning on all the statistics and progress switches I found in borgmatic.
Meanwhile on the host:
The healthchecks endpoint reported an "OK" for the backup test-run here.
This is ran with 512MB memory on the 2TB dataset.
Running the command manually, removing the temporary files used, yields the exit code 137 but also prints the line "Killed" before exiting (though I am unsure if this is due to the shell or borg itself).
Thanks so much for this detailed repro info! I wasn't able to exactly reproduce it, but I was able to create a representative manual test and the corresponding fix. What I think was going on is that when the OS kills Borg due to an OOM situation, it sends it a SIGKILL (9) signal. In bash, this shows up as a 137 exit status (128 + 9). In Python, however, this shows up as -9, which the borgmatic
execute.py
code was not interpreting as an error. (Borg errors are all >= 2.)However, now with a fix accounting for negative exit statuses, Borg's getting killed is handled properly. Which leads me to believe the OOM situation will be handled properly.
This fix is in borgmatic 1.5.14, just released! Let me know whether it solves this issue for you.
Verified to be fixed, thanks a lot!
Awesome, glad to hear it! The fix should also apply to any other situation in which Borg gets killed out from under borgmatic.