BorgMatic doesn't return when a BEFORE-hook script keeps background processes alive? #522
Labels
No Label
bug
data loss
design finalized
good first issue
new feature area
question / support
security
waiting for response
No Milestone
No Assignees
2 Participants
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: borgmatic-collective/borgmatic#522
Loading…
Reference in New Issue
No description provided.
Delete Branch "%!s(<nil>)"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
What I'm trying to do and why
I have one server hosting multiple VMs which contain different databases. The host uses BorgMatic and SSHFS to backup files of within the VMs already and I would like to enhance that to backup individual dumps as well. The important point is that I do not want to setup BorgMatic within the VMs and don't want to store the dumps as files within the VMs as well, e.g. because those VMs simply lack the necessary storage for additional full dumps. I'm aware of what BorgMatic provides to support that use-case already, but can't use that, because it lacks the flexibility I need.
Instead, what I'm doing right now is retrieviung dumps using SSH and STDOUT: Create a SSH connection, execute the dump tool of interest within the VM, return the dump using STDOUT through SSH. And here's the important part: The VM-host additionally creates named pipes and attaches the output of SSH to those named pipes. BorgMatic is configured to read those named pipes using
--read-special
.The overall setup seems to work: After starting SSH, I can read the data from the named pipes e.g. using
cat
orvi
.Steps to reproduce (if a bug)
Here's the second important part: What I described above is implemented using a BEFORE-hook and starting SSH processes using
nohup
in the background. So expectation is that BGM starts the hook, I setup my environment, the hook finishes and afterwards BGM starts to read from the named pipes because ofread--special
.This seems to somewhat work, I can see multiple subshells hosting the SSH processes and stuff. OTOH, the bash instance running the hook itself seems to successfully end: After printing its own PID as last statement, I can see that this PID is really gone. I can't manually
kill -9
or alike,kill
tells me that the process is not available.Actual behavior (if a bug)
The problem is that BGM seems to keep waiting for subshells of the hook to end. The BGM process keeps waiting without any further log output until I either terminate it or kill all of the subshells. Of course, after killing the subshells reading fomr the named pipes doesn't work anymore any BGM waits for data at those forever, which won't come anymore.
Expected behavior (if a bug)
I would expect BGM to recognize that the started process for the hook itself ONLY has finished and simply continue it's actions, especially processing files. From my point of view it shouldn't care about additional child processes. Though, looking at the code I couldn't find if/how child processes are taken into account.
But there needs to be some problem in BGM, because when changing my setup to write into files instead of using names pipes, not using background processes anymore, things work as expected. But I would like to avoid writing to files because of storage consumption in the host, my databases need to be stored encrypted always which requires special setup, the backup itself takes longer with intermediate files, the server uses SSDs and it makes sense to not write more than necessary etc.
Do you have any explanation for what I see?
Thanks!
Environment
borgmatic version: 1.5.15
borgmatic installation method: PIP, system wide
Borg version: 1.1.16
Python version: 3.8.10
operating system and version: Ubuntu 20.04
What I observe seems to really be some Python specific behaviour, others have it as well:
https://stackoverflow.com/questions/65356187/have-subprocess-popen-only-wait-on-its-child-process-to-return-but-not-any-gran
Though, they claim that passing the special argument
shell=True
fixed it for them, but BGM seems to do that already:7bd6374751/borgmatic/hooks/command.py (L70)
There are some other references about the relationship of first and additional level processes, those mention process groups and some sort of session. Though, that seems to be disabled by default in Python.
https://alexandra-zaharia.github.io/posts/kill-subprocess-and-its-children-on-timeout-python/
https://docs.python.org/3/library/subprocess.html#popen-constructor
I've updated BGM to 1.5.24 and still see the same behaviour. I have the feeling theat either Python provides too many processes to BGM or BGM already knows about all those children and wrongly expects output from those.
4d1d8d7409/borgmatic/execute.py (L48)
Though, BGM really needs to wait until the hook process itself has propelry finished of course, because only afterwards my setup is finished. It can not just start the process and forget it, as my pipes might not be ready to read yet.
Any hints what I might check for in
log_outputs
?Trying to ask some other people with more knowldge of Python than me as well... :-)
https://stackoverflow.com/questions/71936728/under-which-circumstances-does-python-wait-for-sub-processes-of-a-started-proces
Based on some testing locally, I think what's happening here is that since borgmatic tries to read output from the
before_backup
child process (for purposes of logging it), that's preventing borgmatic from returning and continuing on with the backup. So one way you could work around it is as per the following example. Let's say you've got a background hook like the following:And when you run borgmatic, it hangs on that
before_backup
step for five seconds while it's waiting for that command's output. But if you change it to this ...... then the child process' stdout/stderr file descriptors are redirected to
/dev/null
instead of borgmatic, and there's nothing for borgmatic to wait on. When I tried that work-around, borgmatic no longer waits for the sleep and happily proceeds on to the backup.Perhaps you can try something similar with your hook?
Waiting on the started hook script itself is perfectly fine and expected. That hook script is responsible for setting up the pipes etc. which need to be backed up afterwards. That might take 1 second or 2 or ... One can't know, so BGM waiting on that first level, directly started hook script for output seems correct to me.
The problem is that this hook script finishes at some point, the PID really is gone and BGM still keeps waiting for something. I expected it would return, but it keeps waiting for the started backghround children and that is wrong in my opinion. When I
killall
thos children, BGM continues it's work, but those children are important to be alive in use-case.I don't have enough knowledge of Python, but there might be two things happening: Either the returned process object doesn't only contain the first löevel hook-script, but all of it's children as well for some reason. In that case BGM would simply wait for all of those processes, while it only should wait for the first one.
Or the process object really only contains the main hook script, but when trying to log process output, somehow the output of the children processes leak in as well. When you look at the logging code, there are two loops: One over processes and one over buffers. One of those objects most likely wrongly contain the child processes, but I couldn't debug this on my own yet.
I'm as well somewhat sure to already properly detach my subprocesses like in your example:
That is just a simple one-level example. The same principle of silencing stdout/stderr can be applied to an application called by your top-level hook script.
So normally, borgmatic waits for output on the top-level script's stdout/stderr file descriptors (via
select()
). However, when the that scripts forks another application, my guess based on some research is that the application inherits the script's stdout/stderr file descriptors—and therefore borgmatic ends up waiting on that application's output too (unless you do the redirection trick).Having said that, you are using
nohup
and doing some/dev/null
redirection of your own, so that should be taking care of the file descriptors and prevent the wait. In fact, when I do a similar command called from a top-level script, it doesn't wait at all:So I'm not sure why that's not working for you. Maybe try stripping it down to a simple command like the
sleep
above, confirm that it works, and then build it back to your full command until you find out what's causing it to fail?Made some progress.... The following are the running process when the backup software doesn't move forward. All of the bash processes at the bottom contain my started SSH instances most likely, though Im wondering why they are still associated with their parent bash. OTOH, all of those bash instances seem properly released from my hook shell script, which should be the one zombie process mentioned. I guess that zombie simply needs to stop to get this fixed...
Tested with a reduced sleep and recognized the following: Without proper detach of STDIN/STDOUT etc., pretty much the same happens like with my SSH command. The only difference is that I really sleep individual
sleep
instaces running, but BGM waits forever again. When redirecting STDOUT etc. like in the third command, I can seesleep
instances running, but BGM DOES NOT wait forever on its own hook script anymore! :-)So like with your tests, things work in general and my problem is that some input/output channel is not properly detached from the hook script. But of course for my use case I need STDOUT being redirected to some file/pipe and somewhat available this way... :-/ Just debugged this further by redirecting STDUT to
/dev/null
as well and again BGM DOES NOT wait for its own hook script anymore. Of course that setup is of not much use... :-) I need to find a way to redirect the output and detach from the hook script at the same time.Though, some progress at last...
The above is wrong, if one wants to really have totally independent background process, because of the redirection of SSH's output into a file. Who does that redirection in the end? It's the bash process starting the actual command, which in my case is the hook script of the backup software and which makes that hook script a zombie in the end. That zombie tries to take care of the redirection and that is the reason why all of those
bash
instances are still available in my output ofps axf
. When the redirection to the file is changed to/dev/null
, the hook script really terminates entirely, the backup software continues and the SSH instances are visible as such inps axf
.Of course, in my concrete use-case I need the redirected output to the file. That's what this is all about. Though, I "simply" need it differently: The shell executing my hook script needs to start a process taking care of the redirection on its own, while the starting shell is fully detached from that started process. That would be really easy if SSH would be able to write to a file on its own, which doesn't seem to be the case. Instead, it seems to entirely rely on the shell to take care of redirection. This means I need an additional shell instance: One executing SSH and taking care of the redirection into a file, while at the same time this shell instance is executed in the background of its parent and the parent ignores all input/output.
In fact, I was pretty close to the solution all the time already, but simply didn't fully understand what I was doing. The examples with
nohup
made things only more difficult, because I need an additional shell instance in the end. The following is what I had before already with the help of some other SO questions:https://gist.github.com/bluekezza/e511f3f4429939a0f9ecb6447099b3dc
https://stackoverflow.com/a/54688673/2055163
The above executes my SSH command as a compound command, allowing to have SIGHUP etc. getting ignored. Depending on the config of the current shell,. this might be necessary to keep the executed command running in the background when the parent ends.
The important point to understand is that this actually creates an additional subshell already, so exactly what I need! The problem with the above was that that the parent shell waits for the output of the subshell, because the redirection of STDOUT was defined for the parent shell, pretty much like with my
nohup
examples. And this is the easy part now: As we have a subshell with individual STDOUT, that can simply be redirected into a file on its own! This way the parent shell can be fully detached from all channels, the compound command is executed in the background, the backup software continues to run because the hook script doesn't become a zombie and is finally able to read from the pipes as the database dumpers write into those using SSH. :-)I've looked at the files the backup software created/restored and they look OK. I can see lots of SQL commands and data and stuff. What DOESN'T work is the following:
I've thought to give that a try simply to be safe, because I don't need STDIN and STDERR anyway. But with the above I don't get any content in the pipes at all, really only EOF, because the backup software creates empty files in the end. Additionally, I can see that the database dumpers are not even executed in their hosts, compared to the working solution for which I can easily see CPU and I/O load in the system. Might have something to do with how I executed SSH using some function or whatever, don't care too much anymore.
https://superuser.com/questions/1717487/how-to-properly-make-a-ssh-call-locally-async-background-independent
Wow, that's pretty tricky! I'm glad to hear you have a solution or at least a work-around.