BorgMatic doesn't return when a BEFORE-hook script keeps background processes alive? #522

Closed
opened 2022-04-18 19:05:49 +00:00 by ams_tschoening · 10 comments

What I'm trying to do and why

I have one server hosting multiple VMs which contain different databases. The host uses BorgMatic and SSHFS to backup files of within the VMs already and I would like to enhance that to backup individual dumps as well. The important point is that I do not want to setup BorgMatic within the VMs and don't want to store the dumps as files within the VMs as well, e.g. because those VMs simply lack the necessary storage for additional full dumps. I'm aware of what BorgMatic provides to support that use-case already, but can't use that, because it lacks the flexibility I need.

Instead, what I'm doing right now is retrieviung dumps using SSH and STDOUT: Create a SSH connection, execute the dump tool of interest within the VM, return the dump using STDOUT through SSH. And here's the important part: The VM-host additionally creates named pipes and attaches the output of SSH to those named pipes. BorgMatic is configured to read those named pipes using --read-special.

The overall setup seems to work: After starting SSH, I can read the data from the named pipes e.g. using cat or vi.

Steps to reproduce (if a bug)

Here's the second important part: What I described above is implemented using a BEFORE-hook and starting SSH processes using nohup in the background. So expectation is that BGM starts the hook, I setup my environment, the hook finishes and afterwards BGM starts to read from the named pipes because of read--special.

This seems to somewhat work, I can see multiple subshells hosting the SSH processes and stuff. OTOH, the bash instance running the hook itself seems to successfully end: After printing its own PID as last statement, I can see that this PID is really gone. I can't manually kill -9 or alike, kill tells me that the process is not available.

Actual behavior (if a bug)

The problem is that BGM seems to keep waiting for subshells of the hook to end. The BGM process keeps waiting without any further log output until I either terminate it or kill all of the subshells. Of course, after killing the subshells reading fomr the named pipes doesn't work anymore any BGM waits for data at those forever, which won't come anymore.

Expected behavior (if a bug)

I would expect BGM to recognize that the started process for the hook itself ONLY has finished and simply continue it's actions, especially processing files. From my point of view it shouldn't care about additional child processes. Though, looking at the code I couldn't find if/how child processes are taken into account.

But there needs to be some problem in BGM, because when changing my setup to write into files instead of using names pipes, not using background processes anymore, things work as expected. But I would like to avoid writing to files because of storage consumption in the host, my databases need to be stored encrypted always which requires special setup, the backup itself takes longer with intermediate files, the server uses SSDs and it makes sense to not write more than necessary etc.

Do you have any explanation for what I see?

Thanks!

Environment

borgmatic version: 1.5.15
borgmatic installation method: PIP, system wide
Borg version: 1.1.16
Python version: 3.8.10
operating system and version: Ubuntu 20.04

#### What I'm trying to do and why I have one server hosting multiple VMs which contain different databases. The host uses BorgMatic and SSHFS to backup files of within the VMs already and I would like to enhance that to backup individual dumps as well. The important point is that I do not want to setup BorgMatic within the VMs and don't want to store the dumps as files within the VMs as well, e.g. because those VMs simply lack the necessary storage for additional full dumps. I'm aware of what BorgMatic provides to support that use-case already, but can't use that, because [it lacks the flexibility I need](https://projects.torsion.org/borgmatic-collective/borgmatic/issues/435). Instead, what I'm doing right now is retrieviung dumps using SSH and STDOUT: Create a SSH connection, execute the dump tool of interest within the VM, return the dump using STDOUT through SSH. And here's the important part: The VM-host additionally creates named pipes and attaches the output of SSH to those named pipes. BorgMatic is configured to read those named pipes using `--read-special`. The overall setup seems to work: After starting SSH, I can read the data from the named pipes e.g. using `cat` or `vi`. #### Steps to reproduce (if a bug) Here's the second important part: What I described above is implemented using a BEFORE-hook and starting SSH processes using `nohup` in the background. So expectation is that BGM starts the hook, I setup my environment, the hook finishes and afterwards BGM starts to read from the named pipes because of `read--special`. This seems to somewhat work, I can see multiple subshells hosting the SSH processes and stuff. OTOH, the bash instance running the hook itself seems to successfully end: After printing its own PID as last statement, I can see that this PID is really gone. I can't manually `kill -9` or alike, `kill` tells me that the process is not available. #### Actual behavior (if a bug) The problem is that BGM seems to keep waiting for subshells of the hook to end. The BGM process keeps waiting without any further log output until I either terminate it or kill all of the subshells. Of course, after killing the subshells reading fomr the named pipes doesn't work anymore any BGM waits for data at those forever, which won't come anymore. #### Expected behavior (if a bug) I would expect BGM to recognize that the started process for the hook itself ONLY has finished and simply continue it's actions, especially processing files. From my point of view it shouldn't care about additional child processes. Though, looking at the code I couldn't find if/how child processes are taken into account. But there needs to be some problem in BGM, because when changing my setup to write into files instead of using names pipes, not using background processes anymore, things work as expected. But I would like to avoid writing to files because of storage consumption in the host, my databases need to be stored encrypted always which requires special setup, the backup itself takes longer with intermediate files, the server uses SSDs and it makes sense to not write more than necessary etc. **Do you have any explanation for what I see?** Thanks! #### Environment **borgmatic version:** 1.5.15 **borgmatic installation method:** PIP, system wide **Borg version:** 1.1.16 **Python version:** 3.8.10 **operating system and version:** Ubuntu 20.04
Author

What I observe seems to really be some Python specific behaviour, others have it as well:

https://stackoverflow.com/questions/65356187/have-subprocess-popen-only-wait-on-its-child-process-to-return-but-not-any-gran

Though, they claim that passing the special argument shell=True fixed it for them, but BGM seems to do that already:

                execute.execute_command(
                    [command],
                    output_log_level=logging.ERROR
                    if description == 'on-error'
                    else logging.WARNING,
                    shell=True,
                )

7bd6374751/borgmatic/hooks/command.py (L70)

There are some other references about the relationship of first and additional level processes, those mention process groups and some sort of session. Though, that seems to be disabled by default in Python.

class subprocess.Popen(args, bufsize=- 1, executable=None, stdin=None, stdout=None, stderr=None, preexec_fn=None, close_fds=True, shell=False, cwd=None, env=None, universal_newlines=None, startupinfo=None, creationflags=0, restore_signals=True, start_new_session=False, pass_fds=(), *, group=None, extra_groups=None, user=None, umask=- 1, encoding=None, errors=None, text=None, pipesize=- 1)

https://alexandra-zaharia.github.io/posts/kill-subprocess-and-its-children-on-timeout-python/
https://docs.python.org/3/library/subprocess.html#popen-constructor

What I observe seems to really be some Python specific behaviour, others have it as well: https://stackoverflow.com/questions/65356187/have-subprocess-popen-only-wait-on-its-child-process-to-return-but-not-any-gran Though, they claim that passing the special argument `shell=True` fixed it for them, but BGM seems to do that already: ```python execute.execute_command( [command], output_log_level=logging.ERROR if description == 'on-error' else logging.WARNING, shell=True, ) ``` https://github.com/borgmatic-collective/borgmatic/blob/7bd637475172a11700fc5f62a6b93d803c756123/borgmatic/hooks/command.py#L70 There are some other references about the relationship of first and additional level processes, those mention process groups and some sort of session. Though, that seems to be disabled by default in Python. ```python class subprocess.Popen(args, bufsize=- 1, executable=None, stdin=None, stdout=None, stderr=None, preexec_fn=None, close_fds=True, shell=False, cwd=None, env=None, universal_newlines=None, startupinfo=None, creationflags=0, restore_signals=True, start_new_session=False, pass_fds=(), *, group=None, extra_groups=None, user=None, umask=- 1, encoding=None, errors=None, text=None, pipesize=- 1) ``` https://alexandra-zaharia.github.io/posts/kill-subprocess-and-its-children-on-timeout-python/ https://docs.python.org/3/library/subprocess.html#popen-constructor
Author

I've updated BGM to 1.5.24 and still see the same behaviour. I have the feeling theat either Python provides too many processes to BGM or BGM already knows about all those children and wrongly expects output from those.

log_outputs

4d1d8d7409/borgmatic/execute.py (L48)

Though, BGM really needs to wait until the hook process itself has propelry finished of course, because only afterwards my setup is finished. It can not just start the process and forget it, as my pipes might not be ready to read yet.

Any hints what I might check for in log_outputs?

I've updated BGM to 1.5.24 and still see the same behaviour. I have the feeling theat either Python provides too many processes to BGM or BGM already knows about all those children and wrongly expects output from those. > log_outputs https://projects.torsion.org/borgmatic-collective/borgmatic/src/commit/4d1d8d74091c4d7ef3e18acd73639be7874b119e/borgmatic/execute.py#L48 Though, BGM really needs to wait until the hook process itself has propelry finished of course, because only afterwards my setup is finished. It can not just start the process and forget it, as my pipes might not be ready to read yet. Any hints what I might check for in `log_outputs`?
Author

Trying to ask some other people with more knowldge of Python than me as well... :-)

https://stackoverflow.com/questions/71936728/under-which-circumstances-does-python-wait-for-sub-processes-of-a-started-proces

Trying to ask some other people with more knowldge of Python than me as well... :-) https://stackoverflow.com/questions/71936728/under-which-circumstances-does-python-wait-for-sub-processes-of-a-started-proces
Owner

Based on some testing locally, I think what's happening here is that since borgmatic tries to read output from the before_backup child process (for purposes of logging it), that's preventing borgmatic from returning and continuing on with the backup. So one way you could work around it is as per the following example. Let's say you've got a background hook like the following:

   before_backup:
        - "sleep 5s &"

And when you run borgmatic, it hangs on that before_backup step for five seconds while it's waiting for that command's output. But if you change it to this ...

   before_backup:
        - "sleep 5s > /dev/null 2>&1 &"

... then the child process' stdout/stderr file descriptors are redirected to /dev/null instead of borgmatic, and there's nothing for borgmatic to wait on. When I tried that work-around, borgmatic no longer waits for the sleep and happily proceeds on to the backup.

Perhaps you can try something similar with your hook?

Based on some testing locally, I think what's happening here is that since borgmatic tries to read output from the `before_backup` child process (for purposes of logging it), that's preventing borgmatic from returning and continuing on with the backup. So one way you could work around it is as per the following example. Let's say you've got a background hook like the following: ```yaml before_backup: - "sleep 5s &" ``` And when you run borgmatic, it hangs on that `before_backup` step for five seconds while it's waiting for that command's output. But if you change it to this ... ```yaml before_backup: - "sleep 5s > /dev/null 2>&1 &" ``` ... then the child process' stdout/stderr file descriptors are redirected to `/dev/null` instead of borgmatic, and there's nothing for borgmatic to wait on. When I tried that work-around, borgmatic no longer waits for the sleep and happily proceeds on to the backup. Perhaps you can try something similar with your hook?
Author

And when you run borgmatic, it hangs on that before_backup step for five seconds while it's waiting for that command's output. But if you change it to this ...

Waiting on the started hook script itself is perfectly fine and expected. That hook script is responsible for setting up the pipes etc. which need to be backed up afterwards. That might take 1 second or 2 or ... One can't know, so BGM waiting on that first level, directly started hook script for output seems correct to me.

The problem is that this hook script finishes at some point, the PID really is gone and BGM still keeps waiting for something. I expected it would return, but it keeps waiting for the started backghround children and that is wrong in my opinion. When I killall thos children, BGM continues it's work, but those children are important to be alive in use-case.

I don't have enough knowledge of Python, but there might be two things happening: Either the returned process object doesn't only contain the first löevel hook-script, but all of it's children as well for some reason. In that case BGM would simply wait for all of those processes, while it only should wait for the first one.

Or the process object really only contains the main hook script, but when trying to log process output, somehow the output of the children processes leak in as well. When you look at the logging code, there are two loops: One over processes and one over buffers. One of those objects most likely wrongly contain the child processes, but I couldn't debug this on my own yet.

I'm as well somewhat sure to already properly detach my subprocesses like in your example:

nohup ssh -F ${PATH_LOCAL_SSH_CFG} -f -n ${HOST_OTHER} ${cmd} < '/dev/null' > "${PATH_LOCAL_MNT2}/${db_name}" 2> '/dev/null' &
> And when you run borgmatic, it hangs on that before_backup step for five seconds while it's waiting for that command's output. But if you change it to this ... Waiting on the started hook script itself is perfectly fine and expected. That hook script is responsible for setting up the pipes etc. which need to be backed up afterwards. That might take 1 second or 2 or ... One can't know, so BGM waiting on that first level, directly started hook script for output seems correct to me. The problem is that this hook script finishes at some point, the PID really is gone and BGM still keeps waiting for something. I expected it would return, but it keeps waiting for the started backghround children and that is wrong in my opinion. When I `killall` thos children, BGM continues it's work, but those children are important to be alive in use-case. I don't have enough knowledge of Python, but there might be two things happening: Either the returned process object doesn't only contain the first löevel hook-script, but all of it's children as well for some reason. In that case BGM would simply wait for all of those processes, while it only should wait for the first one. Or the process object really only contains the main hook script, but when trying to log process output, somehow the output of the children processes leak in as well. When you look at the logging code, there are two loops: One over processes and one over buffers. One of those objects most likely wrongly contain the child processes, but I couldn't debug this on my own yet. I'm as well somewhat sure to already properly detach my subprocesses like in your example: ```bash nohup ssh -F ${PATH_LOCAL_SSH_CFG} -f -n ${HOST_OTHER} ${cmd} < '/dev/null' > "${PATH_LOCAL_MNT2}/${db_name}" 2> '/dev/null' & ```
Owner

Waiting on the started hook script itself is perfectly fine and expected. That hook script is responsible for setting up the pipes etc. which need to be backed up afterwards. That might take 1 second or 2 or ... One can't know, so BGM waiting on that first level, directly started hook script for output seems correct to me.

That is just a simple one-level example. The same principle of silencing stdout/stderr can be applied to an application called by your top-level hook script.

Or the process object really only contains the main hook script, but when trying to log process output, somehow the output of the children processes leak in as well. When you look at the logging code, there are two loops: One over processes and one over buffers. One of those objects most likely wrongly contain the child processes, but I couldn't debug this on my own yet.

So normally, borgmatic waits for output on the top-level script's stdout/stderr file descriptors (via select()). However, when the that scripts forks another application, my guess based on some research is that the application inherits the script's stdout/stderr file descriptors—and therefore borgmatic ends up waiting on that application's output too (unless you do the redirection trick).

Having said that, you are using nohup and doing some /dev/null redirection of your own, so that should be taking care of the file descriptors and prevent the wait. In fact, when I do a similar command called from a top-level script, it doesn't wait at all:

#!/bin/bash

nohup sleep 5s > fake.db 2> '/dev/null' &

So I'm not sure why that's not working for you. Maybe try stripping it down to a simple command like the sleep above, confirm that it works, and then build it back to your full command until you find out what's causing it to fail?

> Waiting on the started hook script itself is perfectly fine and expected. That hook script is responsible for setting up the pipes etc. which need to be backed up afterwards. That might take 1 second or 2 or ... One can't know, so BGM waiting on that first level, directly started hook script for output seems correct to me. That is just a simple one-level example. The same principle of silencing stdout/stderr can be applied to an application called by your top-level hook script. > Or the process object really only contains the main hook script, but when trying to log process output, somehow the output of the children processes leak in as well. When you look at the logging code, there are two loops: One over processes and one over buffers. One of those objects most likely wrongly contain the child processes, but I couldn't debug this on my own yet. So normally, borgmatic waits for output on the top-level script's stdout/stderr file descriptors (via `select()`). However, when the that scripts forks another application, my guess based on some research is that the application inherits the script's stdout/stderr file descriptors—and therefore borgmatic ends up waiting on that application's output too (unless you do the redirection trick). Having said that, you *are* using `nohup` and doing some `/dev/null` redirection of your own, so that *should* be taking care of the file descriptors and prevent the wait. In fact, when I do a similar command called from a top-level script, it doesn't wait at all: ```bash #!/bin/bash nohup sleep 5s > fake.db 2> '/dev/null' & ``` So I'm not sure why that's not working for you. Maybe try stripping it down to a simple command like the `sleep` above, confirm that it works, and then build it back to your full command until you find out what's causing it to fail?
witten added the
waiting for response
label 2022-04-28 17:48:54 +00:00
Author

Made some progress.... The following are the running process when the backup software doesn't move forward. All of the bash processes at the bottom contain my started SSH instances most likely, though Im wondering why they are still associated with their parent bash. OTOH, all of those bash instances seem properly released from my hook shell script, which should be the one zombie process mentioned. I guess that zombie simply needs to stop to get this fixed...

   1641 ?        Ss     0:02 sshd: /usr/sbin/sshd -D [listener] 0 of 10-100 startups
1835356 ?        Ss     0:00  \_ sshd: [USR1] [priv]
1835380 ?        S      0:00  |   \_ sshd: [USR1]@pts/1
1835381 pts/1    Ss     0:00  |       \_ -bash
1835418 pts/1    S      0:00  |           \_ sudo -i
1835612 pts/1    S      0:00  |               \_ -bash
1835621 pts/1    S      0:00  |                   \_ sudo -u [USR2] -i
1835622 pts/1    S      0:00  |                       \_ -bash
1840864 pts/1    S+     0:00  |                           \_ sudo borgmatic create --config /[...]/[HOST]_piped.yaml
1840865 pts/1    S+     0:00  |                               \_ /usr/bin/python3 /usr/local/bin/borgmatic create --config /[...]
1840874 pts/1    Z+     0:00  |                                   \_ [sh] <defunct>
1840918 pts/1    S+     0:00 /bin/bash /[...]/other_mysql_dumps.sh [HOST] [TOPIC]
1840920 pts/1    S+     0:00 /bin/bash /[...]/other_mysql_dumps.sh [HOST] [TOPIC]
1840922 pts/1    S+     0:00 /bin/bash /[...]/other_mysql_dumps.sh [HOST] [TOPIC]
1840924 pts/1    S+     0:00 /bin/bash /[...]/other_mysql_dumps.sh [HOST] [TOPIC]
1840926 pts/1    S+     0:00 /bin/bash /[...]/other_mysql_dumps.sh [HOST] [TOPIC]
1840928 pts/1    S+     0:00 /bin/bash /[...]/other_mysql_dumps.sh [HOST] [TOPIC]
1840930 pts/1    S+     0:00 /bin/bash /[...]/other_mysql_dumps.sh [HOST] [TOPIC]
1840932 pts/1    S+     0:00 /bin/bash /[...]/other_mysql_dumps.sh [HOST] [TOPIC]
1840934 pts/1    S+     0:00 /bin/bash /[...]/other_mysql_dumps.sh [HOST] [TOPIC]
1840936 pts/1    S+     0:00 /bin/bash /[...]/other_mysql_dumps.sh [HOST] [TOPIC]
1840938 pts/1    S+     0:00 /bin/bash /[...]/other_mysql_dumps.sh [HOST] [TOPIC]
1840940 pts/1    S+     0:00 /bin/bash /[...]/other_mysql_dumps.sh [HOST] [TOPIC]
1840942 pts/1    S+     0:00 /bin/bash /[...]/other_mysql_dumps.sh [HOST] [TOPIC]
1840944 pts/1    S+     0:00 /bin/bash /[...]/other_mysql_dumps.sh [HOST] [TOPIC]
1840946 pts/1    S+     0:00 /bin/bash /[...]/other_mysql_dumps.sh [HOST] [TOPIC]

Tested with a reduced sleep and recognized the following: Without proper detach of STDIN/STDOUT etc., pretty much the same happens like with my SSH command. The only difference is that I really sleep individual sleep instaces running, but BGM waits forever again. When redirecting STDOUT etc. like in the third command, I can see sleep instances running, but BGM DOES NOT wait forever on its own hook script anymore! :-)

#nohup ssh -F ${PATH_LOCAL_SSH_CFG} -f -n ${HOST_OTHER} ${cmd} < '/dev/null' > "${PATH_LOCAL_MNT2}/${db_name}" 2> '/dev/null' &
#nohup sleep 15 &
nohup sleep 15 < '/dev/null' > '/dev/null' 2> '/dev/null' &

So like with your tests, things work in general and my problem is that some input/output channel is not properly detached from the hook script. But of course for my use case I need STDOUT being redirected to some file/pipe and somewhat available this way... :-/ Just debugged this further by redirecting STDUT to /dev/null as well and again BGM DOES NOT wait for its own hook script anymore. Of course that setup is of not much use... :-) I need to find a way to redirect the output and detach from the hook script at the same time.

Though, some progress at last...

Made some progress.... The following are the running process when the backup software doesn't move forward. All of the bash processes at the bottom contain my started SSH instances most likely, though Im wondering why they are still associated with their parent bash. OTOH, all of those bash instances seem properly released from my hook shell script, which should be the one zombie process mentioned. I guess that zombie simply needs to stop to get this fixed... ```bash 1641 ? Ss 0:02 sshd: /usr/sbin/sshd -D [listener] 0 of 10-100 startups 1835356 ? Ss 0:00 \_ sshd: [USR1] [priv] 1835380 ? S 0:00 | \_ sshd: [USR1]@pts/1 1835381 pts/1 Ss 0:00 | \_ -bash 1835418 pts/1 S 0:00 | \_ sudo -i 1835612 pts/1 S 0:00 | \_ -bash 1835621 pts/1 S 0:00 | \_ sudo -u [USR2] -i 1835622 pts/1 S 0:00 | \_ -bash 1840864 pts/1 S+ 0:00 | \_ sudo borgmatic create --config /[...]/[HOST]_piped.yaml 1840865 pts/1 S+ 0:00 | \_ /usr/bin/python3 /usr/local/bin/borgmatic create --config /[...] 1840874 pts/1 Z+ 0:00 | \_ [sh] <defunct> 1840918 pts/1 S+ 0:00 /bin/bash /[...]/other_mysql_dumps.sh [HOST] [TOPIC] 1840920 pts/1 S+ 0:00 /bin/bash /[...]/other_mysql_dumps.sh [HOST] [TOPIC] 1840922 pts/1 S+ 0:00 /bin/bash /[...]/other_mysql_dumps.sh [HOST] [TOPIC] 1840924 pts/1 S+ 0:00 /bin/bash /[...]/other_mysql_dumps.sh [HOST] [TOPIC] 1840926 pts/1 S+ 0:00 /bin/bash /[...]/other_mysql_dumps.sh [HOST] [TOPIC] 1840928 pts/1 S+ 0:00 /bin/bash /[...]/other_mysql_dumps.sh [HOST] [TOPIC] 1840930 pts/1 S+ 0:00 /bin/bash /[...]/other_mysql_dumps.sh [HOST] [TOPIC] 1840932 pts/1 S+ 0:00 /bin/bash /[...]/other_mysql_dumps.sh [HOST] [TOPIC] 1840934 pts/1 S+ 0:00 /bin/bash /[...]/other_mysql_dumps.sh [HOST] [TOPIC] 1840936 pts/1 S+ 0:00 /bin/bash /[...]/other_mysql_dumps.sh [HOST] [TOPIC] 1840938 pts/1 S+ 0:00 /bin/bash /[...]/other_mysql_dumps.sh [HOST] [TOPIC] 1840940 pts/1 S+ 0:00 /bin/bash /[...]/other_mysql_dumps.sh [HOST] [TOPIC] 1840942 pts/1 S+ 0:00 /bin/bash /[...]/other_mysql_dumps.sh [HOST] [TOPIC] 1840944 pts/1 S+ 0:00 /bin/bash /[...]/other_mysql_dumps.sh [HOST] [TOPIC] 1840946 pts/1 S+ 0:00 /bin/bash /[...]/other_mysql_dumps.sh [HOST] [TOPIC] ``` Tested with a reduced sleep and recognized the following: Without proper detach of STDIN/STDOUT etc., pretty much the same happens like with my SSH command. The only difference is that I really sleep individual `sleep` instaces running, but BGM waits forever again. When redirecting STDOUT etc. like in the third command, I can see `sleep` instances running, but BGM DOES NOT wait forever on its own hook script anymore! :-) ```bash #nohup ssh -F ${PATH_LOCAL_SSH_CFG} -f -n ${HOST_OTHER} ${cmd} < '/dev/null' > "${PATH_LOCAL_MNT2}/${db_name}" 2> '/dev/null' & #nohup sleep 15 & nohup sleep 15 < '/dev/null' > '/dev/null' 2> '/dev/null' & ``` So like with your tests, things work in general and my problem is that some input/output channel is not properly detached from the hook script. But of course for my use case I need STDOUT being redirected to some file/pipe and somewhat available this way... :-/ Just debugged this further by redirecting STDUT to `/dev/null` as well and again BGM DOES NOT wait for its own hook script anymore. Of course that setup is of not much use... :-) I need to find a way to redirect the output and detach from the hook script at the same time. Though, some progress at last...
Author
nohup ssh -F ${PATH_LOCAL_SSH_CFG} -f -n ${HOST_OTHER} ${cmd} < '/dev/null' > "${PATH_LOCAL_MNT2}/${db_name}" 2> '/dev/null' &

The above is wrong, if one wants to really have totally independent background process, because of the redirection of SSH's output into a file. Who does that redirection in the end? It's the bash process starting the actual command, which in my case is the hook script of the backup software and which makes that hook script a zombie in the end. That zombie tries to take care of the redirection and that is the reason why all of those bash instances are still available in my output of ps axf. When the redirection to the file is changed to /dev/null, the hook script really terminates entirely, the backup software continues and the SSH instances are visible as such in ps axf.

Of course, in my concrete use-case I need the redirected output to the file. That's what this is all about. Though, I "simply" need it differently: The shell executing my hook script needs to start a process taking care of the redirection on its own, while the starting shell is fully detached from that started process. That would be really easy if SSH would be able to write to a file on its own, which doesn't seem to be the case. Instead, it seems to entirely rely on the shell to take care of redirection. This means I need an additional shell instance: One executing SSH and taking care of the redirection into a file, while at the same time this shell instance is executed in the background of its parent and the parent ignores all input/output.

In fact, I was pretty close to the solution all the time already, but simply didn't fully understand what I was doing. The examples with nohup made things only more difficult, because I need an additional shell instance in the end. The following is what I had before already with the help of some other SO questions:

(
  trap '' HUP INT
  ${cmd_exec} <<< "${cmd}"
) < '/dev/null' > "${PATH_LOCAL_MNT2}/${db_name}" 2> '/dev/null' &

https://gist.github.com/bluekezza/e511f3f4429939a0f9ecb6447099b3dc
https://stackoverflow.com/a/54688673/2055163

The above executes my SSH command as a compound command, allowing to have SIGHUP etc. getting ignored. Depending on the config of the current shell,. this might be necessary to keep the executed command running in the background when the parent ends.

The important point to understand is that this actually creates an additional subshell already, so exactly what I need! The problem with the above was that that the parent shell waits for the output of the subshell, because the redirection of STDOUT was defined for the parent shell, pretty much like with my nohup examples. And this is the easy part now: As we have a subshell with individual STDOUT, that can simply be redirected into a file on its own! This way the parent shell can be fully detached from all channels, the compound command is executed in the background, the backup software continues to run because the hook script doesn't become a zombie and is finally able to read from the pipes as the database dumpers write into those using SSH. :-)

(
  trap '' HUP INT
  ${cmd_exec} <<< "${cmd}" > "${PATH_LOCAL_MNT2}/${db_name}"
) < '/dev/null' > '/dev/null' 2> '/dev/null' &

I've looked at the files the backup software created/restored and they look OK. I can see lots of SQL commands and data and stuff. What DOESN'T work is the following:

(
  trap '' HUP INT
  ${cmd_exec} <<< "${cmd}" < '/dev/null' > "${PATH_LOCAL_MNT2}/${db_name}" 2> '/dev/null'
) < '/dev/null' > '/dev/null' 2> '/dev/null' &

I've thought to give that a try simply to be safe, because I don't need STDIN and STDERR anyway. But with the above I don't get any content in the pipes at all, really only EOF, because the backup software creates empty files in the end. Additionally, I can see that the database dumpers are not even executed in their hosts, compared to the working solution for which I can easily see CPU and I/O load in the system. Might have something to do with how I executed SSH using some function or whatever, don't care too much anymore.

```bash nohup ssh -F ${PATH_LOCAL_SSH_CFG} -f -n ${HOST_OTHER} ${cmd} < '/dev/null' > "${PATH_LOCAL_MNT2}/${db_name}" 2> '/dev/null' & ``` The above is wrong, if one wants to really have totally independent background process, because of the redirection of SSH's output into a file. Who does that redirection in the end? It's the bash process starting the actual command, which in my case is the hook script of the backup software and which makes that hook script a zombie in the end. That zombie tries to take care of the redirection and that is the reason why all of those `bash` instances are still available in my output of `ps axf`. When the redirection to the file is changed to `/dev/null`, the hook script really terminates entirely, the backup software continues and the SSH instances are visible as such in `ps axf`. Of course, in my concrete use-case I need the redirected output to the file. That's what this is all about. Though, I "simply" need it differently: The shell executing my hook script needs to start a process taking care of the redirection on its own, while the starting shell is fully detached from that started process. That would be really easy if SSH would be able to write to a file on its own, which doesn't seem to be the case. Instead, it seems to entirely rely on the shell to take care of redirection. This means I need an additional shell instance: One executing SSH and taking care of the redirection into a file, while at the same time this shell instance is executed in the background of its parent and the parent ignores all input/output. In fact, I was pretty close to the solution all the time already, but simply didn't fully understand what I was doing. The examples with `nohup` made things only more difficult, because I need an additional shell instance in the end. The following is what I had before already with the help of some other SO questions: ```bash ( trap '' HUP INT ${cmd_exec} <<< "${cmd}" ) < '/dev/null' > "${PATH_LOCAL_MNT2}/${db_name}" 2> '/dev/null' & ``` https://gist.github.com/bluekezza/e511f3f4429939a0f9ecb6447099b3dc https://stackoverflow.com/a/54688673/2055163 The above executes my SSH command as a compound command, allowing to have SIGHUP etc. getting ignored. Depending on the config of the current shell,. this might be necessary to keep the executed command running in the background when the parent ends. The important point to understand is that this actually creates an additional subshell already, so exactly what I need! The problem with the above was that that the parent shell waits for the output of the subshell, because the redirection of STDOUT was defined for the parent shell, pretty much like with my `nohup` examples. And this is the easy part now: As we have a subshell with individual STDOUT, that can simply be redirected into a file on its own! This way the parent shell can be fully detached from all channels, the compound command is executed in the background, the backup software continues to run because the hook script doesn't become a zombie and is finally able to read from the pipes as the database dumpers write into those using SSH. :-) ```bash ( trap '' HUP INT ${cmd_exec} <<< "${cmd}" > "${PATH_LOCAL_MNT2}/${db_name}" ) < '/dev/null' > '/dev/null' 2> '/dev/null' & ``` I've looked at the files the backup software created/restored and they look OK. I can see lots of SQL commands and data and stuff. What **DOESN'T** work is the following: ```bash ( trap '' HUP INT ${cmd_exec} <<< "${cmd}" < '/dev/null' > "${PATH_LOCAL_MNT2}/${db_name}" 2> '/dev/null' ) < '/dev/null' > '/dev/null' 2> '/dev/null' & ``` I've thought to give that a try simply to be safe, because I don't need STDIN and STDERR anyway. But with the above I don't get any content in the pipes at all, really only EOF, because the backup software creates empty files in the end. Additionally, I can see that the database dumpers are not even executed in their hosts, compared to the working solution for which I can easily see CPU and I/O load in the system. Might have something to do with how I executed SSH using some function or whatever, don't care too much anymore.
Author
https://superuser.com/questions/1717487/how-to-properly-make-a-ssh-call-locally-async-background-independent
Owner

Wow, that's pretty tricky! I'm glad to hear you have a solution or at least a work-around.

Wow, that's pretty tricky! I'm glad to hear you have a solution or at least a work-around.
Sign in to join this conversation.
No Milestone
No Assignees
2 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: borgmatic-collective/borgmatic#522
No description provided.