Borgmatic takes a long time apparently doing practically nothing #583
Labels
No Label
bug
data loss
design finalized
good first issue
new feature area
question / support
security
waiting for response
No Milestone
No Assignees
2 Participants
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: borgmatic-collective/borgmatic#583
Loading…
Reference in New Issue
No description provided.
Delete Branch "%!s(<nil>)"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
What I'm trying to do and why
I am running "borgmatic -v 2 create" to take a backup.
Steps to reproduce (if a bug)
I ran "borgmatic -v 2 create" with this config.yaml except not redacted.
Actual behavior (if a bug)
I have tried this several times. The present process is still running, and
it has been about four days since anything has printed. Other times I have
done this, I have killed the process and seen that archives were created.
Below is output with
--verbosity 2
, redacted.It seems to me that borgmatic creates an archive and then continues
running but does practically nothing.
Expected behavior (if a bug)
I expect this to run faster.
Other notes / implementation ideas
The trouble with backups taking this long is that I can run them only infrequently.
Environment
borgmatic version: 1.7.1
Use
sudo borgmatic --version
orsudo pip show borgmatic | grep ^Version
borgmatic installation method: pip
Borg version: 1.2.2
Use
sudo borg --version
Python version: 3.9.13
Use
python3 --version
Database version (if applicable): 14.5 on both server and client
Use
psql --version
ormysql --version
on client and server.operating system and version: OpenBSD 7.1
My guess is that this is a hang when attempting to dump the database, as Borg tends to hang if borgmatic hands it a named pipe that's inactive. (borgmatic uses named pipes to stream database dumps to Borg.)
So my first suggestion is to try commenting out the PostgreSQL database hook and see if that "fixes" the apparent hang problem. If so, then we know the problem is somewhere in that database dump/stream process.
I don't know what's in your excludes, but I see that
/
is in your source directories. Unless certain directories are explicitly excluded, that's almost certainly going to pass named pipes and special files to Borg that it can't read from, causing it to hang. (This only happens when database hooks are enabled or you useread_special
explicitly.)If that appears to be the case, my recommendation is to take a look at the borgmatic databases documentation, specifically the limitations, and see if you can find what paths in
/
are causing problems. Then, exclude them! Alternatively, just exclude common problem directories like/dev
and/run
and go from there.Let me know what you find!
I ran
find / -type c,b,p,s
, and the results were all among the excludes.I ran without the postgresql hook, and the process finished.
I'm now running without the postgresql hook but with read_special true, to confirm that read_special is the cause.
In terms of debugging, that sounds like a good course of action. One other thing you could maybe try is adding the
--files
flag to the borgmatic invocation, so that Borg lists out files as it's backing up. I don't recall whether Borg lists the files before or after it completes backing it up, but the output might at least give you an idea of approximately where in the filesystem something is going wrong, if not the individual problem file.Edit: One more idea.. You could try running Borg with
strace
, which will spew a ton of output, but should tell you what exactly it was doing right before it hung—and possibly what file it was reading from. The easiest way to do that would probably be to make an executable script that looks something like the following (untested):And then set the borgmatic
local_path
option to the path of that script so it gets called by borgmatic instead of Borg directly.Edit 2: Have you tried just nuking the entire contents of
/backup/root/dump
in case there are some stale named pipes hanging around in there? There shouldn't be, but there might be anyway.And indeed it was still running.
I did not find
--files
, but I used--progress
and found the output has stayed like this for a while.I don't know what exactly the file name indicates on that line, but I figured it is either the file that has just been handled or is about to be handled. The
unwind
and the adjacent files in alphabetical order are regular files.So I added my own path-printing line to borg, and I found the troublesome file to be
/backup/root/dump/postgresql_databases/127.0.0.1/all
, which is of course a FIFO that I neglected to delete after killing borgmatic.On the file order, in case it matters:
This file's real name comes right after
/sbin
in alphabetical order, but I removed some private information from this path and from the borgmatic source directory name before sending theconfig.yaml
earlier.So then I deleted
/backup/root/dump/postgresql_databases/127.0.0.1/all
and ran borg againThis demonstrates that the
find
parameter in the documentation don't match a FIFO, so I triedfind / -type b -or -type c -or -type p -or -type s
and found some FIFOs. I have found this command to work on NetBSD, OpenBSD, and GNU.It seems to me that
find / -type b,c,p,s
in OpenBSD and NetBSD is equivalent tofind / -type b
. (And perhaps the lack of error should be considered a bug.)Nice work debugging this! I'll update the documentation with your modified
find
command. And it sounds like the underlying issue (borgmatic sometimes leaving named pipes laying around when killed and not cleaning them up on startup) is the same as issue #360, which I should really get around to. So I'll close this ticket in favor of that. Thanks!It is too early to say what the underlying issue is, since it will take me reverting all the settings and running a backup to see. And of course the concept of one underlying issue is limiting.
But I am confident they are not the same. In my case I think one underlying issue was the GNU-specific find command in the documentation, and another could be the lack of error message from the borgmatic command. I doubt the problematic FIFO was originally the PostgreSQL dump FIFO, since I expect borgmatic to handle it. When I changed the find command, the FIFOs that I found were ii channels, so, when I exclude those, I expect the backup to run differently.
Okay, understood! Feel free to post any of your follow-up findings here, and I'd be happy to reopen the ticket if there's still an issue distinct from #360 (and not solved by the correct
find
command you've helpfully provided).