no activity during borgmatic check #920

Closed
opened 2024-10-02 17:49:43 +00:00 by sshaikh · 24 comments

What I'm trying to do and why

I have a regular borgmatic config run twice a day. I have the following in the config:

checks:
    - name: repository
      frequency: 1 week
    - name: archives
      frequency: 1 week
    - name: spot
      frequency: 1 week
      count_tolerance_percentage: 10
      data_sample_percentage: 1
      data_tolerance_percentage: 0.5

Generally this works out fine, but sometimes the check hangs and I'm not sure why (logs don't seem to help). They hang long enough to run concurrently with the next scheduled backup.

Steps to reproduce

Here's my command:

borgmatic -c config.yaml --syslog-verbosity 1

I don't specify any actions so it should be the default four.

I also backup locally and to a server over ssh.

Actual behavior

A bunch of hung processes (this is after two cycles, which I think is why there are two "sets"):

/root/.local/pipx/venvs/borgmatic/bin/python /root/.local/bin/borgmatic -c config.yaml --syslog-verbosity 1
root     1894709  0.0  0.0   2660   816 ?        S    Oct01   0:00 borg check --glob-archives {hostname}-* --info ssh://borg@server/backup/borg
root     1894710  0.0  1.0  83040 41084 ?        S    Oct01   0:01 borg check --glob-archives {hostname}-* --info ssh://borg@server/backup/borg
root     1894712  0.0  0.3  19736 13384 ?        S    Oct01   0:05 ssh borg@server borg serve --info
root     1990949  0.0  0.9  47140 38564 ?        S    11:00   0:00 /root/.local/pipx/venvs/borgmatic/bin/python /root/.local/bin/borgmatic -c config.yaml --syslog-verbosity 1
root     1993009  0.0  0.0   2660   944 ?        S    11:02   0:00 borg check --glob-archives {hostname}-* --info ssh://borg@server/backup/borg
root     1993010  0.0  1.0  82888 43588 ?        S    11:02   0:01 borg check --glob-archives {hostname}-* --info ssh://borg@server/backup/borg
root     1993012  0.0  0.2  15868  9792 ?        S    11:02   0:00 ssh borg@server borg serve --info

and on the server:

root     1224552  0.0  0.0  17820 10960 ?        Ss   Oct01   0:00 sshd: borg [priv]
borg     1224560  0.0  0.0  18080  6816 ?        S    Oct01   0:00 sshd: borg@notty
root     1683888  0.0  0.0  17820 11036 ?        Ss   11:02   0:00 sshd: borg [priv]
borg     1683895  0.0  0.0  18080  6868 ?        S    11:02   0:00 sshd: borg@notty

Expected behavior

For the checks to finish or exit

Other notes / implementation ideas

No response

borgmatic version

1.8.12

borgmatic installation method

pip install

Borg version

borg 1.4.0

Python version

Python 3.11.2

Database version (if applicable)

No response

Operating system and version

Debian GNU/Linux 12 (bookworm)

### What I'm trying to do and why I have a regular borgmatic config run twice a day. I have the following in the config: ``` checks: - name: repository frequency: 1 week - name: archives frequency: 1 week - name: spot frequency: 1 week count_tolerance_percentage: 10 data_sample_percentage: 1 data_tolerance_percentage: 0.5 ``` Generally this works out fine, but sometimes the check hangs and I'm not sure why (logs don't seem to help). They hang long enough to run concurrently with the next scheduled backup. ### Steps to reproduce Here's my command: ``` borgmatic -c config.yaml --syslog-verbosity 1 ``` I don't specify any actions so it should be the default four. I also backup locally and to a server over ssh. ### Actual behavior A bunch of hung processes (this is after two cycles, which I think is why there are two "sets"): ``` /root/.local/pipx/venvs/borgmatic/bin/python /root/.local/bin/borgmatic -c config.yaml --syslog-verbosity 1 root 1894709 0.0 0.0 2660 816 ? S Oct01 0:00 borg check --glob-archives {hostname}-* --info ssh://borg@server/backup/borg root 1894710 0.0 1.0 83040 41084 ? S Oct01 0:01 borg check --glob-archives {hostname}-* --info ssh://borg@server/backup/borg root 1894712 0.0 0.3 19736 13384 ? S Oct01 0:05 ssh borg@server borg serve --info root 1990949 0.0 0.9 47140 38564 ? S 11:00 0:00 /root/.local/pipx/venvs/borgmatic/bin/python /root/.local/bin/borgmatic -c config.yaml --syslog-verbosity 1 root 1993009 0.0 0.0 2660 944 ? S 11:02 0:00 borg check --glob-archives {hostname}-* --info ssh://borg@server/backup/borg root 1993010 0.0 1.0 82888 43588 ? S 11:02 0:01 borg check --glob-archives {hostname}-* --info ssh://borg@server/backup/borg root 1993012 0.0 0.2 15868 9792 ? S 11:02 0:00 ssh borg@server borg serve --info ``` and on the server: ``` root 1224552 0.0 0.0 17820 10960 ? Ss Oct01 0:00 sshd: borg [priv] borg 1224560 0.0 0.0 18080 6816 ? S Oct01 0:00 sshd: borg@notty root 1683888 0.0 0.0 17820 11036 ? Ss 11:02 0:00 sshd: borg [priv] borg 1683895 0.0 0.0 18080 6868 ? S 11:02 0:00 sshd: borg@notty ``` ### Expected behavior For the checks to finish or exit ### Other notes / implementation ideas _No response_ ### borgmatic version 1.8.12 ### borgmatic installation method pip install ### Borg version borg 1.4.0 ### Python version Python 3.11.2 ### Database version (if applicable) _No response_ ### Operating system and version Debian GNU/Linux 12 (bookworm)
Owner

My guess is that it's the "spot" check that's hanging, probably trying to access a source file to hash. The check is probabilistic (meaning it randomly selects the files to read), so that could explain why it only sometimes happens.

To verify this, I'd recommend upping the verbosity level to 2. That should show which check is being run when it hangs. But it'll also produce more debugging output about each check. Specifically, for the spot check, it'll show you a (truncated) list of the source files it's sampling. That should give you an idea whether there's anything that maybe should be excluded, e.g. if a special file like /dev/random is somehow getting erroneously included.

Lastly, the spot check can just be really slow depending on how it's configured and the specifics of your data. I see you have a data_sample_percentage of 1. That means 1% of your source files get sampled in each check. But if borgmatic happens to select a really large file, that could take much longer than when it selects a small one.

Feel free to post any verbose logs you get here, and I'd be happy to help decipher them.

My guess is that it's the "spot" check that's hanging, probably trying to access a source file to hash. The check is probabilistic (meaning it randomly selects the files to read), so that could explain why it only sometimes happens. To verify this, I'd recommend upping the verbosity level to `2`. That should show which check is being run when it hangs. But it'll also produce more debugging output about each check. Specifically, for the spot check, it'll show you a (truncated) list of the source files it's sampling. That should give you an idea whether there's anything that maybe should be excluded, e.g. if a special file like `/dev/random` is somehow getting erroneously included. Lastly, the spot check can just be really slow depending on how it's configured and the specifics of your data. I see you have a `data_sample_percentage` of `1`. That means 1% of your source files get sampled in each check. But if borgmatic happens to select a really large file, that could take much longer than when it selects a small one. Feel free to post any verbose logs you get here, and I'd be happy to help decipher them.
Author

If it was a large file, shouldn't we expect some CPU usage?

Presumably the logging would be for /future/ checks? If so how do I safely abort this run? Just kill the borg processes?

And then how do I force this spot check? Given that it's hung twice in a row now it seems repeatable?

If it was a large file, shouldn't we expect some CPU usage? Presumably the logging would be for /future/ checks? If so how do I safely abort this run? Just kill the borg processes? And then how do I force this spot check? Given that it's hung twice in a row now it seems repeatable?
Owner

If it was a large file, shouldn't we expect some CPU usage?

Either that or disk I/O.

Presumably the logging would be for /future/ checks?

Yes, the logs for what checks are to be run occur before the checks themselves. However there is some intermediate debug logging during the spot check that happens while it's running.

If so how do I safely abort this run? Just kill the borg processes?

Yes, and borgmatic as well if necessary. That would actually be an interesting data point to help debug this—does the Borg process command-line show anything useful before you kill it?

Also, you may have to run borgmatic break-lock afterwards if a Borg lock file is left behind. (You'll know because you'll run Borg/borgmatic and it'll complain about a lock file.)

And then how do I force this spot check?

You can always run borgmatic check --force to force checks to run. There's also the --only spot flag if you don't want to run other checks.

Given that it's hung twice in a row now it seems repeatable?

Possibly, yeah. Spot is still probabilistic though, so it could vary from run to run. I guess you'll find out!

> If it was a large file, shouldn't we expect some CPU usage? Either that or disk I/O. > Presumably the logging would be for /future/ checks? Yes, the logs for what checks are to be run occur before the checks themselves. However there is some intermediate debug logging during the spot check that happens while it's running. > If so how do I safely abort this run? Just kill the borg processes? Yes, and borgmatic as well if necessary. That would actually be an interesting data point to help debug this—does the Borg process command-line show anything useful before you kill it? Also, you may have to run `borgmatic break-lock` afterwards if a Borg lock file is left behind. (You'll know because you'll run Borg/borgmatic and it'll complain about a lock file.) > And then how do I force this spot check? You can always run `borgmatic check --force` to force checks to run. There's also the `--only spot` flag if you don't want to run other checks. > Given that it's hung twice in a row now it seems repeatable? Possibly, yeah. Spot is still probabilistic though, so it could vary from run to run. I guess you'll find out!
Author

FWIW the third scheduled backup did "stack up". Killing them cleaned everything up (including the processes on the server).

I then ran the scheduled backup manually (no force, no spot). This quits in an error which I guess is better than the hanging? Although it seems to have passed the check against the first repo (I backup to two).

There are things in the syslog - forgive me but what the best way to share these?

$ sudo borgmatic -c config.yaml --syslog-verbosity 2
Sending message 'BORG: Finished backup to bitbucket:
29.547705 seconds
42 / 314440 files changed' to 11054976
Done!
using builtin fallback logging configuration
33 self tests completed in 0.04 seconds
Verified integrity of /backup/borg/index.40556
TAM-verified manifest
security: read previous location '/backup/borg'
security: read manifest timestamp '2024-10-03T15:12:36.020869'
security: determined newest manifest timestamp as 2024-10-03T15:12:36.020869
security: repository checks ok, allowing access
Archive authentication DISABLED.
TAM-verified archive
terminating with success status, rc 0
using builtin fallback logging configuration
33 self tests completed in 0.05 seconds
Verified integrity of /backup/borg/index.40556
TAM-verified manifest
security: read previous location '/backup/borg'
security: read manifest timestamp '2024-10-03T15:12:36.020869'
security: determined newest manifest timestamp as 2024-10-03T15:12:36.020869
security: repository checks ok, allowing access
Archive authentication DISABLED.
TAM-verified archive

terminating with success status, rc 0
cctv: Error running actions for repository
Command 'borg create --debug --show-rc ssh://server/backup/borg::{hostname}-{now:%Y-%m-%dT%H:%M:%S.%f} config.yaml /root/.borgmatic /jobs /music /server-data /redacted /users' returned non-zero exit status 90.
config.yaml: Variable '{hostname}' is not supported in on-error hook
Sending message 'BORG: Error in repo ssh://server/backup/borg: 'Command '"borg' to 11054976
Done!
config.yaml: An error occurred

summary:
config.yaml: An error occurred
cctv: Error running actions for repository
...
  File "borg/repository.py", line 442, in check_transaction
  File "borg/repository.py", line 936, in replay_segments
  File "borg/repository.py", line 946, in _update_index
  File "borg/repository.py", line 1583, in iter_objects
  File "borg/repository.py", line 1682, in _read
borg.helpers.errors.IntegrityError: Data integrity error: Segment entry checksum mismatch [segment 175, offset 357046609]
RemoteRepository: 203 B bytes sent, 3.60 kB bytes received, 4 messages sent
Data integrity error: Segment entry checksum mismatch [segment 175, offset 357046609]
Traceback (most recent call last):
  File "borg/archiver.py", line 5391, in main
  File "borg/archiver.py", line 5309, in run
  File "borg/archiver.py", line 178, in wrapper
  File "borg/helpers/manifest.py", line 186, in load
  File "borg/remote.py", line 984, in get
  File "borg/remote.py", line 988, in get_many
  File "borg/remote.py", line 836, in call_many
  File "borg/remote.py", line 772, in handle_error
borg.helpers.errors.IntegrityError: Data integrity error: Segment entry checksum mismatch [segment 175, offset 357046609]
Platform: Linux nas 6.1.0-25-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.106-3 (2024-08-26) x86_64
Linux: Unknown Linux
Borg: 1.4.0  Python: CPython 3.11.9 msgpack: 1.0.8 fuse: llfuse 1.5.0 [pyfuse3,llfuse]
PID: 2247790  CWD: /redacted
sys.argv: ['borg', 'create', '--debug', '--show-rc', 'ssh://server/backup/borg::{hostname}-{now:%Y-%m-%dT%H:%M:%S.%f}', 'config.yaml', '/root/.borgmatic', '/jobs', '/music', '/server-data', '/redacted', '/users']
SSH_ORIGINAL_COMMAND: None
terminating with error status, rc 90
Command 'borg create --debug --show-rc ssh://server/backup/borg::{hostname}-{now:%Y-%m-%dT%H:%M:%S.%f} config.yaml /root/.borgmatic /jobs /music /server-data /redacted /users' returned non-zero exit status 90.

Need some help? https://torsion.org/borgmatic/#issues
FWIW the third scheduled backup did "stack up". Killing them cleaned everything up (including the processes on the server). I then ran the scheduled backup manually (no force, no spot). This quits in an error which I guess is better than the hanging? Although it seems to have passed the check against the first repo (I backup to two). There are things in the syslog - forgive me but what the best way to share these? ``` $ sudo borgmatic -c config.yaml --syslog-verbosity 2 Sending message 'BORG: Finished backup to bitbucket: 29.547705 seconds 42 / 314440 files changed' to 11054976 Done! using builtin fallback logging configuration 33 self tests completed in 0.04 seconds Verified integrity of /backup/borg/index.40556 TAM-verified manifest security: read previous location '/backup/borg' security: read manifest timestamp '2024-10-03T15:12:36.020869' security: determined newest manifest timestamp as 2024-10-03T15:12:36.020869 security: repository checks ok, allowing access Archive authentication DISABLED. TAM-verified archive terminating with success status, rc 0 using builtin fallback logging configuration 33 self tests completed in 0.05 seconds Verified integrity of /backup/borg/index.40556 TAM-verified manifest security: read previous location '/backup/borg' security: read manifest timestamp '2024-10-03T15:12:36.020869' security: determined newest manifest timestamp as 2024-10-03T15:12:36.020869 security: repository checks ok, allowing access Archive authentication DISABLED. TAM-verified archive terminating with success status, rc 0 cctv: Error running actions for repository Command 'borg create --debug --show-rc ssh://server/backup/borg::{hostname}-{now:%Y-%m-%dT%H:%M:%S.%f} config.yaml /root/.borgmatic /jobs /music /server-data /redacted /users' returned non-zero exit status 90. config.yaml: Variable '{hostname}' is not supported in on-error hook Sending message 'BORG: Error in repo ssh://server/backup/borg: 'Command '"borg' to 11054976 Done! config.yaml: An error occurred summary: config.yaml: An error occurred cctv: Error running actions for repository ... File "borg/repository.py", line 442, in check_transaction File "borg/repository.py", line 936, in replay_segments File "borg/repository.py", line 946, in _update_index File "borg/repository.py", line 1583, in iter_objects File "borg/repository.py", line 1682, in _read borg.helpers.errors.IntegrityError: Data integrity error: Segment entry checksum mismatch [segment 175, offset 357046609] RemoteRepository: 203 B bytes sent, 3.60 kB bytes received, 4 messages sent Data integrity error: Segment entry checksum mismatch [segment 175, offset 357046609] Traceback (most recent call last): File "borg/archiver.py", line 5391, in main File "borg/archiver.py", line 5309, in run File "borg/archiver.py", line 178, in wrapper File "borg/helpers/manifest.py", line 186, in load File "borg/remote.py", line 984, in get File "borg/remote.py", line 988, in get_many File "borg/remote.py", line 836, in call_many File "borg/remote.py", line 772, in handle_error borg.helpers.errors.IntegrityError: Data integrity error: Segment entry checksum mismatch [segment 175, offset 357046609] Platform: Linux nas 6.1.0-25-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.106-3 (2024-08-26) x86_64 Linux: Unknown Linux Borg: 1.4.0 Python: CPython 3.11.9 msgpack: 1.0.8 fuse: llfuse 1.5.0 [pyfuse3,llfuse] PID: 2247790 CWD: /redacted sys.argv: ['borg', 'create', '--debug', '--show-rc', 'ssh://server/backup/borg::{hostname}-{now:%Y-%m-%dT%H:%M:%S.%f}', 'config.yaml', '/root/.borgmatic', '/jobs', '/music', '/server-data', '/redacted', '/users'] SSH_ORIGINAL_COMMAND: None terminating with error status, rc 90 Command 'borg create --debug --show-rc ssh://server/backup/borg::{hostname}-{now:%Y-%m-%dT%H:%M:%S.%f} config.yaml /root/.borgmatic /jobs /music /server-data /redacted /users' returned non-zero exit status 90. Need some help? https://torsion.org/borgmatic/#issues ```
Owner

There are things in the syslog - forgive me but what the best way to share these?

Pasting like this into a ticket comment or attaching it as a file to a ticket is fine.

Based on those error logs though, it looks like Borg's own checks found some legitimate repository errors. You might try check --repair, first making yourself aware of the warnings here. Also perhaps look into the hardware this is running on and whether anything there is failing (disk, RAM, etc.). Occasional Borg repository/archive check failures happen—and can usually be repaired—but if it's happening frequently, I would suspect the hardware.

> There are things in the syslog - forgive me but what the best way to share these? Pasting like this into a ticket comment or attaching it as a file to a ticket is fine. Based on those error logs though, it looks like Borg's own checks found some legitimate repository errors. You might try `check --repair`, first making yourself aware of [the warnings here](https://borgbackup.readthedocs.io/en/stable/usage/check.html#about-repair-mode). Also perhaps look into the hardware this is running on and whether anything there is failing (disk, RAM, etc.). Occasional Borg repository/archive check failures happen—and can usually be repaired—but if it's happening frequently, I would suspect the hardware.
Author

I went ahead with the repair and all is running smoothly now. This happens more than I would like, and I suspect I only noticed when the spot checks are performed. Which leaves me with two questions:

  1. Is it expected that Borg/borgmatic hangs on a corrupted backup?
  2. Assuming it is a hardware issue, how to confirm that? The server otherwise works fine and no other errors are seen by other apps. The other Borg checks work fine also so it's unclear when exactly the data is corrupted.
I went ahead with the repair and all is running smoothly now. This happens more than I would like, and I suspect I only noticed when the spot checks are performed. Which leaves me with two questions: 1. Is it expected that Borg/borgmatic hangs on a corrupted backup? 2. Assuming it is a hardware issue, how to confirm that? The server otherwise works fine and no other errors are seen by other apps. The other Borg checks work fine also so it's unclear when exactly the data is corrupted.
Owner

Is it expected that Borg/borgmatic hangs on a corrupted backup?

Not generally, no. But if for instance your disk is failing, that could cause hangs related to disk access.

Assuming it is a hardware issue, how to confirm that? The server otherwise works fine and no other errors are seen by other apps. The other Borg checks work fine also so it's unclear when exactly the data is corrupted.

The two places I would look to start are:

  1. Kernel/system logs. Those will generally show disk errors.
  2. A higher-level utility like smartctl that can look for bad blocks.

If that doesn't turn up anything, then maybe your disks are fine—in which case we should look at verbose borgmatic logs when the spot check is failing.

> Is it expected that Borg/borgmatic hangs on a corrupted backup? Not generally, no. But if for instance your disk is failing, that _could_ cause hangs related to disk access. > Assuming it is a hardware issue, how to confirm that? The server otherwise works fine and no other errors are seen by other apps. The other Borg checks work fine also so it's unclear when exactly the data is corrupted. The two places I would look to start are: 1. Kernel/system logs. Those will generally show disk errors. 2. A higher-level utility like [`smartctl`](https://www.smartmontools.org/wiki/BadBlockHowto) that can look for bad blocks. If that doesn't turn up anything, then maybe your disks are fine—in which case we should look at verbose borgmatic logs when the spot check is failing.
Author

Fwiw we do a smart scan periodically and disks have been healthy for a while. I'll look into the deeper options and will check the logs.

Fwiw we do a smart scan periodically and disks have been healthy for a while. I'll look into the deeper options and will check the logs.
Owner

Fwiw we do a smart scan periodically and disks have been healthy for a while.

Okay, it's possible that the Borg error has nothing to do with hardware and it was just a Borg bug. Similar on the spot check hang being a potential borgmatic bug.

I'll look into the deeper options and will check the logs.

Sounds good.

> Fwiw we do a smart scan periodically and disks have been healthy for a while. Okay, it's possible that the Borg error has nothing to do with hardware and it was just a Borg bug. Similar on the spot check hang being a potential borgmatic bug. > I'll look into the deeper options and will check the logs. Sounds good.
Author

So, smartctl extended tests came back OK, and I can't see anything obvious in the syslog.

I tried the following:

$ sudo borgmatic check --force -c /config.yaml --verbosity 2 --only spot

And although the second completed fine, the first error'd out. Unfortunately the console output overflowed with a huge list of file paths (presumably those being spot checked) but I spotted a python stacktrace - perhaps an "invalid character" error? I think I remember an issue about a strange filepath causing issues with borg. Either way I'm not sure if it's even related to my original issue!

FWIW I've been having a lot of these kind of check and issues since upgrading to the new-database-format version of borg (just as a data point!).

So, smartctl extended tests came back OK, and I can't see anything obvious in the syslog. I tried the following: ``` $ sudo borgmatic check --force -c /config.yaml --verbosity 2 --only spot ``` And although the second completed fine, the first error'd out. Unfortunately the console output overflowed with a huge list of file paths (presumably those being spot checked) but I spotted a python stacktrace - perhaps an "invalid character" error? I think I remember an issue about a strange filepath causing issues with borg. Either way I'm not sure if it's even related to my original issue! FWIW I've been having a lot of these kind of check and issues since upgrading to the new-database-format version of borg (just as a data point!).
Owner

So, smartctl extended tests came back OK, and I can't see anything obvious in the syslog.

Okay, I'm glad to hear your disk isn't bad!

I spotted a python stacktrace - perhaps an "invalid character" error?

I'm happy to look at the traceback if you've got it, but it's sounding a whole lot like #899, which is fixed in a newer version of borgmatic than you're using. So it might be time to upgrade! But yeah, I'm not sure it would be related to your original hang issue.

FWIW I've been having a lot of these kind of check and issues since upgrading to the new-database-format version of borg (just as a data point!).

You're speaking of Borg 1.4.0? If the check issues are with the spot check, then it's likely a borgmatic problem rather than anything to do with Borg. But if you're getting more integrity errors, that's almost certainly a Borg (or hardware) issue, and you might consider filing a ticket with the Borg project.

> So, smartctl extended tests came back OK, and I can't see anything obvious in the syslog. Okay, I'm glad to hear your disk isn't bad! > I spotted a python stacktrace - perhaps an "invalid character" error? I'm happy to look at the traceback if you've got it, but it's sounding a whole lot like #899, which is fixed in a newer version of borgmatic than you're using. So it might be time to upgrade! But yeah, I'm not sure it would be related to your original hang issue. > FWIW I've been having a lot of these kind of check and issues since upgrading to the new-database-format version of borg (just as a data point!). You're speaking of Borg 1.4.0? If the check issues are with the spot check, then it's likely a borgmatic problem rather than anything to do with Borg. But if you're getting more integrity errors, that's almost certainly a Borg (or hardware) issue, and you might consider filing a ticket with the Borg project.
Author

#899 was mine! I've upgraded now.

My suspicion is that something is happening in the daily backups (all of which end in a logged success), which only becomes an issue during the less frequent spot check.

Is there a way to keep every backup run as a timestamped file for future investigation?

#899 was mine! I've upgraded now. My suspicion is that something is happening in the daily backups (all of which end in a logged success), which only becomes an issue during the less frequent spot check. Is there a way to keep every backup run as a timestamped file for future investigation?
Owner

You can try the --log-file command-line flag, which does include timestamps in the log. But you'd have to either use a different file name each time or rotate the log files with an external tool (e.g. logrotate)—or else each borgmatic run will overwrite the log file.

Another option is to just use the default system logging and run journalctl to view logs from the desired time span.

You can try the `--log-file` command-line flag, which does include timestamps in the log. But you'd have to either use a different file name each time or rotate the log files with an external tool (e.g. `logrotate`)—or else each borgmatic run will overwrite the log file. Another option is to just use the default system logging and run `journalctl` to view logs from the desired time span.
Author

Got a log for a failed backup! Strangely it seems to occur while pruning after a successful backup:

config.yaml: Pinging Healthchecks start
localbackup: Creating archive
Creating archive at "/backup/borg::nas-2024-10-13T11:00:02.590570"
config.yaml: Running command for post-backup hook
Sending message 'BORG: Finished backup to localbackup:
29.544958 seconds
264 / 306615 files changed' to telegram
Done!
localbackup: Pruning archives
localbackup: Compacting segments
compaction freed about 3.86 MB repository space.
localbackup: Running consistency checks
Skipping repository check due to configured frequency; 6 days, 12:38:11.994065 until next check (use --force to check anyway)
Skipping archives check due to configured frequency; 6 days, 12:38:11.993885 until next check (use --force to check anyway)
Skipping spot check due to configured frequency; 2 days, 7:03:56.655055 until next check (use --force to check anyway)
remotebackup: Creating archive
Creating archive at "ssh://borg@remotebackupg/backup/borg::nas-2024-10-13T11:01:11.751558"
config.yaml: Running command for post-backup hook
Sending message 'BORG: Finished backup to remotebackup:
29.775529 seconds
262 / 306615 files changed' to telegram
Done!
remotebackup: Pruning archives
remotebackup: Compacting segments
Remote: Repository index missing or corrupted, trying to recover from: File failed integrity check: /backup/borg/index.38546
Data integrity error: Segment entry checksum mismatch [segment 2, offset 10970537]
Traceback (most recent call last):
  File "borg/archiver.py", line 5391, in main
  File "borg/archiver.py", line 5309, in run
  File "borg/archiver.py", line 191, in wrapper
  File "borg/archiver.py", line 1884, in do_compact
  File "borg/remote.py", line 984, in get
  File "borg/remote.py", line 988, in get_many
  File "borg/remote.py", line 836, in call_many
  File "borg/remote.py", line 772, in handle_error
borg.helpers.errors.IntegrityError: Data integrity error: Segment entry checksum mismatch [segment 2, offset 10970537]
Platform: Linux nas 6.1.0-25-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.106-3 (2024-08-26) x86_64
Linux: Unknown Linux
Borg: 1.4.0  Python: CPython 3.11.9 msgpack: 1.0.8 fuse: llfuse 1.5.0 [pyfuse3,llfuse]
PID: 387368  CWD: /root
sys.argv: ['borg', 'compact', '--info', 'ssh://borg@remotebackupg/backup/borg']
SSH_ORIGINAL_COMMAND: None
remotebackup: Error running actions for repository
Command 'borg compact --info ssh://borg@remotebackupg/backup/borg' returned non-zero exit status 90.
config.yaml: Pinging Healthchecks log
config.yaml: Running command for on-error hook
Sending message 'BORG: Error in repo ssh://borg@remotebackupg/backup/borg: 'Command '"borg' to telegram
Done!
config.yaml: Pinging Healthchecks fail
Got a log for a failed backup! Strangely it seems to occur while pruning after a successful backup: ``` config.yaml: Pinging Healthchecks start localbackup: Creating archive Creating archive at "/backup/borg::nas-2024-10-13T11:00:02.590570" config.yaml: Running command for post-backup hook Sending message 'BORG: Finished backup to localbackup: 29.544958 seconds 264 / 306615 files changed' to telegram Done! localbackup: Pruning archives localbackup: Compacting segments compaction freed about 3.86 MB repository space. localbackup: Running consistency checks Skipping repository check due to configured frequency; 6 days, 12:38:11.994065 until next check (use --force to check anyway) Skipping archives check due to configured frequency; 6 days, 12:38:11.993885 until next check (use --force to check anyway) Skipping spot check due to configured frequency; 2 days, 7:03:56.655055 until next check (use --force to check anyway) remotebackup: Creating archive Creating archive at "ssh://borg@remotebackupg/backup/borg::nas-2024-10-13T11:01:11.751558" config.yaml: Running command for post-backup hook Sending message 'BORG: Finished backup to remotebackup: 29.775529 seconds 262 / 306615 files changed' to telegram Done! remotebackup: Pruning archives remotebackup: Compacting segments Remote: Repository index missing or corrupted, trying to recover from: File failed integrity check: /backup/borg/index.38546 Data integrity error: Segment entry checksum mismatch [segment 2, offset 10970537] Traceback (most recent call last): File "borg/archiver.py", line 5391, in main File "borg/archiver.py", line 5309, in run File "borg/archiver.py", line 191, in wrapper File "borg/archiver.py", line 1884, in do_compact File "borg/remote.py", line 984, in get File "borg/remote.py", line 988, in get_many File "borg/remote.py", line 836, in call_many File "borg/remote.py", line 772, in handle_error borg.helpers.errors.IntegrityError: Data integrity error: Segment entry checksum mismatch [segment 2, offset 10970537] Platform: Linux nas 6.1.0-25-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.106-3 (2024-08-26) x86_64 Linux: Unknown Linux Borg: 1.4.0 Python: CPython 3.11.9 msgpack: 1.0.8 fuse: llfuse 1.5.0 [pyfuse3,llfuse] PID: 387368 CWD: /root sys.argv: ['borg', 'compact', '--info', 'ssh://borg@remotebackupg/backup/borg'] SSH_ORIGINAL_COMMAND: None remotebackup: Error running actions for repository Command 'borg compact --info ssh://borg@remotebackupg/backup/borg' returned non-zero exit status 90. config.yaml: Pinging Healthchecks log config.yaml: Running command for on-error hook Sending message 'BORG: Error in repo ssh://borg@remotebackupg/backup/borg: 'Command '"borg' to telegram Done! config.yaml: Pinging Healthchecks fail ```
Author

Running it again won't even get past backup:

summary:
config.yaml: An error occurred
remotebackup: Error running actions for repository
Data integrity error: Segment entry checksum mismatch [segment 5, offset 126510872]
Traceback (most recent call last):
  File "borg/archiver.py", line 5391, in main
  File "borg/archiver.py", line 5309, in run
  File "borg/archiver.py", line 178, in wrapper
  File "borg/helpers/manifest.py", line 186, in load
  File "borg/remote.py", line 984, in get
  File "borg/remote.py", line 988, in get_many
  File "borg/remote.py", line 836, in call_many
  File "borg/remote.py", line 772, in handle_error
borg.helpers.errors.IntegrityError: Data integrity error: Segment entry checksum mismatch [segment 5, offset 126510872]
Platform: Linux nas 6.1.0-25-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.106-3 (2024-08-26) x86_64
Linux: Unknown Linux
Borg: 1.4.0  Python: CPython 3.11.9 msgpack: 1.0.8 fuse: llfuse 1.5.0 [pyfuse3,llfuse]
PID: 403771  CWD: /users/omv
sys.argv: ['borg', 'create', 'ssh://borg@remotebackup/backup/borg::{hostname}-{now:%Y-%m-%dT%H:%M:%S.%f}', '/config.yaml', '/root/.borgmatic', '/files']
SSH_ORIGINAL_COMMAND: None
Command 'borg create ssh://borg@remotebackup/backup/borg::{hostname}-{now:%Y-%m-%dT%H:%M:%S.%f} /config.yaml /root/.borgmatic /files' returned non-zero exit status 90.

Need some help? https://torsion.org/borgmatic/#issues
Running it again won't even get past backup: ``` summary: config.yaml: An error occurred remotebackup: Error running actions for repository Data integrity error: Segment entry checksum mismatch [segment 5, offset 126510872] Traceback (most recent call last): File "borg/archiver.py", line 5391, in main File "borg/archiver.py", line 5309, in run File "borg/archiver.py", line 178, in wrapper File "borg/helpers/manifest.py", line 186, in load File "borg/remote.py", line 984, in get File "borg/remote.py", line 988, in get_many File "borg/remote.py", line 836, in call_many File "borg/remote.py", line 772, in handle_error borg.helpers.errors.IntegrityError: Data integrity error: Segment entry checksum mismatch [segment 5, offset 126510872] Platform: Linux nas 6.1.0-25-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.106-3 (2024-08-26) x86_64 Linux: Unknown Linux Borg: 1.4.0 Python: CPython 3.11.9 msgpack: 1.0.8 fuse: llfuse 1.5.0 [pyfuse3,llfuse] PID: 403771 CWD: /users/omv sys.argv: ['borg', 'create', 'ssh://borg@remotebackup/backup/borg::{hostname}-{now:%Y-%m-%dT%H:%M:%S.%f}', '/config.yaml', '/root/.borgmatic', '/files'] SSH_ORIGINAL_COMMAND: None Command 'borg create ssh://borg@remotebackup/backup/borg::{hostname}-{now:%Y-%m-%dT%H:%M:%S.%f} /config.yaml /root/.borgmatic /files' returned non-zero exit status 90. Need some help? https://torsion.org/borgmatic/#issues ```
Author

Side question (maybe) - when trying to run:

$ sudo borgmatic check -c /config.yaml --repository ssh://borg@remotebackup/backup/borg --verbosity 2

The command fails because borgmatic seems to execute a borg info first, which fails because of the segment error. Is that something that check always does?

Side question (maybe) - when trying to run: ``` $ sudo borgmatic check -c /config.yaml --repository ssh://borg@remotebackup/backup/borg --verbosity 2 ``` The command fails because borgmatic seems to execute a `borg info` first, which fails because of the segment error. Is that something that `check` always does?
Author

Using borg directly for the check:

Remote: Starting repository check
Remote: Data integrity error: Segment entry checksum mismatch [segment 122, offset 182376752]
Remote: Data integrity error: Segment entry checksum mismatch [segment 131, offset 495071959]
Remote: Data integrity error: Segment entry checksum mismatch [segment 143, offset 191562475]
Remote: Data integrity error: Segment entry checksum mismatch [segment 165, offset 365940210]
Remote: Data integrity error: Segment entry checksum mismatch [segment 166, offset 421447650]
Remote: Data integrity error: Segment entry checksum mismatch [segment 26141, offset 13097722]
Remote: Data integrity error: Segment entry checksum mismatch [segment 32151, offset 118594596]
Remote: Data integrity error: Segment entry checksum mismatch [segment 34757, offset 52439546]
Remote: finished segment check at segment 38546
Remote: Starting repository index check
Remote: Finished full repository check, errors found

Which are different to the segments being complained about during the other actions. A further check lists different ones still:

Remote: Starting repository check
Remote: Data integrity error: Segment entry checksum mismatch [segment 22, offset 274334380]
Remote: Data integrity error: Segment entry checksum mismatch [segment 110, offset 16074473]
Remote: Data integrity error: Segment entry checksum mismatch [segment 113, offset 398256964]
Remote: Data integrity error: Segment entry checksum mismatch [segment 147, offset 190731549]
Remote: Data integrity error: Segment entry checksum mismatch [segment 221, offset 356419101]
Remote: Data integrity error: Segment entry checksum mismatch [segment 249, offset 49785478]
Remote: Data integrity error: Segment entry checksum mismatch [segment 251, offset 85867597]
Remote: Data integrity error: Segment entry checksum mismatch [segment 2734, offset 75986588]
Remote: Data integrity error: Segment entry checksum mismatch [segment 3057, offset 190164956]
Remote: Data integrity error: Segment entry checksum mismatch [segment 12239, offset 195621763]
Remote: Data integrity error: Segment entry checksum mismatch [segment 20687, offset 279381070]
Remote: Data integrity error: Segment entry checksum mismatch [segment 22555, offset 86865580]
Remote: finished segment check at segment 38546
Remote: Starting repository index check
Remote: Finished full repository check, errors found.

I won't repair this in case there's something I can dig into while it's in this state - the other repo is backing up okay for now.

Using `borg` directly for the check: ``` Remote: Starting repository check Remote: Data integrity error: Segment entry checksum mismatch [segment 122, offset 182376752] Remote: Data integrity error: Segment entry checksum mismatch [segment 131, offset 495071959] Remote: Data integrity error: Segment entry checksum mismatch [segment 143, offset 191562475] Remote: Data integrity error: Segment entry checksum mismatch [segment 165, offset 365940210] Remote: Data integrity error: Segment entry checksum mismatch [segment 166, offset 421447650] Remote: Data integrity error: Segment entry checksum mismatch [segment 26141, offset 13097722] Remote: Data integrity error: Segment entry checksum mismatch [segment 32151, offset 118594596] Remote: Data integrity error: Segment entry checksum mismatch [segment 34757, offset 52439546] Remote: finished segment check at segment 38546 Remote: Starting repository index check Remote: Finished full repository check, errors found ``` Which are different to the segments being complained about during the other actions. A further check lists different ones still: ``` Remote: Starting repository check Remote: Data integrity error: Segment entry checksum mismatch [segment 22, offset 274334380] Remote: Data integrity error: Segment entry checksum mismatch [segment 110, offset 16074473] Remote: Data integrity error: Segment entry checksum mismatch [segment 113, offset 398256964] Remote: Data integrity error: Segment entry checksum mismatch [segment 147, offset 190731549] Remote: Data integrity error: Segment entry checksum mismatch [segment 221, offset 356419101] Remote: Data integrity error: Segment entry checksum mismatch [segment 249, offset 49785478] Remote: Data integrity error: Segment entry checksum mismatch [segment 251, offset 85867597] Remote: Data integrity error: Segment entry checksum mismatch [segment 2734, offset 75986588] Remote: Data integrity error: Segment entry checksum mismatch [segment 3057, offset 190164956] Remote: Data integrity error: Segment entry checksum mismatch [segment 12239, offset 195621763] Remote: Data integrity error: Segment entry checksum mismatch [segment 20687, offset 279381070] Remote: Data integrity error: Segment entry checksum mismatch [segment 22555, offset 86865580] Remote: finished segment check at segment 38546 Remote: Starting repository index check Remote: Finished full repository check, errors found. ``` I won't repair this in case there's something I can dig into while it's in this state - the other repo is backing up okay for now.
Owner

Running it again won't even get past backup:

Yeah, this is another integrity error within the Borg repository itself. You can try another check --repair, but the fact that this keeps happening makes me suspect the hardware again (or maybe even a Borg bug).

The command fails because borgmatic seems to execute a borg info first, which fails because of the segment error. Is that something that check always does?

Yes. borgmatic check first runs borg info under the hood to get the Borg repository ID. This is used to record which checks have been run and when. That way checks can be run at the desired frequency instead of every time borgmatic runs.

Which are different to the segments being complained about during the other actions. A further check lists different ones still:

I won't repair this in case there's something I can dig into while it's in this state - the other repo is backing up okay for now.

Yeah, that's not looking good. As per the Borg docs, you might consider making a copy of the repo (even maybe on another machine) and trying the repair there.

It may also be worth filing a ticket with the Borg project. They'll have much more insight into what to do with a corrupted Borg repository than I do!

> Running it again won't even get past backup: Yeah, this is another integrity error within the Borg repository itself. You can try another `check --repair`, but the fact that this keeps happening makes me suspect the hardware again (or maybe even a Borg bug). > The command fails because borgmatic seems to execute a borg info first, which fails because of the segment error. Is that something that check always does? Yes. `borgmatic check` first runs `borg info` under the hood to get the Borg repository ID. This is used to record which checks have been run and when. That way checks can be run at the desired frequency instead of every time borgmatic runs. > Which are different to the segments being complained about during the other actions. A further check lists different ones still: > > I won't repair this in case there's something I can dig into while it's in this state - the other repo is backing up okay for now. Yeah, that's not looking good. As per the [Borg docs](https://borgbackup.readthedocs.io/en/stable/faq.html#why-does-running-borg-check-repair-warn-about-data-loss), you might consider making a copy of the repo (even maybe on another machine) and trying the repair there. It may also be worth filing a ticket with the Borg project. They'll have much more insight into what to do with a corrupted Borg repository than I do!
Author

Yes agree - I've raised it here:

https://github.com/borgbackup/borg/issues/8473

Yes agree - I've raised it here: https://github.com/borgbackup/borg/issues/8473
Owner

Okay, I'll close this ticket for now. Hopefully you get some resolution to this problem with that Borg ticket!

Okay, I'll close this ticket for now. Hopefully you get some resolution to this problem with that Borg ticket!
Author

Seems like it was a pretty obscure hardware issue - only caught by the scheduled checks by borgmatic. So there's value there!

Seems like it was a pretty obscure hardware issue - only caught by the scheduled checks by borgmatic. So there's value there!
Owner

Cool! I'm glad to hear you've made some progress on the issue, even if you might be down a hard drive now.

Cool! I'm glad to hear you've made some progress on the issue, even if you might be down a hard drive now.
Author

Just to round this off, the issue was down to the RAM I was using, even though it passed memtest. So something I learned there too.

For anyone interested I managed to make the error repeatable - more info in the borg issue at: https://github.com/borgbackup/borg/issues/8473

Just to round this off, the issue was down to the RAM I was using, even though it passed memtest. So something I learned there too. For anyone interested I managed to make the error repeatable - more info in the borg issue at: https://github.com/borgbackup/borg/issues/8473
Owner

I'm glad to hear you've got this resolved now, and thanks for circling back with the root cause. Great detective work!

I'm glad to hear you've got this resolved now, and thanks for circling back with the root cause. Great detective work!
Sign in to join this conversation.
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
borgmatic-collective/borgmatic#920
No description provided.