healthchecks hook failure blocks backup #439

Closed
opened 11 months ago by BastienDurel · 6 comments

What I'm trying to do and why

I've included an healthchecks hook to monitor my backups.
For some reason, healthchecks.io does not respond to HTTP from my IP since yesterday
As borgmatic calls the /start endpoint on start, the whole backup fails

Steps to reproduce (if a bug)

Have a borgmatic backup running with healthchecks hook

hooks:
    healthchecks: https://hc-ping.com/UUID

Block access to hc-ping.com via your firewall (DROP)

Actual behavior (if a bug)

The backup fails at pre hook phase

août 04 00:52:24 corrin systemd[1]: Starting borgmatic backup...
août 04 00:57:25 corrin borgmatic[31259]: INFO /etc/borgmatic/config.yaml: Pinging Healthchecks start
août 04 00:59:27 corrin systemd-inhibit[31258]: /etc/borgmatic/config.yaml: Error running pre hook
août 04 00:59:27 corrin borgmatic[31259]: CRITICAL /etc/borgmatic/config.yaml: Error running pre hook
août 04 00:59:27 corrin systemd-inhibit[31258]: HTTPSConnectionPool(host='hc-ping.com', port=443): Max retries exceeded with url: /UUID/start (Caused by SSLError(SSLError("bad handshake: SysCallError(104, '
août 04 00:59:27 corrin borgmatic[31259]: CRITICAL HTTPSConnectionPool(host='hc-ping.com', port=443): Max retries exceeded with url: /UUID/start (Caused by SSLError(SSLError("bad handshake: SysCallError(104
août 04 00:59:27 corrin borgmatic[31259]: INFO /etc/borgmatic/config.yaml: Running 2 commands for on-error hook
août 04 00:59:27 corrin systemd-inhibit[31258]: Wed 04 Aug 2021 12:59:27 AM CEST - Error while creating a backup.
août 04 00:59:27 corrin borgmatic[31259]: ERROR Wed 04 Aug 2021 12:59:27 AM CEST - Error while creating a backup.
août 04 00:59:27 corrin borgmatic[31259]: INFO /etc/borgmatic/config.yaml: Pinging Healthchecks fail
août 04 01:03:25 corrin systemd-inhibit[31258]: /etc/borgmatic/config.yaml: Error running on-error hook
août 04 01:03:25 corrin borgmatic[31259]: CRITICAL /etc/borgmatic/config.yaml: Error running on-error hook
août 04 01:03:25 corrin systemd-inhibit[31258]: HTTPSConnectionPool(host='hc-ping.com', port=443): Max retries exceeded with url: /UUID/fail (Caused by SSLError(SSLError("bad handshake: SysCallError(104, 'E
août 04 01:03:25 corrin borgmatic[31259]: CRITICAL HTTPSConnectionPool(host='hc-ping.com', port=443): Max retries exceeded with url: /UUID/fail (Caused by SSLError(SSLError("bad handshake: SysCallError(104,
août 04 01:03:25 corrin systemd-inhibit[31258]: /etc/borgmatic/config.yaml: Error running configuration file
août 04 01:03:25 corrin borgmatic[31259]: CRITICAL /etc/borgmatic/config.yaml: Error running configuration file
août 04 01:03:25 corrin borgmatic[31259]: CRITICAL
août 04 01:03:25 corrin systemd-inhibit[31258]: summary:
août 04 01:03:25 corrin borgmatic[31259]: CRITICAL summary:
août 04 01:03:25 corrin systemd-inhibit[31258]: /etc/borgmatic/config.yaml: Error running configuration file
août 04 01:03:25 corrin borgmatic[31259]: CRITICAL /etc/borgmatic/config.yaml: Error running configuration file
août 04 01:03:25 corrin systemd-inhibit[31258]: /etc/borgmatic/config.yaml: Error running pre hook
août 04 01:03:25 corrin borgmatic[31259]: CRITICAL /etc/borgmatic/config.yaml: Error running pre hook
août 04 01:03:25 corrin systemd-inhibit[31258]: HTTPSConnectionPool(host='hc-ping.com', port=443): Max retries exceeded with url: /UUID/start (Caused by SSLError(SSLError("bad handshake: SysCallError(104, '
août 04 01:03:25 corrin borgmatic[31259]: CRITICAL HTTPSConnectionPool(host='hc-ping.com', port=443): Max retries exceeded with url: /UUID/start (Caused by SSLError(SSLError("bad handshake: SysCallError(104
août 04 01:03:25 corrin systemd-inhibit[31258]: /etc/borgmatic/config.yaml: Error running on-error hook
août 04 01:03:25 corrin borgmatic[31259]: CRITICAL /etc/borgmatic/config.yaml: Error running on-error hook
août 04 01:03:25 corrin systemd-inhibit[31258]: HTTPSConnectionPool(host='hc-ping.com', port=443): Max retries exceeded with url: /UUID/fail (Caused by SSLError(SSLError("bad handshake: SysCallError(104, 'E
août 04 01:03:25 corrin borgmatic[31259]: CRITICAL HTTPSConnectionPool(host='hc-ping.com', port=443): Max retries exceeded with url: /UUID/fail (Caused by SSLError(SSLError("bad handshake: SysCallError(104,
août 04 01:03:25 corrin borgmatic[31259]: CRITICAL
août 04 01:03:25 corrin systemd-inhibit[31258]: Need some help? https://torsion.org/borgmatic/#issues
août 04 01:03:25 corrin borgmatic[31259]: CRITICAL Need some help? https://torsion.org/borgmatic/#issues
août 04 01:03:25 corrin systemd-inhibit[31258]: /usr/bin/borgmatic failed with exit status 1.
août 04 01:03:25 corrin systemd[1]: borgmatic.service: Main process exited, code=exited, status=1/FAILURE
août 04 01:03:25 corrin systemd[1]: borgmatic.service: Failed with result 'exit-code'.

Other notes / implementation ideas

Backup shoud be done even if healthchecks hook fails. (ignore failure on this hook); it's better to have a backup done reported missing by monitoring than no backup at all (with clear log about reporting failure)

Environment

borgmatic version:

1.5.1

borgmatic installation method:

Debian package

Borg version: [version here]

1.1.9

Python version: [version here]

3.7.3

operating system and version: [OS here]

Debian 10

#### What I'm trying to do and why I've included an `healthchecks` hook to monitor my backups. For some reason, healthchecks.io does not respond to HTTP from my IP since yesterday As borgmatic calls the /start endpoint on start, the whole backup fails #### Steps to reproduce (if a bug) Have a borgmatic backup running with healthchecks hook ``` hooks: healthchecks: https://hc-ping.com/UUID ``` Block access to hc-ping.com via your firewall (DROP) #### Actual behavior (if a bug) The backup fails at pre hook phase ``` août 04 00:52:24 corrin systemd[1]: Starting borgmatic backup... août 04 00:57:25 corrin borgmatic[31259]: INFO /etc/borgmatic/config.yaml: Pinging Healthchecks start août 04 00:59:27 corrin systemd-inhibit[31258]: /etc/borgmatic/config.yaml: Error running pre hook août 04 00:59:27 corrin borgmatic[31259]: CRITICAL /etc/borgmatic/config.yaml: Error running pre hook août 04 00:59:27 corrin systemd-inhibit[31258]: HTTPSConnectionPool(host='hc-ping.com', port=443): Max retries exceeded with url: /UUID/start (Caused by SSLError(SSLError("bad handshake: SysCallError(104, ' août 04 00:59:27 corrin borgmatic[31259]: CRITICAL HTTPSConnectionPool(host='hc-ping.com', port=443): Max retries exceeded with url: /UUID/start (Caused by SSLError(SSLError("bad handshake: SysCallError(104 août 04 00:59:27 corrin borgmatic[31259]: INFO /etc/borgmatic/config.yaml: Running 2 commands for on-error hook août 04 00:59:27 corrin systemd-inhibit[31258]: Wed 04 Aug 2021 12:59:27 AM CEST - Error while creating a backup. août 04 00:59:27 corrin borgmatic[31259]: ERROR Wed 04 Aug 2021 12:59:27 AM CEST - Error while creating a backup. août 04 00:59:27 corrin borgmatic[31259]: INFO /etc/borgmatic/config.yaml: Pinging Healthchecks fail août 04 01:03:25 corrin systemd-inhibit[31258]: /etc/borgmatic/config.yaml: Error running on-error hook août 04 01:03:25 corrin borgmatic[31259]: CRITICAL /etc/borgmatic/config.yaml: Error running on-error hook août 04 01:03:25 corrin systemd-inhibit[31258]: HTTPSConnectionPool(host='hc-ping.com', port=443): Max retries exceeded with url: /UUID/fail (Caused by SSLError(SSLError("bad handshake: SysCallError(104, 'E août 04 01:03:25 corrin borgmatic[31259]: CRITICAL HTTPSConnectionPool(host='hc-ping.com', port=443): Max retries exceeded with url: /UUID/fail (Caused by SSLError(SSLError("bad handshake: SysCallError(104, août 04 01:03:25 corrin systemd-inhibit[31258]: /etc/borgmatic/config.yaml: Error running configuration file août 04 01:03:25 corrin borgmatic[31259]: CRITICAL /etc/borgmatic/config.yaml: Error running configuration file août 04 01:03:25 corrin borgmatic[31259]: CRITICAL août 04 01:03:25 corrin systemd-inhibit[31258]: summary: août 04 01:03:25 corrin borgmatic[31259]: CRITICAL summary: août 04 01:03:25 corrin systemd-inhibit[31258]: /etc/borgmatic/config.yaml: Error running configuration file août 04 01:03:25 corrin borgmatic[31259]: CRITICAL /etc/borgmatic/config.yaml: Error running configuration file août 04 01:03:25 corrin systemd-inhibit[31258]: /etc/borgmatic/config.yaml: Error running pre hook août 04 01:03:25 corrin borgmatic[31259]: CRITICAL /etc/borgmatic/config.yaml: Error running pre hook août 04 01:03:25 corrin systemd-inhibit[31258]: HTTPSConnectionPool(host='hc-ping.com', port=443): Max retries exceeded with url: /UUID/start (Caused by SSLError(SSLError("bad handshake: SysCallError(104, ' août 04 01:03:25 corrin borgmatic[31259]: CRITICAL HTTPSConnectionPool(host='hc-ping.com', port=443): Max retries exceeded with url: /UUID/start (Caused by SSLError(SSLError("bad handshake: SysCallError(104 août 04 01:03:25 corrin systemd-inhibit[31258]: /etc/borgmatic/config.yaml: Error running on-error hook août 04 01:03:25 corrin borgmatic[31259]: CRITICAL /etc/borgmatic/config.yaml: Error running on-error hook août 04 01:03:25 corrin systemd-inhibit[31258]: HTTPSConnectionPool(host='hc-ping.com', port=443): Max retries exceeded with url: /UUID/fail (Caused by SSLError(SSLError("bad handshake: SysCallError(104, 'E août 04 01:03:25 corrin borgmatic[31259]: CRITICAL HTTPSConnectionPool(host='hc-ping.com', port=443): Max retries exceeded with url: /UUID/fail (Caused by SSLError(SSLError("bad handshake: SysCallError(104, août 04 01:03:25 corrin borgmatic[31259]: CRITICAL août 04 01:03:25 corrin systemd-inhibit[31258]: Need some help? https://torsion.org/borgmatic/#issues août 04 01:03:25 corrin borgmatic[31259]: CRITICAL Need some help? https://torsion.org/borgmatic/#issues août 04 01:03:25 corrin systemd-inhibit[31258]: /usr/bin/borgmatic failed with exit status 1. août 04 01:03:25 corrin systemd[1]: borgmatic.service: Main process exited, code=exited, status=1/FAILURE août 04 01:03:25 corrin systemd[1]: borgmatic.service: Failed with result 'exit-code'. ``` #### Other notes / implementation ideas Backup shoud be done even if healthchecks hook fails. (ignore failure on this hook); it's better to have a backup done reported missing by monitoring than no backup at all (with clear log about reporting failure) #### Environment **borgmatic version:** 1.5.1 **borgmatic installation method:** Debian package **Borg version:** [version here] 1.1.9 **Python version:** [version here] 3.7.3 **operating system and version:** [OS here] Debian 10
Owner

This behavior is actually by design! The thinking is that it could be really bad if your backups suddenly stop being monitored when you expect them to be, because then you could go weeks or months with broken backups. So given a sudden lack of monitoring, the only way to communicate that issue to the user is by failing the backup as a whole. That way, they are immediately made aware that their backups are unmonitored and can hopefully rectify the issue.

Let me know your thoughts on this and whether you have any other ideas. And thanks for taking the time to file ticket!

This behavior is actually by design! The thinking is that it could be really bad if your backups suddenly stop being monitored when you expect them to be, because then you could go weeks or months with broken backups. So given a sudden lack of monitoring, the only way to communicate that issue to the user is by failing the backup as a whole. That way, they are immediately made aware that their backups are unmonitored and can hopefully rectify the issue. Let me know your thoughts on this and whether you have any other ideas. And thanks for taking the time to file ticket!
Poster

If the connection to the minitoring system fails, the monitoring should alert me it wasn't contacted. So I will be notified of the problem, backup done or not.
Or the monitoring system itself is down, and I won't be notified, backup failed or not.

Therefore I did not see how failling the backup should lead to a better state.

If the connection to the minitoring system fails, the monitoring should alert me it wasn't contacted. So I will be notified of the problem, backup done or not. Or the monitoring system itself is down, and I won't be notified, backup failed or not. Therefore I did not see how failling the backup should lead to a better state.

I don't think the reasoning behind failing the backup is sound, at least if you aren't using multiple monitoring services in parallel. If the single monitoring service is down/unreachable so you can't report a backup success, then you most likely can't report a backup failure through the service either. Therefore causing the backup to fail actually won't make you notice the monitoring failure immediately unless you always manually check that backups have been created (but if you do that anyway, why have a monitoring service in the first place?). The only way you'd notice it outside of manual checks is if the monitoring service notifies you after a while that there haven't been any status updates submitted recently, but that would be completely independent of whether you make the backup jobs fail because of monitoring or not.

So what you are actually causing is silently failing backups. In my opinion monitoring functions should never influence the actual work of the system being monitored (other than say causing a logged warning about the failed status submission). Monitoring failures should be detected by the monitoring system itself (eg. through a notification if there hasn't been any status report for x amount of time), not by the monitored system. But if you insist on making the jobs fail at least do it after the new borg archive has been written, so that you still end up with a functional backup archive even if the job is ultimately reported as failed.

I don't think the reasoning behind failing the backup is sound, at least if you aren't using multiple monitoring services in parallel. If the single monitoring service is down/unreachable so you can't report a backup success, then you most likely can't report a backup failure through the service either. Therefore causing the backup to fail actually *won't* make you notice the monitoring failure immediately unless you always manually check that backups have been created (but if you do that anyway, why have a monitoring service in the first place?). The only way you'd notice it outside of manual checks is if the monitoring service notifies you after a while that there haven't been any status updates submitted recently, but that would be completely independent of whether you make the backup jobs fail because of monitoring or not. So what you are actually causing is silently failing backups. In my opinion monitoring functions should never influence the actual work of the system being monitored (other than say causing a logged warning about the failed status submission). Monitoring failures should be detected by the monitoring system itself (eg. through a notification if there hasn't been any status report for x amount of time), not by the monitored system. But if you insist on making the jobs fail at least do it *after* the new borg archive has been written, so that you still end up with a functional backup archive even if the job is ultimately reported as failed.
Owner

I apologize for the lengthy delay in getting back to this. The feature is implemented now in master—connection errors in monitoring hooks are now treated as warnings instead of errors. This will be part of the next release. Thanks for suggesting this!

I apologize for the lengthy delay in getting back to this. The feature is implemented now in master—connection errors in monitoring hooks are now treated as warnings instead of errors. This will be part of the next release. Thanks for suggesting this!
witten closed this issue 1 month ago
Owner

Released in borgmatic 1.6.1!

Released in borgmatic 1.6.1!
Poster

Thanks :)

Thanks :)
Sign in to join this conversation.
No Milestone
No Assignees
3 Participants
Notifications
Due Date

No due date set.

Dependencies

This issue currently doesn't have any dependencies.

Loading…
There is no content yet.