healthchecks hook failure blocks backup #439
Labels
No Label
bug
data loss
design finalized
good first issue
new feature area
question / support
security
waiting for response
No Milestone
No Assignees
3 Participants
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: borgmatic-collective/borgmatic#439
Loading…
Reference in New Issue
No description provided.
Delete Branch "%!s(<nil>)"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
What I'm trying to do and why
I've included an
healthchecks
hook to monitor my backups.For some reason, healthchecks.io does not respond to HTTP from my IP since yesterday
As borgmatic calls the /start endpoint on start, the whole backup fails
Steps to reproduce (if a bug)
Have a borgmatic backup running with healthchecks hook
Block access to hc-ping.com via your firewall (DROP)
Actual behavior (if a bug)
The backup fails at pre hook phase
Other notes / implementation ideas
Backup shoud be done even if healthchecks hook fails. (ignore failure on this hook); it's better to have a backup done reported missing by monitoring than no backup at all (with clear log about reporting failure)
Environment
borgmatic version:
1.5.1
borgmatic installation method:
Debian package
Borg version: [version here]
1.1.9
Python version: [version here]
3.7.3
operating system and version: [OS here]
Debian 10
This behavior is actually by design! The thinking is that it could be really bad if your backups suddenly stop being monitored when you expect them to be, because then you could go weeks or months with broken backups. So given a sudden lack of monitoring, the only way to communicate that issue to the user is by failing the backup as a whole. That way, they are immediately made aware that their backups are unmonitored and can hopefully rectify the issue.
Let me know your thoughts on this and whether you have any other ideas. And thanks for taking the time to file ticket!
If the connection to the minitoring system fails, the monitoring should alert me it wasn't contacted. So I will be notified of the problem, backup done or not.
Or the monitoring system itself is down, and I won't be notified, backup failed or not.
Therefore I did not see how failling the backup should lead to a better state.
I don't think the reasoning behind failing the backup is sound, at least if you aren't using multiple monitoring services in parallel. If the single monitoring service is down/unreachable so you can't report a backup success, then you most likely can't report a backup failure through the service either. Therefore causing the backup to fail actually won't make you notice the monitoring failure immediately unless you always manually check that backups have been created (but if you do that anyway, why have a monitoring service in the first place?). The only way you'd notice it outside of manual checks is if the monitoring service notifies you after a while that there haven't been any status updates submitted recently, but that would be completely independent of whether you make the backup jobs fail because of monitoring or not.
So what you are actually causing is silently failing backups. In my opinion monitoring functions should never influence the actual work of the system being monitored (other than say causing a logged warning about the failed status submission). Monitoring failures should be detected by the monitoring system itself (eg. through a notification if there hasn't been any status report for x amount of time), not by the monitored system. But if you insist on making the jobs fail at least do it after the new borg archive has been written, so that you still end up with a functional backup archive even if the job is ultimately reported as failed.
I apologize for the lengthy delay in getting back to this. The feature is implemented now in master—connection errors in monitoring hooks are now treated as warnings instead of errors. This will be part of the next release. Thanks for suggesting this!
Released in borgmatic 1.6.1!
Thanks :)