Soft failure handling for healthchecks #292
Labels
No Label
bug
data loss
design finalized
good first issue
new feature area
question / support
security
waiting for response
No Milestone
No Assignees
3 Participants
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: borgmatic-collective/borgmatic#292
Loading…
Reference in New Issue
No description provided.
Delete Branch "%!s(<nil>)"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
I am currently using the "soft failure" option to backup on an external drive that is not always plugged. My backups also integrate healtchecks.io hooks. Currently it seems that healthcheck is started before the before_backup hook. Therefore in case of a soft failure, it never receives the regular ping or fail ping that it would get in case of success or hard failure. The check remains "started" until the grace period comes to an end (in this case it is flagged as failed).
I think it should be preferable to either start healthcheck after the before_bakcup hook, or to send a regular ping in case of a soft failure (as this kind of failure is "expected").
(currently running v1.5.0)
Good find! This is one of those unfortunate interactions between features.. I think you are correct in that the two options are either to start the ping later, or consider a soft failure as a success as far as Healthchecks is concerned.
Given that some folks may want the ping start/end to encompass their
before
/after
hooks as well, my first instinct is to go with option two: Leave the Healthchecks ping start where it is (before thebefore
hooks), and then send a success ping in the case of a soft failure. Looking at the code, it looks like a pretty straight-forward (if slightly duplicative) change.I have been thinking a bit more about it and I think there is unfortunately no absolute best solution.
Basically there are three cases:
Start -> soft fail -> ping "pass": In this case, healthcheck timer will be reset and user may never be warned by healthcheck.io that they need to backup on their removable storage. IMO this defeats the main purpose of using healthcheck for some backups. (Ex: backup everyday to cloud but want to do once a week on removable drive. Timer will be reset everyday and after a week you will skip the removable drive backup)
Start -> soft fail -> ping "fail": In this case the user will get a useless notification (because they now the removable drive is not plugged). Healthcheck status will be down, so user must either reactivate it by pinging manually (then fallback to the situation described in 1.) or let it stay down. In the latter, the user will not receive a notification to remind they to do the backup after the grace period because healthcheck remained down.
Soft fail -> healthcheck do not start: In this situation, healthcheck timer will keep counting down and the user will get a reminder after a while to ensure they do their backup on the removable drive.
Personally I am more leaning towards option 3 because as it is the only one that will accurately remind the user that they need to backup on their removable drive, but I understand your point too. I honestly do not know which one would be the absolute best for everyone. Maybe you could introduce a second parameter for the healthcheck hook (besides the ping url) to let the user chose. But that would require a bit more changes.
for instance
soft_fail_reply = {pass, fail, none}
Switching to option 3 outright may indeed be an okay compromise, especially if borgmatic still does a Healthchecks failure ping whenever a
before
hook hard fails. That way, you'll continue to find out via Healthchecks if yourbefore
hook is broken, even if it's no longer wrapped in start/end pings.The main challenge though is internal.. Currently, the Healthchecks hook starts recording the logs to be sent to Healthchecks at the time the start ping fires. If that start ping is moved to after the
before
hook, then the hook's logs (e.g. in the case of hook failure) won't get sent to Healthchecks.This limitation could potentially be remedied by introducing the concept of hook initialization..
I'd also vote for option 3, having a similar optional harddisk setup as lambtho.
Regarding the "internal challenge", wouldn't it suffice to add a further "after-hooks-but-before-backup" state that is passed on to the healthcheck callback function? Then you could continue to start recoding the logs on
monitor.State.START
, but would only report to healthchecks onmonitor.State.BEFORE_BACKUP, FINISH
andFAIL
. The configurable hooks are already pretty diverse, why not mirror this inmonitor.State
?Something like that could certainly work as well. The main reason I could see to mirror every hook into
monitor.State
would be if there's a need for a monitor to act differently on each of those states, e.g. before pruning, before checking, etc. In terms of satisfying this ticket, even your suggestion of just four different states would do it. Especially since the first state can serve to initialize the logging!Well, I ended up going with the initialization function. I tried the multiple states thing first, but I found myself having to add
if state in states_I_don't_want: return
to a bunch of the non-Healthchecks monitors. So it seemed cleaner to break that out to an initialization function.Given that this is all internal, we can always revisit if it turns out there's more of a need for monitors to find out about per-hook states.
This is in master and will be part of the next release! I'll note here when that happens.
This is released now as part of borgmatic 1.5.6!