Soft failure handling for healthchecks #292

Closed
opened 2020-02-03 21:30:12 +00:00 by lambtho12 · 7 comments

I am currently using the "soft failure" option to backup on an external drive that is not always plugged. My backups also integrate healtchecks.io hooks. Currently it seems that healthcheck is started before the before_backup hook. Therefore in case of a soft failure, it never receives the regular ping or fail ping that it would get in case of success or hard failure. The check remains "started" until the grace period comes to an end (in this case it is flagged as failed).

I think it should be preferable to either start healthcheck after the before_bakcup hook, or to send a regular ping in case of a soft failure (as this kind of failure is "expected").

(currently running v1.5.0)

I am currently using the "soft failure" option to backup on an external drive that is not always plugged. My backups also integrate healtchecks.io hooks. Currently it seems that healthcheck is started before the before_backup hook. Therefore in case of a soft failure, it never receives the regular ping or fail ping that it would get in case of success or hard failure. The check remains "started" until the grace period comes to an end (in this case it is flagged as failed). I think it should be preferable to either start healthcheck after the before_bakcup hook, or to send a regular ping in case of a soft failure (as this kind of failure is "expected"). (currently running v1.5.0)
Owner

Good find! This is one of those unfortunate interactions between features.. I think you are correct in that the two options are either to start the ping later, or consider a soft failure as a success as far as Healthchecks is concerned.

Given that some folks may want the ping start/end to encompass their before/after hooks as well, my first instinct is to go with option two: Leave the Healthchecks ping start where it is (before the before hooks), and then send a success ping in the case of a soft failure. Looking at the code, it looks like a pretty straight-forward (if slightly duplicative) change.

Good find! This is one of those unfortunate interactions between features.. I think you are correct in that the two options are either to start the ping later, or consider a soft failure as a success as far as Healthchecks is concerned. Given that some folks may want the ping start/end to encompass their `before`/`after` hooks as well, my first instinct is to go with option two: Leave the Healthchecks ping start where it is (before the `before` hooks), and then send a success ping in the case of a soft failure. Looking at the code, it looks like a pretty straight-forward (if slightly duplicative) change.
Author

I have been thinking a bit more about it and I think there is unfortunately no absolute best solution.
Basically there are three cases:

  1. Start -> soft fail -> ping "pass": In this case, healthcheck timer will be reset and user may never be warned by healthcheck.io that they need to backup on their removable storage. IMO this defeats the main purpose of using healthcheck for some backups. (Ex: backup everyday to cloud but want to do once a week on removable drive. Timer will be reset everyday and after a week you will skip the removable drive backup)

  2. Start -> soft fail -> ping "fail": In this case the user will get a useless notification (because they now the removable drive is not plugged). Healthcheck status will be down, so user must either reactivate it by pinging manually (then fallback to the situation described in 1.) or let it stay down. In the latter, the user will not receive a notification to remind they to do the backup after the grace period because healthcheck remained down.

  3. Soft fail -> healthcheck do not start: In this situation, healthcheck timer will keep counting down and the user will get a reminder after a while to ensure they do their backup on the removable drive.

Personally I am more leaning towards option 3 because as it is the only one that will accurately remind the user that they need to backup on their removable drive, but I understand your point too. I honestly do not know which one would be the absolute best for everyone. Maybe you could introduce a second parameter for the healthcheck hook (besides the ping url) to let the user chose. But that would require a bit more changes.
for instance soft_fail_reply = {pass, fail, none}

I have been thinking a bit more about it and I think there is unfortunately no absolute best solution. Basically there are three cases: 1. Start -> soft fail -> ping "pass": In this case, healthcheck timer will be reset and user may never be warned by healthcheck.io that they need to backup on their removable storage. IMO this defeats the main purpose of using healthcheck for some backups. (Ex: backup everyday to cloud but want to do once a week on removable drive. Timer will be reset everyday and after a week you will skip the removable drive backup) 2. Start -> soft fail -> ping "fail": In this case the user will get a useless notification (because they now the removable drive is not plugged). Healthcheck status will be down, so user must either reactivate it by pinging manually (then fallback to the situation described in 1.) or let it stay down. In the latter, the user will not receive a notification to remind they to do the backup after the grace period because healthcheck remained down. 3. Soft fail -> healthcheck do not start: In this situation, healthcheck timer will keep counting down and the user will get a reminder after a while to ensure they do their backup on the removable drive. Personally I am more leaning towards option 3 because as it is the only one that will accurately remind the user that they need to backup on their removable drive, but I understand your point too. I honestly do not know which one would be the absolute best for everyone. Maybe you could introduce a second parameter for the healthcheck hook (besides the ping url) to let the user chose. But that would require a bit more changes. for instance `soft_fail_reply = {pass, fail, none}`
Owner

Switching to option 3 outright may indeed be an okay compromise, especially if borgmatic still does a Healthchecks failure ping whenever a before hook hard fails. That way, you'll continue to find out via Healthchecks if your before hook is broken, even if it's no longer wrapped in start/end pings.

The main challenge though is internal.. Currently, the Healthchecks hook starts recording the logs to be sent to Healthchecks at the time the start ping fires. If that start ping is moved to after the before hook, then the hook's logs (e.g. in the case of hook failure) won't get sent to Healthchecks.

This limitation could potentially be remedied by introducing the concept of hook initialization..

Switching to option 3 outright may indeed be an okay compromise, especially if borgmatic still does a Healthchecks failure ping whenever a `before` hook hard fails. That way, you'll continue to find out via Healthchecks if your `before` hook is broken, even if it's no longer wrapped in start/end pings. The main challenge though is internal.. Currently, the Healthchecks hook starts recording the logs to be sent to Healthchecks at the time the start ping fires. If that start ping is moved to after the `before` hook, then the hook's logs (e.g. in the case of hook failure) won't get sent to Healthchecks. This limitation could potentially be remedied by introducing the concept of hook initialization..

I'd also vote for option 3, having a similar optional harddisk setup as lambtho.

Regarding the "internal challenge", wouldn't it suffice to add a further "after-hooks-but-before-backup" state that is passed on to the healthcheck callback function? Then you could continue to start recoding the logs on monitor.State.START, but would only report to healthchecks on monitor.State.BEFORE_BACKUP, FINISH and FAIL. The configurable hooks are already pretty diverse, why not mirror this in monitor.State?

I'd also vote for option 3, having a similar optional harddisk setup as lambtho. Regarding the "internal challenge", wouldn't it suffice to add a further "after-hooks-but-before-backup" state that is passed on to the [healthcheck callback function](https://projects.torsion.org/witten/borgmatic/src/branch/master/borgmatic/hooks/healthchecks.py#L68)? Then you could continue to start recoding the logs on `monitor.State.START`, but would only report to healthchecks on `monitor.State.BEFORE_BACKUP, FINISH` and `FAIL`. The configurable hooks are already [pretty diverse](https://torsion.org/borgmatic/docs/how-to/add-preparation-and-cleanup-steps-to-backups/), why not mirror this in `monitor.State`?
Owner

Something like that could certainly work as well. The main reason I could see to mirror every hook into monitor.State would be if there's a need for a monitor to act differently on each of those states, e.g. before pruning, before checking, etc. In terms of satisfying this ticket, even your suggestion of just four different states would do it. Especially since the first state can serve to initialize the logging!

Something like that could certainly work as well. The main reason I could see to mirror every hook into `monitor.State` would be if there's a need for a monitor to act differently on each of those states, e.g. before pruning, before checking, etc. In terms of satisfying this ticket, even your suggestion of just four different states would do it. Especially since the first state can serve to initialize the logging!
Owner

Well, I ended up going with the initialization function. I tried the multiple states thing first, but I found myself having to add if state in states_I_don't_want: return to a bunch of the non-Healthchecks monitors. So it seemed cleaner to break that out to an initialization function.

Given that this is all internal, we can always revisit if it turns out there's more of a need for monitors to find out about per-hook states.

This is in master and will be part of the next release! I'll note here when that happens.

Well, I ended up going with the initialization function. I tried the multiple states thing first, but I found myself having to add `if state in states_I_don't_want: return` to a bunch of the non-Healthchecks monitors. So it seemed cleaner to break that out to an initialization function. Given that this is all internal, we can always revisit if it turns out there's more of a need for monitors to find out about per-hook states. This is in master and will be part of the next release! I'll note here when that happens.
Owner

This is released now as part of borgmatic 1.5.6!

This is released now as part of borgmatic 1.5.6!
Sign in to join this conversation.
No Milestone
No Assignees
3 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: borgmatic-collective/borgmatic#292
No description provided.