Healthchecks / Uptime Kuma pinged even when a soft failure occurs #1110
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
What I'm trying to do and why
When Healthchecks or Uptime Kuma endpoints are configured, they are pinged as if the backup is successful even if a soft failure occurs.
This is poor behavior because the entire point of the ping should be that "the backup completed successfully", not "the backup was skipped successfully because someone forgot to plug in a drive". In fact, the exact reason to have a target using soft failure but hooked up to monitoring is to make it so not having an external backup drive plugged in isn't a failure, but if people forget to do it over a long period of time (e.g. several weeks) then healthchecks/uptime services will note the lack of successful backups.
This also goes against the documented behavior of soft failures, which says that after a soft failure, the rest of the backup steps will be skipped -- which is true, EXCEPT that the healthchecks and uptime notifications proceed as if the backup finished successfully.
Steps to reproduce
Something like this:
Actual behavior
Example from logs:
Note that even though there was a soft failure, it still pinged the Healthchecks and Uptime urls with the finish state, which is completely incorrect -- the backup did NOT finish, and these pings should have been skipped.
Expected behavior
A soft failure skips all further actions on the repository like is documented, INCLUDING skipping Healthchecks and Uptime Kuma pings indicating that the backup completed successfully.
Other notes / implementation ideas
It would be fine if there was a separate event for Healthchecks or Uptime Kuma like
- skipthat you could optionally enable along with- startor- failor- finish. I wouldn't personally use that, but I could see some people wanting that feature and it could avoid a bunch of arguments about if a ping should happen here or not.But in any case, it definitely should not being pinging with a "finish" state if there was a soft failure!
borgmatic version
2.0.4
borgmatic installation method
pip
Borg version
1.4.0
Python version
3.11.2
Database version (if applicable)
No response
Operating system and version
Debian GNU/Linux 12 (bookworm)
Thanks for filing this.
I'm not sure I agree about the documented behavior—"skip all subsequent actions for the current repository" doesn't encompass monitoring hooks for the current action—but I understand your broader point about expected behavior.
The challenge comes in what to do in multi-repository situations. For instance, let's say you've got two repositories in a configuration file:
I guess after these, borgmatic should ping the monitoring services with a success state? Or not?
Here's another:
So should borgmatic ping the monitoring services with the fail state? Or not?
And then I guess if all repositories in the configuration file soft fail, borgmatic should not ping the monitoring services afterwards at all?
I'm not calling these out as gotchas, but rather because I literally don't know the right answer here.
Well, I don't want to argue about whether the current behavior of matches the documentation, as much as I just want it to work in an intelligent way. =) So let's step back for a moment and look at what is the point of monitoring and what borgmatic should do to support it:
So clearly, what I described initially isn't meeting this goal at all, and because of how borgmatic behaves, there is no good way for me to make it work correctly, at least when having borgmatic be part of the reporting process. Obviously I can (and have) scripted my own correctly-behaving monitoring by just bypassing this feature of borgmatic and doing it myself in bash.
But your example raises a good point, if a soft failure is a failure, what's "soft" about it, and how should it interact with monitoring?
So let me propose how I think it should work. I think there are two scenarios:
In the Scenario A, I actually agree with your example completely. If the backup to one repository completes, but the backup to another repository soft fails (disk not inserted or whatever the user intends), then yes, it SHOULD mark that backup as successful! I think this is probably the original intent of a soft-fail.
In Scenario B, this is more like what I set up. Each configuration (from /etc/borgmatic.d/xyz) is essentially independent, with its own sources, repositories, configuration, etc. In this case, I would also agree that if it successfully backed up to a repository but others soft-failed, that should also be a success. However, if there is only one repository, and it soft fails, what I want is this behavior:
I don't think these are incompatible, and I think there is some simple logic that could be implemented to make this behave sensibly in all cases. Here is what I'd propose, pseudo-code style:
Captured in words, the basic idea is:
Run each configuration file as a separate batch (this is basically done already!). If backing up to any repository defined in that batch fails, well, that's a failure both of that batch and of the whole run of borgmatic. If there were no such failures, and backing up to at least one repository in the batch succeeds, with the rest being soft failures, then the batch succeeds, monitoring hooks are triggered, and overall the run of borgmatic moves on.
Essentially the possible scenarios for each configuration file handled comes down to:
The the overall status (borgmatic exit code, or whatever) is just: did ALL configuration files complete successfully.
Anyway, I'd encourage you to consider something like this. For now, what I've had to do personally is not using any of the monitoring facilities in borgmatic and instead just do it separately in a wrapper script. This is fine, but unfortunate because the monitoring hooks in borgmatic would otherwise be really useful for me. =)
Is this related to the default behaviour of Borg, which outputs warnings instead of errors for a lot of things that IMHO should be errors (such as nonexistent source folders), and Borgmatic (unless specified otherwise) will not escalate it into an error? ( #1092 )
If so, then this issue is not restricted to the Uptime Kuma plugin, but also the other Borgmatic plugins to direct pass/fail to external services (e.g. 'Healthchecks')
wjl: In case of your 'Scenario A' (multiple target repositories to backup to, where Borgmatic is unable to backup to all the target repositories but at least the backup to one repo was successful), I would still not mark that backup task as successful!
I might have multiple target repositories (a local repo for quick access and a remote repo for secure offsite storage). I would not be happy when (after a fire that destroys the source and the local repo) I find out that no backups have been made to the remote repo because warnings were hidden.
I know that my use case might not resemble yours, but I would suggest to use the strict "return an error if anything goes wrong" by default, and allow users to add configuration options to ignore specific warnings, if needed, on a case-by-case basis.
@wjl I appreciate the detailed explanation. I'm going to repeat this idea back to you in my own words just to make sure I understand what you're suggesting.
For each configuration file:
Is that all correct / does it address your original need?
@sgiebels And does it address your stated concern with the proposal? (Modulo the fact that borgmatic still treats Borg warnings as warnings unless configured otherwise.)
Anyway, I can think of some difficulties/questions in implementation:
@sgiebels In regards to this question:
It's related insofar as these Borg warnings won't cause borgmatic to treat these repositories as failed.
Yeah, I would expect any solution for this ticket to be in the general borgmatic code and not in any specific monitoring hook.
@witten
I had to read the If..Otherwise..Otherwise a few times to understand what you mean, I assume in pseudocode this would be:
Does the "..without_softfail"-part match your idea as well?
My original idea was "if (no hard_fail and no soft_fail) then report_success and return_success_exit" (and have users explicitly add options to the config to allow specific soft fails)
But I'm quite happy with this idea, though the 3rd case would require some explanation in the documentation and in the borgmatic output ( success exit code but nothing reported to monitoring services might confuse people).
That pseudocode does look like it corresponds to what I was proposing, so nice job on decoding that. 😄
Yes. I should've clarified that part!
Agreed! And I'm glad to hear this approach works for you even if it's not quite as configurable as what you were envisioning. It's possible we can iterate on the approach though as people start to use it.
Should we wait on feedback from @wjl to see he agrees with this solution? I believe his idea:
does not explicitly specify what should happen when a repo was backuped to successfully -with- soft errors. Also I believe he has a stricter idea of what the borgmatic exit code should represent.
Yeah, that's one reason I was hoping for @wjl's feedback on the proposal before proceeding with it!
Hey guys, I think we are mostly saying the same thing perhaps with some a little variations. Let me try to summarize what I think we're all talking about, and perhaps where some disagreement may be coming from. As I'm reading what everone wrote plus my original issue, I think there are a couple aspects that are interwoven a little bit.
Overall Goal
I think the most important points that I believe we can all agree on are these two scenarios, ignoring soft failures for the moment:
No arguments here, right?
Treatment of multiple configuration files
So, I thought this was already cut and dried, but just in case, I'll repeat how I think borgmatic works or should work in the face of multiple configuration files.
When I have multiple conf.d files, right now I can configure each one with completely different sources, target repositories, and monitoring endpoints. So I expect when I run borgmatic and it processes each of these it treats each one independently, so it's completely possible to have everything configured in /etc/borgmatic.d/backup_A.conf fail while everything in /etc/borgmatic.d/backup_B.conf passes, or vice versa.
So again, ignoring soft failures for the moment, I think of these as "batches" or "runs" (or whatever terminology you like) and the call to borgmatic is basically saying "run all these batches".
So from this point of via, I expect there to be two different results:
How soft-failures work
Now, when we add soft failures in, I think this is where perhaps we are looking at things a little differently and probably where the confusion or disagreement in behavior lies. Perhaps it really comes down to the definition of what a "soft failured" is and our expectations for it.
So what I would expect is that treating each batch (separate conf file) independently, the logic is:
The reason this makes sense to me is that I look at soft failures as somewhere where its existence should not be the sole reason for a backup failure. Like, I'd like to backup these two repos on different hot-swappable external drives. If one of them isn't plugged in, I don't care, as long as it backs up to at least one of them.
I think the disagreement I see in other comments is the idea that if any soft failures occur, it doesn't count as a backup success. I guess I can see this point of view, but in my mind that isn't very useful, and I don't really see the use case for it.
A big reason I say this is that for most monitoring systems, there is not much difference between sending a failure and not sending anything, and very few monitoring systems have anything like a tristate, it's just "passed" (pinged some URL) or failed (either pinged a URL with "failed" as the status OR didn't ping a URL at all within a period of time).
But I could be wrong, maybe this really is the best behavior and there are monitoring systems I'm not familiar with that work better with this kind of tristate action.
So I think this might be the crux of what we're arguing about now. I think we could probably make either of these ways work as long as its well documented, and everything else works as described above (separate batches, the top-level borgmatic always runs to completion even if one of the batches/conf files fails, it continues to run the other ones, etc).
The original problem
Going back to my original report, the original problem I was having was that the monitoring endpoints were be triggered with success when there was only soft failures and no other successes. The behavior seemed wrong to me and very surprising based on what was documented.
But I think the important thing is not that things are implemented exactly how I'm suggesting (although I do think it's a reasonable way to work), but just that I need to be able to do things like this:
It might even be possible to do all of these without soft failures at all, maybe I'm missing the point of them! 😅
Summary
I think the solutions we're talking about are different variations and could all be made to work. It might just be a matter of clearly defining what a "soft failure" is and what its meant to be used for, and then making sure that the implementation supports those use cases (even if it doesn't match my idea).
The other ideas I put forward about the overall borgmatic exit code, running in "batches" (different /etc/borgmatic.d conf files) and not quitting even if one fails, etc, are really all just trying to propose implementation ideas, but there are of course multiple ways to attack this, and I want to go with, not against, borgmatic's core philosophies.
Anyway, hope this helps, sorry it took me a while to come back around to this and respond. =)
Thanks for the detailed explanation!
Right.
Yup.
I personally wasn't saying this.
Interesting! I'm honestly fine with that. It would be a bit of a breaking change, but hopefully not a huge one for most users.
It looks like borgmatic already does this—if one configuration file fails, other configuration files still get run. If you're finding that this isn't happening for you, please file a separate ticket and I'd be happy to have a look!
One problem I can think of with this plan is that if there might be legitimate cases when a user really wants "all soft failures" not to get elevated to a hard failure. For instance, if they want to silently skip backups (soft failure) when their laptop is low on battery, but they don't want to get an error, because they're pretty confident they'll be back on power in the next day or so.
So that makes me think that this "all soft failures equals a hard failure" feature should probably be tucked behind an option that can toggle it on or off. E.g.,
all_soft_failures_policy(with possible values offailorfinish) or whatever.Alternatively, there could be a new exit code that means "soft failure, but error if it's all soft failures". However that seems potentially more complicated. (What do you do if the various repository exit codes conflict?)
Any thoughts on the ideas in the previous comment? Do either of those approaches sound like they would work for you?
I feel a bit overwhelmed with the complexity required for different usage scenarios, for example using borgmatic with multiple repositories, or using one borgmatic call with multiple configuration files as parameters (I believe this is how wjl uses Borgmatic when he talks about 'batches'). English not being my primary language doesn't help either, that's why I represented my argument with some pseudo-code.
I fully support wjl's statement:
as it would fix the issue that I have encountered with Borgmatic, when a task returns with a soft-failure (e.g. due to inaccessible source files) is reported as a 'success' to the monitoring services, which made me incorrectly believe that my files were 100% backed up safe..
As I might have mentioned before, I would prefer soft failures to report as error -by default-, and allow the user to configure options to ignore specific (or all?) soft failures.
I find it hard to deduce the function of a setting named 'all_soft_failures_policy'. More descriptive would be a setting with a name like 'ignore_soft_failure_200' or 'ignore_all_soft_failures' (I would prefer a default value of 'false' for all these).
For the scenario when using multiple repositories (which I personally do not use, so take my suggestion here with a grain of salt), I can imagine a setting like 'all_repositories_need_success'(default:true) or 'any_repository_success_is_good_enough'(default:false).
For the use case:
I was unaware that 'battery level monitoring' was a feature of Borgmatic, will have to look into that.
I would think, in the case of not running a backup due to low battery level (or failing any precondition check), not reporting anything to the monitoring service would be the best thing to do. The monitoring service will alert the user if that backup task fails to run for a long enough time (which is configurable).
Yeah, this is a complex feature within a complex set of features. I appreciate you continuing with this discussion, especially with the language barrier!
My understanding of @wjl's statement is that they were disagreeing with that approach, commenting right after: "I guess I can see this point of view, but in my mind that isn't very useful, and I don't really see the use case for it."
I think it's good here to clarify what everyone's talking about. Are you suggesting that any soft failures should be considered a monitoring error (an error sent to the monitoring service) ... or a borgmatic result error (an error from borgmatic on the command line) ... or both?
Because if both, that sounds exactly like a hard error to me!
On this one particular point though:
Inaccessible source files don't currently cause a soft failure; they just cause a warning. So if that's an issue you're grappling with, see #1092.
We can get back to the configuration option design once we're on the same page on the requirements. 😄
borgmatic itself doesn't do battery level monitoring, but it is a supposed use case for the existing soft failure feature. You can see an example of that towards the end of this page: https://torsion.org/borgmatic/how-to/backup-to-a-removable-drive-or-an-intermittent-server/
I would prefer both, by default.
If I'd be running Borgmatic from the console, and the backup would have any issues (hard or soft errors, such as borg warnings), by default I would expect it to return a command line error (non-zero exit code ), as the expected result ('making a backup') did not succeed.
Running it from the console is also how I would test any changes I've made to configuration files, and that is also the last time I really look at any console output from the borgmatic executable. Because if the command from the console does not return an error (non-zero exit code), I will put it in a cron job or systemd timer, and any console output will be hidden from me in the future. I would then rely on a monitoring service to alert me on -any- issue (including borg warnings).
To complicate things further: Once I'm 100% sure that -any- issue will be reported to me via the monitoring service, I might want to 'mute' the Borgmatic 'exit code' (to avoid cron error messages)..
If I read your comments in our discussion on #1092 correctly, you don't see the need to upgrade 'warnings' from Borg to 'errors' by default in Borgmatic the way as I do, but you do feel that the Borg point of view "it's up to you to review your Borg logs" is unpractical when Borg is used in combination with Borgmatic, as I do.
The monitoring service that I use (HealthChecks.io) does not parse the Borgmatic logs / console output, and only uses the exit code for determining success/fail of a backup job; I prefer not to use some extra tool to parse the output (using 'grep' for a specific error message string gets ugly when the logs also contain filenames that can have arbitrary names..), so I hope I can rely on exit codes to determine if either my backup ran perfect, or if anything needs my attention.
Okay, a few things to unpack here:
That was certainly my initial view, but if you look at the comments later on in #1092, I was/am willing to change the default behavior for that one Borg warning type (source directories not found)—but only when I can do such a breaking change in a more major release. To be clear, this wouldn't elevate all Borg warnings to errors by default.
Understood. I agree that relying on exit codes makes sense.
It looks as if I have been confusing 'soft failure' with 'warnings', my apologies.
wjl wrote:
So true..
@sgiebels Hah no worries. I'm glad we got to the bottom of that point. Shall we move the discussion on Borg warnings to #1092 then? It's possible that that ticket is at least a partial solution to your ask.
@wjl On this ticket, I still want to address your particular issue with soft failures that kicked off all of this discussion. So does the proposed solution in my comment above work for you? If not, could you say why not? I'm not at all wedded to it.
This is all good discussion, and I think some of the complexities and misunderstandings or differences of opinion throughout this conversation highlight that there are definitely multiple use-cases that different people have. Without really wanting to push super hard to a certain solution, my feeling is we could summarize all of this so far as these points:
I think perhaps point 8 above is probably the most complex one, but I can't see how any of these points would be controversial at a high level, although we might argue about how best to implement them.
TLDR: When I run "borgmatic" I expect it to tell me with hooks/exit-codes/etc "did my data get backed up correctly as expected".
So again, we can argue about how borgmatic makes this happen (hard failures, soft failures, exit-codes, when it pokes various healthchecks / uptime kuma hooks, etc) and I shared my opinion earlier about one way to do it, but really I'll be 100% happy as long as there is some well-supported and clearly documented way to accomplish these things, even if it means I need to change my configuration or assumptions. =)
Thanks again for the thoughtful response. Here are my own thoughts:
I agree!
And yeah, most of these items are already supported by borgmatic, and specifically by soft failures vs. hard failures. Although to your point there could certainly be other solutions for satisfying those requirements.
The main downside I see of soft failures is that you have to do a tiny bit of scripting... Just enough to determine whether the situation merits a soft failure or a hard failure, so borgmatic can interpret your exit code accordingly.
Note that I'm intentionally not including the #1092 type of failures in this discussion, as I think those can be addressed separately in that ticket (and maybe even elevated to hard failures by default).
For item number 8, I do think that either the previously proposed
all_soft_failures_policyor theignore_all_soft_failuresoption could satisfy the requirement. Personally I would lean towards something like the former, just because I think it is a complex concept and it's going to be hard to encode into a single variable name anyway. For instance, doesignore_all_soft_failuresmean that any and all soft failures should be ignored—or simply that we shouldn't error if all we receive 100% soft failures and no hard failures or successes? I'm pretty sure you meant the latter, but you can see how a user might be confused on this point given the name.Anyway, I think this would also line up with your previous suggestion of "batches", with each configuration file being one "batch." That's because the proposed configuration option, whatever we call it, would presumably exist and apply at the level of one configuration file.
Sorry for jumping late into this very productive conversation, but I'd like to contribute two small pieces based on the summary alone:
"totally expected" should not be read as "healthy". It should mean "skipped" - a distinct execution outcome. IIUC we don't currently have a way to signal that internally.
If a backup does not run because a target repository is intermittently unavailable, that outcome may be anticipated but it is still not OK. Until the data is successfully written to at least one repository, the system is in a degraded state that we need visibility into.
Treating skipped runs as “finished/OK” creates a blind spot:
This is not hypothetical, and is what got me here: I want to be alerted if more than one week has passed since data was successfully backed up to intermittently available host.
That is currently unachievable because skipped executions are indistinguishable from successes. The monitoring system cannot distinguish “data is safe” from “nothing happened again.”
A collection of soft-failures still represents a retryable unit, not necessarily an alert-worthy failure.
Treating 6) and 8) differently introduces a sharp edge when going from 1 to multiple repos.
Assume we have a backup job to a semi-available host A:
Now, to increase my odds of getting one successful backup every N periods, let's add another semi-available host B.
If we focus on the scenario where host B is unavailable:
Host A now alerts in a scenario where it previously didn't alert, purely because Host B exists. Adding redundancy increased operational noise without increasing risk.
An implicit "at least one successful target” semantic may be useful, but not necessarily for every use-case. That points to either that being an explicit knob (e.g. a new property of a backup task) or externalized (e.g. using the existing configuration-wide hooks to explicitly ping all transient hosts)
I just came across this issue when trying to set up soft-failures, which I now understand to be be pointless (for my case) with the current code. My usecase is backing up a laptop to a NAS in my house, which is not always available. I intended to schedule borgmatic to run every hour or so, and then have a hook that softfails if 1) my laptop is not on AC, 2) the NAS is not available or 3) a backup was already made today (since I do not really want an hourly backup, just daily, but with daily scheduling I have to satisfy 1) and 2) when I first power up my laptop and not get any second chance until the next day). I have healthchecks set up for monitoring, so I would like hardfailures to be reported as failures to healtchecks, to have soft failure be not reported at all (I do not want an hourly healthcheck failure alert...), and only actual successful backups result in a healthchecks ping. I only have a single repo currently.
Reading the thread, I think #1110 (comment) has a good summary of intended behavior.
The only thing I disagree with, is:
As indicated in the post, this is because I use a tri-stated monitoring system, healtchecks. This system does have a distinction between "failure" and "no ping at all", since a failure triggers immediate notification, while "no ping at all" only triggers a notification if no ping is sent for a configured time.
Another point that I think has not been made very explicitly, is that handling of all soft failures (aka no successful repos), can be dependent on whether your have 1 or multiple repositories configured.
For just one repository, handling "all soft failures" as a hard failure makes no sense: Then you could have just let your hook return a hard failure (or just let the backup fail for intermittent repos). The reason you set up a soft-fail check is because you do not want a hard fail. Handling this as success also makes no sense: no backup has happened, so a monitoring system should not be satisfied (this is the original reported issue). So that leaves only one option: Do not Hardfail and do not succeed, but let the batch softfail (what that means tbd, see below). In other words: The batch result is identical to the single repo result.
For multiple repositories, this is less clear. It seems you can either want:
In both cases, there seem to be two possible handlings of "all repos soft-fail". I can either want:
A) failure ("I expect to always have at least one repo available")
B) softfail ("Ok if sometimes no repos are available, my monitoring will check this does not persist too long").
Resulting in four possible handlings (1A, 1B, 2A, 2B).
Of these, 1B and 2B are convenient, since they automatically produce the desired (as described by me above) behavior for single-repo configs (succeed if the repo succeeds, softfail if it softfails). So if option 1A or 2A is adopted, then the single-repo case should be special-cased, which might not be desirable. Alternatively, if all four options are made configurable, then if 1A or 2A is the default, then a single repo config (that uses soft-fail) needs non-default config to work as expected, which I also think is not desirable (OTOH this might be acceptable since it only applies to configs using soft-fail hooks).
Of these, option 1 (especially 1B) has the risk of a broken repository going completely unnoticed.
I do wonder if option 2B is really needed to be supported, since this could also be handled by splitting a multi-repo config into multiple single-repo configs with each their own monitoring endpoint. Since then the monitoring is only satisified when both repos succeed, otherwise monitoring pings will happen. But for consistency, it might be worth supporting all four options anyway.
As suggested above, this handling could be made configurable. My above analysis suggests that there might be two config directives, one to select 1/2 and one to select A/B? For example:
some_repositories_softfail: softfail / successall_repositories_softfail: fail / softfailOr maybe support all three values (fail / softfail / success) for both directives - see below.
Meaning of batch softfail
So then the question is: What does it mean for a batch/config to softfail?
For monitoring, this depends on the monitoring solution, but I would typically think it means neither success nor failure - it is as if the backup was not actually started. This can be problematic for solutions with a "start" ping (e.g. healtchecks) that always expect a success or failure, but I guess that just leaving them hanging (marked as "running") is acceptable? For healthchecks, this could also result in a "log" ping (which produces neither failure nor success, but does explicitly log the softfail result for later lookup).
For the process exit status, the result of multiple batches/configs should be summarized into as single exit status. This essentially the same question as summarizing the result of multiple repos into one batch/config status, and I can imagine the solution is the similar and could also have two config directives:
some_configs_softfail: fail / softfail / successall_configs_softfail: fail / softfail / successI would suggest that
softfailhere would mean letting borgmatic itself return exit status 75.That does raise the question about where to configure this. Could be in any of the config files (and maybe require all configs either omit these configs or set them to the same values?).
I included all three options (fail/softfail/success) for both configs here, since the reasons for omitting them for multiple repos do not apply here:
Writing this, I would actually consider letting
some_repos_softfailandall_repos_softfailconfig directives also just support all three values, since that does not give extra implementation complexity (I expect) and does give a bit more flexibility. I can imagine, for example, that you could write a before check hook that decides that checks should be skipped with a soft-fail, but because the backup succeeded (you cannot set up a softfail check for create in this case), you still want the monitoring to be pinged with success?Some more thoughts about this issue:
Instead of the configs I suggested in my previous comment, one could also think about a system like PAM uses, where you configure for each repository or config how the outcome of that result affects the overall outcome. PAM uses verbs like "required" or optional, but also supports more detailed configuration where you specify an action for each possible result (e.g.
requiredmeans[success=ok new_authtok_reqd=ok ignore=ignore default=bad]).Here, I can imagine something similar, where for each repo you map the fail/softfail/success outcome to an action like "fail" (set outcome to fail), "ok" (keep outcome unchanged), "succeed" (set output to success if not failed), those kinds of things. I haven't thought this through, but I suspect you would also need to configure an initial outcome, and maybe need to do one final mapping of the outcome (e.g. to map (all) soft-fail to either fail or success if you want that).
When discussing the "result" or "outcome" or a repo or batch/config, maybe the term "skipped" is better than "soft-fail"?
I just realized that the issue reported here actually also applies somewhat to checks that are skipped because of their configured frequency, but still ping the monitoring as successful. This is maybe specific to my setup, but I configured the backup and checks in separate config files with separate healthchecks urls, so I can separately verify that backups happen and that checks happen and are successful. However, now the checks config healthchecks url is pinged even when no actual checks happened. So if I would accidentally misconfigure the frequency, or some issue in the timestamp tracking would somehow cause the checks to never run at all, then there is no way the monitoring could detect this. This might make things a lot more complicated, but if this is something that would fit into a general solution, that would be great.
For the original issue reported, I've now implemented a different solution to work around the problem: Instead of checking repo availability as hook with soft-fail, I've added
ExecConditiondirectives to the borgmatic systemd service, which means if I am not near my NAS, the borgmatic service will simply not run at all. This is a pragmatic alternative, which works because I have just one repository to backup to, so it might not apply to others, but my immediate problem is solved :-)