Retry failing backups #432
No reviewers
Labels
No Label
bug
data loss
design finalized
good first issue
new feature area
question / support
security
waiting for response
No Milestone
No Assignees
2 Participants
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: borgmatic-collective/borgmatic#432
Loading…
Reference in New Issue
No description provided.
Delete Branch "cadamswaite/borgmatic:master"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
This should retry a failed backup n times, where n is specified in config as
storage: retries: n
Hopefully this should close issue #28 and prevent transient issues such as failure to resolve hostname.
When a repo fails, it is added to the back of the queue - keeping track of the number of times fail has been seen. This allows the other repos to sync in the meantime and gives round robbin arbitration between failing repos.
Possible improvements:
10s * retry_num
Sorry for the delay here! This looks like a totally reasonable solution to me. I don't think a new import is a big deal, especially since it's from the standard library. If you really didn't want the relative complexity of a
Queue
, you could do something similar with nestedfor
loops (outerfor
for the repos and innerfor
for the retry count), but that wouldn't have the nice property you have here of adding retries to the back of the line.In any case, I think this would benefit from test coverage of some sort—just to make sure the retrying and number of retries is as expected. Do you want to take a stab and testing or would you prefer I do it?
Awesome, thanks for doing the work for the exhaustive test coverage! I'll merge this and then make a couple of minor tweaks like standardizing the log messages. But you can expect this to be part of the next release.
Oh, and just a heads up: I think I'm going to rename
retry_timeout
toretry_wait
, as the value is not a timeout in the traditional sense of the word!No worries, the renames sound good to me.
I've been running with retries for a while, and they've significantly reduced the number of false fails.
Though I must admit I'm also being effected by #439 e.g.