Retry failing backups #432

Merged
witten merged 8 commits from cadamswaite/borgmatic:master into master 2021-11-15 19:34:25 +00:00
Contributor

This should retry a failed backup n times, where n is specified in config as storage: retries: n
Hopefully this should close issue #28 and prevent transient issues such as failure to resolve hostname.

When a repo fails, it is added to the back of the queue - keeping track of the number of times fail has been seen. This allows the other repos to sync in the meantime and gives round robbin arbitration between failing repos.

Possible improvements:

  • Add backoff, such as 10s * retry_num
    • Without this, if the user has few repos, the retries may finish before the transient issue is resolved.
  • Add tests..
  • Avoid new imports? Not sure if this is important. Queue is probably overkill in this scenario.
This should retry a failed backup n times, where n is specified in config as `storage: retries: n` Hopefully this should close issue #28 and prevent transient issues such as failure to resolve hostname. When a repo fails, it is added to the back of the queue - keeping track of the number of times fail has been seen. This allows the other repos to sync in the meantime and gives round robbin arbitration between failing repos. Possible improvements: - Add backoff, such as `10s * retry_num` - Without this, if the user has few repos, the retries may finish before the transient issue is resolved. - Add tests.. - Avoid new imports? Not sure if this is important. Queue is probably overkill in this scenario.
cadamswaite added 3 commits 2021-07-14 22:08:29 +00:00
cadamswaite added 1 commit 2021-07-14 22:17:55 +00:00
continuous-integration/drone/pr Build is failing Details
89baf757cf
Sort imports
Owner

Sorry for the delay here! This looks like a totally reasonable solution to me. I don't think a new import is a big deal, especially since it's from the standard library. If you really didn't want the relative complexity of a Queue, you could do something similar with nested for loops (outer for for the repos and inner for for the retry count), but that wouldn't have the nice property you have here of adding retries to the back of the line.

In any case, I think this would benefit from test coverage of some sort—just to make sure the retrying and number of retries is as expected. Do you want to take a stab and testing or would you prefer I do it?

Sorry for the delay here! This looks like a totally reasonable solution to me. I don't think a new import is a big deal, especially since it's from the standard library. If you really didn't want the relative complexity of a `Queue`, you could do something similar with nested `for` loops (outer `for` for the repos and inner `for` for the retry count), but that wouldn't have the nice property you have here of adding retries to the back of the line. In any case, I think this would benefit from test coverage of some sort—just to make sure the retrying and number of retries is as expected. Do you want to take a stab and testing or would you prefer I do it?
cadamswaite added 1 commit 2021-11-14 18:24:21 +00:00
continuous-integration/drone/pr Build is failing Details
6b182c9d2d
Merge branch 'master' into master
cadamswaite added 3 commits 2021-11-14 22:39:30 +00:00
Owner

Awesome, thanks for doing the work for the exhaustive test coverage! I'll merge this and then make a couple of minor tweaks like standardizing the log messages. But you can expect this to be part of the next release.

Awesome, thanks for doing the work for the exhaustive test coverage! I'll merge this and then make a couple of minor tweaks like standardizing the log messages. But you can expect this to be part of the next release.
witten merged commit 180018fd81 into master 2021-11-15 19:34:25 +00:00
Owner

Oh, and just a heads up: I think I'm going to rename retry_timeout to retry_wait, as the value is not a timeout in the traditional sense of the word!

Oh, and just a heads up: I think I'm going to rename `retry_timeout` to `retry_wait`, as the value is not a timeout in the traditional sense of the word!
Author
Contributor

No worries, the renames sound good to me.
I've been running with retries for a while, and they've significantly reduced the number of false fails.
Though I must admit I'm also being effected by #439 e.g.

Remote: ssh: Could not resolve hostname foo: Try again
Connection closed by remote host. Is borg working on the server?
ssh://root@foo:23//borgmatic_data: Error running actions for repository
Command 'borg prune --keep-daily 3 --keep-weekly 4 --keep-monthly 12 --keep-yearly 2 --prefix {hostname}- --stats ssh://root@foo:23//borgmatic_data' returned non-zero exit status 2.
Retrying.. attempt 1/3
------------------------------------------------------------------------------
                       Original size      Compressed size    Deduplicated size
                       Deleted data:             
 ... ------------------------------------------------------------------------------
Finished a backup.
/etc/borgmatic.d/config.yaml: Error running post hook
HTTPSConnectionPool(host='hc-ping.com', port=443): Max retries exceeded with url: /hcurl (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f9da97e80>: Failed to establish a new connection: [Errno -3] Try again'))
No worries, the renames sound good to me. I've been running with retries for a while, and they've significantly reduced the number of false fails. Though I must admit I'm also being effected by #439 e.g. ``` Remote: ssh: Could not resolve hostname foo: Try again Connection closed by remote host. Is borg working on the server? ssh://root@foo:23//borgmatic_data: Error running actions for repository Command 'borg prune --keep-daily 3 --keep-weekly 4 --keep-monthly 12 --keep-yearly 2 --prefix {hostname}- --stats ssh://root@foo:23//borgmatic_data' returned non-zero exit status 2. Retrying.. attempt 1/3 ------------------------------------------------------------------------------ Original size Compressed size Deduplicated size Deleted data: ... ------------------------------------------------------------------------------ Finished a backup. /etc/borgmatic.d/config.yaml: Error running post hook HTTPSConnectionPool(host='hc-ping.com', port=443): Max retries exceeded with url: /hcurl (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f9da97e80>: Failed to establish a new connection: [Errno -3] Try again')) ```
Sign in to join this conversation.
No reviewers
No Milestone
No Assignees
2 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: borgmatic-collective/borgmatic#432
No description provided.