Excess spinning = excess CPU #321
Labels
No Label
bug
data loss
design finalized
good first issue
new feature area
question / support
security
waiting for response
No Milestone
No Assignees
2 Participants
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: borgmatic-collective/borgmatic#321
Loading…
Reference in New Issue
No description provided.
Delete Branch "%!s(<nil>)"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Related to bug #361,
when borg is busy looking for files to backup, borgmatic is chewing up the CPU like mad.
Right now, borg is using 10%, borgmatic is using 80%.
My guess is the execute.py loop (waiting for data on the pipes) is polling with no sleep.
I think it needs to wait on both output_buffers and processes.
I added this line: time.sleep(0.1) (see below), and import time at the top of the file, of course.
And now borgmatic was a lot better behaved.
I'm not a python guy so I can't help further on that.
Good find! I'm not sure what's going on here, because:
It already should be!
This should be calling
process.wait()
whenoutput_buffers
is not set (and therefore theselect()
gets bypassed). So I suspect thatoutput_buffers
is set, and theselect()
is actually firing. (And maybe firing too fast.)In any case, it sounds like something is going wrong, and borgmatic is chewing too much CPU. I'll see if I can reproduce this.
Are you sure that's not reversed, with Borg chewing up most of the CPU, and borgmatic hovering around 10%? Because that's consistent with what I'm seeing here. I haven't been able to get borgmatic to go much above 10%, even when Borg is spewing hundreds of filenames to backup. I've also tried setting
output_buffers
to an empty list, which uses even less CPU.I did try introducing a
sleep()
, which cut borgmatic's CPU usage down to 3-4% or so on my machine. Borg CPU was also lower, probably because overall throughput was reduced.This is when borg is not printing anything out.
ie when its busy scanning files looking for what to backup.
Note the very large memory usage by borg, due to 1-2 million file in the backup.
A bit later, to show its not just at the start (see total CPU time):
After a few minutes, borg still hasn't printed anything for a while.
Notice total CPU time (TIME+) is 5 minutes for borgmatic - same as borg!
Still scanning I guess ...
It found some files to backup, so its started printing ...
Okay, so just to clarify... If you've got
-v 2
set with borgmatic, this is happening when Borg printsProcessing files ...
but before it prints anything else? Or is it happening at a different stage in the process? Is this with or without borgmatic's--files
flag?It might be helpful to see your Borg command and/or (redacted) borgmatic verbose output. Thanks for your help to debug!
Responded via email, and its in the middle of printing files,
so eg it prints 100 files, then stops while its scanning for other files to back up (most already backed up). Eventually it finds some more.
Interesting. I still haven't been able to reproduce this locally. I've tried various combinations of deleting Borg's cache, lowering the Borg checkpoint interval, etc. It might just be that I don't have files laid out such that I can get Borg to scan for (new/modified) files to backup for an appreciable amount of time. Any particular clues in terms of how your files are laid out that might be causing this behavior?
BTW, I don't think I saw a response via email.
I emailed, I guess you aren't getting emails from me.
Create a "data" folder to back up.
NODES: Create 2 hundred folders (from 1 to 200).
MONTHS: In each folder, create 60 folders (one for each month of 5 years, they are numbered eg 2020-03, but they can just be 1,2,3,...)
HOURS: In each of those folders, create a 1000 files, each 120KB of random stuff, ie use dd if=/dev/random of=thefile bs=120k count=1 (thats a guess at the command line)
Back it up.
Then, once backed up, back it up again.
You can add some more files to be sure its actually working.
In my case, new folders are added to the last of each of those "60 folders" (ie a new MONTHS folder for all nodes)
and new files are added to the last of each of those MONTHS folders.
and sometimes a new top-level folder is created (NODES)
The second run highlights the problem, not the first run.
Unfortunately, I'm going to have to admit defeat on this one. Having insufficient disk space to repro this locally, I finally sprung for a temporary cloud server with enough storage space. But it costs US$330/month (billed hourly) and it was taking too long to even get to a repro state. (~12 hours to create the file layout, and then another ~12 hours to get not even a quarter of the way through the first backup.)
So I'm closing this ticket for the time being. However... If you manage to come up with a repro on a (much) smaller data set—or if you can get me temporary access to a server where this problem is manifesting—I'd be happy to reopen this and look into it further! Thanks for all your patience with this ticket, and I'm sorry for not making more progress on it.
Maybe just include the 2 lines that I suggested?
borgmatic has been working much better for me now,
and it won't cause borgmatic to run slower for anyone else...
As a general rule, I try to avoid adding random delays without understanding the underlying issue they're fixing. And in this case, I certainly don't understand, because there should be no busywaiting if the
select()
is working as intended! I'm curious... If you change yourtime.sleep(0.1)
totime.sleep(0)
, does it have the same effect of drastically reducing borgmatic CPU usage? I'm trying to discern whether it's the delay that "fixes" it, or whether it's simply the yielding to other processes (like Borg).I'm closing this ticket due to inactivity. But please feel free to reopen if you have any follow-up comments.