Excess spinning = excess CPU #321

Closed
opened 2020-05-30 00:23:20 +00:00 by paulharris · 11 comments

Related to bug #361,

when borg is busy looking for files to backup, borgmatic is chewing up the CPU like mad.

Right now, borg is using 10%, borgmatic is using 80%.

My guess is the execute.py loop (waiting for data on the pipes) is polling with no sleep.

I think it needs to wait on both output_buffers and processes.
I added this line: time.sleep(0.1) (see below), and import time at the top of the file, of course.
And now borgmatic was a lot better behaved.
I'm not a python guy so I can't help further on that.

if output_buffers:
    (ready_buffers, _, _) = select.select(output_buffers, [], [])

    for ready_buffer in ready_buffers:
        line = ready_buffer.readline().rstrip().decode()
        if not line:
            continue

        # Keep the last few lines of output in case the process errors, and we need the output for
        # the exception below.
        last_lines = buffer_last_lines[ready_buffer]
        last_lines.append(line)
        if len(last_lines) > ERROR_OUTPUT_MAX_LINE_COUNT:
            last_lines.pop(0)

        logger.log(output_log_level, line)
else: ### HERE is my addition
    time.sleep(0.1) ### HERE is my addition
Related to bug #361, when borg is busy looking for files to backup, borgmatic is chewing up the CPU like mad. Right now, borg is using 10%, borgmatic is using 80%. My guess is the execute.py loop (waiting for data on the pipes) is polling with no sleep. I think it needs to wait on both output_buffers *and* processes. I added this line: time.sleep(0.1) (see below), and import time at the top of the file, of course. And now borgmatic was a lot better behaved. I'm not a python guy so I can't help further on that. ``` if output_buffers: (ready_buffers, _, _) = select.select(output_buffers, [], []) for ready_buffer in ready_buffers: line = ready_buffer.readline().rstrip().decode() if not line: continue # Keep the last few lines of output in case the process errors, and we need the output for # the exception below. last_lines = buffer_last_lines[ready_buffer] last_lines.append(line) if len(last_lines) > ERROR_OUTPUT_MAX_LINE_COUNT: last_lines.pop(0) logger.log(output_log_level, line) else: ### HERE is my addition time.sleep(0.1) ### HERE is my addition ```
Owner

Good find! I'm not sure what's going on here, because:

I think it needs to wait on both output_buffers and processes.

It already should be!

for process in processes:
    exit_code = process.poll() if output_buffers else process.wait()

This should be calling process.wait() when output_buffers is not set (and therefore the select() gets bypassed). So I suspect that output_buffers is set, and the select() is actually firing. (And maybe firing too fast.)

In any case, it sounds like something is going wrong, and borgmatic is chewing too much CPU. I'll see if I can reproduce this.

Good find! I'm not sure what's going on here, because: > I think it needs to wait on both output_buffers and processes. It already should be! ```python for process in processes: exit_code = process.poll() if output_buffers else process.wait() ``` This should be calling `process.wait()` when `output_buffers` is not set (and therefore the `select()` gets bypassed). So I suspect that `output_buffers` *is* set, and the `select()` is actually firing. (And maybe firing too fast.) In any case, it sounds like something is going wrong, and borgmatic is chewing too much CPU. I'll see if I can reproduce this.
witten added the
bug
label 2020-05-30 20:57:09 +00:00
Owner

Right now, borg is using 10%, borgmatic is using 80%.

Are you sure that's not reversed, with Borg chewing up most of the CPU, and borgmatic hovering around 10%? Because that's consistent with what I'm seeing here. I haven't been able to get borgmatic to go much above 10%, even when Borg is spewing hundreds of filenames to backup. I've also tried setting output_buffers to an empty list, which uses even less CPU.

I did try introducing a sleep(), which cut borgmatic's CPU usage down to 3-4% or so on my machine. Borg CPU was also lower, probably because overall throughput was reduced.

> Right now, borg is using 10%, borgmatic is using 80%. Are you sure that's not reversed, with Borg chewing up most of the CPU, and borgmatic hovering around 10%? Because that's consistent with what I'm seeing here. I haven't been able to get borgmatic to go much above 10%, even when Borg is spewing hundreds of filenames to backup. I've also tried setting `output_buffers` to an empty list, which uses even less CPU. I did try introducing a `sleep()`, which cut borgmatic's CPU usage down to 3-4% or so on my machine. Borg CPU was also lower, probably because overall throughput was reduced.
witten added the
waiting for response
label 2020-06-06 21:27:37 +00:00
Author
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                        
 8077 root      20   0   38784  28248   9272 R  49.5   0.3   0:55.94 borgmatic                                                                      
 8823 root      20   0 3044956   2.9g  11964 R  47.5  36.2   1:50.96 borg          

This is when borg is not printing anything out.
ie when its busy scanning files looking for what to backup.
Note the very large memory usage by borg, due to 1-2 million file in the backup.

A bit later, to show its not just at the start (see total CPU time):

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                        
 8077 root      20   0   38784  28248   9272 R  54.3   0.3   2:02.43 borgmatic                                                                      
 8823 root      20   0 3044956   2.9g  11964 R  42.1  36.2   2:45.19 borg             

After a few minutes, borg still hasn't printed anything for a while.
Notice total CPU time (TIME+) is 5 minutes for borgmatic - same as borg!

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                        
 8077 root      20   0   38784  26688   7712 R  61.8   0.3   5:55.88 borgmatic                                                                      
 8823 root      20   0 3044956   2.9g  10656 R  35.9  36.2   5:30.84 borg    

Still scanning I guess ...

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                        
 8077 root      20   0   38784   7604   6972 R  55.8   0.1  20:41.63 borgmatic                                                                      
 8823 root      20   0 3046748   2.9g   8996 R  42.2  36.1  15:53.57 borg         

It found some files to backup, so its started printing ...

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                        
 8077 root      20   0   38784   8468   6972 R  94.7   0.1  36:38.06 borgmatic                                                                      
 8823 root      20   0 3047260   2.9g   9172 D   0.0  36.1  26:47.19 borg                  
 
 
A little later
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                        
 8077 root      20   0   38784   8468   6972 R  51.3   0.1  37:06.44 borgmatic                                                                      
 8823 root      20   0 3048284   2.9g   9172 R  39.7  36.1  27:05.47 borg         
``` PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 8077 root 20 0 38784 28248 9272 R 49.5 0.3 0:55.94 borgmatic 8823 root 20 0 3044956 2.9g 11964 R 47.5 36.2 1:50.96 borg ``` This is when borg is *not* printing anything out. ie when its busy scanning files looking for what to backup. Note the very large memory usage by borg, due to 1-2 million file in the backup. A bit later, to show its not just at the start (see total CPU time): ``` PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 8077 root 20 0 38784 28248 9272 R 54.3 0.3 2:02.43 borgmatic 8823 root 20 0 3044956 2.9g 11964 R 42.1 36.2 2:45.19 borg ``` After a few minutes, borg still hasn't printed anything for a while. Notice total CPU time (TIME+) is 5 minutes for borgmatic - same as borg! ``` PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 8077 root 20 0 38784 26688 7712 R 61.8 0.3 5:55.88 borgmatic 8823 root 20 0 3044956 2.9g 10656 R 35.9 36.2 5:30.84 borg ``` Still scanning I guess ... ``` PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 8077 root 20 0 38784 7604 6972 R 55.8 0.1 20:41.63 borgmatic 8823 root 20 0 3046748 2.9g 8996 R 42.2 36.1 15:53.57 borg ``` It found some files to backup, so its started printing ... ``` PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 8077 root 20 0 38784 8468 6972 R 94.7 0.1 36:38.06 borgmatic 8823 root 20 0 3047260 2.9g 9172 D 0.0 36.1 26:47.19 borg A little later PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 8077 root 20 0 38784 8468 6972 R 51.3 0.1 37:06.44 borgmatic 8823 root 20 0 3048284 2.9g 9172 R 39.7 36.1 27:05.47 borg ```
Owner

Okay, so just to clarify... If you've got -v 2 set with borgmatic, this is happening when Borg prints Processing files ... but before it prints anything else? Or is it happening at a different stage in the process? Is this with or without borgmatic's --files flag?

It might be helpful to see your Borg command and/or (redacted) borgmatic verbose output. Thanks for your help to debug!

Okay, so just to clarify... If you've got `-v 2` set with borgmatic, this is happening when Borg prints `Processing files ...` but before it prints anything else? Or is it happening at a different stage in the process? Is this with or without borgmatic's `--files` flag? It might be helpful to see your Borg command and/or (redacted) borgmatic verbose output. Thanks for your help to debug!
witten removed the
waiting for response
label 2020-06-10 04:35:28 +00:00
Author

Responded via email, and its in the middle of printing files,
so eg it prints 100 files, then stops while its scanning for other files to back up (most already backed up). Eventually it finds some more.

Responded via email, and its in the middle of printing files, so eg it prints 100 files, then stops while its scanning for other files to back up (most already backed up). Eventually it finds some more.
Owner

Interesting. I still haven't been able to reproduce this locally. I've tried various combinations of deleting Borg's cache, lowering the Borg checkpoint interval, etc. It might just be that I don't have files laid out such that I can get Borg to scan for (new/modified) files to backup for an appreciable amount of time. Any particular clues in terms of how your files are laid out that might be causing this behavior?

BTW, I don't think I saw a response via email.

Interesting. I still haven't been able to reproduce this locally. I've tried various combinations of deleting Borg's cache, lowering the Borg checkpoint interval, etc. It might just be that I don't have files laid out such that I can get Borg to scan for (new/modified) files to backup for an appreciable amount of time. Any particular clues in terms of how your files are laid out that might be causing this behavior? BTW, I don't think I saw a response via email.
witten added the
waiting for response
label 2020-06-23 18:05:33 +00:00
Author

I emailed, I guess you aren't getting emails from me.

Create a "data" folder to back up.
NODES: Create 2 hundred folders (from 1 to 200).
MONTHS: In each folder, create 60 folders (one for each month of 5 years, they are numbered eg 2020-03, but they can just be 1,2,3,...)
HOURS: In each of those folders, create a 1000 files, each 120KB of random stuff, ie use dd if=/dev/random of=thefile bs=120k count=1 (thats a guess at the command line)

Back it up.
Then, once backed up, back it up again.
You can add some more files to be sure its actually working.
In my case, new folders are added to the last of each of those "60 folders" (ie a new MONTHS folder for all nodes)
and new files are added to the last of each of those MONTHS folders.
and sometimes a new top-level folder is created (NODES)

The second run highlights the problem, not the first run.

I emailed, I guess you aren't getting emails from me. Create a "data" folder to back up. NODES: Create 2 hundred folders (from 1 to 200). MONTHS: In each folder, create 60 folders (one for each month of 5 years, they are numbered eg 2020-03, but they can just be 1,2,3,...) HOURS: In each of those folders, create a 1000 files, each 120KB of random stuff, ie use dd if=/dev/random of=thefile bs=120k count=1 (thats a guess at the command line) Back it up. Then, once backed up, back it up again. You can add some more files to be sure its actually working. In my case, new folders are added to the last of each of those "60 folders" (ie a new MONTHS folder for all nodes) and new files are added to the last of each of those MONTHS folders. and sometimes a new top-level folder is created (NODES) The second run highlights the problem, not the first run.
witten removed the
waiting for response
label 2020-06-29 21:26:11 +00:00
Owner

Unfortunately, I'm going to have to admit defeat on this one. Having insufficient disk space to repro this locally, I finally sprung for a temporary cloud server with enough storage space. But it costs US$330/month (billed hourly) and it was taking too long to even get to a repro state. (~12 hours to create the file layout, and then another ~12 hours to get not even a quarter of the way through the first backup.)

So I'm closing this ticket for the time being. However... If you manage to come up with a repro on a (much) smaller data set—or if you can get me temporary access to a server where this problem is manifesting—I'd be happy to reopen this and look into it further! Thanks for all your patience with this ticket, and I'm sorry for not making more progress on it.

Unfortunately, I'm going to have to admit defeat on this one. Having insufficient disk space to repro this locally, I finally sprung for a temporary cloud server with enough storage space. But it costs US$330/month (billed hourly) and it was taking too long to even get to a repro state. (~12 hours to create the file layout, and then another ~12 hours to get not even a quarter of the way through the *first* backup.) So I'm closing this ticket for the time being. However... If you manage to come up with a repro on a (much) smaller data set—*or* if you can get me temporary access to a server where this problem is manifesting—I'd be happy to reopen this and look into it further! Thanks for all your patience with this ticket, and I'm sorry for not making more progress on it.
Author

Maybe just include the 2 lines that I suggested?
borgmatic has been working much better for me now,
and it won't cause borgmatic to run slower for anyone else...

diff --git a/borgmatic/execute.py b/borgmatic/execute.py
index e140ea8..2a46576 100644
--- a/borgmatic/execute.py
+++ b/borgmatic/execute.py
@@ -83,6 +83,8 @@ def log_outputs(processes, exclude_stdouts, output_log_level, borg_local_path):
                     last_lines.pop(0)
 
                 logger.log(output_log_level, line)
+        else:
+            time.sleep(0.1)
 
         for process in processes:
             exit_code = process.poll() if output_buffers else process.wait()
Maybe just include the 2 lines that I suggested? borgmatic has been working much better for me now, and it won't cause borgmatic to run slower for anyone else... ``` diff --git a/borgmatic/execute.py b/borgmatic/execute.py index e140ea8..2a46576 100644 --- a/borgmatic/execute.py +++ b/borgmatic/execute.py @@ -83,6 +83,8 @@ def log_outputs(processes, exclude_stdouts, output_log_level, borg_local_path): last_lines.pop(0) logger.log(output_log_level, line) + else: + time.sleep(0.1) for process in processes: exit_code = process.poll() if output_buffers else process.wait() ```
Owner

As a general rule, I try to avoid adding random delays without understanding the underlying issue they're fixing. And in this case, I certainly don't understand, because there should be no busywaiting if the select() is working as intended! I'm curious... If you change your time.sleep(0.1) to time.sleep(0), does it have the same effect of drastically reducing borgmatic CPU usage? I'm trying to discern whether it's the delay that "fixes" it, or whether it's simply the yielding to other processes (like Borg).

As a general rule, I try to avoid adding random delays without understanding the underlying issue they're fixing. And in this case, I certainly don't understand, because there should be no busywaiting if the `select()` is working as intended! I'm curious... If you change your `time.sleep(0.1)` to `time.sleep(0)`, does it have the same effect of drastically reducing borgmatic CPU usage? I'm trying to discern whether it's the delay that "fixes" it, or whether it's simply the yielding to other processes (like Borg).
witten added the
waiting for response
label 2022-03-08 05:53:33 +00:00
Owner

I'm closing this ticket due to inactivity. But please feel free to reopen if you have any follow-up comments.

I'm closing this ticket due to inactivity. But please feel free to reopen if you have any follow-up comments.
Sign in to join this conversation.
No Milestone
No Assignees
2 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: borgmatic-collective/borgmatic#321
No description provided.