#321 Excess spinning = excess CPU

Open
opened 4 months ago by paulharris · 7 comments

Related to bug #361,

when borg is busy looking for files to backup, borgmatic is chewing up the CPU like mad.

Right now, borg is using 10%, borgmatic is using 80%.

My guess is the execute.py loop (waiting for data on the pipes) is polling with no sleep.

I think it needs to wait on both output_buffers and processes.
I added this line: time.sleep(0.1) (see below), and import time at the top of the file, of course.
And now borgmatic was a lot better behaved.
I’m not a python guy so I can’t help further on that.

if output_buffers:
    (ready_buffers, _, _) = select.select(output_buffers, [], [])

    for ready_buffer in ready_buffers:
        line = ready_buffer.readline().rstrip().decode()
        if not line:
            continue

        # Keep the last few lines of output in case the process errors, and we need the output for
        # the exception below.
        last_lines = buffer_last_lines[ready_buffer]
        last_lines.append(line)
        if len(last_lines) > ERROR_OUTPUT_MAX_LINE_COUNT:
            last_lines.pop(0)

        logger.log(output_log_level, line)
else: ### HERE is my addition
    time.sleep(0.1) ### HERE is my addition
Related to bug #361, when borg is busy looking for files to backup, borgmatic is chewing up the CPU like mad. Right now, borg is using 10%, borgmatic is using 80%. My guess is the execute.py loop (waiting for data on the pipes) is polling with no sleep. I think it needs to wait on both output_buffers *and* processes. I added this line: time.sleep(0.1) (see below), and import time at the top of the file, of course. And now borgmatic was a lot better behaved. I'm not a python guy so I can't help further on that. ``` if output_buffers: (ready_buffers, _, _) = select.select(output_buffers, [], []) for ready_buffer in ready_buffers: line = ready_buffer.readline().rstrip().decode() if not line: continue # Keep the last few lines of output in case the process errors, and we need the output for # the exception below. last_lines = buffer_last_lines[ready_buffer] last_lines.append(line) if len(last_lines) > ERROR_OUTPUT_MAX_LINE_COUNT: last_lines.pop(0) logger.log(output_log_level, line) else: ### HERE is my addition time.sleep(0.1) ### HERE is my addition ```
witten commented 4 months ago
Owner

Good find! I’m not sure what’s going on here, because:

I think it needs to wait on both output_buffers and processes.

It already should be!

for process in processes:
    exit_code = process.poll() if output_buffers else process.wait()

This should be calling process.wait() when output_buffers is not set (and therefore the select() gets bypassed). So I suspect that output_buffers is set, and the select() is actually firing. (And maybe firing too fast.)

In any case, it sounds like something is going wrong, and borgmatic is chewing too much CPU. I’ll see if I can reproduce this.

Good find! I'm not sure what's going on here, because: > I think it needs to wait on both output_buffers and processes. It already should be! ```python for process in processes: exit_code = process.poll() if output_buffers else process.wait() ``` This should be calling `process.wait()` when `output_buffers` is not set (and therefore the `select()` gets bypassed). So I suspect that `output_buffers` *is* set, and the `select()` is actually firing. (And maybe firing too fast.) In any case, it sounds like something is going wrong, and borgmatic is chewing too much CPU. I'll see if I can reproduce this.
witten added the
bug
label 4 months ago
witten commented 4 months ago
Owner

Right now, borg is using 10%, borgmatic is using 80%.

Are you sure that’s not reversed, with Borg chewing up most of the CPU, and borgmatic hovering around 10%? Because that’s consistent with what I’m seeing here. I haven’t been able to get borgmatic to go much above 10%, even when Borg is spewing hundreds of filenames to backup. I’ve also tried setting output_buffers to an empty list, which uses even less CPU.

I did try introducing a sleep(), which cut borgmatic’s CPU usage down to 3-4% or so on my machine. Borg CPU was also lower, probably because overall throughput was reduced.

> Right now, borg is using 10%, borgmatic is using 80%. Are you sure that's not reversed, with Borg chewing up most of the CPU, and borgmatic hovering around 10%? Because that's consistent with what I'm seeing here. I haven't been able to get borgmatic to go much above 10%, even when Borg is spewing hundreds of filenames to backup. I've also tried setting `output_buffers` to an empty list, which uses even less CPU. I did try introducing a `sleep()`, which cut borgmatic's CPU usage down to 3-4% or so on my machine. Borg CPU was also lower, probably because overall throughput was reduced.
witten added the
waiting for response
label 4 months ago
paulharris commented 4 months ago
Poster
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                        
 8077 root      20   0   38784  28248   9272 R  49.5   0.3   0:55.94 borgmatic                                                                      
 8823 root      20   0 3044956   2.9g  11964 R  47.5  36.2   1:50.96 borg          

This is when borg is not printing anything out.
ie when its busy scanning files looking for what to backup.
Note the very large memory usage by borg, due to 1-2 million file in the backup.

A bit later, to show its not just at the start (see total CPU time):

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                        
 8077 root      20   0   38784  28248   9272 R  54.3   0.3   2:02.43 borgmatic                                                                      
 8823 root      20   0 3044956   2.9g  11964 R  42.1  36.2   2:45.19 borg             

After a few minutes, borg still hasn’t printed anything for a while.
Notice total CPU time (TIME+) is 5 minutes for borgmatic - same as borg!

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                        
 8077 root      20   0   38784  26688   7712 R  61.8   0.3   5:55.88 borgmatic                                                                      
 8823 root      20   0 3044956   2.9g  10656 R  35.9  36.2   5:30.84 borg    

Still scanning I guess ...

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                        
 8077 root      20   0   38784   7604   6972 R  55.8   0.1  20:41.63 borgmatic                                                                      
 8823 root      20   0 3046748   2.9g   8996 R  42.2  36.1  15:53.57 borg         

It found some files to backup, so its started printing ...

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                        
 8077 root      20   0   38784   8468   6972 R  94.7   0.1  36:38.06 borgmatic                                                                      
 8823 root      20   0 3047260   2.9g   9172 D   0.0  36.1  26:47.19 borg                  
 
 
A little later
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                        
 8077 root      20   0   38784   8468   6972 R  51.3   0.1  37:06.44 borgmatic                                                                      
 8823 root      20   0 3048284   2.9g   9172 R  39.7  36.1  27:05.47 borg         
``` PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 8077 root 20 0 38784 28248 9272 R 49.5 0.3 0:55.94 borgmatic 8823 root 20 0 3044956 2.9g 11964 R 47.5 36.2 1:50.96 borg ``` This is when borg is *not* printing anything out. ie when its busy scanning files looking for what to backup. Note the very large memory usage by borg, due to 1-2 million file in the backup. A bit later, to show its not just at the start (see total CPU time): ``` PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 8077 root 20 0 38784 28248 9272 R 54.3 0.3 2:02.43 borgmatic 8823 root 20 0 3044956 2.9g 11964 R 42.1 36.2 2:45.19 borg ``` After a few minutes, borg still hasn't printed anything for a while. Notice total CPU time (TIME+) is 5 minutes for borgmatic - same as borg! ``` PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 8077 root 20 0 38784 26688 7712 R 61.8 0.3 5:55.88 borgmatic 8823 root 20 0 3044956 2.9g 10656 R 35.9 36.2 5:30.84 borg ``` Still scanning I guess ... ``` PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 8077 root 20 0 38784 7604 6972 R 55.8 0.1 20:41.63 borgmatic 8823 root 20 0 3046748 2.9g 8996 R 42.2 36.1 15:53.57 borg ``` It found some files to backup, so its started printing ... ``` PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 8077 root 20 0 38784 8468 6972 R 94.7 0.1 36:38.06 borgmatic 8823 root 20 0 3047260 2.9g 9172 D 0.0 36.1 26:47.19 borg A little later PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 8077 root 20 0 38784 8468 6972 R 51.3 0.1 37:06.44 borgmatic 8823 root 20 0 3048284 2.9g 9172 R 39.7 36.1 27:05.47 borg ```
witten commented 4 months ago
Owner

Okay, so just to clarify... If you’ve got -v 2 set with borgmatic, this is happening when Borg prints Processing files ... but before it prints anything else? Or is it happening at a different stage in the process? Is this with or without borgmatic’s --files flag?

It might be helpful to see your Borg command and/or (redacted) borgmatic verbose output. Thanks for your help to debug!

Okay, so just to clarify... If you've got `-v 2` set with borgmatic, this is happening when Borg prints `Processing files ...` but before it prints anything else? Or is it happening at a different stage in the process? Is this with or without borgmatic's `--files` flag? It might be helpful to see your Borg command and/or (redacted) borgmatic verbose output. Thanks for your help to debug!
witten removed the
waiting for response
label 4 months ago
paulharris commented 4 months ago
Poster

Responded via email, and its in the middle of printing files,
so eg it prints 100 files, then stops while its scanning for other files to back up (most already backed up). Eventually it finds some more.

Responded via email, and its in the middle of printing files, so eg it prints 100 files, then stops while its scanning for other files to back up (most already backed up). Eventually it finds some more.
witten commented 4 months ago
Owner

Interesting. I still haven’t been able to reproduce this locally. I’ve tried various combinations of deleting Borg’s cache, lowering the Borg checkpoint interval, etc. It might just be that I don’t have files laid out such that I can get Borg to scan for (new/modified) files to backup for an appreciable amount of time. Any particular clues in terms of how your files are laid out that might be causing this behavior?

BTW, I don’t think I saw a response via email.

Interesting. I still haven't been able to reproduce this locally. I've tried various combinations of deleting Borg's cache, lowering the Borg checkpoint interval, etc. It might just be that I don't have files laid out such that I can get Borg to scan for (new/modified) files to backup for an appreciable amount of time. Any particular clues in terms of how your files are laid out that might be causing this behavior? BTW, I don't think I saw a response via email.
witten added the
waiting for response
label 4 months ago
paulharris commented 4 months ago
Poster

I emailed, I guess you aren’t getting emails from me.

Create a “data” folder to back up.
NODES: Create 2 hundred folders (from 1 to 200).
MONTHS: In each folder, create 60 folders (one for each month of 5 years, they are numbered eg 2020-03, but they can just be 1,2,3,...)
HOURS: In each of those folders, create a 1000 files, each 120KB of random stuff, ie use dd if=/dev/random of=thefile bs=120k count=1 (thats a guess at the command line)

Back it up.
Then, once backed up, back it up again.
You can add some more files to be sure its actually working.
In my case, new folders are added to the last of each of those “60 folders” (ie a new MONTHS folder for all nodes)
and new files are added to the last of each of those MONTHS folders.
and sometimes a new top-level folder is created (NODES)

The second run highlights the problem, not the first run.

I emailed, I guess you aren't getting emails from me. Create a "data" folder to back up. NODES: Create 2 hundred folders (from 1 to 200). MONTHS: In each folder, create 60 folders (one for each month of 5 years, they are numbered eg 2020-03, but they can just be 1,2,3,...) HOURS: In each of those folders, create a 1000 files, each 120KB of random stuff, ie use dd if=/dev/random of=thefile bs=120k count=1 (thats a guess at the command line) Back it up. Then, once backed up, back it up again. You can add some more files to be sure its actually working. In my case, new folders are added to the last of each of those "60 folders" (ie a new MONTHS folder for all nodes) and new files are added to the last of each of those MONTHS folders. and sometimes a new top-level folder is created (NODES) The second run highlights the problem, not the first run.
witten removed the
waiting for response
label 3 months ago
Sign in to join this conversation.
No Milestone
No Assignees
2 Participants
Notifications
Due Date

No due date set.

Dependencies

This issue currently doesn't have any dependencies.

Loading…
There is no content yet.