Added support for grafana loki #747
No reviewers
Labels
No Label
bug
data loss
design finalized
good first issue
new feature area
question / support
security
waiting for response
No Milestone
No Assignees
2 Participants
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: borgmatic-collective/borgmatic#747
Loading…
Reference in New Issue
No description provided.
Delete Branch ":main"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Wow, thanks so much for implementing this hook. Overall it looks great and most of the feedback I left is for admittedly pretty minor stuff. The one potentially bigger thing missing to my eye is unit tests. Looking at the tests for some of the monitoring hooks may prove useful, and I'm happy to answer questions if you get stuck.
@ -1406,0 +1417,4 @@
labels:
type: object
additionalProperties:
type: string
Cool! I didn't know you could do this.
@ -1406,0 +1422,4 @@
Allows setting custom labels for the logging stream. At
least one label is required. "__hostname" gets replaced by
the machine hostname automatically. "__config" gets replaced
by just the name of the configuration file. "__config_path"
What do you think of standardizing on the
{name}
placeholder syntax that's used elsewhere in borgmatic's configuration file and also by Borg as well?@ -0,0 +9,4 @@
logger = logging.getLogger(__name__)
MONITOR_STATE_TO_HEALTHCHECKS = {
This should probably be renamed to
MONITOR_STATE_TO_LOKI
. 😄@ -0,0 +19,4 @@
MAX_BUFFER_LINES = 100
class loki_log_buffer:
Code style convention: Uppercase first letter of class names, e.g.
Loki_log_buffer
.@ -0,0 +30,4 @@
self.root = {}
self.root['streams'] = [{}]
self.root['streams'][0]['stream'] = {}
self.root['streams'][0]['values'] = []
A perhaps more succinct/declarative way to express this might be:
@ -0,0 +32,4 @@
self.root['streams'][0]['stream'] = {}
self.root['streams'][0]['values'] = []
def add_value(self, value):
It would be great if some of these functions have a brief docstring describing what they do.
@ -0,0 +39,4 @@
def add_label(self, label, value):
self.root['streams'][0]['stream'][label] = value
def _to_request(self):
Note that I've been trying to stop using underscore functions in this codebase, operating under the "we're all adults" theory of function privacy. You'll still find them in places though.
@ -0,0 +46,4 @@
return len(self.root['streams'][0]['values'])
def flush(self):
if self.dry_run:
Blank lines after some of these
if
s would be nice. Maybe before thetry
as well. In general I find that helps readability even if it's not required by the language.@ -0,0 +56,4 @@
request_header = {'Content-Type': 'application/json'}
try:
result = requests.post(self.url, headers=request_header, data=request_body, timeout=5)
result.raise_for_status()
This code looks good, but it's a little counter-intuitive to me that it would be here. For instance, I think of a buffer as a data structure for storing stuff, not necessarily as a data structure that also implicitly has the side effect of pushing logs to an external service. Maybe I'm just not used to OOP. 😄 I don't feel super strongly or anything, but this might be less surprising if the push to Loki took place elsewhere like in
ping_monitor()
.I'm guessing part of the reason you're doing it this way though is so that logs get sent to Loki as borgmatic runs rather than all at the end once
ping_monitor()
is called..? The Healthchecks hook for example only sends logs at the end, but the rationale there is that it's explicitly logging the success/failure status of the backup rather than only logs along the way. So the requirements may be a little different.Anyway, let me know your thoughts. (And then maybe put some of them into docstrings. 😄)
Well I am trying to not hit the max request size limits of a lot of loki instances that run behind e.g. nginx. I think it is much better for large volumes of logs to be pushed incrementally instead of pushing it as one who knows how big request in the end.
Gotcha. The Healthchecks hook "solves" that particular problem by reverse truncating the logs so that older messages are not sent if the logs get too big by the time
ping_monitor()
is called. However I can see why you might not want to do that in this case, since loki seems much more about log aggregation than simply tracking service status.@ -0,0 +58,4 @@
result = requests.post(self.url, headers=request_header, data=request_body, timeout=5)
result.raise_for_status()
except requests.RequestException:
logger.warn('Failed to upload logs to loki')
I think this should be
.warning()
rather than.warn()
. See the logging docs for more info.Also.. Nice-to-have but not required: Putting the
config_filename
at the start of this warning message. Seehealthchecks.py
error handling for an example.@ -0,0 +61,4 @@
logger.warn('Failed to upload logs to loki')
class loki_log_handeler(logging.Handler):
Spelling and capitalization:
Loki_log_handler
@ -0,0 +76,4 @@
def add_label(self, key, value):
self.buffer.add_label(key, value)
def raw(self, msg):
Some brief docstrings would be good on these functions too. For instance, I'm not sure at a glance why this function is called
.raw()
.. I'm sure there's a good reason though.@ -0,0 +88,4 @@
def initialize_monitor(hook_config, config, config_filename, monitoring_log_level, dry_run):
'''
Add a handler to the root logger to regularly send the logs to loki
Just a code style convention nit: Period at the end of sentences in docstrings.
@ -0,0 +93,4 @@
url = hook_config.get('url')
loki = loki_log_handeler(url, dry_run)
for key, value in hook_config.get('labels').items():
if value == '__hostname':
Will users expect to substitute these placeholders within longer strings? Example value:
The name of this host is __hostname
Not really. Grafana loki labels are rarely longer than a single word and the placeholders can be compared to what grafana themself have implemented in promtail which is their loki log agent. The label format is taken from https://grafana.com/docs/loki/latest/clients/promtail/configuration/#syslog. It works the same for almost all of promtail and I think I would rather adhere to their ecosystem of how they label streams than this one but thats up for debate.
I'm fine leaving it as-is if there's already convention on the loki side.
@ -0,0 +96,4 @@
if value == '__hostname':
loki.add_label(key, platform.node())
elif value == '__config':
loki.add_label(key, config_filename.split('/')[-1])
Check out
os.path.basename()
for this!I wanted to ask about this anyway: Is the hostname of the system used anywhere else so we can have consistent naming? Not that it matters much for this as users can change the hostname label anyways but it would still be good to be consistent
In borgmatic code, I think the PagerDuty hook uses hostname, but that's about it. Well, I guess hostname can be part of the
archive_name_format
configured in borgmatic but then it's passed to Borg.@ -0,0 +108,4 @@
'''
Adds an entry to the loki logger with the current state
'''
if not dry_run:
You could instead do:
... just to cut down on the deep indentation a bit. That would also give you an opportunity to log about the fact that it's a dry run and you're bailing. See
healthchecks.py
or one of the other monitoring hooks for an example.I don't think adding a entry that we are bailing is useful here as I added the entry for every time we flush the buffer now (which is at least once per program).
Sounds good.
@ -0,0 +112,4 @@
for handler in tuple(logging.getLogger().handlers):
if isinstance(handler, loki_log_handeler):
if state in MONITOR_STATE_TO_HEALTHCHECKS.keys():
handler.raw(f'{config_filename} {MONITOR_STATE_TO_HEALTHCHECKS[state]} backup')
Maybe put a
:
after{config_filename}
as per convention.Is there a reason though you're picking out the loki log handler to log to instead of just logging this as a general log message? E.g., you could just throw the word "loki" somewhere in it and use
logger.info()
or.debug()
or whatever. Again, check out the other monitoring hooks for examples.Yes there is a reason for doing it this way. First of all as far as I am aware the function is just supposed to inform the monitoring app of the current state. If we used
logger.info()
the message will depend on the current log level and settings etc. Skipping the logging system makes sure that: 1. The message is only send to loki and not any other agents (which should generate their own message) 2. The message is always send regardless of how the application logger is configured.Okay, gotcha.
Tests are coming as soon as you are fine with the current state of the hook and the mayor questions are resolved.
014fd78b98
to7e419ec995
Yup, I think this approach is fine. Thanks!
@witten Unit tests are done now.
Just had a look at the tests and left a bunch of comments, some of them no doubt obnoxious. But overall, the tests are lookin' good. I like how you have coverage of both the higher-level integrations and some more detailed tests on the lower-level units. As per usual, let me know if you have any questions or feedback on the comments.
@ -0,0 +4,4 @@
import platform
import logging
import requests
There's a convention in this codebase to import the module under test as
module
. E.g.:... and then refer to it as such in the tests. This makes is super clear what's part of the unit under test and what isn't.
Thats fixed now
@ -0,0 +6,4 @@
import requests
def test_log_handler_gets_added():
Test function naming convention in this codebase is to include the name of the unit under test after
test_
. So for instance you could call thistest_initialize_monitor_adds_log_handler()
. IMO this makes it easier to find relevant tests and forces you to pick a single unit to really focus on in each test.Similar for other test functions in this file.
I'll go ahead and do this after merging rather than having to do another round of back-and-forth on the PR!
@ -0,0 +9,4 @@
def test_log_handler_gets_added():
hook_config = {'url': 'http://localhost:3100/loki/api/v1/push', 'labels': {'app': 'borgmatic'}}
config_filename = 'test.yaml'
dry_run = True
You could move these inline to the function call as keyword arguments (
... , config_filename='test.yaml', ...
) since they're only used once.@ -0,0 +10,4 @@
hook_config = {'url': 'http://localhost:3100/loki/api/v1/push', 'labels': {'app': 'borgmatic'}}
config_filename = 'test.yaml'
dry_run = True
loki.initialize_monitor(hook_config, '', config_filename, '', dry_run)
Instead of
''
as placeholder for values that don't matter, the convention here is to useflexmock()
instances (which you can byfrom flexmock import flexmock
). That has the benefit of giving more sensible error messages if they do end up used.Same thing elsewhere in this file.
@ -0,0 +13,4 @@
loki.initialize_monitor(hook_config, '', config_filename, '', dry_run)
for handler in tuple(logging.getLogger().handlers):
if isinstance(handler, loki.Loki_log_handler):
assert True
You could safely delete this
assert
since it'll always be true.@ -0,0 +15,4 @@
if isinstance(handler, loki.Loki_log_handler):
assert True
return
assert False
If you wanted to make this loop and assertion a little more declarative, you could do:
Do not feel strongly.
@ -0,0 +16,4 @@
assert True
return
assert False
So looking at the preceding test, I don't think it actually qualifies as a true "unit" test in that its unit under test (
initialize_monitor()
) integrates with another unit (Loki_log_handler
). So I think your options here are: 1. Mock outLoki_log_handler
withflexmock
or 2. Move this test into theintegration
test directory.This may seem like unnecessary pedantry, but this sort of rigor is intended to make it super clear what sort of coverage a given unit has—and make sure units actually get unit tested where that makes sense.
Also: IMO you don't need to mock out
platform.node()
oros.path.basename()
in this particular test though, because those code paths aren't triggered here. And I generally don't mock out Python's logging system in a test because it's standard library.@ -0,0 +18,4 @@
assert False
def test_ping():
Besides the name of the unit under test, it would be great to have your expectation encoded into the test name. Example:
test_ping_monitor_flushes_buffer()
if that's indeed what you're expecting to happen.@ -0,0 +26,4 @@
loki.ping_monitor(hook_config, '', config_filename, loki.monitor.State.FINISH, '', dry_run)
for handler in tuple(logging.getLogger().handlers):
if isinstance(handler, loki.Loki_log_handler):
assert len(handler.buffer) <= 1
Interesting. I think what you're doing here is checking that the buffer actually got flushed, which seems like a reasonable thing to do in this test. But one failure mode I can see is if nothing got put into the buffer to begin with. Is there some way to prevent that at the test level? Maybe that case is covered by the other tests below? I'll keep reading!
In any case, why
<= 1
instead of== 0
?I changed this test to makes more sense. It now checks that the buffer gets the exact message I expect.
@ -0,0 +28,4 @@
if isinstance(handler, loki.Loki_log_handler):
assert len(handler.buffer) <= 1
return
assert False
Similar feedback for this test: I think it either needs mocking to make it into a unit test.. or you can move it into
integration
if you prefer to keep it integrating multiple functions.@ -0,0 +39,4 @@
loki.destroy_monitor(hook_config, '', config_filename, '', dry_run)
for handler in tuple(logging.getLogger().handlers):
if isinstance(handler, loki.Loki_log_handler):
assert False
You could use the
any()
idiom here too.@ -0,0 +47,4 @@
buffer.add_label('test', 'label')
assert buffer.root['streams'][0]['stream']['test'] == 'label'
buffer.add_label('test2', 'label2')
assert buffer.root['streams'][0]['stream']['test2'] == 'label2'
For this
assert
(and maybe the previous one) it'd be good to assert the entire'stream'
structure IMO. That way, you're ensuring that adding one label doesn't erroneously replace previous ones. Example:@ -0,0 +62,4 @@
if isinstance(handler, loki.Loki_log_handler):
assert handler.buffer.root['streams'][0]['stream']['hostname'] == platform.node()
assert handler.buffer.root['streams'][0]['stream']['config'] == 'test.yaml'
assert handler.buffer.root['streams'][0]['stream']['config_full'] == config_filename
You could also do a single full-
'stream'
assert here. Python is pretty good about displaying sensible errors for complex value discrepancies in test assertions.@ -0,0 +78,4 @@
def test_log_handler_gets_raw():
handler = loki.Loki_log_handler('', False)
handler.emit(flexmock(getMessage=lambda: 'Some test log line'))
Nice use of mocking. Although personally I'm not above calling
logging.makeLogRecord()
in tests...@ -0,0 +84,4 @@
def test_log_handler_json():
buffer = loki.Loki_log_buffer('', False)
assert json.loads(buffer.to_request()) == json.loads('{"streams":[{"stream":{},"values":[]}]}')
Why not just do an assert that the
to_request()
value is the expected encoded JSON string? Why bother decoding it? Presumably any JSON string you put into a test is known-valid.Mainly whitespace etc. It's more consistent this way and we don't have to worry about anything formatting related. In the end I don't care that the strings are the same but that the json is the same.
@ -0,0 +105,4 @@
handler = loki.Loki_log_handler('', False)
flexmock(loki.requests).should_receive('post').and_return(
flexmock(raise_for_status=lambda: '')
).once()
Good. I'm glad to see this code path has some coverage!
@ -0,0 +106,4 @@
flexmock(loki.requests).should_receive('post').and_return(
flexmock(raise_for_status=lambda: '')
).once()
for x in range(150):
Instead of
150
, you could doloki.MAX_BUFFER_LINES + 1
or evenloki.MAX_BUFFER_LINES * 1.5
.Do not feel strongly. Magic numbers are okay-er in tests than in actual code IMO.
@ -0,0 +119,4 @@
handler.raw(x)
def test_empty_flush():
Since there are no assertions here, I'd encode that fact into the test name. Example:
test_loki_log_handler_flush_does_not_raise()
. That says: We're just running through the motions here to shake out any stupid bugs but not asserting anything.I updated the branch based on what you wrote @witten
Looks great. Thank you for sticking with this!