Log Aggregation with Grafana Loki
We recently deployed Loki to our infrastructure and in this blog post want to share the pitfalls and tips we discovered.
With the growing number of servers and services in our infrastructure it was time to introduce a central logging infrastructure to aggregate logs from our hosts and allow searching them centrally. This also has security benefits: if the logs are copied to a separate server, a potential attacker cannot easily hide their traces by manipulating logs files. The most common software stack for log aggregation is the ELK stack consisting of Elasticsearch, Logstash, and Kibana. ELK is however known to be a burden to maintain and there is a relative new alternative that looked promising to us: Loki. After some research on ELK, Loki and further alternatives we decided to go with Loki.
The central difference between Loki and ELK is that loki does not use a
full-text index. The philosophy of Loki is to only have a small number of
indexed labels, like host=webhost1
or selector=postgres
, and beyond those,
rely on highly parallelised brute-forced search.
Loki works in combination with Grafana as user interface – we already used
Grafana, so this was no additional component. The software that scrapes the
logs and ships them to the loki instance is called Promtail. It’s similarity
with Prometheus is not limited to the name, but there is one central
difference: promtail is usually installed on every host which logs should be
aggregated.
Since we don’t want to rely on VPNs to encrypt and authenticate the traffic between promtail and loki we put both behind an nginx reverse proxy. The nginx configuration for promtail is fairly straight forward. In the proxy in front of loki however we block access to API endpoints that we are currently not using, to reduce attack surface. To enable the live-tail mode of loki in Grafana one needs to allow sockets in the nginx configuration of the reverse proxy in front of loki and in front of Grafana. Here are shortened versions of our nginx reverse proxy configurations:
Promtail:
Loki:
Grafana:
While we also scrape several logs from files, most log lines get scraped from systemd
journal. We first adopted the scrape configuration as shown in the official
logs. It uses the relabel_configs
step to save the unit that created the
log line in a label unit
. This seemed natural and worked fine on our test
infrastructure. But when reading Loki’s documentation on best practices, we discovered that this is
highly discouraged: Since unit
can have many different values – especially
since it may not only contain service names, but also dynamical created values
such as session-10837.scope
– it’s cardinality is terrible high and this
hurts Loki’s performance severely. Even after filtering the values the
cardinality is much higher than recommended.
So we removed the label entirely –
but now we were missing the information altogether. Another dead-end to fix
this was the option json: true
of the journal scraper: while it solved the
problem by merging metadata into the log message, it resulted in
incompatibilities with logs scraped from files. The solution we finally came up
with is the usage of pack
and unpack
: The pack stage embeds extracted
values and labels into the log line by packing the log line and labels inside a
JSON object. Loki’s query language LogQL has a corresponding unpack
operation
that extracts the labels and restores the original log message. This way it is
possible to filter conveniently by labels while these labels are not indexed but
searched brute-force.
Here is a shortened version of our scrape configuration:
As you might have noticed in the code snippet, we define a label _cmdline
. The
underscore has the effect that this label is not rendered prominently in the
Grafana log view but only after opening the details of a log line. This
allows to store data in labels without messing up the UI. Similarly, we edit
the log lines scraped from log files to remove the timestamp from the log message. Promtail
already parsed and stored them, so keeping them would have no benefit but
make the messages harder to read.
In addition to the Grafana web UI there is a decent command-line client called
logcli, which allows us to use default unix tools to process logs further.
Our overall impression of loki is mostly positive. The documentation is improvable and the log UI of Grafana could still need some improvements – but apart from that working with loki is fun and the performance is solid. We are confident that loki was the right choice for us.