Log Aggregation with Grafana Loki
We recently deployed Loki to our infrastructure and in this blog post want to share the pitfalls and tips we discovered.
With the growing number of servers and services in our infrastructure it was time to introduce a central logging infrastructure to aggregate logs from our hosts and allow searching them centrally. This also has security benefits: if the logs are copied to a separate server, a potential attacker cannot easily hide their traces by manipulating logs files. The most common software stack for log aggregation is the ELK stack consisting of Elasticsearch, Logstash, and Kibana. ELK is however known to be a burden to maintain and there is a relative new alternative that looked promising to us: Loki. After some research on ELK, Loki and further alternatives we decided to go with Loki.
The central difference between Loki and ELK is that loki does not use a
full-text index. The philosophy of Loki is to only have a small number of
indexed labels, like host=webhost1
or selector=postgres
, and beyond those,
rely on highly parallelised brute-forced search.
Loki works in combination with Grafana as user interface – we already used
Grafana, so this was no additional component. The software that scrapes the
logs and ships them to the loki instance is called Promtail. It’s similarity
with Prometheus is not limited to the name, but there is one central
difference: promtail is usually installed on every host which logs should be
aggregated.
Since we don’t want to rely on VPNs to encrypt and authenticate the traffic between promtail and loki we put both behind an nginx reverse proxy. The nginx configuration for promtail is fairly straight forward. In the proxy in front of loki however we block access to API endpoints that we are currently not using, to reduce attack surface. To enable the live-tail mode of loki in Grafana one needs to allow sockets in the nginx configuration of the reverse proxy in front of loki and in front of Grafana. Here are shortened versions of our nginx reverse proxy configurations:
Promtail:
upstream promtail {
server 127.0.0.1: fail_timeout=0;
}
server {
listen ssl http2;
listen [::]: ssl http2;
server_name ;
location / {
auth_basic "closed site";
auth_basic_user_file /etc/nginx/promtail.htpasswd;
proxy_pass http://promtail;
}
}
Loki:
map $http_upgrade $connection_upgrade {
default upgrade;
'' close;
}
upstream loki {
server 127.0.0.1: fail_timeout=0;
}
server {
listen ssl http2;
listen [::]: ssl http2;
server_name ;
auth_basic "closed site";
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection $connection_upgrade;
# Push logs
location /loki/api/v1/push {
auth_basic_user_file loki_push.htpasswd;
proxy_pass http://loki;
}
location /distributor/ring {
auth_basic_user_file loki_push.htpasswd;
proxy_pass http://loki;
}
# Query logs
location ~ /loki/api/v1/(index|label|labels|query|query_range|rules|series|tail) {
satisfy all;
allow 111.222.333.444; # IP of our grafana host
deny all;
auth_basic_user_file loki_query.htpasswd;
proxy_pass http://loki;
}
# Monitor loki with Prometheus
location ~ /(metrics|prometheus) {
auth_basic_user_file prometheus.htpasswd;
proxy_pass http://loki;
}
# Block currently unused endpoints to reduce attack surface
location /api/prom {
deny all;
}
location /loki/api/v1/ {
deny all;
}
location ~* /(ingester|flush|compactor) {
deny all;
}
location / {
deny all;
}
}
Grafana:
map $http_upgrade $connection_upgrade {
default upgrade;
'' close;
}
upstream grafana {
server 127.0.0.1:3000;
}
server {
listen 443 ssl http2;
listen [::]:443 ssl http2;
server_name ;
index index.php;
location / {
proxy_pass http://localhost:3000/;
proxy_set_header Host $http_host;
}
location ~ /(api/datasources/proxy/\d+/loki) {
proxy_pass http://localhost:3000$request_uri;
proxy_set_header Host $http_host;
proxy_set_header Connection $connection_upgrade;
proxy_set_header Upgrade $http_upgrade;
}
location /api/live/ws {
proxy_pass http://localhost:3000$request_uri;
proxy_set_header Host $http_host;
proxy_set_header Connection $connection_upgrade;
proxy_set_header Upgrade $http_upgrade;
}
}
While we also scrape several logs from files, most log lines get scraped from systemd
journal. We first adopted the scrape configuration as shown in the official
logs. It uses the relabel_configs
step to save the unit that created the
log line in a label unit
. This seemed natural and worked fine on our test
infrastructure. But when reading Loki’s documentation on best practices, we discovered that this is
highly discouraged: Since unit
can have many different values – especially
since it may not only contain service names, but also dynamical created values
such as session-10837.scope
– it’s cardinality is terrible high and this
hurts Loki’s performance severely. Even after filtering the values the
cardinality is much higher than recommended.
So we removed the label entirely –
but now we were missing the information altogether. Another dead-end to fix
this was the option json: true
of the journal scraper: while it solved the
problem by merging metadata into the log message, it resulted in
incompatibilities with logs scraped from files. The solution we finally came up
with is the usage of pack
and unpack
: The pack stage embeds extracted
values and labels into the log line by packing the log line and labels inside a
JSON object. Loki’s query language LogQL has a corresponding unpack
operation
that extracts the labels and restores the original log message. This way it is
possible to filter conveniently by labels while these labels are not indexed but
searched brute-force.
Here is a shortened version of our scrape configuration:
- job_name: journal
journal:
path:
labels:
job: journal
relabel_configs:
# Drop log lines from the kernel
- action: drop
source_labels: ['__journal__transport']
regex: 'kernel'
# The scraper saves metadata in ephemeral labels whose names
# start with `__` and that will be discarded after the
# relabel_configs step.
# Rename those of interest for us to keep them.
# Notice that there are sometimes one and sometimes two
# underscores after `journal`
# This is due to a systemd naming scheme,
# see systemd.journal-fields(7)
#
# Unit, e.g. nginx.service
- source_labels: ['__journal__systemd_unit']
target_label: 'unit'
# Level, e.g. warning
- source_labels: ['__journal_priority_keyword']
target_label: 'level'
# Syslog_identifier, e.g. docker
# one service might use multiple syslog identifier and the
# other way around
- source_labels: ['__journal_syslog_identifier']
target_label: 'syslog_identifier'
# Command line of the process that printed the log, e.g.
# /usr/bin/docker start -a my_container
- source_labels: ['__journal__cmdline']
target_label: '_cmdline'
pipeline_stages:
# Drop messages from selected units
- drop:
source: "unit"
expression: "(dhclient|systemd-udevd|systemd-tempfiles)"
# Add a label `selector=postfix` to log lines from postfix,
# we will *not* pack this (compare bellow), therefore this
# will become a regular indexed label.
# We can use this to select all postfix logs in a performant
# way (index & no regexp)
- match:
selector: '{unit=~"postfix.*"}'
stages:
- static_labels:
selector: "postfix"
# Pack all labels, that should not be indexed, into the log
# message, that then looks like:
# { unit: "sshd.service", level: "warning",
# syslog_identifier: "sshd", _cmdline: "/usr/bin/sshd",
# _entry: "original message by sshd" }
- pack:
labels:
- unit
- level
- syslog_identifier
- _cmdline
- job_name: fail2ban
static_configs:
- targets:
- localhost
labels:
job: fail2ban
__path__: /var/log/fail2ban.log
pipeline_stages:
# Remove time stamp from log line, see explanation bellow
- replace:
expression: "^(\\d+-\\d+-\\d+ \\d+:\\d+:\\d+,\\d+ )"
# No need to pack anything, we only use pack to be compatible
# with the systemd logs. This way we can always use `unpack`
# in the query, no matter where the message originated from
- pack:
labels:
As you might have noticed in the code snippet, we define a label _cmdline
. The
underscore has the effect that this label is not rendered prominently in the
Grafana log view but only after opening the details of a log line. This
allows to store data in labels without messing up the UI. Similarly, we edit
the log lines scraped from log files to remove the timestamp from the log message. Promtail
already parsed and stored them, so keeping them would have no benefit but
make the messages harder to read.
In addition to the Grafana web UI there is a decent command-line client called
logcli, which allows us to use default unix tools to process logs further.
Our overall impression of loki is mostly positive. The documentation is improvable and the log UI of Grafana could still need some improvements – but apart from that working with loki is fun and the performance is solid. We are confident that loki was the right choice for us.