Log Aggregation with Grafana Loki

We recently deployed Loki to our infrastructure and in this blog post want to share the pitfalls and tips we discovered.

With the growing number of servers and services in our infrastructure it was time to introduce a central logging infrastructure to aggregate logs from our hosts and allow searching them centrally. This also has security benefits: if the logs are copied to a separate server, a potential attacker cannot easily hide their traces by manipulating logs files. The most common software stack for log aggregation is the ELK stack consisting of Elasticsearch, Logstash, and Kibana. ELK is however known to be a burden to maintain and there is a relative new alternative that looked promising to us: Loki. After some research on ELK, Loki and further alternatives we decided to go with Loki.

The central difference between Loki and ELK is that loki does not use a full-text index. The philosophy of Loki is to only have a small number of indexed labels, like host=webhost1 or selector=postgres, and beyond those, rely on highly parallelised brute-forced search. Loki works in combination with Grafana as user interface – we already used Grafana, so this was no additional component. The software that scrapes the logs and ships them to the loki instance is called Promtail. It’s similarity with Prometheus is not limited to the name, but there is one central difference: promtail is usually installed on every host which logs should be aggregated.

Since we don’t want to rely on VPNs to encrypt and authenticate the traffic between promtail and loki we put both behind an nginx reverse proxy. The nginx configuration for promtail is fairly straight forward. In the proxy in front of loki however we block access to API endpoints that we are currently not using, to reduce attack surface. To enable the live-tail mode of loki in Grafana one needs to allow sockets in the nginx configuration of the reverse proxy in front of loki and in front of Grafana. Here are shortened versions of our nginx reverse proxy configurations:

Promtail:

upstream promtail {
   server 127.0.0.1: fail_timeout=0;
}

server {
   listen  ssl http2;
   listen [::]: ssl http2;
   server_name ;

   location / {
      auth_basic           "closed site";
      auth_basic_user_file /etc/nginx/promtail.htpasswd;
      proxy_pass           http://promtail;
   }
}

Loki:

map $http_upgrade $connection_upgrade {
    default upgrade;
    '' close;
}

upstream loki {
        server 127.0.0.1: fail_timeout=0;
}

server {
        listen  ssl http2;
        listen [::]: ssl http2;
        server_name ;

        auth_basic			"closed site";
        proxy_set_header	Upgrade $http_upgrade;
        proxy_set_header	Connection $connection_upgrade;

        # Push logs
        location /loki/api/v1/push {
                auth_basic_user_file loki_push.htpasswd;
                proxy_pass 			http://loki;
        }
        location /distributor/ring {
                auth_basic_user_file loki_push.htpasswd;
                proxy_pass 			http://loki;
        }

        # Query logs
        location ~ /loki/api/v1/(index|label|labels|query|query_range|rules|series|tail) {
                satisfy all;
                allow   111.222.333.444;  # IP of our grafana host
                deny    all;
                auth_basic_user_file  loki_query.htpasswd;
                proxy_pass	      http://loki;
        }

        # Monitor loki with Prometheus
        location ~ /(metrics|prometheus) {
                auth_basic_user_file	prometheus.htpasswd;
                proxy_pass 		http://loki;
        }

        # Block currently unused endpoints to reduce attack surface
        location /api/prom {
                deny all;
        }
        location /loki/api/v1/ {
                deny all;
        }
        location ~* /(ingester|flush|compactor) {
                deny all;
        }
        location / {
                deny all;
        }
}

Grafana:

map $http_upgrade $connection_upgrade {
    default upgrade;
    '' close;
}

upstream grafana {
    server 127.0.0.1:3000;
}

server {
    listen 443 ssl http2;
    listen [::]:443 ssl http2;
    server_name ;

    index index.php;

    location / {
        proxy_pass http://localhost:3000/;
        proxy_set_header Host $http_host;
    }

    location ~ /(api/datasources/proxy/\d+/loki) {
        proxy_pass          http://localhost:3000$request_uri;
        proxy_set_header    Host              $http_host;
        proxy_set_header    Connection        $connection_upgrade;
        proxy_set_header    Upgrade           $http_upgrade;
    }
        location /api/live/ws {
        proxy_pass          http://localhost:3000$request_uri;
        proxy_set_header    Host              $http_host;
        proxy_set_header    Connection        $connection_upgrade;
        proxy_set_header    Upgrade           $http_upgrade;
    }
}

While we also scrape several logs from files, most log lines get scraped from systemd journal. We first adopted the scrape configuration as shown in the official logs. It uses the relabel_configs step to save the unit that created the log line in a label unit. This seemed natural and worked fine on our test infrastructure. But when reading Loki’s documentation on best practices, we discovered that this is highly discouraged: Since unit can have many different values – especially since it may not only contain service names, but also dynamical created values such as session-10837.scope – it’s cardinality is terrible high and this hurts Loki’s performance severely. Even after filtering the values the cardinality is much higher than recommended.

So we removed the label entirely – but now we were missing the information altogether. Another dead-end to fix this was the option json: true of the journal scraper: while it solved the problem by merging metadata into the log message, it resulted in incompatibilities with logs scraped from files. The solution we finally came up with is the usage of pack and unpack: The pack stage embeds extracted values and labels into the log line by packing the log line and labels inside a JSON object. Loki’s query language LogQL has a corresponding unpack operation that extracts the labels and restores the original log message. This way it is possible to filter conveniently by labels while these labels are not indexed but searched brute-force.

Here is a shortened version of our scrape configuration:

    - job_name: journal
      journal:
        path: 
        labels:
          job: journal
      relabel_configs:
        # Drop log lines from the kernel
        - action: drop
          source_labels: ['__journal__transport']
          regex: 'kernel'
        # The scraper saves metadata in ephemeral labels whose names
        # start with `__` and that will be discarded after the
        # relabel_configs step.
        # Rename those of interest for us to keep them.
        # Notice that there are sometimes one and sometimes two
        # underscores after `journal`
        # This is due to a systemd naming scheme,
        # see systemd.journal-fields(7)
        #
        # Unit, e.g. nginx.service
        - source_labels: ['__journal__systemd_unit']
          target_label: 'unit'
        # Level, e.g. warning
        - source_labels: ['__journal_priority_keyword']
          target_label: 'level'
        # Syslog_identifier, e.g. docker
        # one service might use multiple syslog identifier and the
        # other way around
        - source_labels: ['__journal_syslog_identifier']
          target_label: 'syslog_identifier'
        # Command line of the process that printed the log, e.g.
        # /usr/bin/docker start -a my_container
        - source_labels: ['__journal__cmdline']
          target_label: '_cmdline'
      pipeline_stages:
        # Drop messages from selected units
        - drop:
            source: "unit"
            expression: "(dhclient|systemd-udevd|systemd-tempfiles)"
        # Add a label `selector=postfix` to log lines from postfix,
        # we will *not* pack this (compare bellow), therefore this
        # will become a regular indexed label.
        # We can use this to select all postfix logs in a performant
        # way (index & no regexp)
        - match:
            selector: '{unit=~"postfix.*"}'
            stages:
                - static_labels:
                    selector: "postfix"
        # Pack all labels, that should not be indexed, into the log
        # message, that then looks like:
        # { unit: "sshd.service", level: "warning",
        # syslog_identifier: "sshd", _cmdline: "/usr/bin/sshd",
        # _entry: "original message by sshd" }
        - pack:
            labels:
                - unit
                - level
                - syslog_identifier
                - _cmdline

    - job_name: fail2ban
      static_configs:
        - targets:
            - localhost
          labels:
            job: fail2ban
            __path__: /var/log/fail2ban.log
      pipeline_stages:
        # Remove time stamp from log line, see explanation bellow
        - replace:
            expression: "^(\\d+-\\d+-\\d+ \\d+:\\d+:\\d+,\\d+ )"
        # No need to pack anything, we only use pack to be compatible
        # with the systemd logs. This way we can always use `unpack`
        # in the query, no matter where the message originated from
        - pack:
            labels:

As you might have noticed in the code snippet, we define a label _cmdline. The underscore has the effect that this label is not rendered prominently in the Grafana log view but only after opening the details of a log line. This allows to store data in labels without messing up the UI. Similarly, we edit the log lines scraped from log files to remove the timestamp from the log message. Promtail already parsed and stored them, so keeping them would have no benefit but make the messages harder to read. In addition to the Grafana web UI there is a decent command-line client called logcli, which allows us to use default unix tools to process logs further.

Our overall impression of loki is mostly positive. The documentation is improvable and the log UI of Grafana could still need some improvements – but apart from that working with loki is fun and the performance is solid. We are confident that loki was the right choice for us.