Elastic Stack (formerly ELK) - Logstash (Part 2)

This is a follow-up to the previous post Elastic Stack (formerly ELK) - Logstash (Part 1). We continue were we left of the last time, and dive right into it. No intro, no explanation of concepts or terms, only configuration of Logstash pipelines.

Output our example log to Elasticsearch
Use a more realistic logging data
Parse logging data by delimiter
Remove unnecessary fields
Set the timestamp field
Specify data type conversion
Summary & Outlook
References

Change history:
Date	Change description
2018-03-02	The first release

To get most out of this post, it’s advisable to read the two previous posts in this series:

If you already know that content, here we go.

Output our example log to Elasticsearch

Create a new pipeline configuration file named /etc/logstash/conf.d/elasticsearch.conf and add the content below:

input {
  file {
    id => "my-app2-id-in"
    path => "/var/log/app2/source.log"
  }
}

output {
  elasticsearch {
    id => "my-app2-id-out"
    hosts => ["http://es1:9200"]
    index => "app2"
  }
}

Add some log entries to our log file:

$ echo $(date -Is) >> /var/log/app2/source.log

Query Elasticsearch if the new index got created:

1	$ curl es1:9200/_cat/indices?format=yaml

The response we get:

---
- health: "yellow"
  status: "open"
  index: "app2"
  uuid: "hL-CbpmmTm2I_aKryzqj-A"
  pri: "5"
  rep: "1"
  docs.count: "2"
  docs.deleted: "0"
  store.size: "11kb"
  pri.store.size: "11kb"

Query Elasticsearch for the documents for the index app2:

1	$ curl -s es1:9200/app2/_search? \| jq

The response we get:

{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 2,
    "max_score": 1,
    "hits": [
      {
        "_index": "app2",
        "_type": "doc",
        "_id": "YhO3r2EBDLUsSt4lTcKT",
        "_score": 1,
        "_source": {
          "path": "/var/log/app2/source.log",
          "host": "ls1",
          "@version": "1",
          "@timestamp": "2018-02-19T20:18:00.592Z",
          "message": "2018-02-19T20:18:00+00:00"
        }
      },
      {
        "_index": "app2",
        "_type": "doc",
        "_id": "YRO2r2EBDLUsSt4l88J0",
        "_score": 1,
        "_source": {
          "message": "2018-02-19T20:17:36+00:00",
          "path": "/var/log/app2/source.log",
          "@timestamp": "2018-02-19T20:17:37.420Z",
          "host": "ls1",
          "@version": "1"
        }
      }
    ]
  }
}

Logstash did its job to encapsulate our messages into new JSON objects with meta data and to forward this JSON object to Elasticsearch. To be precise, the elasticsearch output plugin did the job. The index also got created like specified in the config file.

So far, we’ve used fairly useless data for the log entries. The next section will use more realistic one.

Use a more realistic logging data

Writing the current timestamp into a file was just a vehicle to have unique log entries, which makes it easier to follow. Let’s create more realistic logging data. This will show us nice problems we will solve in the sections after this one.

After logging into the virtual machine ls1 with vagrant ssh ls1, create a Python script example.py on server ls1 which logs into a file:

#!/usr/bin/env python

import logging
logging.basicConfig(format='%(asctime)s %(levelname)s '
                           '%(process)d %(thread)d '
                           '%(name)s %(funcName)s %(lineno)s '
                           '%(message)s',
                    level=logging.DEBUG,
                    filename="/var/log/example-app/example.log")
logger = logging.getLogger('example')


def do_something():
    logger.debug('I did something!')


def main():
    logger.info('Started the application.')
    do_something()


if __name__ == '__main__':
    main()

The highlighted line is the important one. This line ensures that our log record data [1] will be stored in a file, which will be the the data source for Logstash.

Make sure the example app can be executed:

$ chmod +x /opt/example-app/example.py
$ mkdir -p /var/log/example-app

Create a new pipeline file in /etc/logstash/conf.d/example-app.conf:

input {
  file {
    id => "example-in"
    path => "/var/log/example-app/example.log"
    start_position => "beginning"
    sincedb_path => "/dev/null"
  }
}

output {
  elasticsearch {
    id => "example-out"
    hosts => ["http://es1:9200"]
    index => "example"
  }
}

Warning

The highlighted lines here tell Logstash to forget where its last scanning position was for this file. This means every restart of the Logstash service triggers a scan of the whole file. While this allows me to use the same logging data throughout this post, which makes following the examples easier, it is most likely NOT what you want to have in a production environment with tons of logging data.

Restart the Logstash service to read the new config:

1	$ systemctl restart logstash.service

Execute the example app:

1	$ /opt/example-app/example.py

This creates these entries in /var/log/example-app/example.log:

2018-03-01 19:20:56,709 INFO 8777 140603093944064 example main 18 Started the application.
2018-03-01 19:20:56,709 DEBUG 8777 140603093944064 example do_something 14 I did something!

Query Elasticsearch for the documents for the index example:

1	$ curl -s es1:9200/example/_search? \| jq

The response looks like this:

{
    "hits": [
      {
        "_index": "example",
        "_type": "doc",
        "_id": "qV0C42EBmk4Q3X4_1k9M",
        "_score": 1,
        "_source": {
          "host": "ls1",
          "message": "2018-03-01 19:20:56,709 INFO 8777 140603093944064 example main 18 Started the application.",
          "@version": "1",
          "@timestamp": "2018-03-01T19:21:08.894Z",
          "path": "/var/log/example-app/example.log"
        }
      },
      {
        "_index": "example",
        "_type": "doc",
        "_id": "ql0C42EBmk4Q3X4_1k9M",
        "_score": 1,
        "_source": {
          "host": "ls1",
          "message": "2018-03-01 19:20:56,709 DEBUG 8777 140603093944064 example do_something 14 I did something!",
          "@version": "1",
          "@timestamp": "2018-03-01T19:21:08.896Z",
          "path": "/var/log/example-app/example.log"
        }
      }
    ]
}

Note

I trimmed down the JSON responses in this post, to keep the focus on the important parts.

Given the knowledge from before, this is the expected outcome.

Assumed we have thousands of log entries like this for a big application with tons of functions and classes and modules and log levels and threads: How do we reasonably query only the log messages we care about in a specific scenario?

We could use the built-in query DSL of Elasticsearch [2] to do searches on the message value only. This would feel like a grep and will probably bring good results at first:

$ curl -s 'es1:9200/example/_search' -H 'Content-Type: application/json' -d'
{
    "query": {
        "match" : {
            "message" : "DEBUG"
        }
    }
}
' | jq

The response we get:

{
  "hits": {
    "total": 1,
    "max_score": 0.2876821,
    "hits": [
      {
        "_index": "example",
        "_type": "doc",
        "_id": "ql0C42EBmk4Q3X4_1k9M",
        "_score": 0.2876821,
        "_source": {
          "host": "ls1",
          "message": "2018-03-01 19:20:56,709 DEBUG 8777 140603093944064 example do_something 14 I did something!",
          "@version": "1",
          "@timestamp": "2018-03-01T19:21:08.896Z",
          "path": "/var/log/example-app/example.log"
        }
      }
    ]
}

The query we provided to Elasticsearch matched one of our logging entries.

This is already pretty cool, but maybe we can split the very long logging message into something more meaningful, which can then be queried more easily.

Parse logging data by delimiter

The dissect filter plugin [3] allows to transform your log entries based on delimiters. It’s a good fit for the classical log files like the one from before.

Change the file /etc/logstash/conf.d/example-app.conf to look like this:

input {
  file {
    id => "example-in"
    path => "/var/log/example-app/example.log"
    start_position => "beginning"
    sincedb_path => "/dev/null"
  }
}

filter {
  dissect {
    mapping => {
      "message" => "%{time} %{+time} %{level} %{pid} %{thread} %{loggername} %{function} %{line} %{msg}"
    }
  }
}

output {
  elasticsearch {
    id => "example-out"
    hosts => ["http://es1:9200"]
    index => "example"
  }
}

The highlighted lines are the ones which got added. No changes were made to the other lines of the previous configuration.

The dissect filter is based on delimiter but much smarter than your typical split(' ') method. The lines we added here in basically takes message as input and splits it up into parts. Everything before %{ and after } gets treated as a delimiter. In this case here, it’s only a blank. Everything between %{ and } is a named field. The usage of + appends the value of this field to the field of the same name, which was declared before. We used it here for the timestamp, %{time} %{+time}. It will become more clear when we see the result of this filter.

Delete the example index [4] in Elasticsearch and restart the Logstash service:

$ curl -X DELETE es1:9200/example/
$ systemctl restart logstash.service

Note

It takes a few seconds until Logstash is up an running again and realizes that there are logs to deal with.

Query Elasticsearch with the index example:

1	$ curl -s es1:9200/example/_search? \| jq

The JSON response we get:

{
    "hits": [
      {
        "_index": "example",
        "_type": "doc",
        "_id": "rV0U42EBmk4Q3X4_A09R",
        "_score": 1,
        "_source": {
          "@version": "1",
          "level": "INFO",
          "time": "2018-03-01 19:20:56,709",
          "pid": "8777",
          "message": "2018-03-01 19:20:56,709 INFO 8777 140603093944064 example main 18 Started the application.",
          "host": "ls1",
          "path": "/var/log/example-app/example.log",
          "loggername": "example",
          "line": "18",
          "thread": "140603093944064",
          "msg": "Started the application.",
          "function": "main",
          "@timestamp": "2018-03-01T19:39:54.710Z"
        }
      },
      {
        "_index": "example",
        "_type": "doc",
        "_id": "rl0U42EBmk4Q3X4_A09R",
        "_score": 1,
        "_source": {
          "@version": "1",
          "level": "DEBUG",
          "time": "2018-03-01 19:20:56,709",
          "pid": "8777",
          "message": "2018-03-01 19:20:56,709 DEBUG 8777 140603093944064 example do_something 14 I did something!",
          "host": "ls1",
          "path": "/var/log/example-app/example.log",
          "loggername": "example",
          "line": "14",
          "thread": "140603093944064",
          "msg": "I did something!",
          "function": "do_something",
          "@timestamp": "2018-03-01T19:39:54.711Z"
        }
      }
    ]
}

This shows that our log entries in the log file got properly split up into fields. Please note that the time field contains the full timestamp, although there is a blank between date and time in the log file. This was possible by the “append” syntax %{time} %{+time}.

We should now be able to query Elasticsearch for those fields:

$ curl -s 'es1:9200/example/_search' -H 'Content-Type: application/json' -d'
{
    "query": {
        "match" : {
            "level" : "DEBUG"
        }
    }
}
' | jq

The response we get:

{
  "hits": {
    "total": 1,
    "max_score": 0.6931472,
    "hits": [
      {
        "_index": "example",
        "_type": "doc",
        "_id": "rl0U42EBmk4Q3X4_A09R",
        "_score": 0.6931472,
        "_source": {
          "@version": "1",
          "level": "DEBUG",
          "time": "2018-03-01 19:20:56,709",
          "pid": "8777",
          "message": "2018-03-01 19:20:56,709 DEBUG 8777 140603093944064 example do_something 14 I did something!",
          "host": "ls1",
          "path": "/var/log/example-app/example.log",
          "loggername": "example",
          "line": "14",
          "thread": "140603093944064",
          "msg": "I did something!",
          "function": "do_something",
          "@timestamp": "2018-03-01T19:39:54.711Z"
        }
      }
    ]
  }
}

Elasticsearch did correctly find only 1 match for log entries where the level field has the value DEBUG.

Three things aren’t pretty yet:

The message field is not necessary anymore, as we have all the information in the other fields.
The @timestamp field doesn’t match our actual timestamp, captured in field time.
The fields which hold numbers as values, like line, pid and thread treat those values as strings.

The next sections will show ways to do that.

Remove unnecessary fields

The message field from above contains the full log entry, but we don’t need that anymore, as we have split that into its fields. One way to remove that extraneous field, is the remove_field filter option.

Change the file /etc/logstash/conf.d/example-app.conf to look like this:

input {
  file {
    id => "example-in"
    path => "/var/log/example-app/example.log"
    start_position => "beginning"
    sincedb_path => "/dev/null"
  }
}

filter {
  dissect {
    mapping => {
      "message" => "%{time} %{+time} %{level} %{pid} %{thread} %{loggername} %{function} %{line} %{msg}"
    }
    remove_field => [ "message" ]
  }
}

output {
  elasticsearch {
    id => "example-out"
    hosts => ["http://es1:9200"]
    index => "example"
  }
}

The highlighted line is the one which got added. Like before:

delete the index in Elasticsearch
restart Logstash
query Elasticsearch

$ curl -X DELETE es1:9200/example/
$ systemctl restart logstash.service
$ curl -s es1:9200/example/_search? | jq

The response looks now like this:

{
    "hits": [
      {
        "_index": "example",
        "_type": "doc",
        "_id": "sF0h42EBmk4Q3X4_9U-7",
        "_score": 1,
        "_source": {
          "function": "do_something",
          "level": "DEBUG",
          "time": "2018-03-01 19:20:56,709",
          "@version": "1",
          "thread": "140603093944064",
          "@timestamp": "2018-03-01T19:55:08.776Z",
          "line": "14",
          "msg": "I did something!",
          "host": "ls1",
          "loggername": "example",
          "pid": "8777",
          "path": "/var/log/example-app/example.log"
        }
      },
      {
        "_index": "example",
        "_type": "doc",
        "_id": "r10h42EBmk4Q3X4_9U-f",
        "_score": 1,
        "_source": {
          "function": "main",
          "level": "INFO",
          "time": "2018-03-01 19:20:56,709",
          "@version": "1",
          "thread": "140603093944064",
          "@timestamp": "2018-03-01T19:55:08.774Z",
          "line": "18",
          "msg": "Started the application.",
          "host": "ls1",
          "loggername": "example",
          "pid": "8777",
          "path": "/var/log/example-app/example.log"
        }
      }
    ]
}

The unnecessary message field is gone. It looks much cleaner now. A few more things until we’re happy.

Set the timestamp field

The date filter plugin [5] is capable of using our time field and use it as input for the @timestamp field:

input {
  file {
    id => "example-in"
    path => "/var/log/example-app/example.log"
    start_position => "beginning"
    sincedb_path => "/dev/null"
  }
}

filter {
  dissect {
    mapping => {
      "message" => "%{time} %{+time} %{level} %{pid} %{thread} %{loggername} %{function} %{line} %{msg}"
    }
    remove_field => [ "message" ]
  }
  date {
    match => [ "time", "ISO8601" ]
    remove_field => [ "time" ]
  }
}

output {
  elasticsearch {
    id => "example-out"
    hosts => ["http://es1:9200"]
    index => "example"
  }
}

The time field served its purpose, so we remove it like shown in the previous section. This example also shows nicely, that multiple filters can be specified in one config, and that they get executed in the given order.

If your log files don’t use ISO8601, you can filter for other formats, like the Unix time in seconds or milliseconds since epoch. Any other format is also valid, e.g. "MMM dd yyyy HH:mm:ss". See the docs at [5] for more details.

After changing the file /etc/logstash/conf.d/example-app.conf to make it look like the setting above, we repeat the steps:

delete the index in Elasticsearch
restart Logstash
query Elasticsearch

$ curl -X DELETE es1:9200/example/
$ systemctl restart logstash.service
$ curl -s es1:9200/example/_search? | jq

{
    "hits": [
      {
        "_index": "example",
        "_type": "doc",
        "_id": "sl0p42EBmk4Q3X4_pk-z",
        "_score": 1,
        "_source": {
          "function": "do_something",
          "@timestamp": "2018-03-01T19:20:56.709Z",
          "msg": "I did something!",
          "loggername": "example",
          "pid": "8777",
          "line": "14",
          "host": "ls1",
          "thread": "140603093944064",
          "@version": "1",
          "path": "/var/log/example-app/example.log",
          "level": "DEBUG"
        }
      },
      {
        "_index": "example",
        "_type": "doc",
        "_id": "sV0p42EBmk4Q3X4_pk-z",
        "_score": 1,
        "_source": {
          "function": "main",
          "@timestamp": "2018-03-01T19:20:56.709Z",
          "msg": "Started the application.",
          "loggername": "example",
          "pid": "8777",
          "line": "18",
          "host": "ls1",
          "thread": "140603093944064",
          "@version": "1",
          "path": "/var/log/example-app/example.log",
          "level": "INFO"
        }
      }
    ]
}

The time field is gone and the @timestamp field uses the date and time specified in the original log file.

Specify data type conversion

The dissect filter provides the option convert_datatype which can do a conversion to integer of float numbers. We only need conversions to int here:

input {
  file {
    id => "example-in"
    path => "/var/log/example-app/example.log"
    start_position => "beginning"
    sincedb_path => "/dev/null"
  }
}

filter {
  dissect {
    mapping => {
      "message" => "%{time} %{+time} %{level} %{pid} %{thread} %{loggername} %{function} %{line} %{msg}"
    }
    remove_field => [ "message" ]
    convert_datatype => {
      line => "int"
      pid => "int"
      thread => "int"
    }
  }
  date {
    match => [ "time", "ISO8601" ]
    remove_field => [ "time" ]
  }
}

output {
  elasticsearch {
    id => "example-out"
    hosts => ["http://es1:9200"]
    index => "example"
  }
}

After changing that in /etc/logstash/conf.d/example-app.conf, same story like before:

$ curl -X DELETE es1:9200/example/
$ systemctl restart logstash.service
$ curl -s es1:9200/example/_search? | jq

The JSON request contains numbers now:

{
    "hits": [
      {
        "_index": "example",
        "_type": "doc",
        "_id": "tF0u42EBmk4Q3X4_6k-i",
        "_score": 1,
        "_source": {
          "msg": "I did something!",
          "level": "DEBUG",
          "loggername": "example",
          "@timestamp": "2018-03-01T19:20:56.709Z",
          "thread": 140603093944064,
          "line": 14,
          "function": "do_something",
          "host": "ls1",
          "pid": 8777,
          "@version": "1",
          "path": "/var/log/example-app/example.log"
        }
      },
      {
        "_index": "example",
        "_type": "doc",
        "_id": "s10u42EBmk4Q3X4_6k-i",
        "_score": 1,
        "_source": {
          "msg": "Started the application.",
          "level": "INFO",
          "loggername": "example",
          "@timestamp": "2018-03-01T19:20:56.709Z",
          "thread": 140603093944064,
          "line": 18,
          "function": "main",
          "host": "ls1",
          "pid": 8777,
          "@version": "1",
          "path": "/var/log/example-app/example.log"
        }
      }
    ]
}

This can be useful when you search for a range of numeric values [6]:

$ curl -s 'es1:9200/example/_search' -H 'Content-Type: application/json' -d'
{
    "query": {
        "range" : {
            "line" : {
                "gte" : 10,
                "lte" : 15
            }
        }
    }
}
' | jq

The response:

{
  "hits": {
    "total": 1,
    "max_score": 1,
    "hits": [
      {
        "_index": "example",
        "_type": "doc",
        "_id": "tF0u42EBmk4Q3X4_6k-i",
        "_score": 1,
        "_source": {
          "msg": "I did something!",
          "level": "DEBUG",
          "loggername": "example",
          "@timestamp": "2018-03-01T19:20:56.709Z",
          "thread": 140603093944064,
          "line": 14,
          "function": "do_something",
          "host": "ls1",
          "pid": 8777,
          "@version": "1",
          "path": "/var/log/example-app/example.log"
        }
      }
    ]
  }
}

This example isn’t that impressive, admittedly, but I’m sure you get the principle and have an idea where this might come in handy with your log files.

Summary & Outlook

This post showed how a log file with realistic data can be split into its parts by the dissect input plugin, and brought into the Elasticsearch service for later queries.

While writing this post, a few questions came up in my mind and maybe you ask them yourself too:

What if dissect is not powerful enough for my logs?

There is a more powerful filter, named grok, which is based on regular expressions [7]. I don’t have plans to dive into that filter at the moment.
Do I need to install a Logstash server on every application server to gather my logs?

Nope, Filebeat to the rescue [8]. That’s an insight I got while writing this post. I will probably look into that at a later point in time, after I talked about Kibana.
How do I get my syslog, to get the bigger picture?

There is a syslog input plugin [9], but I couldn’t get it to work yet, although the intro looks easy. I tried different things but have to table that to a follow-up post in the future.
Doesn’t using the dissect filter create a contract on the structure of my log file?

Yeah, I think so. Adding another field to the logger of your application will most likely break the defined mapping. The structure of log files doesn’t change that often, I guess. I was still wondering if it may be reasonable to create log files with JSON Lines [10]. This would imply that a log file is NOT supposed anymore to get read directly by humans, but to get used as input for a log file processing engine like Logstash. Not sure if this change would be welcomed by every one. Maybe I’ll give it a try in a follow-up post.

We have enough in place to put a nice GUI on top of it. In ELK, this is Kibana. The next post in this series will show this.

Previous: Elastic Stack (formerly ELK) - Logstash (Part 1) Next: HTML5 slides with Reveal.js, Markdown and Docker