Elastic Stack (formerly ELK) - Logstash (Part 1)

In this post I’ll talk about the Logstash service, which is part of the Elastic Stack, formerly known as ELK stack. The purpose of Logstash is to ingest (logging) data, do some transformation or filtering on it, and output it into a data store like Elasticsearch. For details about Elasticsearch, you can get more information in my previous post Elastic Stack (formerly ELK) - Elasticsearch.

Change history:
Date Change description
2018-02-16 The first release

Intro

This is the second part of a multi-part series about the Elastic Stack (formerly known as the ELK stack). This stack consists of 3 parts:

  • storing data with Elasticsearch
  • ingesting data with Logstash (and/or Beats)
  • visualizing data with Kibana

This post will focus on the second part, Logstash. I’ll use a virtualized environment with only virtual machine, named ls1. This virtual machine contains the Logstash service which provides a REST API at port 9600. But more on that REST API later.

Skip the next section if you don’t want to repeat the steps locally.

Set up the environment

To reproduce the steps in this post, you need to have installed locally:

After these prerequisites are fulfilled:

  1. download the compressed project source files.
  2. extract the archive
  3. change to the env directory
  4. start the Vagrant setup
1
2
3
4
$ wget http://www.markusz.io/_downloads/elastic-stack-elk-logstash-part1.tar.gz
$ tar -zxvf elastic-stack-elk-logstash-part1.tar.gz
$ cd env
$ vagrant up  # does also all of the installation

After this is fully done, you can access these two servers with:

1
2
$ vagrant ssh         # log into the logstash server
[vagrant@ls1] $ exit  # log out

Note

After you decided that you don’t need this environment anymore, you can remove it with vagrant destroy -f

While the setup goes on for a minute or two, let’s have a look at a few basic terms and concepts of Logstash.

Terms and Concepts

Logstash has a concept of pipelines. It reads data from a source, optionally transforms and/or filters out the data and writes the data to a data sink. The pipelines are configured with these three steps

  1. the input step
  2. the filter step
  3. the output step
The pipeline concept of |ls|

This allows to have multiple pipelines in parallel, or let the output from one pipelines be the input for another pipeline. It’s the very same idea like in a shell.

Every atomic input such a pipeline can read is called an event. The filter step can transform such events to a different format or even filter out some events. This makes Logstash very flexible and you can adjust it to your data.

The steps described above are implemented as plugins. There are many input plugins [4], filter plugins [5] and output plugins [6]. Let’s see in the next section what we can do with these plugins and how to specify a pipeline.

Pipeline Configuration

The pipelines get configured with a custom grammar which resembles Ruby a little, but it isn’t. The description of the syntax is at [7]. For our first pipeline, create a file /etc/logstash/conf.d/logstash-simple.conf with this content:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
input {
  file {
    id => "my-app1-id-in"
    path => "/var/log/app1/source.log"
  }
}

output {
  file {
    id => "my-app1-id-out"
    path => "/var/log/app1/target.log"
  }
}

This file gets read because Logstash uses the often seen conf.d concept:

1
2
3
4
5
6
7
$ cat /etc/logstash/pipelines.yml
# This file is where you define your pipelines. You can define multiple.
# For more information on multiple pipelines, see the documentation:
#   https://www.elastic.co/guide/en/logstash/current/multiple-pipelines.html

- pipeline.id: main
  path.config: "/etc/logstash/conf.d/*.conf"

This example is rather useless in a real-live setup, but shows well the structure and how to use one of the many input and output plugins. I deliberately left out the (optional) filter plugin, to keep the example simple.

What’s happening here is:

  • we use the file input plugin [8]
  • we gave that one usage of that plugin the ID my-app1-id-in
  • we specified that this input plugin should listen on changes for file /var/log/app1/source.log
  • we configured the file output plugin similarly [9]

Let’s use this pipeline with some dummy data:

1
2
$ vagrant ssh
$ echo $(date -Is) >> /var/log/app1/source.log

Execute this last line a few times and take a look at what the Logstash pipeline has output into the target file.

Tip

The jq CLI is very useful to show that JSON output.

Execute cat /var/log/app1/target.log | jq:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
{
  "@version": "1",
  "message": "2018-02-14T18:24:19+00:00",
  "@timestamp": "2018-02-14T18:24:21.364Z",
  "path": "/var/log/app1/source.log",
  "host": "ls1"
}
{
  "@version": "1",
  "message": "2018-02-14T18:24:33+00:00",
  "@timestamp": "2018-02-14T18:24:34.414Z",
  "path": "/var/log/app1/source.log",
  "host": "ls1"
}
{
  "@version": "1",
  "message": "2018-02-14T18:24:36+00:00",
  "@timestamp": "2018-02-14T18:24:37.491Z",
  "path": "/var/log/app1/source.log",
  "host": "ls1"
}

As a comparison, this is our source file:

1
2
3
4
$ cat /var/log/app1/source.log
2018-02-14T18:24:19+00:00
2018-02-14T18:24:33+00:00
2018-02-14T18:24:36+00:00

A few interesting observations with this small example:

  • Logstash encapsulates the message we created into a JSON object and adds meta data like a timestamp, the host and the version
  • there is a small delay between our message creation and the timestamp which Logstash adds itself

When working with those pipelines and events, it may become useful to get some insights into Logstash itself. The next section will show how to get them.

Basic interaction with Logstash

As shown in the beginning, Logstash has a REST API to get some stats from it. This is useful to get some insights, for example if your configured pipelines are recognized and how many events they processed.

For this, Logstash needs to bind to an IP address you can reach from outside the server. In my virtual environment, I have this setting in the file /etc/logstash/logstash.yml:

1
2
3
# ------------ Metrics Settings --------------
# Bind address for the metrics REST endpoint
http.host: "192.168.73.12"

To make the following queries a bit easier to read, export the URI of the Logstash server as variable.

1
$ export ls1="http://192.168.73.12:9600"

Let’s do some queries.

Request details about the Logstash instance:

1
$ curl "$ls1/?pretty"

Note

During my experiments, the REST API didn’t come up if not at least one pipeline was defined. Not sure if this is a bug or a feature. The next section will tell more about pipeline definitions.

Response:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
{
  "host" : "ls1",
  "version" : "6.2.1",
  "http_address" : "192.168.73.12:9600",
  "id" : "b961f021-8470-48ad-ba6c-a4f1ca4ca5f1",
  "name" : "ls1",
  "build_date" : "2018-02-07T21:17:29+00:00",
  "build_sha" : "2b141ed331d8372b0cdd01fd1caad330ecc77df6",
  "build_snapshot" : false
}

You’ll notice that we’re on host ls1 as described in the environment section from the beginning of this post, and that we use Logstash in version 6.2. This request helps to figure out if the instance is running.

Request details about the pipelines:

1
$ curl "$ls1/_node/pipelines?pretty"

Response:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
{
  "host" : "ls1",
  "version" : "6.2.1",
  "http_address" : "192.168.73.12:9600",
  "id" : "b961f021-8470-48ad-ba6c-a4f1ca4ca5f1",
  "name" : "ls1",
  "pipelines" : {
    "main" : {
      "workers" : 4,
      "batch_size" : 125,
      "batch_delay" : 50,
      "config_reload_automatic" : false,
      "config_reload_interval" : 3000000000,
      "dead_letter_queue_enabled" : false
    }
  }
}

There is one pipeline configured, named main. The configuration was shown in the previous section. To be honest, I have no clue yet what the other key-value-pairs in that dictionary mean. My assumption is, that in high-availability setups, these things get important, but ignore them for now.

Request metrics about the events:

1
$ curl "$ls1/_node/stats/events?pretty"

Response:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
{
  "host" : "ls1",
  "version" : "6.2.1",
  "http_address" : "192.168.73.12:9600",
  "id" : "b961f021-8470-48ad-ba6c-a4f1ca4ca5f1",
  "name" : "ls1",
  "events" : {
    "in" : 3,
    "filtered" : 3,
    "out" : 3,
    "duration_in_millis" : 99,
    "queue_push_duration_in_millis" : 0
  }
}

This shows well how the processing of Logstash is based on events. Also, the pipeline steps from above can be found here. The three events I created above are reflected here too.

Request metrics about the pipeline named main:

1
$ curl "$ls1/_node/stats/pipelines/main?pretty"

Response:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
{
  "host" : "ls1",
  "version" : "6.2.1",
  "http_address" : "192.168.73.12:9600",
  "id" : "b961f021-8470-48ad-ba6c-a4f1ca4ca5f1",
  "name" : "ls1",
  "pipelines" : {
    "main" : {
      "events" : {
        "duration_in_millis" : 99,
        "in" : 3,
        "out" : 3,
        "filtered" : 3,
        "queue_push_duration_in_millis" : 0
      },
      "plugins" : {
        "inputs" : [ {
          "id" : "my-app1-id-in",
          "events" : {
            "out" : 3,
            "queue_push_duration_in_millis" : 0
          },
          "name" : "file"
        } ],
        "filters" : [ ],
        "outputs" : [ {
          "id" : "my-app1-id-out",
          "events" : {
            "duration_in_millis" : 93,
            "in" : 3,
            "out" : 3
          },
          "name" : "file"
        } ]
      },
      "reloads" : {
        "last_error" : null,
        "successes" : 0,
        "last_success_timestamp" : null,
        "last_failure_timestamp" : null,
        "failures" : 0
      },
      "queue" : {
        "type" : "memory"
      }
    }
  }
}

This shows the IDs we specified earlier and what type of plugin for input and output we used. The next section will use a filter plugin and show some of the capabilities.

Summary

To keep this post at an digestible size, I make a cut here and will focus in a follow up post more on the filter plugins and more realistic input and how to connect to Elasticsearch as a data store.

This post showed a brief overview of the basics, which is simply necessary to dive deeper into the great possibilities Logstash offers. We’ve seen the pipeline concept and that events get encapsulated into JSON objects. The metrics REST API of Logstash provides observability.

One thing I didn’t talk about, but you should be aware of when you consider to use Logstash in a productive environment, is the fact that there are in-memory queues and persistent queues and you need to make a decision what fits your requirements best [10]. Unfortunately I have too little real-live knowledge to give a recommendation here.