This is a follow-up to the previous post Elastic Stack (formerly ELK) - Logstash (Part 1). We continue were we left of the last time, and dive right into it. No intro, no explanation of concepts or terms, only configuration of Logstash pipelines.
Date | Change description |
---|---|
2018-03-02 | The first release |
To get most out of this post, it’s advisable to read the two previous posts in this series:
If you already know that content, here we go.
Create a new pipeline configuration file named
/etc/logstash/conf.d/elasticsearch.conf
and add the content below:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | input {
file {
id => "my-app2-id-in"
path => "/var/log/app2/source.log"
}
}
output {
elasticsearch {
id => "my-app2-id-out"
hosts => ["http://es1:9200"]
index => "app2"
}
}
|
Add some log entries to our log file:
1 | $ echo $(date -Is) >> /var/log/app2/source.log
|
Query Elasticsearch if the new index got created:
1 | $ curl es1:9200/_cat/indices?format=yaml
|
The response we get:
1 2 3 4 5 6 7 8 9 10 11 | ---
- health: "yellow"
status: "open"
index: "app2"
uuid: "hL-CbpmmTm2I_aKryzqj-A"
pri: "5"
rep: "1"
docs.count: "2"
docs.deleted: "0"
store.size: "11kb"
pri.store.size: "11kb"
|
Query Elasticsearch for the documents for the index app2
:
1 | $ curl -s es1:9200/app2/_search? | jq
|
The response we get:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 | {
"took": 1,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 1,
"hits": [
{
"_index": "app2",
"_type": "doc",
"_id": "YhO3r2EBDLUsSt4lTcKT",
"_score": 1,
"_source": {
"path": "/var/log/app2/source.log",
"host": "ls1",
"@version": "1",
"@timestamp": "2018-02-19T20:18:00.592Z",
"message": "2018-02-19T20:18:00+00:00"
}
},
{
"_index": "app2",
"_type": "doc",
"_id": "YRO2r2EBDLUsSt4l88J0",
"_score": 1,
"_source": {
"message": "2018-02-19T20:17:36+00:00",
"path": "/var/log/app2/source.log",
"@timestamp": "2018-02-19T20:17:37.420Z",
"host": "ls1",
"@version": "1"
}
}
]
}
}
|
Logstash did its job to encapsulate our messages into new JSON objects with
meta data and to forward this JSON object to Elasticsearch. To be precise,
the elasticsearch
output plugin did the job. The index also got
created like specified in the config file.
So far, we’ve used fairly useless data for the log entries. The next section will use more realistic one.
Writing the current timestamp into a file was just a vehicle to have unique log entries, which makes it easier to follow. Let’s create more realistic logging data. This will show us nice problems we will solve in the sections after this one.
After logging into the virtual machine ls1
with vagrant ssh ls1
,
create a Python script example.py
on server ls1
which logs into
a file:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 | #!/usr/bin/env python
import logging
logging.basicConfig(format='%(asctime)s %(levelname)s '
'%(process)d %(thread)d '
'%(name)s %(funcName)s %(lineno)s '
'%(message)s',
level=logging.DEBUG,
filename="/var/log/example-app/example.log")
logger = logging.getLogger('example')
def do_something():
logger.debug('I did something!')
def main():
logger.info('Started the application.')
do_something()
if __name__ == '__main__':
main()
|
The highlighted line is the important one. This line ensures that our log record data [1] will be stored in a file, which will be the the data source for Logstash.
Make sure the example app can be executed:
1 2 | $ chmod +x /opt/example-app/example.py
$ mkdir -p /var/log/example-app
|
Create a new pipeline file in /etc/logstash/conf.d/example-app.conf
:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | input {
file {
id => "example-in"
path => "/var/log/example-app/example.log"
start_position => "beginning"
sincedb_path => "/dev/null"
}
}
output {
elasticsearch {
id => "example-out"
hosts => ["http://es1:9200"]
index => "example"
}
}
|
Warning
The highlighted lines here tell Logstash to forget where its last scanning position was for this file. This means every restart of the Logstash service triggers a scan of the whole file. While this allows me to use the same logging data throughout this post, which makes following the examples easier, it is most likely NOT what you want to have in a production environment with tons of logging data.
Restart the Logstash service to read the new config:
1 | $ systemctl restart logstash.service
|
Execute the example app:
1 | $ /opt/example-app/example.py
|
This creates these entries in /var/log/example-app/example.log
:
1 2 | 2018-03-01 19:20:56,709 INFO 8777 140603093944064 example main 18 Started the application.
2018-03-01 19:20:56,709 DEBUG 8777 140603093944064 example do_something 14 I did something!
|
Query Elasticsearch for the documents for the index example
:
1 | $ curl -s es1:9200/example/_search? | jq
|
The response looks like this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 | {
"hits": [
{
"_index": "example",
"_type": "doc",
"_id": "qV0C42EBmk4Q3X4_1k9M",
"_score": 1,
"_source": {
"host": "ls1",
"message": "2018-03-01 19:20:56,709 INFO 8777 140603093944064 example main 18 Started the application.",
"@version": "1",
"@timestamp": "2018-03-01T19:21:08.894Z",
"path": "/var/log/example-app/example.log"
}
},
{
"_index": "example",
"_type": "doc",
"_id": "ql0C42EBmk4Q3X4_1k9M",
"_score": 1,
"_source": {
"host": "ls1",
"message": "2018-03-01 19:20:56,709 DEBUG 8777 140603093944064 example do_something 14 I did something!",
"@version": "1",
"@timestamp": "2018-03-01T19:21:08.896Z",
"path": "/var/log/example-app/example.log"
}
}
]
}
|
Note
I trimmed down the JSON responses in this post, to keep the focus on the important parts.
Given the knowledge from before, this is the expected outcome.
Assumed we have thousands of log entries like this for a big application with tons of functions and classes and modules and log levels and threads: How do we reasonably query only the log messages we care about in a specific scenario?
We could use the built-in query DSL of Elasticsearch [2] to do searches on the
message
value only. This would feel like a grep
and will probably
bring good results at first:
1 2 3 4 5 6 7 8 9 | $ curl -s 'es1:9200/example/_search' -H 'Content-Type: application/json' -d'
{
"query": {
"match" : {
"message" : "DEBUG"
}
}
}
' | jq
|
The response we get:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 | {
"hits": {
"total": 1,
"max_score": 0.2876821,
"hits": [
{
"_index": "example",
"_type": "doc",
"_id": "ql0C42EBmk4Q3X4_1k9M",
"_score": 0.2876821,
"_source": {
"host": "ls1",
"message": "2018-03-01 19:20:56,709 DEBUG 8777 140603093944064 example do_something 14 I did something!",
"@version": "1",
"@timestamp": "2018-03-01T19:21:08.896Z",
"path": "/var/log/example-app/example.log"
}
}
]
}
|
The query we provided to Elasticsearch matched one of our logging entries.
This is already pretty cool, but maybe we can split the very long logging message into something more meaningful, which can then be queried more easily.
The dissect
filter plugin [3] allows to transform your
log entries based on delimiters. It’s a good fit for the classical log files
like the one from before.
Change the file /etc/logstash/conf.d/example-app.conf
to look like this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 | input {
file {
id => "example-in"
path => "/var/log/example-app/example.log"
start_position => "beginning"
sincedb_path => "/dev/null"
}
}
filter {
dissect {
mapping => {
"message" => "%{time} %{+time} %{level} %{pid} %{thread} %{loggername} %{function} %{line} %{msg}"
}
}
}
output {
elasticsearch {
id => "example-out"
hosts => ["http://es1:9200"]
index => "example"
}
}
|
The highlighted lines are the ones which got added. No changes were made to the other lines of the previous configuration.
The dissect
filter is based on delimiter but much smarter than
your typical split(' ')
method. The lines we added here in
basically takes message
as input and splits it up into parts.
Everything before %{
and after }
gets treated as a delimiter.
In this case here, it’s only a blank. Everything between %{
and }
is a named field. The usage of +
appends the value of this field
to the field of the same name, which was declared before. We used it
here for the timestamp, %{time} %{+time}
. It will become more clear
when we see the result of this filter.
Delete the example
index [4] in Elasticsearch and restart
the Logstash service:
1 2 | $ curl -X DELETE es1:9200/example/
$ systemctl restart logstash.service
|
Note
It takes a few seconds until Logstash is up an running again and realizes that there are logs to deal with.
Query Elasticsearch with the index example
:
1 | $ curl -s es1:9200/example/_search? | jq
|
The JSON response we get:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 | {
"hits": [
{
"_index": "example",
"_type": "doc",
"_id": "rV0U42EBmk4Q3X4_A09R",
"_score": 1,
"_source": {
"@version": "1",
"level": "INFO",
"time": "2018-03-01 19:20:56,709",
"pid": "8777",
"message": "2018-03-01 19:20:56,709 INFO 8777 140603093944064 example main 18 Started the application.",
"host": "ls1",
"path": "/var/log/example-app/example.log",
"loggername": "example",
"line": "18",
"thread": "140603093944064",
"msg": "Started the application.",
"function": "main",
"@timestamp": "2018-03-01T19:39:54.710Z"
}
},
{
"_index": "example",
"_type": "doc",
"_id": "rl0U42EBmk4Q3X4_A09R",
"_score": 1,
"_source": {
"@version": "1",
"level": "DEBUG",
"time": "2018-03-01 19:20:56,709",
"pid": "8777",
"message": "2018-03-01 19:20:56,709 DEBUG 8777 140603093944064 example do_something 14 I did something!",
"host": "ls1",
"path": "/var/log/example-app/example.log",
"loggername": "example",
"line": "14",
"thread": "140603093944064",
"msg": "I did something!",
"function": "do_something",
"@timestamp": "2018-03-01T19:39:54.711Z"
}
}
]
}
|
This shows that our log entries in the log file got properly split up
into fields. Please note that the time
field contains the full timestamp,
although there is a blank between date and time in the log file. This
was possible by the “append” syntax %{time} %{+time}
.
We should now be able to query Elasticsearch for those fields:
1 2 3 4 5 6 7 8 9 | $ curl -s 'es1:9200/example/_search' -H 'Content-Type: application/json' -d'
{
"query": {
"match" : {
"level" : "DEBUG"
}
}
}
' | jq
|
The response we get:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 | {
"hits": {
"total": 1,
"max_score": 0.6931472,
"hits": [
{
"_index": "example",
"_type": "doc",
"_id": "rl0U42EBmk4Q3X4_A09R",
"_score": 0.6931472,
"_source": {
"@version": "1",
"level": "DEBUG",
"time": "2018-03-01 19:20:56,709",
"pid": "8777",
"message": "2018-03-01 19:20:56,709 DEBUG 8777 140603093944064 example do_something 14 I did something!",
"host": "ls1",
"path": "/var/log/example-app/example.log",
"loggername": "example",
"line": "14",
"thread": "140603093944064",
"msg": "I did something!",
"function": "do_something",
"@timestamp": "2018-03-01T19:39:54.711Z"
}
}
]
}
}
|
Elasticsearch did correctly find only 1 match for log entries where the level
field has the value DEBUG
.
Three things aren’t pretty yet:
message
field is not necessary anymore, as we have all the
information in the other fields.@timestamp
field doesn’t match our actual timestamp, captured
in field time
.line
, pid
and
thread
treat those values as strings.The next sections will show ways to do that.
The message
field from above contains the full log entry, but
we don’t need that anymore, as we have split that into its fields.
One way to remove that extraneous field, is the remove_field
filter
option.
Change the file /etc/logstash/conf.d/example-app.conf
to look like this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 | input {
file {
id => "example-in"
path => "/var/log/example-app/example.log"
start_position => "beginning"
sincedb_path => "/dev/null"
}
}
filter {
dissect {
mapping => {
"message" => "%{time} %{+time} %{level} %{pid} %{thread} %{loggername} %{function} %{line} %{msg}"
}
remove_field => [ "message" ]
}
}
output {
elasticsearch {
id => "example-out"
hosts => ["http://es1:9200"]
index => "example"
}
}
|
The highlighted line is the one which got added. Like before:
1 2 3 | $ curl -X DELETE es1:9200/example/
$ systemctl restart logstash.service
$ curl -s es1:9200/example/_search? | jq
|
The response looks now like this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 | {
"hits": [
{
"_index": "example",
"_type": "doc",
"_id": "sF0h42EBmk4Q3X4_9U-7",
"_score": 1,
"_source": {
"function": "do_something",
"level": "DEBUG",
"time": "2018-03-01 19:20:56,709",
"@version": "1",
"thread": "140603093944064",
"@timestamp": "2018-03-01T19:55:08.776Z",
"line": "14",
"msg": "I did something!",
"host": "ls1",
"loggername": "example",
"pid": "8777",
"path": "/var/log/example-app/example.log"
}
},
{
"_index": "example",
"_type": "doc",
"_id": "r10h42EBmk4Q3X4_9U-f",
"_score": 1,
"_source": {
"function": "main",
"level": "INFO",
"time": "2018-03-01 19:20:56,709",
"@version": "1",
"thread": "140603093944064",
"@timestamp": "2018-03-01T19:55:08.774Z",
"line": "18",
"msg": "Started the application.",
"host": "ls1",
"loggername": "example",
"pid": "8777",
"path": "/var/log/example-app/example.log"
}
}
]
}
|
The unnecessary message
field is gone.
It looks much cleaner now. A few more things until we’re happy.
The date
filter plugin [5] is capable of using our time
field
and use it as input for the @timestamp
field:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 | input {
file {
id => "example-in"
path => "/var/log/example-app/example.log"
start_position => "beginning"
sincedb_path => "/dev/null"
}
}
filter {
dissect {
mapping => {
"message" => "%{time} %{+time} %{level} %{pid} %{thread} %{loggername} %{function} %{line} %{msg}"
}
remove_field => [ "message" ]
}
date {
match => [ "time", "ISO8601" ]
remove_field => [ "time" ]
}
}
output {
elasticsearch {
id => "example-out"
hosts => ["http://es1:9200"]
index => "example"
}
}
|
The time
field served its purpose, so we remove it like shown in the
previous section. This example also shows nicely, that multiple filters
can be specified in one config, and that they get executed in the
given order.
If your log files don’t use ISO8601, you can filter for other formats,
like the Unix time in seconds or milliseconds since epoch. Any other
format is also valid, e.g. "MMM dd yyyy HH:mm:ss"
. See the docs
at [5] for more details.
After changing the file /etc/logstash/conf.d/example-app.conf
to make it look like the setting
above, we repeat the steps:
1 2 3 | $ curl -X DELETE es1:9200/example/
$ systemctl restart logstash.service
$ curl -s es1:9200/example/_search? | jq
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 | {
"hits": [
{
"_index": "example",
"_type": "doc",
"_id": "sl0p42EBmk4Q3X4_pk-z",
"_score": 1,
"_source": {
"function": "do_something",
"@timestamp": "2018-03-01T19:20:56.709Z",
"msg": "I did something!",
"loggername": "example",
"pid": "8777",
"line": "14",
"host": "ls1",
"thread": "140603093944064",
"@version": "1",
"path": "/var/log/example-app/example.log",
"level": "DEBUG"
}
},
{
"_index": "example",
"_type": "doc",
"_id": "sV0p42EBmk4Q3X4_pk-z",
"_score": 1,
"_source": {
"function": "main",
"@timestamp": "2018-03-01T19:20:56.709Z",
"msg": "Started the application.",
"loggername": "example",
"pid": "8777",
"line": "18",
"host": "ls1",
"thread": "140603093944064",
"@version": "1",
"path": "/var/log/example-app/example.log",
"level": "INFO"
}
}
]
}
|
The time
field is gone and the @timestamp
field uses the date and
time specified in the original log file.
The dissect
filter provides the option convert_datatype
which can do a conversion to integer of float numbers. We only need
conversions to int
here:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 | input {
file {
id => "example-in"
path => "/var/log/example-app/example.log"
start_position => "beginning"
sincedb_path => "/dev/null"
}
}
filter {
dissect {
mapping => {
"message" => "%{time} %{+time} %{level} %{pid} %{thread} %{loggername} %{function} %{line} %{msg}"
}
remove_field => [ "message" ]
convert_datatype => {
line => "int"
pid => "int"
thread => "int"
}
}
date {
match => [ "time", "ISO8601" ]
remove_field => [ "time" ]
}
}
output {
elasticsearch {
id => "example-out"
hosts => ["http://es1:9200"]
index => "example"
}
}
|
After changing that in /etc/logstash/conf.d/example-app.conf
, same story like before:
1 2 3 | $ curl -X DELETE es1:9200/example/
$ systemctl restart logstash.service
$ curl -s es1:9200/example/_search? | jq
|
The JSON request contains numbers now:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 | {
"hits": [
{
"_index": "example",
"_type": "doc",
"_id": "tF0u42EBmk4Q3X4_6k-i",
"_score": 1,
"_source": {
"msg": "I did something!",
"level": "DEBUG",
"loggername": "example",
"@timestamp": "2018-03-01T19:20:56.709Z",
"thread": 140603093944064,
"line": 14,
"function": "do_something",
"host": "ls1",
"pid": 8777,
"@version": "1",
"path": "/var/log/example-app/example.log"
}
},
{
"_index": "example",
"_type": "doc",
"_id": "s10u42EBmk4Q3X4_6k-i",
"_score": 1,
"_source": {
"msg": "Started the application.",
"level": "INFO",
"loggername": "example",
"@timestamp": "2018-03-01T19:20:56.709Z",
"thread": 140603093944064,
"line": 18,
"function": "main",
"host": "ls1",
"pid": 8777,
"@version": "1",
"path": "/var/log/example-app/example.log"
}
}
]
}
|
This can be useful when you search for a range of numeric values [6]:
1 2 3 4 5 6 7 8 9 10 11 12 | $ curl -s 'es1:9200/example/_search' -H 'Content-Type: application/json' -d'
{
"query": {
"range" : {
"line" : {
"gte" : 10,
"lte" : 15
}
}
}
}
' | jq
|
The response:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 | {
"hits": {
"total": 1,
"max_score": 1,
"hits": [
{
"_index": "example",
"_type": "doc",
"_id": "tF0u42EBmk4Q3X4_6k-i",
"_score": 1,
"_source": {
"msg": "I did something!",
"level": "DEBUG",
"loggername": "example",
"@timestamp": "2018-03-01T19:20:56.709Z",
"thread": 140603093944064,
"line": 14,
"function": "do_something",
"host": "ls1",
"pid": 8777,
"@version": "1",
"path": "/var/log/example-app/example.log"
}
}
]
}
}
|
This example isn’t that impressive, admittedly, but I’m sure you get the principle and have an idea where this might come in handy with your log files.
This post showed how a log file with realistic data can be split into
its parts by the dissect
input plugin, and brought into the
Elasticsearch service for later queries.
While writing this post, a few questions came up in my mind and maybe you ask them yourself too:
What if dissect
is not powerful enough for my logs?
There is a more powerful filter, named grok
, which is based
on regular expressions [7]. I don’t have plans to dive into
that filter at the moment.
Do I need to install a Logstash server on every application server to gather my logs?
Nope, Filebeat to the rescue [8]. That’s an insight I got while writing this post. I will probably look into that at a later point in time, after I talked about Kibana.
How do I get my syslog, to get the bigger picture?
There is a syslog input plugin [9], but I couldn’t get it to work yet, although the intro looks easy. I tried different things but have to table that to a follow-up post in the future.
Doesn’t using the dissect
filter create a contract on the structure
of my log file?
Yeah, I think so. Adding another field to the logger of your application will most likely break the defined mapping. The structure of log files doesn’t change that often, I guess. I was still wondering if it may be reasonable to create log files with JSON Lines [10]. This would imply that a log file is NOT supposed anymore to get read directly by humans, but to get used as input for a log file processing engine like Logstash. Not sure if this change would be welcomed by every one. Maybe I’ll give it a try in a follow-up post.
We have enough in place to put a nice GUI on top of it. In ELK, this is Kibana. The next post in this series will show this.
[1] | https://docs.python.org/2/library/logging.html#logrecord-attributes |
[2] | https://www.elastic.co/guide/en/elasticsearch/reference/6.2/query-dsl.html |
[3] | https://www.elastic.co/guide/en/logstash/6.2/plugins-filters-dissect.html |
[4] | https://www.elastic.co/guide/en/elasticsearch/reference/6.2/indices-delete-index.html |
[5] | (1, 2) https://www.elastic.co/guide/en/logstash/6.2/plugins-filters-date.html |
[6] | https://www.elastic.co/guide/en/elasticsearch/reference/6.2/query-dsl-range-query.html |
[7] | https://www.elastic.co/guide/en/logstash/6.2/plugins-filters-grok.html |
[8] | https://www.elastic.co/products/beats/filebeat |
[9] | https://www.elastic.co/guide/en/logstash/6.2/plugins-inputs-syslog.html |
[10] | http://jsonlines.org/ |