Manuel Aldana — Top-Tools for Log-Analysis (CLI based)

< ‘Embrace Failure’ approaches in Software Development

Unit-Testing: Situations when NOT to do it >

March 13th, 2011 · No Comments

Top-Tools for Log-Analysis (CLI based)

After all the years it is still remarkable how powerful plain text is for software development (human readability, advanced SCM tools, expressiveness, no GUI necessary, etc.). This applies also on logging, where plain text makes a very nice source for doing adhoc analysis. Following samples show how bash, its interprocess-communication piping and the usual suspect tools (awk, grep, sed, tr, etc.) give you great control when doing analysis.

Format

Logs should have a standard format, so analysis can rely on structured output and operate on it, programs are less flexible as humans and need strict patterns. Most often such logs are line-based and white-space separate based. Fortunately most logging frameworks make this a non brainer (e.g apache log-format settings, log4j).

Usual log-pattern, rows/lines form single datasets and are most often ordered by time, because they are appended by the software over time, they often contain also timestamps. The lines most often are independant of each other, but for each line data needs to follow strict order (apart from loose text like error-messages or furhter stack traces):

dataset1(line1): data1 data2 data3
dataset2(line2): data1 data2 data3
dataset3(line3): data1 data2 data3
…

To better understand examples I refer to apache-logging pattern (HTTP based web-traffic). Still all the example adhoc queries can be applied to all other log-output target/formats:

…
127.0.0.1 – - [29/Feb/2011:13:58:50 +0100] “GET /api/image/status/ok HTTP/1.1″ 200 476 “-” “Jakarta Commons-HttpClient/3.1″
…

Log analysis in Action

Quick overview for further analysis steps:

# using ‘less’ so reading longer output is easier
grep ” 200 “ log-file | less

# only direct stuff to ‘less’ which is of interest, in this case show relative URI path (7th position)
grep ” 200 “ log-file | awk ‘{print $7}’ | less

# plain text is good candidate for compressing therefore on production
# environment files are often zipped to save disk space, therefore use zgrep
zgrep * | awk ‘{print $7}’ | less

Counts:

# count of HTTP calls which had 404 status
grep -c ” 404 “ log-file
# more verbose equivalent
cat log-file | grep ” 404 “ | wc -l

# count of all other but 404 HTTP status (inverse grep)
grep -v -c ” 404 “ log-file

# piping (filtering step-by-step)
# showing number of failed (non 201 status) image uploads (POST)
cat log-file | grep -v ” 201 “ | grep POST | grep image | wc -l

Filtering:

# spit out all respone times of HTTP 200 status
# response time is whitespace-separated on 10th position
grep ” 200 “ log-file | awk ‘{print $10}’

# show number of unique called URLs (stand on 7th position)
cat log-file | awk ‘{print $7}’ | uniq

# show slowest response time
grep ” 200 “ log-file | awk ‘{print $10}’ | sort -n | tail -n1
# show fastest response time
grep ” 200 “ log-file | awk ‘{print $10}’ | sort -n | head -n1

# filter lines ‘from,to’ (e.g. analyse critical time section)
cat log-file | sed -n ‘20,30′

Transformations:

# chain response times with ‘,’ instead of linebreaks ‘\n’
grep ” 200 “ log-file | awk ‘{print $10}’ | tr -d “\n“ “,”

# delete/omit lines 10-20
cat log-file | awk ‘{print $10}’ | sed ‘20,30d’
# slightly shorter
awk ‘{print $10}’ < log-file | sed ‘20,30d’

Side-Note: While running adhoc log-files queries, often it is enough to go the pragmatic way, e.g. above I used the ” 404 ” in grep command which is not correct (404 could occur inside the response time part also). Never the less it is quicker and shorter to write and collision is unlikely.

Server farm approach (horizontal scaled)

How do you run log-analysis in server-farm environments (lots of machines are used to scale up horizontally)? In this situation logs are distributed around on each machine’s log directory. An option here is to have a dedicated more storage focused server which runs a batch/cron-job over night copying all the server logs to a single place. This of course is practicable if the files aren’t too huge and can be processed in reasonable time by single script. If this is not the case you may want to use a more scalable sophisticated solution (like MapReduce with Hadoop). Another alternative is to analyse log on single machine and extrapolate. Most likely you use a load-balancer which causes the traffic to be distributed equally on each machine, i.e. running the same log-query on each machine would give back similar results. The big advantage of the single machine is simplicity: no infrastructure necessary (terminal and log-file sufficient), commands are simple one-liners and speed (only one machine log-file needs to be analysed).

Summary

The CLI triangle (bash, its utilities, piping) form a very convenient ad-hoc query-interface for logs. At first glance they look a bit cryptic, but you get used to it very quickly. Also you already achieve a lot by only knowing a small subset of functionality (e.g. I use only a minimum subset of ‘awk’ and ’sed’).

Am happy about any comments/hints of other lightweight tooling for log-analysis.

Tags: Uncategorized

< ‘Embrace Failure’ approaches in Software Development

Unit-Testing: Situations when NOT to do it >

0 responses

You must log in to post a comment.

Last entries

About me

Blog Categories

Important SE Links