SOFTWARE ENGINEERING blog & .lessons_learned
manuel aldana
Manuel Aldana

>July 3rd, 2011 · No Comments

Accessing rrdtool-files Data with Java / Scala

Though rrdtool is widely used especially for monitoring, it took me a while to find a simple and compatible bridge for reading data of rrd-files with Java. Following hopefully helps others to save some time and to workaround some pitfalls.

My requirement was:

  • Monitoring website showing trends of metrics, i.e. compare current with past values, both absolute and relative (hour over hour, day over day, etc.).
  • Data backed is round-robin-database. The rrd-files are generated by munin (with under the hood rrdtool). Data should be accessed read-only.
  • Website’s technology stack is Java based (Play! framework with Scala integration).

Because I use Play! framework I needed a way to access rrd-files with Java-Technology. My first look at rrd4j looked promising but failed, because rrd4j is a port and cannot read rrd-files created by original rrdtool. After some time-consuming research and Google-digging I finally stepped over java-rrd.

Installation Steps java-rrd

I couldn’t find the library built and distributed in any Maven-repository, so you have to download + build yourself:

# download + unpack
wget http://oss.stamfest.net/java-rrd-hg/archive/tip.tar.gz
tar xfz <downloaded-tarball.tar.gz>
# build .jar library with Maven
cd <untarred-directory>
mvn package
cp target/*.jar <your-target-lib-folder>

Instead of dump-copying and to be more clean with your build-system you might want to deploy the .jar file to a repository like Nexus.

Read-Access Usage java-rrd

As I use Play! framework with Scala integration, Scala was the integration way:

import net.stamfest.rrd._
….
val rrd = new RRDp(“/tmp”, “55555″)
val command = Array(“fetch”, “your-rrd-file.rrd”, “MAX”, “-r”, “1800″, “-s”, “-1d”)
val result = rrd.command(command)

if (!result.ok)
  println(result.error)
else
  println(result.output)

For completeness another snippet in Java source language:

import net.stamfest.rrd.CommandResult;
import net.stamfest.rrd.RRDp;
….
RRDp rrd = new RRDp(“/tmp”, “55555″);
String[] command = {“fetch”, “your-rrd-file.rrd”, “MAX”, “-r”, “1800″, “-s”, “-1d”};
CommandResult result = rrd.command(command);

if (!result.ok)
    System.out.println(result.error);
else
    System.out.println(result.output);

With java-rrd you can also access rrdtool over sockets/network.

Parsing rrdtool output

rrdtool output after fetch is plain text, something like:

        speed


920804700: nan
920805000: 4.0000000000e+02
920805300: 2.0000000000e+03
920805600: 0.0000000000e+00
920806800: 3.3333333333e+01

Above needs to be processed further to be “structured enough” for your code. For example do following (I only show Scala, I skipped Java iterating syntax hell…, did I mention that I love passing functions ;):

def outputToNumberPairs(rrdFetchOutput: String) = {
    // filtering unneeded + empty lines
    val list = rrdFetchOutput.trim().split(\n).filter((n) => n.contains(“:”) && !n.contains(“nan”))
    // parsing strings to numeric values and combine to pair
    for (i <- list) yield i.split(“:”)(0).trim().toLong -> i.split(“:”)(1).trim().toFloat.round
  }
GIVES BACK:
===
res34: Array[(Long, Int)] = Array((920805000,400), (920805300,2000), (920805600,0), (920806800,33))

→ No CommentsTags: Technologies/Tools

>May 15th, 2011 · No Comments

‘Embrace Failure’ approaches in Software Development

Software-Developers (and humans generally) must not try to be 100% perfect but should take the imperfection as granted and “embrace” it. Following gives some thoughts on how to approach the ‘Embracing failure’ principle in the field of Software Development.

Log thoughtfully

Proper and insightful logging cannot be overstated, you will praise the developer who provides good logs, i.e. you can easily search for problem-symptoms inside logs. Inside team find an appropriate logging convention for log-patterns + semantics of level. Some short and abbreviated ideas:

  • DEBUG: Detailed information while going through integration points (like printing payload of XML for api call). Quantified information like statistics while refreshing/invalidating cache.
  • INFO: Bootstrapping/startup information, initialization routines, reoccurring asynchronous actions (including elapsed time).
  • WARN: Logging suspicious/errornous application state, where fallback is still possible and routine/call can be proceeded.
  • ERROR: No fallback possible. For the user/customer this most likely means a crashed call/request (in HTTP jargon this would be a 500 status). Examples for such failures could be NullPointers or database unavailability.

NEVER EVER swallow away Exceptions, I’ve seen several empty catch{} blocks which later caused major headaches during production, on user-level the system crashed on many calls but nothing could be seen inside logs. At topmost application layer log uncaught Exceptions as ERROR.

Also have a look at a longer blog-entry about appropriate logging.

Make deployment + upgrade easy

When a critical failure occurs you want to be able to react quickly. Reaction-time lasts between the time of knowing the issue until the deployment is finished and the customers sees the change. This includes getting the code (thank a fast SCM), understanding the problem, fixing/extending the code, verifying change on local and/or qa-environment and finally push to production. For building deployment package and executing deployment automate and speed up as much as possible.

If you are lucky your app is a typical browser-server based webapp. It has the highest potential regarding instant deployment: As soon as the new app version has been rolled out to your server machine (or server farm, in case you scaled out) the upgrade is complete and will be instantly seen inside the client’s browser. Yes, webapp deployments have their own pitfalls (high traffic site’s rollout isn’t atomic, cache settings, browser state inside DOM or cookies), but other application classes like embedded-systems, desktop-app or mobile-apps have much tougher deployment requirements and upgrade process is more complicated. But Google’s Chrome Webbrowser shows that even for these applications a well designed and transparent upgrade mechanism is possible.

Monitor your application

At some point the system will fail, therefore you have to focus on notfications of respective problems. A typical and annoying scenario is a ‘false negative’ production system: You think everything is alright but in fact the production behaviour is unstable and unreliable. Unstable production behaviour can be infrastructure or application related. Infrastructure problems include network problems, disk crashes, external-system/integration-points downtime or not enough capacity/resources to handle the load. Application related are more “real” usual application bugs like unchecked null-references, dead-locks, infinity-loops, thread-unsafety or simply wrong implemented requirements.

To sense + sanity-check such problems you need system-hooks which deliver data (e.g. Servlet-Container built-in monitoring, JMX). Sometimes you will also write little extension to fulfill a specific monitoring requirement (e.g. ERROR log counter). Apart from data-provision you also need an entity which processes the numbers. To not hire monkeys you should use existing tools. For instance munin + rdd-tool tracks the data over time and plots nice graphs and Nagios alerts you actively in case of a peak of a metric and a potential incident.

Make it visible

We as humans are also far from perfect at number crunching and intepreting masses of detailed numeric data. Often application “health” degrades over time so you should deliver comparison of ‘now’ to historic data (like hour over hour, day over day, week over week). Try to abstract away numbers behind meaningful graphs or charts. The signal light analogy (green, orange, red) also helps a lot to classify the criticality of numbers.

Enabling live diagnostics

Sometimes app behaviour isn’t well enough understood and you won’t be able to reproduce a scenario in local isolated environment. You then have to try to get more insight of production system. In case resources are the problem (like CPU, memory) several profilers support live production machine-instance attachements to get some metrics like running threads, heap-stats or garbage collector info.

If you need more semantically application insights and followed log conventions, you should also make it possible to enable DEBUG or TRACE logs for a certain code sections. Make your logging-levels configurable during runtime. If you have to restart your app when changing log-settings you can wipe away the current application’s state you want to analyse (issues are often not reproducible straight after restart).

Another helpful option is live debugging, where you debug remotely and direcetly on the problem-server instance. You can directly sense the concrete problem as is and don’t need to waste effort for local reproduction. One of your users will have bad luck and will hit a connection timeout, but the request anyway would have caused an error page.

Use Users/Customers as feedback provider

Even if you are running under a diagnostic friendly webapp (server is under app-provider’s control) you may end-up with very weird cases, which happen on the client’s side. Therefore make it very easy for customers to give you feedback, either provide free-phone-contact or a easy to find feedback form. In case you don’t run as webapp an option would also be that in case of an error the user just needs to click a button and some diagnostic data is sent to the server (I remember, when reporting a bug Ubuntu OS fetched a good deal of OS-configuration-data and attached it to the bug-report). Some embedded systems don’t need user action but automatically dump application-state to a persistent storage before a reset. This data then can be retrieved later from a mechanic inside a car repair-shop.

Conclusion

Your production system will fail at some point. You can try to fight this by getting your QA as close as possible to production and invest A LOT into testing. Still you have to find the sweet spot of cost/benefit ratio, because this effort increases exponentially (effort of providing same hardware, 100% test-coverage + test-data, simulate real-user behaviour, etc.). There are software-system classes where this effort must be pushed to the limits like in money + life critical intense systems, but most applications we are building aren’t that sensitive where damage cannot be tolerated or reversed. Therefore build in as many diagnostics as sensible and have good tooling + process around a quick alert/fix/deploy roundtrip.

→ No CommentsTags: Software Engineering

>March 13th, 2011 · No Comments

Top-Tools for Log-Analysis (CLI based)

After all the years it is still remarkable how powerful plain text is for software development (human readability, advanced SCM tools, expressiveness, no GUI necessary, etc.). This applies also on logging, where plain text makes a very nice source for doing adhoc analysis. Following samples show how bash, its interprocess-communication piping and the usual suspect tools (awk, grep, sed, tr, etc.) give you great control when doing analysis.

Format

Logs should have a standard format, so analysis can rely on structured output and operate on it, programs are less flexible as humans and need strict patterns. Most often such logs are line-based and white-space separate based. Fortunately most logging frameworks make this a non brainer (e.g apache log-format settings, log4j).

Usual log-pattern, rows/lines form single datasets and are most often ordered by time, because they are appended by the software over time, they often contain also timestamps. The lines most often are independant of each other, but for each line data needs to follow strict order (apart from loose text like error-messages or furhter stack traces):

dataset1(line1): data1 data2 data3
dataset2(line2): data1 data2 data3
dataset3(line3): data1 data2 data3

To better understand examples I refer to apache-logging pattern (HTTP based web-traffic). Still all the example adhoc queries can be applied to all other log-output target/formats:


127.0.0.1 – - [29/Feb/2011:13:58:50 +0100] “GET /api/image/status/ok HTTP/1.1″ 200 476 “-” “Jakarta Commons-HttpClient/3.1″

Log analysis in Action

Quick overview for further analysis steps:

# using ‘less’ so reading longer output is easier
grep ” 200 “ log-file | less

# only direct stuff to ‘less’ which is of interest, in this case show relative URI path (7th position)
grep ” 200 “ log-file | awk ‘{print $7}’ | less

# plain text is good candidate for compressing therefore on production
# environment files are often zipped to save disk space, therefore use zgrep
zgrep * | awk ‘{print $7}’ | less

Counts:

# count of HTTP calls which had 404 status
grep -c ” 404 “ log-file
# more verbose equivalent
cat log-file | grep ” 404 “ | wc -l

# count of all other but 404 HTTP status (inverse grep)
grep -v -c ” 404 “ log-file

# piping (filtering step-by-step)
# showing number of failed (non 201 status) image uploads (POST)
cat log-file | grep -v ” 201 “ | grep POST | grep image | wc -l

Filtering:

# spit out all respone times of HTTP 200 status
# response time is whitespace-separated on 10th position
grep ” 200 “ log-file | awk ‘{print $10}’

# show number of  unique called URLs (stand on 7th position)
cat log-file | awk ‘{print $7}’ | uniq

# show slowest response time
grep ” 200 “ log-file | awk ‘{print $10}’ | sort -n | tail -n1
# show fastest response time
grep ” 200 “ log-file | awk ‘{print $10}’ | sort -n | head -n1

# filter lines ‘from,to’ (e.g. analyse critical time section)
cat log-file | sed -n ‘20,30′

Transformations:

# chain response times with ‘,’ instead of linebreaks ‘\n’
grep ” 200 “ log-file | awk ‘{print $10}’ | tr -d \n “,”

# delete/omit lines 10-20
cat log-file | awk ‘{print $10}’ | sed ‘20,30d’
# slightly shorter
awk ‘{print $10}’ <  log-file | sed ‘20,30d’

Side-Note: While running adhoc log-files queries, often it is enough to go the pragmatic way, e.g. above I used the ” 404 ” in grep command which is not correct (404 could occur inside the response time part also). Never the less it is quicker and shorter to write and collision is unlikely.

Server farm approach (horizontal scaled)

How do you run log-analysis in server-farm environments (lots of machines are used to scale up horizontally)? In this situation logs are distributed around on each machine’s log directory. An option here is to have a dedicated more storage focused server which runs a batch/cron-job over night copying all the server logs to a single place. This of course is practicable if the files aren’t too huge and can be processed in reasonable time by single script. If this is not the case you may want to use a more scalable sophisticated solution (like MapReduce with Hadoop). Another alternative is to analyse log on single machine and extrapolate. Most likely you use a load-balancer which causes the traffic to be distributed equally on each machine, i.e. running the same log-query on each machine would give back similar results. The big advantage of the single machine is simplicity: no infrastructure necessary (terminal and log-file sufficient), commands are simple one-liners and speed (only one machine log-file needs to be analysed).

Summary

The CLI triangle (bash, its utilities, piping) form a very convenient ad-hoc query-interface for logs. At first glance they look a bit cryptic, but you get used to it very quickly. Also you already achieve a lot by only knowing a small subset of functionality (e.g. I use only a minimum subset of ‘awk’ and ’sed’).

Am happy about any comments/hints of other lightweight tooling for log-analysis.

→ No CommentsTags: Uncategorized