SOFTWARE ENGINEERING blog & .lessons_learned
manuel aldana
Manuel Aldana

>May 15th, 2011 · No Comments

‘Embrace Failure’ approaches in Software Development

Software-Developers (and humans generally) must not try to be 100% perfect but should take the imperfection as granted and “embrace” it. Following gives some thoughts on how to approach the ‘Embracing failure’ principle in the field of Software Development.

Log thoughtfully

Proper and insightful logging cannot be overstated, you will praise the developer who provides good logs, i.e. you can easily search for problem-symptoms inside logs. Inside team find an appropriate logging convention for log-patterns + semantics of level. Some short and abbreviated ideas:

  • DEBUG: Detailed information while going through integration points (like printing payload of XML for api call). Quantified information like statistics while refreshing/invalidating cache.
  • INFO: Bootstrapping/startup information, initialization routines, reoccurring asynchronous actions (including elapsed time).
  • WARN: Logging suspicious/errornous application state, where fallback is still possible and routine/call can be proceeded.
  • ERROR: No fallback possible. For the user/customer this most likely means a crashed call/request (in HTTP jargon this would be a 500 status). Examples for such failures could be NullPointers or database unavailability.

NEVER EVER swallow away Exceptions, I’ve seen several empty catch{} blocks which later caused major headaches during production, on user-level the system crashed on many calls but nothing could be seen inside logs. At topmost application layer log uncaught Exceptions as ERROR.

Also have a look at a longer blog-entry about appropriate logging.

Make deployment + upgrade easy

When a critical failure occurs you want to be able to react quickly. Reaction-time lasts between the time of knowing the issue until the deployment is finished and the customers sees the change. This includes getting the code (thank a fast SCM), understanding the problem, fixing/extending the code, verifying change on local and/or qa-environment and finally push to production. For building deployment package and executing deployment automate and speed up as much as possible.

If you are lucky your app is a typical browser-server based webapp. It has the highest potential regarding instant deployment: As soon as the new app version has been rolled out to your server machine (or server farm, in case you scaled out) the upgrade is complete and will be instantly seen inside the client’s browser. Yes, webapp deployments have their own pitfalls (high traffic site’s rollout isn’t atomic, cache settings, browser state inside DOM or cookies), but other application classes like embedded-systems, desktop-app or mobile-apps have much tougher deployment requirements and upgrade process is more complicated. But Google’s Chrome Webbrowser shows that even for these applications a well designed and transparent upgrade mechanism is possible.

Monitor your application

At some point the system will fail, therefore you have to focus on notfications of respective problems. A typical and annoying scenario is a ‘false negative’ production system: You think everything is alright but in fact the production behaviour is unstable and unreliable. Unstable production behaviour can be infrastructure or application related. Infrastructure problems include network problems, disk crashes, external-system/integration-points downtime or not enough capacity/resources to handle the load. Application related are more “real” usual application bugs like unchecked null-references, dead-locks, infinity-loops, thread-unsafety or simply wrong implemented requirements.

To sense + sanity-check such problems you need system-hooks which deliver data (e.g. Servlet-Container built-in monitoring, JMX). Sometimes you will also write little extension to fulfill a specific monitoring requirement (e.g. ERROR log counter). Apart from data-provision you also need an entity which processes the numbers. To not hire monkeys you should use existing tools. For instance munin + rdd-tool tracks the data over time and plots nice graphs and Nagios alerts you actively in case of a peak of a metric and a potential incident.

Make it visible

We as humans are also far from perfect at number crunching and intepreting masses of detailed numeric data. Often application “health” degrades over time so you should deliver comparison of ‘now’ to historic data (like hour over hour, day over day, week over week). Try to abstract away numbers behind meaningful graphs or charts. The signal light analogy (green, orange, red) also helps a lot to classify the criticality of numbers.

Enabling live diagnostics

Sometimes app behaviour isn’t well enough understood and you won’t be able to reproduce a scenario in local isolated environment. You then have to try to get more insight of production system. In case resources are the problem (like CPU, memory) several profilers support live production machine-instance attachements to get some metrics like running threads, heap-stats or garbage collector info.

If you need more semantically application insights and followed log conventions, you should also make it possible to enable DEBUG or TRACE logs for a certain code sections. Make your logging-levels configurable during runtime. If you have to restart your app when changing log-settings you can wipe away the current application’s state you want to analyse (issues are often not reproducible straight after restart).

Another helpful option is live debugging, where you debug remotely and direcetly on the problem-server instance. You can directly sense the concrete problem as is and don’t need to waste effort for local reproduction. One of your users will have bad luck and will hit a connection timeout, but the request anyway would have caused an error page.

Use Users/Customers as feedback provider

Even if you are running under a diagnostic friendly webapp (server is under app-provider’s control) you may end-up with very weird cases, which happen on the client’s side. Therefore make it very easy for customers to give you feedback, either provide free-phone-contact or a easy to find feedback form. In case you don’t run as webapp an option would also be that in case of an error the user just needs to click a button and some diagnostic data is sent to the server (I remember, when reporting a bug Ubuntu OS fetched a good deal of OS-configuration-data and attached it to the bug-report). Some embedded systems don’t need user action but automatically dump application-state to a persistent storage before a reset. This data then can be retrieved later from a mechanic inside a car repair-shop.

Conclusion

Your production system will fail at some point. You can try to fight this by getting your QA as close as possible to production and invest A LOT into testing. Still you have to find the sweet spot of cost/benefit ratio, because this effort increases exponentially (effort of providing same hardware, 100% test-coverage + test-data, simulate real-user behaviour, etc.). There are software-system classes where this effort must be pushed to the limits like in money + life critical intense systems, but most applications we are building aren’t that sensitive where damage cannot be tolerated or reversed. Therefore build in as many diagnostics as sensible and have good tooling + process around a quick alert/fix/deploy roundtrip.

→ No CommentsTags: Software Engineering

>March 13th, 2011 · No Comments

Top-Tools for Log-Analysis (CLI based)

After all the years it is still remarkable how powerful plain text is for software development (human readability, advanced SCM tools, expressiveness, no GUI necessary, etc.). This applies also on logging, where plain text makes a very nice source for doing adhoc analysis. Following samples show how bash, its interprocess-communication piping and the usual suspect tools (awk, grep, sed, tr, etc.) give you great control when doing analysis.

Format

Logs should have a standard format, so analysis can rely on structured output and operate on it, programs are less flexible as humans and need strict patterns. Most often such logs are line-based and white-space separate based. Fortunately most logging frameworks make this a non brainer (e.g apache log-format settings, log4j).

Usual log-pattern, rows/lines form single datasets and are most often ordered by time, because they are appended by the software over time, they often contain also timestamps. The lines most often are independant of each other, but for each line data needs to follow strict order (apart from loose text like error-messages or furhter stack traces):

dataset1(line1): data1 data2 data3
dataset2(line2): data1 data2 data3
dataset3(line3): data1 data2 data3

To better understand examples I refer to apache-logging pattern (HTTP based web-traffic). Still all the example adhoc queries can be applied to all other log-output target/formats:


127.0.0.1 – - [29/Feb/2011:13:58:50 +0100] “GET /api/image/status/ok HTTP/1.1″ 200 476 “-” “Jakarta Commons-HttpClient/3.1″

Log analysis in Action

Quick overview for further analysis steps:

# using ‘less’ so reading longer output is easier
grep ” 200 “ log-file | less

# only direct stuff to ‘less’ which is of interest, in this case show relative URI path (7th position)
grep ” 200 “ log-file | awk ‘{print $7}’ | less

# plain text is good candidate for compressing therefore on production
# environment files are often zipped to save disk space, therefore use zgrep
zgrep * | awk ‘{print $7}’ | less

Counts:

# count of HTTP calls which had 404 status
grep -c ” 404 “ log-file
# more verbose equivalent
cat log-file | grep ” 404 “ | wc -l

# count of all other but 404 HTTP status (inverse grep)
grep -v -c ” 404 “ log-file

# piping (filtering step-by-step)
# showing number of failed (non 201 status) image uploads (POST)
cat log-file | grep -v ” 201 “ | grep POST | grep image | wc -l

Filtering:

# spit out all respone times of HTTP 200 status
# response time is whitespace-separated on 10th position
grep ” 200 “ log-file | awk ‘{print $10}’

# show number of  unique called URLs (stand on 7th position)
cat log-file | awk ‘{print $7}’ | uniq

# show slowest response time
grep ” 200 “ log-file | awk ‘{print $10}’ | sort -n | tail -n1
# show fastest response time
grep ” 200 “ log-file | awk ‘{print $10}’ | sort -n | head -n1

# filter lines ‘from,to’ (e.g. analyse critical time section)
cat log-file | sed -n ‘20,30′

Transformations:

# chain response times with ‘,’ instead of linebreaks ‘\n’
grep ” 200 “ log-file | awk ‘{print $10}’ | tr -d \n “,”

# delete/omit lines 10-20
cat log-file | awk ‘{print $10}’ | sed ‘20,30d’
# slightly shorter
awk ‘{print $10}’ <  log-file | sed ‘20,30d’

Side-Note: While running adhoc log-files queries, often it is enough to go the pragmatic way, e.g. above I used the ” 404 ” in grep command which is not correct (404 could occur inside the response time part also). Never the less it is quicker and shorter to write and collision is unlikely.

Server farm approach (horizontal scaled)

How do you run log-analysis in server-farm environments (lots of machines are used to scale up horizontally)? In this situation logs are distributed around on each machine’s log directory. An option here is to have a dedicated more storage focused server which runs a batch/cron-job over night copying all the server logs to a single place. This of course is practicable if the files aren’t too huge and can be processed in reasonable time by single script. If this is not the case you may want to use a more scalable sophisticated solution (like MapReduce with Hadoop). Another alternative is to analyse log on single machine and extrapolate. Most likely you use a load-balancer which causes the traffic to be distributed equally on each machine, i.e. running the same log-query on each machine would give back similar results. The big advantage of the single machine is simplicity: no infrastructure necessary (terminal and log-file sufficient), commands are simple one-liners and speed (only one machine log-file needs to be analysed).

Summary

The CLI triangle (bash, its utilities, piping) form a very convenient ad-hoc query-interface for logs. At first glance they look a bit cryptic, but you get used to it very quickly. Also you already achieve a lot by only knowing a small subset of functionality (e.g. I use only a minimum subset of ‘awk’ and ’sed’).

Am happy about any comments/hints of other lightweight tooling for log-analysis.

→ No CommentsTags: Uncategorized

>February 6th, 2011 · 4 Comments

Unit-Testing: Situations when NOT to do it

I am a big fan and practioner of automated unit-testing, but throughout the years I took my lessons. Starting with “everything has to be automated tested” throughout years I experienced situations where doing unit-testing is not optimum approach.

The presented sections go along with my favorite test-smells:

  1. Brittle tests: Though functionality hasn’t been changed the test fails. Test should show green but in fact shows red (false positive).
  2. Inefficient tests: The effort of writing automated tests doesn’t pay out at all. The benefit/cost ratio (short + long term) is extremely low.

Unit-Test little scripts/tools

There is often no sense to write unit-tests for little scripts or tools, which are one or two-liners. The script content is already so “declaritive”, short and compact that the code is too simple to break. Further more often stubbing or mocking the dependencies is tough (e.g. writing to stdout/file, shutdown machine, doing an HTTP call). You can end up writing a external system emulator which is overkill in this situation. Surely testing is important but for that I go the manual way (executing script, and smoke-test sanity check the outcome).

Unit-Test high level orchestration services

Orchestration services have many dependencies and chain-call lower services. The effort of writing such unit-tests is very high: Stubbing/Mocking all these outgoing dependencies is tough, test setup logic can get very complex and make your test-code hard to read and understand. Further more these tests tend to be very brittle, e.g. minor refactoring changes to production code will break them. Main reason is that inside test-code you have to put a lot of implementation detail knowledge to make stubbing/mocking work. You can argue having many fan-out/outgoing dependencies is a bad smell and you should refactor from start on. This is true in some cases but higher order service often have the nature to orchestrate lower ones, so refactoring won’t change/simplify much and make design even more complicated. In the end for such high level services I much prefer to let test-cover them by automated or non-automated acceptance tests.

Test-first during unclear Macro-Design

When implementing a feature or something from scratch often the macro-design is blurry, I like to call this “diving-in”. For diving-in development or quick prototyping you get a feeling which design fits or not. During this phase class structures/interactions change a lot, sometimes even big chunks of code are thrown away and you restart again. Such wide code changes/deletions often will break your tests and you have to adapt or even delete them. In these situations test-first approach doesn’t work for me, writing test-code even distracts me and slows me down. Yes, unit-tests and test-first approach can and should guide your design but I experienced this counts more when the bigger design decisions have been settled.

100% Code-Coverage

I can’t overstate this: Code-Coverage != Test-Coverage. The Code-Coverage of unit-tests is a nice metric to see untested spots, but it is by far not enough. It just tells you that the code has been executed and simply misses the assert part of your test. Without proper asserts, which check the side-effect of your production code and expected behaviour, the test gives zero value. You can reach 100% code-coverage without having tested anything at all. In the end this wrong feeling of security is much worse as having no test at all! Further more 100% code-coverage is inefficient because you will test a lot of code, which is “too simple to break” (e.g. getters/setters, simple constructors + factory-methods).

Summary

Above points shouldn’t give you the impression that I do speak against automated unit-tests, I think they are great: They can guide you to write incremental changes and help you to focus, you get affinity for green colors ;), they are cheap to execute and regression testing gives you security of not breaking things and more courage to refactor. Still going with the attitude that you have to go for 100% code-coverage and to test every code snippet will kill testing culture and end up in Red-Green color blindness.

→ 4 CommentsTags: Continous Integration · Software Engineering · Software Maintenance