SOFTWARE ENGINEERING blog & .lessons_learned
manuel aldana
Manuel Aldana

May 15th, 2011 · No Comments

‘Embrace Failure’ approaches in Software Development

Software-Developers (and humans generally) must not try to be 100% perfect but should take the imperfection as granted and “embrace” it. Following gives some thoughts on how to approach the ‘Embracing failure’ principle in the field of Software Development.

Log thoughtfully

Proper and insightful logging cannot be overstated, you will praise the developer who provides good logs, i.e. you can easily search for problem-symptoms inside logs. Inside team find an appropriate logging convention for log-patterns + semantics of level. Some short and abbreviated ideas:

  • DEBUG: Detailed information while going through integration points (like printing payload of XML for api call). Quantified information like statistics while refreshing/invalidating cache.
  • INFO: Bootstrapping/startup information, initialization routines, reoccurring asynchronous actions (including elapsed time).
  • WARN: Logging suspicious/errornous application state, where fallback is still possible and routine/call can be proceeded.
  • ERROR: No fallback possible. For the user/customer this most likely means a crashed call/request (in HTTP jargon this would be a 500 status). Examples for such failures could be NullPointers or database unavailability.

NEVER EVER swallow away Exceptions, I’ve seen several empty catch{} blocks which later caused major headaches during production, on user-level the system crashed on many calls but nothing could be seen inside logs. At topmost application layer log uncaught Exceptions as ERROR.

Also have a look at a longer blog-entry about appropriate logging.

Make deployment + upgrade easy

When a critical failure occurs you want to be able to react quickly. Reaction-time lasts between the time of knowing the issue until the deployment is finished and the customers sees the change. This includes getting the code (thank a fast SCM), understanding the problem, fixing/extending the code, verifying change on local and/or qa-environment and finally push to production. For building deployment package and executing deployment automate and speed up as much as possible.

If you are lucky your app is a typical browser-server based webapp. It has the highest potential regarding instant deployment: As soon as the new app version has been rolled out to your server machine (or server farm, in case you scaled out) the upgrade is complete and will be instantly seen inside the client’s browser. Yes, webapp deployments have their own pitfalls (high traffic site’s rollout isn’t atomic, cache settings, browser state inside DOM or cookies), but other application classes like embedded-systems, desktop-app or mobile-apps have much tougher deployment requirements and upgrade process is more complicated. But Google’s Chrome Webbrowser shows that even for these applications a well designed and transparent upgrade mechanism is possible.

Monitor your application

At some point the system will fail, therefore you have to focus on notfications of respective problems. A typical and annoying scenario is a ‘false negative’ production system: You think everything is alright but in fact the production behaviour is unstable and unreliable. Unstable production behaviour can be infrastructure or application related. Infrastructure problems include network problems, disk crashes, external-system/integration-points downtime or not enough capacity/resources to handle the load. Application related are more “real” usual application bugs like unchecked null-references, dead-locks, infinity-loops, thread-unsafety or simply wrong implemented requirements.

To sense + sanity-check such problems you need system-hooks which deliver data (e.g. Servlet-Container built-in monitoring, JMX). Sometimes you will also write little extension to fulfill a specific monitoring requirement (e.g. ERROR log counter). Apart from data-provision you also need an entity which processes the numbers. To not hire monkeys you should use existing tools. For instance munin + rdd-tool tracks the data over time and plots nice graphs and Nagios alerts you actively in case of a peak of a metric and a potential incident.

Make it visible

We as humans are also far from perfect at number crunching and intepreting masses of detailed numeric data. Often application “health” degrades over time so you should deliver comparison of ‘now’ to historic data (like hour over hour, day over day, week over week). Try to abstract away numbers behind meaningful graphs or charts. The signal light analogy (green, orange, red) also helps a lot to classify the criticality of numbers.

Enabling live diagnostics

Sometimes app behaviour isn’t well enough understood and you won’t be able to reproduce a scenario in local isolated environment. You then have to try to get more insight of production system. In case resources are the problem (like CPU, memory) several profilers support live production machine-instance attachements to get some metrics like running threads, heap-stats or garbage collector info.

If you need more semantically application insights and followed log conventions, you should also make it possible to enable DEBUG or TRACE logs for a certain code sections. Make your logging-levels configurable during runtime. If you have to restart your app when changing log-settings you can wipe away the current application’s state you want to analyse (issues are often not reproducible straight after restart).

Another helpful option is live debugging, where you debug remotely and direcetly on the problem-server instance. You can directly sense the concrete problem as is and don’t need to waste effort for local reproduction. One of your users will have bad luck and will hit a connection timeout, but the request anyway would have caused an error page.

Use Users/Customers as feedback provider

Even if you are running under a diagnostic friendly webapp (server is under app-provider’s control) you may end-up with very weird cases, which happen on the client’s side. Therefore make it very easy for customers to give you feedback, either provide free-phone-contact or a easy to find feedback form. In case you don’t run as webapp an option would also be that in case of an error the user just needs to click a button and some diagnostic data is sent to the server (I remember, when reporting a bug Ubuntu OS fetched a good deal of OS-configuration-data and attached it to the bug-report). Some embedded systems don’t need user action but automatically dump application-state to a persistent storage before a reset. This data then can be retrieved later from a mechanic inside a car repair-shop.

Conclusion

Your production system will fail at some point. You can try to fight this by getting your QA as close as possible to production and invest A LOT into testing. Still you have to find the sweet spot of cost/benefit ratio, because this effort increases exponentially (effort of providing same hardware, 100% test-coverage + test-data, simulate real-user behaviour, etc.). There are software-system classes where this effort must be pushed to the limits like in money + life critical intense systems, but most applications we are building aren’t that sensitive where damage cannot be tolerated or reversed. Therefore build in as many diagnostics as sensible and have good tooling + process around a quick alert/fix/deploy roundtrip.

Tags: Software Engineering

0 responses

    You must log in to post a comment.