Recipes how to ship good code are easy to find in the internet and practices are well established. But what next? Given we have high code coverage, employed mutation testing and implemented end-to-end tests, can we say that our system is bulletproof? Would it be worth spending additional day to write system health checks that would ensure application on live is really performing its job? Stating the obvious, things do break and it’s good to know about it as soon as possible!

The HealthCheck library’s code is available on GitHub: https://github.com/codete/HealthCheck

One of challenges we faced while maintaining a huge e-commerce system for a client of ours was instability: tests were sparse, solid documentation didn’t exist and sometimes introduction of features was causing the butterfly effect. To say the least, deployment felt like playing Russian roulette where bullet could hit you 2 days later when finally somebody filed a support ticket. Often it would be hard to prepare an automated test for such cases as there were multiple conditions to be fulfilled for the error to occur: entering the page through a certain advertisement vendor being one, buying specific set of products combined with particular voucher another. “You could have tested that manually!” – yes we could, and to some extent we did, but due to limited capacity of our QA team it was often a must to omit some “less important” areas to focus on main flows. To mitigate learning about errors only through user reports we have introduced health checks.

What is that “health check” thing?

A good one sentence definition would be “a script that is run regularly and checks whether application is performing as expected”. The “performing as expected” can be really broad and possibilities are endless. For instance, first indicators that business defined were:

  1. At least 50 orders should be placed during last hour
  2. At least 100 orders should be due to Facebook Ads on daily basis
  3. At least 100 orders should have 2 or more products on daily basis
  4. At least 30 users should register during last hour

While for developers most important things were:

  1. Messages in RabbitMQ are being consumed and are not stacking
  2. Error rate in microservices is <= 0.5%
  3. 3rd party APIs are operational

Having performance indicators in place, there is “a script” part of definition left and one thing that is not in it, but is crucial to entire operation: notifications about failed health checks. When it comes to that part we weren’t happy with existing solutions and we decided to roll our own framework for such checks.

HealthCheck library

The main goal we aimed for was easy to write checks and so we accomplished it – let’s have a look at one of the checks we wrote:

The only things that are needed when writing the check are its name and actual logic. The check doesn’t run itself nor notifies about results as that would violate single responsibility principle and greatly complicate things. To let the library know about the check we wrote we only need to tag it with “hc.health_check” (if using provided Symfony Bundle):

As mentioned earlier, the health check itself doesn’t notify about its result, that responsibility is relayed to ResultHandlers instead. In case of our e-commerce application Slack is the communication tool and all failing checks were reported to a specially created channel. Also Slack handler is one of few handlers available out of the box with the library – you can configure it along with bundles configuration:

You can also define your own result handlers, the next one we implemented just after Slack was a SMS notification handler which, for most critical failures, would send a message to listed people. To learn more about writing custom result handlers, please refer to ResultHandler section of library’s readme file.

Running the checks

Yet another thing that health checks are not doing on their own is running themselves. The only thing library provides you with are two console commands to either run all checks or one specified with an argument. How you will run the check is out of scope of the library to give you most freedom and not make any assumptions about your environment.

First stab we took was utilizing UNIX’s CRON as it’s as easy as it can only be: the checks were run on time with almost no effort from developer’s side. That approach came with a price though as CRON’s scope of interest is also limited: you have no history of what was run nor how it ended and if the server went down for any reason we wouldn’t even know (yes yes, unless another monitoring is in place). Soon enough we replaced our CRON-based approach with more sophisticated Nagios setup.

Concluding

Implementing library hasn’t mitigated root causes of problems with the platform nor made Russian roulette deployments any better. On the other hand finding weird errors on live now doesn’t depend on user submitted bug reports nor on sheer luck while browsing gargantuan logs. Situation is far from perfect but both sides, our and business, already see tangible benefits – we are learning about overlooked mistakes faster, can react faster, thus less users are affected by side effects of releases.

Do you have some interesting use cases for our library already? We have some ideas for next features but we’re more than eager to hear what you may need and make our library a good fit for you too!

Team Leader

As a child, perhaps like most of us, dreamt about creating all those cool computer games he was playing but in the end around 2006 found himself in PHP world and stayed there until this very day, trying to make that world a little bit better every day. MongoDB Team @ Doctrine // Fan @ Symfony