HealthCheck Library

14/07/2017 |

6 min read

Maciej Malarz

Recipes how to ship good code are easy to find on the internet and good practices are well established. But what next?

Given that we have high code coverage, mutation testing employed, and end-to-end tests implemented – can we say that our system is bulletproof? Would it be worth spending an additional day to write system health checks which would ensure our live application is really doing its job? Stating the obvious, things do break – and it’s good to know about it as soon as possible!

The HealthCheck library’s code is available on GitHub: https://github.com/codete/HealthCheck

One of the challenges we faced while maintaining a huge e-commerce system for a client of ours was instability: tests were sparse, solid documentation didn’t exist, and sometimes introducing new features caused the butterfly effect. To say the least, deployment felt like playing Russian roulette where the bullet could hit you 2 days later, when somebody finally filed a support ticket.

Often it would be hard to prepare an automated test for such cases, as there were multiple conditions to be fulfilled for the error to occur: entering the page through a certain advertisement vendor being one, buying specific set of products combined with particular voucher being another.

“You could have tested that manually!” – yes we could, and to some extent – we have indeed, but due to the limited capacity of our QA team, it was often a must to omit some of the “less important” areas to focus on main flows.

To mitigate learning about errors through user reports only, we have introduced health checks.

What is that “health check” thing?

A good one-sentence definition would be “a script that is run regularly and checks whether application is performing as expected”. The “performing as expected” part can be really broad and possibilities are endless.

For instance, the first indicators that business defined for us were:

At least 50 orders should be placed during last hour
At least 100 orders should be due to Facebook Ads on daily basis
At least 100 orders should have 2 or more products on daily basis
At least 30 users should register during last hour

While for developers, the most important things were:

Messages in RabbitMQ are being consumed and are not stacking
Error rate in microservices is <= 0.5%
3rd party APIs are operational

Having performance indicators in place, we still have the “script” part of the definition above left – and one thing that is not in it, but is crucial to the entire operation, is: notifications about failed health checks. When it comes to that part, we weren’t happy with the existing solutions, and we decided to roll our own framework for such checks.

HealthCheck library

The main goal we aimed for was easy-to-write checks, and we managed to accomplish it.

Let’s have a look at one of the checks that we wrote:

namespace AppBundle\HealthCheck;
 
use App\Repository\OrderRepository;
use Codete\HealthCheck\HealthCheck;
use Codete\HealthCheck\HealthStatus;
 
class RecentOrdersPlacedCheck implements HealthCheck
{
   const MIN_RECENT_ORDERS = 50;
 
   const RECENT_INTERVAL = 3600;
 
   /** @var OrderRepository $connection */
   private $orderRepository;
 
   public function __construct(OrderRepository $orderRepository)
   {
       $this->orderRepository = $orderRepository;
   }
 
   public function check(): HealthStatus
   {
       $orders = $this->orderRepository->getOrdersPlacedWithin(RECENT_INTERVAL);
       $count = count($orders);
 
       if ($count < self::MIN_RECENT_ORDERS) {
           return new HealthStatus(
               HealthStatus::ERROR,
               sprintf(
                   'Check FAILED. During last hour %d orders were placed. Minimum expected amount is %d',
                   $count,
                   self::MIN_RECENT_ORDERS
               )
           );
       }
       return new HealthStatus(
           HealthStatus::OK,
             sprintf('Recently %d orders were placed', $count)
       );
   }
 
   public function getName(): string
   {
       return 'Recently Orders Placed Out Check';
   }
}

The only things that are needed when writing a check are its name and actual logic. The check doesn’t run itself nor notifies about its results, as that would violate single responsibility principle and greatly complicate things.

To let the library know about the check, we wrote we only need to tag it with “hc.health_check” (if using provided Symfony Bundle):

services:

  app.health_check.check_recent_orders_placed:

    class: AppBundle\HealthCheck\RecentOrdersPlacedCheck

    arguments:

      - '@app.order.repository'

    tags:

      - name: hc.health_check

As it was mentioned earlier, the health check itself doesn’t notify about its result. That responsibility is relayed to ResultHandlers instead.

In the case of our e-commerce application, Slack is the communication tool and all failing checks were reported to a dedicated channel. Also, the Slack handler is one of the few handlers available out-of-the-box with the library. You can configure it along with bundles configuration:

health_check:

  handlers:

    slack:

      type: slack

      url: https://hooks.slack.com/   # endpoint for an incoming webhook

      channel: critical               # channel to post in

      username: "Bringer of Bad News" # username for bot

      icon: ':skull:'                 # icon for bot

  status:

    green: ~

    yellow: ~

    red: slack                     # post failures to Slack channel

You can also define your own result handlers.

The next one we implemented was an SMS notification handler which would send a message to the listed people informing about critical failures. To learn more about writing custom result handlers, please refer to ResultHandler section of the library’s readme file.

Running the health checks

Yet another thing that the health checks are not doing on their own is running themselves. The library only provides you with two console commands: to either run all checks, or to run one check specified with an argument. How you will run the check is out of the library's scope – it's to give you the most freedom and not make any assumptions about your environment.

The first stab that we took was at utilizing UNIX’s CRON, since it’s as easy as it can be. The checks were run on time and with almost no effort on the developer’s side. That approach came with a price though, as CRON’s scope is also limited: you have no history of what was run nor how it ended, and if the server would go down for any reason, you wouldn’t even know (yes, yes – unless another monitoring was in place). Soon enough, we replaced our CRON-based approach with a more sophisticated Nagios setup.

Conclusion

Implementing the library hasn’t removed the root causes of our problems with the platform nor made the Russian roulette deployments any better. On the other hand, finding weird errors on the live version now doesn’t depend on user submitted bug reports or sheer luck while browsing gargantuan logs.

The situation is far from perfect, but both side – ours and the business side – already see tangible benefits. We learn about overlooked mistakes faster, we can react faster, and less users are affected by the side effects of new releases.

Do you have any interesting use cases for our library already? We have some ideas for new features, but we’re more than eager to hear what you may need so that we make our library a good fit for you, too!

Rated: 5.0 / 1 opinions

Maciej Malarz

Team Leader. As a child, perhaps like most of us, dreamt about creating all those cool computer games he was playing but in the end around 2006 found himself in PHP world and stayed there until this very day, trying to make that world a little bit better every day. MongoDB Team @ Doctrine // Fan @ Symfony.

HealthCheck Library

What is that “health check” thing?

HealthCheck library

Running the health checks

Conclusion

Most popular

Our mission is to accelerate your growth through technology