I have become increasingly convinced that there is little difference between monitoring and testing. Often we can run our automated tests against a production system with only a little effort.
We are used to listening to our automated tests for feedback about our software in the form of test-smells. If our tests are complicated, it’s often a strong indicator that our code is poorly designed and thus hard to test. Perhaps code is too tightly coupled, making it hard to test parts in isolation. Perhaps a class has too many responsibilities so we find ourselves wanting to observe and alter private state.
Doesn’t the same apply to our production monitoring? If it’s hard to write checks to ensure our system is behaving as desired in production, it’s easy to blame the monitoring tool, but isn’t it a reflection on the applications we are monitoring. Why isn’t it easy to monitor them?
Just as we test-drive our code, I like to do monitoring check-driven development for our applications. Add checks that the desired business features are working in production even before we implement them, then implement them and deploy to production in order to make the monitoring checks pass. (This does require that the same team building the software is responsible for monitoring it in production.)
As I do this more I notice that in the same way that TDD gives me feedback on code design (helping me write better code), Check-driven-development gives feedback on application design.
Why are our apps too hard to test? Perhaps they are too monolithic so the monitoring tools can’t check behaviour or data buried in the middle. Perhaps they do not publish the information we need to ascertain whether they are working.
Here are 4 smells of production monitoring checks that I think give design feedback, just like their test counterparts. There are lots more.
- Delving into Private State
- Complexity, Verbosity, Repetition
- Non-Determinism
- Heisentests
Delving into Private State
Testing private methods in a class often indicates there’s another class screaming to be extracted. If something is complex enough to require its own test, but not part of the public interface of the object under test, it likely violates the single responsibility principle and should be pulled out.
Similarly, monitoring checks that peek into the internal state of an app may indicate there’s another app that could be extracted with a clearly separate responsibility that we could monitor independently.
For example, we had an application that ingested some logs, parsed them and generated some sequenced events, which then triggered various state changes.
We had thoroughly tested all of this, but we still occasionally had production issues due to unexpected input and problems with the sequence generation triggered by infrastructure-environment problems.
In response, we added monitoring that spied on the intermediate state of the events flowing through the system, and spied on the database state to keep track of the sequencing.
Monitoring that pokes into the internals of an application like this is similar to tests that use reflection to spy on private state – and similar problems arise. In this case schema changes and refactorings of the application would break the monitoring checks.
In the end we split out some of these responsibilities. We ended up with an entirely stateless app that created events from logs, a separate event sequencer, and the original app consumed the resultant events.
The result is much easier to monitor as the responsibilities are more clear. We can monitor the inputs and outputs of the event generator, both passively and actively by passing through test data in production.
Our monitoring checks are relying on the public interface that is used by the other applications so we are less likely to inadvertently break it. It’s similarly easy to monitor what the sequencer is doing, and we can ingest the events from our production system in a copy to spot problems early.
Complexity, Verbosity, Repetition
I’ve also been guilty of writing long monitoring checks that poll various systems, perform some calculations on the results, and then determine whether there is a problem based on some business logic embedded in the check.
These checks tend to be quite long, and are at risk of copy-paste duplication when someone wants to check something similar. Just as long and duplicated unit tests are a smell, so are monitoring checks. Sometimes we even want to TDD our checks because we realise they are complex.
When our tests get long and repetitive, we sometimes need to invest in improving our test infrastructure to help us write more concise tests in the business domain language. The same applies to monitoring checks – perhaps we need to invest in improving our monitoring tools if the checks are long and repetitive.
Sometimes verbose and unclear tests can be a sign that the implementation they are testing is also unclear and at the wrong abstraction level. If we have modelled concepts poorly in our code then we’ll struggle to write the tests (which are often closer to the requirements). It’s a sign we need to improve our model to simplify how it fulfils the requirements.
For example, if we had a payment amount in our code which we were representing as an integer or a currency value, but could only be positive in our context, we might end up with tests in several places that check that the number is positive and check what happens if we ended up with a negative number due to misuse.
public void deposit(int paymentAmount) { ... }
public void withdraw(int paymentAmount) { ... }
@Test
public void deposit_should_not_accept_negative_paymentAmount() { ... }
@Test
public void withdraw_should_not_accept_negative_paymentAmount() { ... }
We might spot this repetition and complexity in our tests and realise we need to improve our design, we could introduce a PaymentAmount concept that could only be instantiated with positive numbers and pass that around instead, removing the need to test the same thing everywhere.
class PaymentAmount { ... }
@Test
public void should_not_be_able_to_create_negative_paymentamount() { ... }
public void deposit(PaymentAmount amount) { ... }
public void withdraw(PaymentAmount amount) { ... }
In the same way monitoring checks that are repetitive can often be replaced with enforcement of invariants within applications themselves. Have the applications notify the monitoring system if an assumption or constraint is violated, rather than having the monitoring systems check themselves. This encapsulates the business logic in the applications and keeps checks simple.
Compare
#!/bin/bash
COUNT=$(runSql "select count(*) from payment_amounts where value < 0")
# Assert value of count
with
class PaymentAmount {
public PaymentAmount(int value) {
monitoredAssert(value > 0 , "Negative payment amount observed")
...
}
}
static void monitoredAssert(boolean condition, String message) {
if (!condition) notifyMonitoringSystemOfFailure(message);
}
Now your monitoring system just alerts you to the published failures.
Too often I've written checks for things that could be database constraints or assertions in an applications.
Non-Determinism
Non-deterministic tests are dangerous. They rapidly erode our confidence in a test suite. If we think a test failure might be a false alarm we may ignore errors until some time after we introduce them. If it takes multiple runs of a test suite to confirm everything is OK then our feedback loops get longer and we’re likely to run our tests less frequently.
Non-deterministic monitoring checks are just as dangerous, but they’re so common that most monitoring tools have built in support for only triggering an incident/alert if the check fails n times in a row. When people get paged erroneously on a regular basis it increases the risk they’ll ignore a real emergency.
Non-deterministic tests or monitoring checks are often a sign that the system under test is also unreliable. We had a problem with a sporadically failing check that seemed to always fix itself. It turned out to be due to our application not handling errors from the S3 API (which happen quite frequently). We fixed our app to re-try under this scenario and the check became reliable.
HeisenTests
A test or check that influences the behaviour it is trying to measure is another smell. We've had integration tests that ran against the same system as our performance tests and influenced the results by priming caches. We've also had similar issues with our production monitoring.
In one particularly fun example, we had reports that one of our webapps was slow during the night. We were puzzled and decided to start by adding monitoring of the response times of the pages in question at regular intervals so we could confirm the user’s reports.
The data suggested there was no problem, and the users actually reported the problem fixed. It turned out that by loading these pages regularly we were keeping data in caches that normally expired during low usage periods. By monitoring the application we had changed the behaviour.
Both of these scenarios were resolved by having the application publish performance metrics that we could check in our tests. We could then query from a central metrics database for production monitoring. This way we were checking performance against our acceptance criteria and real-world user behaviour, and not influencing that behaviour.
Conclusion
Writing your production monitoring before you implement features can help you towards a better design. Look out for smelly monitoring, what is it telling you about your application design?
What other smells have you noticed from monitoring your systems in production?