Tests, like any code, should be deleted when their cost exceeds their value.
We are often unduly reticent to delete test code. It’s easy to ignore the cost of tests. We can end up paying a lot for tests, long after they are written.
When embarking on a major re-factoring or new feature it can be liberating to delete all the tests associated with the old implementation and test-drive the new one. Especially if you have higher level tests that test important behaviour you want to retain.
This may seem obvious, but in practice it’s often hard to spot where tests should be deleted. Here’s some attributes to think about when evaluating the cost-effectiveness of tests.
Value
Let’s start with value. Good tests provide lots of value. They provide protection against regressions introduced as the software is refactored or altered. They give rapid feedback on unanticipated side-effects. They also provide executable documentation about the intent of code, why it exists, and which use cases were foreseen.
Redundancy
Unneeded Tests
Unneeded Tests
If there multiple tests covering the same behaviours, perhaps you don’t need all of them. If there’s code that’s only referenced from tests, then you can probably delete that code along with its tests. Even if static analysis indicates code is reachable – is it actually being used in production? Or is it a dead or seldom used feature that can be pruned from your system.
Documentation and Duplicating Implementation
Some tests simply duplicate their implementation. This can happen both with very declarative code which has little behaviour to test, and with excessively mocked-out code where the tests become a specification of the implementation. Neither of these provide a lot of value as they tend to break during refactoring of the implementation as well as when the behaviour changes. Nor do they provide any significant documentation value if they are simply repeating the implementation.
Type System
Tests can also be made redundant by the type system. Simple tests on preconditions or valid inputs to methods can often be replaced by preconditions the type system can enforce.
e.g. if you have 3 methods that all expect an argument that is an integer between 1 and 11 why not make them accept an object of a type that can only be between 1 and 11? While you’re at it you can give it a more meaningful name than integer. onChange(VolumeLevel input) is more meaningful than onChange(int volumeLevel) and removes need for tests at the same time.
Risk
When evaluating the regression protection value that we get from a test we need to think about the risk of the behaviour under test being broken in production.
A system that processes money and could lose thousands of dollars a minute with a small tweak means a big risk of expensive regressions, even if they’re caught quickly in production.
Therefore tests on the behaviour of that system are going to have higher value than tests on a weekly batch log-analysis job that can just be re-run if it fails or produces the wrong results.
Outlived Usefulness
Is the test testing desired behaviour, or behaviour that’s really just a side effect of the chosen implementation approach? Some tests are really useful for “driving” a particular implementation design, but don’t provide much regression test value.
Does the tested behaviour still match what customers and or users want now? Usage patterns change over time as users change, their knowledge of the software changes, and new features are added and removed.
The desired behaviour at the time the test was written might no longer match the desired behaviour now.
Speed
Speed is one of the most important attributes of tests to consider when evaluating their value. Fast tests are valuable because they give us rapid feedback on unanticipated side effects of changes we make. Slow tests are less valuable because it takes longer for them to give you feedback.
This brings us nicely to…
Cost
Speed
Worse than being less valuable in and of themselves, slow tests in your test suite delay you from getting feedback from your other tests while you wait for them (Unless you parallelise every single test). This is a significant cost to having them in your test suite at all.
Slow tests can discourage you from running your suite of tests as regularly as you would otherwise, which can lead to you wasting time on changes that you later realise will break important functionality.
Test suites that take a long time also increase the time it takes to deploy changes to production. This reduces the effectiveness of another feedback loop – getting input from users & customers about your released changes. As test suites get slower it is inevitable that you will also release changes less frequently. Releases become bigger and scarier events that are more likely to go wrong.
In short, slow tests threaten continuous integration and continuous delivery.
There are ways of combating slow tests while keeping them. You can profile and optimise them – just like any other code. You can decompose your architecture into smaller, decoupled services so that you don’t have to run so many tests at a time. In some scenarios it’s appropriate to migrate the tests to be monitoring on your production environment instead of tests in the traditional sense.
However, if the cost imposed by slow tests is not outweighed by their value then don’t hesitate to remove them.
Brittleness
Have you ever worked on a codebase where seemingly any change you made broke hundreds of tests, regardless of whether it changed any behaviour? These brittle tests impose a significant cost on the development of any new features, performance improvements, or refactoring work.
There can be several causes of this. Poorly factored tests with lots of duplication between tests tend to be brittle – especially when assertions (or verifications of mock behaviour) are duplicated between tests. Excessive use of strict mocks (that fail on any unexpected interactions) can encourage this. Another common cause is tests that coupled to their implementation, such as user interface tests that have hard coded css selectors, copy, and x/y coordinates.
You can refactor tests to make them less brittle. You can remove duplication, split tests up so that each asserts only one piece of behaviour, and decouple tests from the implementation using a domain specific DSL or the page object pattern
Or, if the cost of continually fixing and or refactoring these tests is not outweighed by the value they’re providing you could just delete them.
Determinism
Non-deterministic tests have a particularly high cost. They cause us to waste time re-trying them when they fail. They reduce our faith in the test suite to protect us against regressions. They also tend to take a particularly long time to diagnose and fix.
Common causes include sleep() in the tests rather than waiting for and responding to an event. Any test that relies on code or systems that you do not control can be prone to non-determinism. How do your tests cope if your DNS server is slow or returns incorrect results? Do they even work without an internet connection?
Due to the high cost they impose, and the often high cost of fixing them, non-deterministic tests are often ideal candidates for deletion.
Measurement
These cost/value attributes of tests are fairly subjective. Different people will judge costs and risks differently. Most of the time this is fine and it’s not worth imposing overhead to make more data driven decisions.
Monitor Production
Code coverage, logging, or analytics from your production system can help you determine which features are used and which can be removed, along with their tests. As can feedback from your users.
Build Server
Some people will collect data from their build server test runs to record test determinism, test suite time and similar. This can be useful, but it ignores all the times the tests are run by developers on their workstations.
I favour a lightweight approach. Only measure things that are actually useful and you need to.
JUnit Rules
A method for getting feedback on tests that I have found useful is to use the test framework features themselves. JUnit provides a way of hooking in before and after test execution using Rules.
A simple Rule implementation can record the time a test takes to execute. It can record whether it’s non-deterministic by re-trying a failing test and logging if it passes a second time. It can record how often a test is run, how often a test fails, and information about why tests fails.
This data can then be logged and collected centrally for analysis with a log management tool, or published via email, dashboard or similar.
This approach means you can get visibility on how your tests are really being used on workstations, rather than how they behave when run in your controlled build server environment.
Final thoughts
Having so many tests that their speed and reliability becomes a big problem is a nice #firstworldproblem to have. You can also bear these cost/value attributes in mind when writing tests to help yourself write better tests.
Don’t forget, you can always get tests back from revision control if you need them again. There’s no reason to @Ignore tests.
So don’t be afraid to delete tests if their cost exceeds their value.
Steve Smith
Hi Benji
This is a great post. I broadly agree with it, although your #firstworldproblem caveat should probably be at the start! Certainly tests can become a burden over time but for every project such as yours there are many, many more projects that a) do not have tests, b) have entire test suites that don’t work.
Coincidentally I’ve been looking at test cost as a function of execution time, determinism, and robustness that is proportional to System Under Test scope. What interests me is how test cost can increase over time, and gradually a tipping point is reached where cost exceeds value. There is then a difficult, contextual decision to make… should that test cost be reduced or entirely eliminated?
Thought-provoking stuff!
Steve
Steve Smith
Hi
We had a great tool at LMAX for measuring non-determinism in acceptance tests known as AutoTrish (named in honour of Trisha Gee) – from memory the past 10 runs of each acceptance test were persisted, and AutoTrish calculated the standard deviation of test results. Intermittent tests were quickly identified and moved into a separate test suite before being fixed.
Steve