Test failure: Too many tests?

pretix comes with a test suite of currently 2167 tests. When executing these tests, 81 % of pretix’ codebase is run. These tests are intended to verify that pretix is operating correctly and errors and regressions are spotted early. Having such a test suite saved us from a lot of major problems in the past and will hopefully continue to do so in the future.

However, writing many tests comes at a price. The obvious price is that our test suite, by now, runs pretty slow. On a modern computer with parallelized running of the tests, you can still run it in 5 minutes, but on CI it takes more like 20 minutes including dependency installation. Since we’re currently supporting 3 databases and 3 Python versions, we’re also running the test suite at least 9 times on CI, leading to very slow feedback on commits pull requests.

Part of this is due to the fact that the biggest portion of our tests are not “clean” and low-level unit tests, but more like functional tests that emulate HTTP requests on a framework level, therefore executing many layers of code in between. The test matrix will also reduce when we soon drop support for Python 3.4 (when we upgrade to Django 2.0).

We don’t like that the tests run slow, and we’d be happy to do something about it, but that’s a known problem that we already live with for a while. Yesterday, we stumbled into a more interesting barrier of having too many tests. When looking into our latest PR to pretix failing only on SQLite, we’ve noticed that we ran into an interesting bug.

We’re by no means the first to run into this, but since it is non-obvious and we’re probably not the last ones either, I thought it might be useful to share some details.

Some background

Django does some optimizations already when running a testsuite with SQLite databases. In particular, Django uses in-memory SQLite databases during testing, i.e. the database is never written to disk, saving a lot of time.

Additionally, Django makes heavy use of transactions and savepoints to accelerate running the tests but still having a clean database state for every test. This means that Django creates a savepoint before running the first test and then rolls back the database to the savepoint. This avoids needing to create the database structure once for every test, which would be very slow.

The bug

In SQLite versions before 3.12.1, there is a bug leading to a segmentation fault when the “size of the savepoint journal is an exact multiple of the in-memory journal buffer size”. I think that’s a really pretty bug.

Unsurprisingly, this bug can be triggered if you make heavy use of savepoints on in-memory database – as we do with testing pretix, and other people did before.

Unfortunately, this affects the SQLite version running on Travis CI, where we cannot easily update it ourselves (or at least not to a version that mitigates the bug), so we need to find a way to work around this issue.

Our workaround

On the internet, multiple workarounds are circulating, some of them sound as obscure as the bug itself, like renaming tests to change their order of execution. Others are simpler, like stopping to use in-memory databases and writing to disk instead. In the hope of not having to slow down the builds even more, we tried to go the opposite route by instead speeding them up!

We installed pytest-xdist and are now running the tests suite across two worker processes on Travis CI. We didn’t consider this before, since we assumed there wouldn’t be much gain since our tests are CPU-bound and Travis would run builds on a single core. This is wrong, however, Travis in fact assigns two cores to build containers.

With splitting the test suite up to two processes, each process has its own in-memory database and therefore those databases in themselves aren’t under as much stress as before. At the same time, our builds seem to run approximately 1.5 times faster! 💯

A downside to running tests on multiple processes is that measuring coverage becomes harder. We just work around this problem as well by now measuring our test coverage on an unparallelized PostgreSQL build that doesn’t run into this issue at all. 😉