Do you know these blog posts about growing companies that have large-scale outages because they ran out of available IDs for their database rows? This is not one of them. We’re highly committed to making new and exciting mistakes, not repeat the ones we’ve read about a dozen times. This is the story of how we went through that migration before it hurt us.
If you haven’t read all the other blog posts about this, here’s a quick recap of what the problem is. You are starting your project, possibly with Django, and you start with a table layout like this:
Years later, your project is growing further than you ever expected, and you have two billion orders in your database.
Only then you learn that
integer can only support values of up to 2,147,483,647 and you can’t create any more orders with higher ID numbers.
When the problem hits you, it’s usually to late for any simple solution.
You can’t just run an
ALTER TABLE statement, because PostgreSQL will need to re-write the entire table for you (and lock the table while it does so).
And your table is huge – otherwise you wouldn’t have the problem in the first place.
ALTER TABLE, just as well as any other solution that is not horrifyingly complex, will cause a long downtime to your system.
We haven’t quite sold two billion tickets yet (we’re still in the lower millions), but we have busier tables (such as log tables) that are in the tens of millions of rows. Going by these numbers, we’re still a number of years away from really hitting this wall.
However, we’ve already reached a size at which running
ALTER TABLE on the primary key of over 400 tables with lots of foreign keys is too slow and fragile,
but we’re still at a size where it is generally possible to rewrite the entire database content within a somewhat reasonable downtime window.
We figured it would be nice to solve this problem now and not run into it in a few years!
Coincidentally, we have been planning upgrading our PostgreSQL cluster from PostgreSQL 11 to PostgreSQL 15. If you have performed major version upgrades of highly-available PostgreSQL setups before, you know that there are multiple approaches to this, each with their own set of disadvantages. The only approaches without any significant downtime (e.g. logical replication) come with huge additional complexity. The appraoch we settled on was to create a completely new cluster from scratch, dump all data from the old system and import it into the new one.
At our size, this is still doable in a reasonable maintenance window (~ 15 minutes for the dump, 30 minutes for the restore, plus lots of time for safety checks). What an opportunity to change more things in the same maintenance window! Ultimately, we have combined a number goals that are hard to do within the running system into this one migration:
- Upgrade to the newest PostgreSQL version
- Upgrade to new hardware with recent CPUs and more network bandwidth
- Upgrade of all integer primary keys to bigints
- Getting rid of a few rows with transaction integrity errors, likely caused by a hardware failure over 2 years ago
- Getting rid of high database bloat, likely caused by wrong autovacuum configurations in the past
- Integrating PostgreSQL more tightly with our log collection system
Preparing this upgrade was a multi-month effort for many reasons, but in this blog post, let’s talk about the primary key migration.
Instead of a full dump, we split our database dump into two parts. On the old database cluster, we first dump the database schema and all database content separately:
old_host> pg_dump --schema-only pretix_prod > /path/schema-prod.sql old_host> pg_dump --data-only -j8 -Fd pretix_prod -f /path/data-prod
Later, on the new host, we import the schema first and then add the data:
new_host> psql pretix_prod < /path/schema-prod.sql new_host> pg_restore -j8 -d pretix_prod \ -Fd /path/dumps/data-prod --disable-triggers
Luckily, the data dump does not contain any type information!
So as long as the types coerce correctly on import, we’re free to change the types in the schema betewen dump and import.
All we need to do is
s/integer/bigint/g on the schema SQL!
Except we don’t really want to replace all integer columns.
We’re fine with converting all
id values to
bigint, and we also need to convert all columns with a foreign key constraint pointing to an ID column.
But there are other
integer columns, which are not IDs, that should keep their type.
To do so, we first want to look for all sequences, such as
and change their type:
Then, we want to find all columns which are generated by that sequence. In the schema dump, these look like this:
Then, we need to find the
CREATE TABLE statement including this column:
Next, we need to find all foreign key constraints pointing to our column, such as:
And again, find the
CREATE TABLE statement for this one and change it to bigint:
Phew. A regex is not going to cut it!
Luckily, someone already built a SQL parser module in Python!
I knew about
sqlparse because it is used in multiple places of the Django ecosystem.
So we can just parse the SQL, inspect the resulting data structure, follow the steps above and be done with it?
Not quite. First, sqlparse unfortunately does not parse SQL into an abstract syntax tree, it just parses it into tokens without interpreation of larger-scale language contracts. Second, apparently sqlparse has not been built to support DDL at all!
For example, the DDL statement
Would be parsed by sqlparse into the following tokens (one token per line, whitespace omitted):
CREATE TABLE public.pretixbase_order ( id integer NOT NULL, name character varying(3) NOT NULL )
NOT NULL, name together as a
IdentifierList token is… not really helpful.
Luckily, we’re not building software here that needs to be really stable and work forever with all kinds of SQL!
We just need to convert that one schema. Once. So we can tune down the perfectionism a bit.
In the end, we end up with a big for loop over all statements, walking through the tokens.
For this blog post, let’s just look at parsing the
CREATE SEQUENCE statements:
In this way, we build four mappings in total:
- A mapping of all sequence names to the token in which their type is defined (see above)
- A mapping of all field names of integer fields to the token in which their type is defined
- A mapping of all sequence names to the field name that uses them
- A mapping of all fields with a foreign key constraint to the field that they refer to
Of course, this makes a lot of assumptions, such as every sequence only being used by one field. But this assumption holds true for PostgreSQL schemas that are, as in our case, automatically generated by Django.
With this information, we can then walk through all entries in the various dictionary and figure out which tokens we need to change:
Finally, we can output the tokens as SQL again:
Tada! If we now run a
diff -u between the old and the new schema, we see exactly the changes we want to see:
CREATE TABLE public.pretixbase_order ( - id integer NOT NULL, + id bigint NOT NULL, code character varying(16) NOT NULL, ... total numeric(10,2) NOT NULL, - event_id integer NOT NULL, + event_id bigint NOT NULL, comment text NOT NULL, ... CREATE SEQUENCE public.pretixbase_order_id_seq - AS integer + AS bigint START WITH 1 INCREMENT BY 1 NO MINVALUE
If you want to see the entire script in its undocumented, unmaintained and untidy state, have a look here.
What about Django?
So far, we’ve covered this only from a database-level point of view and entirely skipped the fact that we just committed blasphemy:
We changed our database schema without going through Django’s migrations framework!
We’ll deal with this afterwords, after the migration is already live, by setting the following setting in our
Now, we can generate lots of migrations in Django:
> python manage.py makemigrations
We can safely apply them to our database without risking table rewrites since they will generate SQL that will be a no-op on the database.
Or, if we want, we can even skip them with
python manage.py migrate --fake.
That’s it …
… is what we thought in July after executing this migration. We even finished a first draft of this blog post in July! Just before we clicked “Merge” on the pull request for the Django migrations, we double-checked our production database and noticed, horrified, that not all columns had been converted to bigints correctly. In contrast to our test system, our production database still had some integer ID columns, mostly on tables that exist since the beginning of pretix.
Digging deeper, it showed that even before our migration, these tables already had a bigint sequence but an integer column. We don’t know why, we assume some old version of Django created sequences without specifying a type and bigint is the default in PostgreSQL. Since our migration script described above only looks into integer sequences, we missed converting those columns. We fixed this in our migration script and scheduled another downtime window in September for another round of dump and restore.
If you start a new project, do yourself a favour and start with
bigint right away.
This is part of the project template in the most recent Django versions, and generally good advice.
If your project (or your table) stays small forever, the additional storage overhead of
bigint will be negligible, since a few bytes times a small number of rows is still small.
If your project (or your table) grows large, you will one day be glad you chose the larger type in the first place.