PGH Web

PGH Web

Web Solutions from Pittsburgh

Schema Migrations Without Downtime (and Why You Should Practice Now)

Jeff Straney·

I did my first production schema migration at 2am on a Sunday because a critical table was running out of space and nothing else could wait. It took three hours. The table was locked for 20 minutes. Everyone was awake. It was not fun.

That was the moment I learned that schema migrations are not something you figure out when you need them. They are something you practice when the stakes are low, so that when you do need one at 2am, you at least know what you are doing.

Why Migrations Are Scary

A VACUUM FULL locks a table. An ALTER TABLE ADD COLUMN locks a table. An ALTER TABLE SET locks a table. Any of these can mean your table is inaccessible for the duration of the operation. On a small table, that is milliseconds. On a table with millions of rows, that is minutes. On a table with millions of rows being written to constantly, the lock queue backs up and now you have a cascade of timeout errors.

Most people's first instinct is to avoid migrations altogether. Denormalize the schema early so you never have to change it. Never rename columns. Never change types. Never add constraints. Build for the schema you think you might need in five years.

The problem with that approach is that you are wrong about what you will need. You are always wrong. Building for a hypothetical future is worse than building for now and being willing to change it later.

What Actually Works

Schema migrations happen in phases: a pre-change, a code change, and a cleanup phase. Each phase is safe in isolation.

Let's say you want to rename a column from user_name to username. You cannot just rename it because your code is still reading from user_name. You have to:

  1. Create the new column
  2. Deploy code that writes to both columns
  3. Run a backfill to copy data from old to new
  4. Deploy code that reads from the new column but still writes to both
  5. Run a cleanup to delete the old column

Each step is safe. If step 3 takes 20 minutes because the table is large, the table is not locked. If step 4 goes wrong, you roll back to step 2 and try again without losing data.

For adding a column, the approach is similar:

  1. Add the column with a default value (this is fast)
  2. Deploy code that populates the column with real values
  3. If the column cannot be nullable, add the NOT NULL constraint once all rows have values (not before)

For dropping a column, reverse the order:

  1. Deploy code that stops using the column
  2. Run a cleanup migration that drops it

Why This Matters Before You Need It

The first time you do this, it will go wrong. You will forget that PostgreSQL locks the table during constraint application. You will run the migration at the wrong time. You will get confused about which code version is deployed. Something will go wrong. That is fine when the table is small and nobody is using it.

If the first time you do this is when the table has millions of rows and your customers are waiting, the mistakes are more expensive.

I practice on a small test table. I write the migration. I deploy code that uses it. I understand how the phases work. I know what locking looks like. Then when I have to do it for real, I have done it before and I am just doing it bigger.

The Practical Commands

For PostgreSQL, migrations are usually expressed as SQL files that run in order. There are tools like Flyway or Liquibase that handle versioning, but at its core it is just "run this SQL in this order and record that you ran it."

For a column rename, you need two migration files separated by code deploys. The first migration adds the new column:

ALTER TABLE users ADD COLUMN full_name TEXT;
CREATE INDEX idx_users_full_name ON users (full_name);

At this point you deploy code that writes to both name and full_name. You run a backfill script to copy existing data. Once the data is in sync, you deploy code that reads from full_name. Only then do you run the second migration:

ALTER TABLE users DROP COLUMN name;

Each SQL file is safe in isolation. The dangerous part is doing both migrations at once while the old code is still reading from name.

Each migration is a separate file with a timestamp. The system records which ones have run. If a migration fails, you fix it and run it again (or write a new one that undoes it).

The Learning is Cheap

The thing I did not understand until I had been burned is that learning migrations on a small table is nearly free. The migration takes the same amount of time whether the table has 100 rows or 100 million rows. You learn the same lesson. You just learn it faster.

So practice. Add a column to a test table. Backfill it. Drop the old one. Rename something. Get comfortable with the fact that migrations lock tables and the goal is to minimize the lock time by splitting the work into phases that do not need locks.

Then when the 2am emergency comes and you really need to expand a table or add a constraint, you are not learning how to do it. You are just doing it.