Schema Migrations for Data Projects

16 Aug 2019

Introduction

Database migrations are a tool for managing the evolution of a database schema in an automated and centralized way.

Without a migration tool, you might perform schema updates on your database by manually running SQL commands.

For example imagine you want to create a database to track your favorite pizza toppings. You could create the DB manually from the command line:

psql -c "CREATE DATABASE pizza_toppings;"

Then you could run another command to add a table:

psql -d pizza_toppings -c "CREATE TABLE toppings (name text not null, ranking integer);"

But maybe you realize a simple number isn’t sufficient to express your topping rankings, so you update the table to use an enum (in this case taking advantage of a specific feature in postgresql):

psql -d pizza_toppings -c "CREATE TYPE topping_attitude AS ENUM ('like', 'dont_like');"
psql -d pizza_toppings -c "ALTER TABLE toppings DROP COLUMN ranking;"
psql -d pizza_toppings -c "ALTER TABLE toppings ADD COLUMN attitude topping_attitude;"

We can connect to the database and see that the schema ends up like this:

pizza_toppings=# \d toppings
                   Table "public.toppings"
  Column  |       Type       | Collation | Nullable | Default
----------+------------------+-----------+----------+---------
 name     | text             |           | not null |
 attitude | topping_attitude |           |          |

Manual Migration Problems

This can work well enough, especially for small projects or quick experiments. However it brings several problems:

Without talking to the engineer who configured the DB (or perhaps consulting a soon-to-be-outdated README) we have no way of knowing how things got to this state.
Manually entering commands is error prone. It’s hard to remember the right SQL syntax for various operations, and it’s easy to forget certain less-visible DB effects like adding indices or NULL constraints.
We will eventually need to re-run the process for other environments, for example to create a separate copy of the database for testing or production, and these processes will have to be excecuted manually from memory or documentation.

You’re not off the hook just because your DB is “internal”

This type of on-the-fly DB configuration is especially common in data engineering and analysis projects. Developers working on online production systems may be more accustomed to the need to automate and standardize their DB configuration, but when working on analytics DBs that will mostly be used by other internal engineers, we may be more likely to give ourselves a pass.

Don’t do this.

There’s no such thing as a “one time” setup process in modern data engineering. You will need to re-do this process sooner or later (hopefully sooner, since you’ll want another copy of the database for running all those automated tests you are writing, right?) And in that case, you’ll be happy to have some automation around it.

Proper Tools

So what can we do instead? The tool we’re looking for is called a “Database Migration” or “Schema Migration”. Perhaps confusingly Database Migrations don’t involve moving your DB between locations (for example migrating from one service provider to another). Rather the term refers to migrating the configuration of the database from one state to another. (Some tools also call this a database “evolution.”)

In practice, this comes up most often in the context of relational (e.g. SQL) databases, since these systems enforce pre-defined schemas which must be explicitly upated in order to change the structure of your data. Adding or removing tables, updating indices, or modifying null constraints are all examples of things we might do with a migration tool.

Schema Migration Benefits

Schema Migrations make the process of modifying your database’s schema predictable, repeatable, and manageable. By managing schema evolution with an automated tool rather than manual commands, we gain some important benefits:

Repeatability: You’ll be able to re-run the process reliably on different databases and different environments.
Version Control: By treating your migration scripts as code artifacts, you’re able to commit them into a repository along with the rest of your code. This helps with reviewing and communicating about potential changes.
Documentation: Similar to the previous point, your project’s migrations document how the database got to its current state and help new developers understand the history of the project.
Operability: Using a dedicated tool to run schema migrations helps you integrate the process into automated deployment pipelines.

Essence of A DB Migration Tool

At its core, a migration tool contains a few parts:

1. Migration scripts describing schema changes

Depending on the tool these could be written explicitly as SQL commands, or specified using a language-specific DSL. Most tools version migration scripts in some way, such as prepending an incrementing counter (001, 002, etc) or a timestamp (20190101) to the migration files.

2. Version-tracking mechanism

If we have a series of migrations 001_create_users.sql, 002_create_items.sql, 003_create_orders.sql, we need to know which version we’re currently on so we can decide which ones remain to be run. Many tools create a dedicated table like schema_migrations or schema_version in your database for storing this information. It may go without saying, but you should allow your migration tool to manage this table and avoid editing it manually.

Migration Workflow

With these pieces in mind, a common migration workflow might look like this:

1. Create a new migration file

To start a DB for a new e-commerce web application, we might create a migration file like 001_create_users.sql and populate it with a bit of SQL:

CREATE TABLE users(
  id integer NOT NULL,
  email text,
);

2. Run the migration

At its most primitive, we could do this manually with a combination of running the sql script:

psql -d our_database -f 001_create_users.sql

And inserting the version to our migrations table:

psql -d our_databse -c "INSERT INTO schema_migrations (version) ('001_create_users');"

However these steps will generally be handled by a migration tool for us, so often all we have to do is run some command like cool_migrator migrate up.

Bells and Whistles

On top of this basic functionality, some migration tools include various additional features such as:

Rollbacks

Some tools give you a way to define a “backwards” or “down” migration alongside each forward one. For example you might have:

-- 001_create_users.sql

-- up:
CREATE TABLE users(
  id integer NOT NULL,
  email text,
);

-- down:
DROP TABLE users;

This gives you the ability to revert a migration if you decided there was a problem or you decided you need to change something.

Some tools can even infer this for you automatically. For example the ActiveRecord migrator included with Ruby on Rails can automatically reverse common operations like create table statements.

However keep a few things in mind:

It’s up to you to ensure that your “down” step is a proper inverse of your “up.” If you rely on this functionality in production, it can be a good idea to test it in development to ensure that you have the schema operations correct.
Schema migrations generally operate as a Stack. So if you have migrated versions 001, 002, and 003, and decide you need to make a change to 001. You’ll need to revert 003 and 002 in order to get back to 001. At this point, it may be easier to just make a new version, 004, which performs whatever modifications you need.

Column Helpers

Some tools include built-in helpers for performing common schema modification tasks, like adding inserted and modified timestamps to a table, or configuring primary keys. This is especially common with DSL-based tools like ActiveRecord, Django, or Ecto migrations. This can be surprisingly helpful in enforcing common standards, so that you don’t end up using different timestamp implementations on different tables in your application.

Script Fingerprinting

Some tools record a hash (such as an MD5) of a migration script’s contents when it is run. This can help you prevent a conflict if you accidentally modified the file and tried to run it again;

Script Generation

Many tools include command-line helpers for generating new migration files with the proper naming conventions (like versioning) and some of the common boilerplate filled in.

Configuration

A common theme among schema migration tools is that connection configuration is specified separately from the code that defines your actual migration operations. Different tools have different techniques for this – it could be a DATABASE_URL system var (like many Twelve-Factor apps use), or some combination of a JSON or YAML file and a command-line flag.

The important part is that this configurability allows you to run migrations independently and repeatably across many environments. You might have 2 copies of the db on each developers local machine (one for testing and one for development), one for staging, and one for production, and they can all be reliably kept in sync.

Migration Tooling Lay of the Land

Ok, enough background, how do we get one? There are a lot of tools out there, here is an incomplete listing of them.

ORM-Bundled Solutions

Many full-featured ORM libraries include their own implementation of a Schema Migrator.

Perhaps the most famous is ActiveRecord, the ORM (and migration tool) that ships with Ruby on Rails. ActiveRecord in particular popularized some of the quality-of-life features we have come to expect from these tools, like having a clean DSL (using Ruby in this case) for defining migrations, and including CLI commands for common operations.

Django similarly includes a bundled migration solution along with its ORM. Django’s implementation includes an interesting feature which can infer necessary migrations by looking at changes in your model definitions:

Working with migrations is simple. Make changes to your models - say, add a field and remove a model - and then run makemigrations:

[…]

Your models will be scanned and compared to the versions currently contained in your migration files, and then a new set of migrations will be written out.

The list goes on: Ecto (Elixir), Laravel (PHP), Entity Framework (.NET / C#), and Play (Scala/Java) all include their own solutions. So if you’re using an ORM or a full-stack web framework, check to see if it includes built-in migration support.

Standalone Tools

However, you may not be using one of these larger tools, so it’s nice to have standalone (and potentially smaller or more lightweight) options. There are plenty of these as well. Here are a few:

Flyway is a popular choice in the JVM ecosystem. It operates separately from any ORM or runtime database library, and has integrations with popular JVM build tools like Maven, Gradle, or SBT. This seems to be a popular option for people using Java-based persistence libraries like Hibernate or Spring.
In node.js, as is often the case, you have as many options as you are willing to spend time researching. node-db-migrate, node-pg-migrate, and sequelize all seem like popular solutions (sequelize being included with a popular ORM framework as well).
Ruby’s Sequel and Python’s SQLAlchemy (in the form of Alembic), both include migration options.

A recent favorite: dbmate

Finally, a great option I have been enjoying lately is dbmate.

This is a standalone, library and language-independent migration tool. dbmate itself is written in Go, so it can be easily built as a standalone binary for many platforms (on Mac OS you can install it with brew install dbmate). It’s designed to run as a separate tool from your runtime application process, which gives you a lot of flexibility in how you integrate it with your deployment workflow. And since it’s deployed as a standalone binary, you don’t need to integrate an additional language dependency chain into your environment just to support your migration tool.

The dbmate CLI includes many of the common commands you’ll find in ActiveRecord or other tools:

COMMANDS:
     new, n          Generate a new migration file
     up              Create database (if necessary) and migrate to the latest version
     create          Create database
     drop            Drop database (if it exists)
     migrate         Migrate to the latest version
     rollback, down  Rollback the most recent migration
     dump            Write the database schema to disk
     wait            Wait for the database to become available
     help, h         Shows a list of commands or help for one command

By default, you store your migrations in a db/migrations directory in your project, and you can generate a new one with the new command: dbmate new create_users_table.

This will generate a placeholder file which you can then populate with the contents of your migration:

-- migrate:up
create table users (
  id integer,
  name varchar(255),
  email varchar(255) not null
);

-- migrate:down
drop table users;

Then you can run dbmate up to create the database (if necessary) and migrate it to the latest version.

dbmate uses the environment variable DATABASE_URL for its configuration, so a full example might look like:

DATABASE_URL="postgres://postgres:postgres@localhost:5432/my_db" dbmate up
Creating: my_db
Applying: 20190816175153_create_users.sql
Writing: ./db/schema.sql

And we can see what it created here:

psql -d my_db
psql (10.8 (Ubuntu 10.8-1.pgdg16.04+1))
Type "help" for help.

my_db=> \dt
               List of relations
 Schema |       Name        | Type  |  Owner
--------+-------------------+-------+----------
 public | schema_migrations | table | postgres
 public | users             | table | postgres
(2 rows)

Note that dbmate automatically created the schema_migrations table which it uses to track schema versions.

I like this tool for data engineering projects in particular, since we often have unconventional deployment models.

For example my “application” might be a collection of Airflow DAGs that manage ETL in and out of some database. Projects like these often don’t revolve around a core framework or application server, so sometimes it can be simpler to just manage database migrations as an independent process. And this is where a standalone tool like dbmate can really shine.

Postscript: What about non-SQL DBs?

More to come!