Schema Migrations for Data Projects
Database migrations are a tool for managing the evolution of a database schema in an automated and centralized way.
Without a migration tool, you might perform schema updates on your database by manually running SQL commands.
For example imagine you want to create a database to track your favorite pizza toppings. You could create the DB manually from the command line:
psql -c "CREATE DATABASE pizza_toppings;"
Then you could run another command to add a table:
psql -d pizza_toppings -c "CREATE TABLE toppings (name text not null, ranking integer);"
But maybe you realize a simple number isn’t sufficient to express your topping rankings, so you update the table to use an enum (in this case taking advantage of a specific feature in postgresql):
psql -d pizza_toppings -c "CREATE TYPE topping_attitude AS ENUM ('like', 'dont_like');" psql -d pizza_toppings -c "ALTER TABLE toppings DROP COLUMN ranking;" psql -d pizza_toppings -c "ALTER TABLE toppings ADD COLUMN attitude topping_attitude;"
We can connect to the database and see that the schema ends up like this:
pizza_toppings=# \d toppings Table "public.toppings" Column | Type | Collation | Nullable | Default ----------+------------------+-----------+----------+--------- name | text | | not null | attitude | topping_attitude | | |
Manual Migration Problems
This can work well enough, especially for small projects or quick experiments. However it brings several problems:
- Without talking to the engineer who configured the DB (or perhaps consulting a soon-to-be-outdated README) we have no way of knowing how things got to this state.
- Manually entering commands is error prone. It’s hard to remember the right SQL syntax for various operations, and it’s easy to forget certain less-visible DB effects like adding indices or NULL constraints.
- We will eventually need to re-run the process for other environments, for example to create a separate copy of the database for testing or production, and these processes will have to be excecuted manually from memory or documentation.
You’re not off the hook just because your DB is “internal”
This type of on-the-fly DB configuration is especially common in data engineering and analysis projects. Developers working on online production systems may be more accustomed to the need to automate and standardize their DB configuration, but when working on analytics DBs that will mostly be used by other internal engineers, we may be more likely to give ourselves a pass.
Don’t do this.
There’s no such thing as a “one time” setup process in modern data engineering. You will need to re-do this process sooner or later (hopefully sooner, since you’ll want another copy of the database for running all those automated tests you are writing, right?) And in that case, you’ll be happy to have some automation around it.
So what can we do instead? The tool we’re looking for is called a “Database Migration” or “Schema Migration”. Perhaps confusingly Database Migrations don’t involve moving your DB between locations (for example migrating from one service provider to another). Rather the term refers to migrating the configuration of the database from one state to another. (Some tools also call this a database “evolution.”)
In practice, this comes up most often in the context of relational (e.g. SQL) databases, since these systems enforce pre-defined schemas which must be explicitly upated in order to change the structure of your data. Adding or removing tables, updating indices, or modifying null constraints are all examples of things we might do with a migration tool.
Schema Migration Benefits
Schema Migrations make the process of modifying your database’s schema predictable, repeatable, and manageable. By managing schema evolution with an automated tool rather than manual commands, we gain some important benefits:
- Repeatability: You’ll be able to re-run the process reliably on different databases and different environments.
- Version Control: By treating your migration scripts as code artifacts, you’re able to commit them into a repository along with the rest of your code. This helps with reviewing and communicating about potential changes.
- Documentation: Similar to the previous point, your project’s migrations document how the database got to its current state and help new developers understand the history of the project.
- Operability: Using a dedicated tool to run schema migrations helps you integrate the process into automated deployment pipelines.
Essence of A DB Migration Tool
At its core, a migration tool contains a few parts:
1. Migration scripts describing schema changes
Depending on the tool these could be written explicitly as SQL commands, or specified using a language-specific DSL. Most tools version migration scripts in some way, such as prepending an incrementing counter (
002, etc) or a timestamp (
20190101) to the migration files.
2. Version-tracking mechanism
If we have a series of migrations
003_create_orders.sql, we need to know which version we’re currently on so we can decide which ones remain to be run. Many tools create a dedicated table like
schema_version in your database for storing this information. It may go without saying, but you should allow your migration tool to manage this table and avoid editing it manually.
With these pieces in mind, a common migration workflow might look like this:
1. Create a new migration file
To start a DB for a new e-commerce web application, we might create a migration file like
001_create_users.sql and populate it with a bit of SQL:
CREATE TABLE users( id integer NOT NULL, email text, );
2. Run the migration
At its most primitive, we could do this manually with a combination of running the sql script:
psql -d our_database -f 001_create_users.sql
And inserting the version to our migrations table:
psql -d our_databse -c "INSERT INTO schema_migrations (version) ('001_create_users');"
However these steps will generally be handled by a migration tool for us, so often all we have to do is run some command like
cool_migrator migrate up.
Bells and Whistles
On top of this basic functionality, some migration tools include various additional features such as:
Some tools give you a way to define a “backwards” or “down” migration alongside each forward one. For example you might have:
-- 001_create_users.sql -- up: CREATE TABLE users( id integer NOT NULL, email text, ); -- down: DROP TABLE users;
This gives you the ability to revert a migration if you decided there was a problem or you decided you need to change something.
Some tools can even infer this for you automatically. For example the ActiveRecord migrator included with Ruby on Rails can automatically reverse common operations like create table statements.
However keep a few things in mind:
- It’s up to you to ensure that your “down” step is a proper inverse of your “up.” If you rely on this functionality in production, it can be a good idea to test it in development to ensure that you have the schema operations correct.
- Schema migrations generally operate as a Stack. So if you have migrated versions
003, and decide you need to make a change to
001. You’ll need to revert
002in order to get back to
001. At this point, it may be easier to just make a new version,
004, which performs whatever modifications you need.
Some tools include built-in helpers for performing common schema modification tasks, like adding inserted and modified timestamps to a table, or configuring primary keys. This is especially common with DSL-based tools like ActiveRecord, Django, or Ecto migrations. This can be surprisingly helpful in enforcing common standards, so that you don’t end up using different timestamp implementations on different tables in your application.
Some tools record a hash (such as an MD5) of a migration script’s contents when it is run. This can help you prevent a conflict if you accidentally modified the file and tried to run it again;
Many tools include command-line helpers for generating new migration files with the proper naming conventions (like versioning) and some of the common boilerplate filled in.
A common theme among schema migration tools is that connection configuration is specified separately from the code that defines your actual migration operations. Different tools have different techniques for this – it could be a
DATABASE_URL system var (like many Twelve-Factor apps use), or some combination of a JSON or YAML file and a command-line flag.
The important part is that this configurability allows you to run migrations independently and repeatably across many environments. You might have 2 copies of the db on each developers local machine (one for testing and one for development), one for staging, and one for production, and they can all be reliably kept in sync.
Migration Tooling Lay of the Land
Ok, enough background, how do we get one? There are a lot of tools out there, here is an incomplete listing of them.
Many full-featured ORM libraries include their own implementation of a Schema Migrator.
Perhaps the most famous is ActiveRecord, the ORM (and migration tool) that ships with Ruby on Rails. ActiveRecord in particular popularized some of the quality-of-life features we have come to expect from these tools, like having a clean DSL (using Ruby in this case) for defining migrations, and including CLI commands for common operations.
Django similarly includes a bundled migration solution along with its ORM. Django’s implementation includes an interesting feature which can infer necessary migrations by looking at changes in your model definitions:
Working with migrations is simple. Make changes to your models - say, add a field and remove a model - and then run makemigrations:
Your models will be scanned and compared to the versions currently contained in your migration files, and then a new set of migrations will be written out.
The list goes on: Ecto (Elixir), Laravel (PHP), Entity Framework (.NET / C#), and Play (Scala/Java) all include their own solutions. So if you’re using an ORM or a full-stack web framework, check to see if it includes built-in migration support.
However, you may not be using one of these larger tools, so it’s nice to have standalone (and potentially smaller or more lightweight) options. There are plenty of these as well. Here are a few:
- Flyway is a popular choice in the JVM ecosystem. It operates separately from any ORM or runtime database library, and has integrations with popular JVM build tools like Maven, Gradle, or SBT. This seems to be a popular option for people using Java-based persistence libraries like Hibernate or Spring.
- In node.js, as is often the case, you have as many options as you are willing to spend time researching. node-db-migrate, node-pg-migrate, and sequelize all seem like popular solutions (sequelize being included with a popular ORM framework as well).
- Ruby’s Sequel and Python’s SQLAlchemy (in the form of Alembic), both include migration options.
A recent favorite: dbmate
Finally, a great option I have been enjoying lately is dbmate.
This is a standalone, library and language-independent migration tool. dbmate itself is written in Go, so it can be easily built as a standalone binary for many platforms (on Mac OS you can install it with
brew install dbmate). It’s designed to run as a separate tool from your runtime application process, which gives you a lot of flexibility in how you integrate it with your deployment workflow. And since it’s deployed as a standalone binary, you don’t need to integrate an additional language dependency chain into your environment just to support your migration tool.
The dbmate CLI includes many of the common commands you’ll find in ActiveRecord or other tools:
COMMANDS: new, n Generate a new migration file up Create database (if necessary) and migrate to the latest version create Create database drop Drop database (if it exists) migrate Migrate to the latest version rollback, down Rollback the most recent migration dump Write the database schema to disk wait Wait for the database to become available help, h Shows a list of commands or help for one command
By default, you store your migrations in a
db/migrations directory in your project, and you can generate a new one with the
dbmate new create_users_table.
This will generate a placeholder file which you can then populate with the contents of your migration:
-- migrate:up create table users ( id integer, name varchar(255), email varchar(255) not null ); -- migrate:down drop table users;
Then you can run
dbmate up to create the database (if necessary) and migrate it to the latest version.
dbmate uses the environment variable
DATABASE_URL for its configuration, so a full example might look like:
DATABASE_URL="postgres://postgres:postgres@localhost:5432/my_db" dbmate up Creating: my_db Applying: 20190816175153_create_users.sql Writing: ./db/schema.sql
And we can see what it created here:
psql -d my_db psql (10.8 (Ubuntu 10.8-1.pgdg16.04+1)) Type "help" for help. my_db=> \dt List of relations Schema | Name | Type | Owner --------+-------------------+-------+---------- public | schema_migrations | table | postgres public | users | table | postgres (2 rows)
Note that dbmate automatically created the
schema_migrations table which it uses to track schema versions.
I like this tool for data engineering projects in particular, since we often have unconventional deployment models.
For example my “application” might be a collection of Airflow DAGs that manage ETL in and out of some database. Projects like these often don’t revolve around a core framework or application server, so sometimes it can be simpler to just manage database migrations as an independent process. And this is where a standalone tool like dbmate can really shine.
Postscript: What about non-SQL DBs?
More to come!