Zero Downtime DB Migration
You are sitting peacefully in your cosy little corner of the office completely minding your own business and then suddenly your come up with an idea that can skyrocket the interactions on your app. A few API calls to another micro-service, a DB migration and done for the day. It seemed like a trivial task. But destiny had something else written for me.
As a software developer, once in a while you get some task that seems quite straightforward in the first go, but in later stages it surprises you with unforeseen issues and blockers. To be honest, this is part of a developer’s life. Solving these not-so-trivial problems makes engineering interesting!
So here it was, another non trivial problem to solve on a normal day. The problem that I faced that day excited me to a level that I knew this is no trivial everyday challenge we face as a developer.
The way I had to come up with the solution to the problem involved migrating one of the tables of a database (adding a new column to a table). A critical microservice is using this very database through which all the authentication and authorisation occurs. This sounds like a trivial task, but it wasn’t. The table that needs to be altered contained 22 Million rows, and algorithm inplace was not working. MySQL gave an error asking me to try algorithm copy. (Disclaimer: If you are not familiar with terms like algorithm copy/inplace, no worries, I will explain later in the article). Using algorithm copy instead of inplace would involve write locks to the table for the duration of the migration (which would be in hours). Such a lock will render the database unusable for the duration of the update.
Of course, we cannot afford such a downtime.
What is algorithm copy and algorithm inplace in respect to MySQL?
Algorithm inplace — As the name suggests, it alters the schema of a table without creating a temporary table of the original one, rather it make changes in the original table itself. It doesn’t cause Read Write Locks to the original table for DML operations (Data Manipulation Language) during alteration of table schema (DDL). e.g. adding a new column to the existing table.
Algorithm Copy — As the name suggests, it alters the schema of the existing table by creating a new temporary table with the altered schema (in our example, adding a new column), migrates the data to the new temporary table, changes the linkage to new table, drops the old table, done. But there’s an issue, as MySQL docs will tell.
An ALTER TABLE operation run with the ALGORITHM=COPY clause prevents concurrent DML operations. Concurrent queries are still allowed. That is, a table-copying operation always includes at least the concurrency restrictions of LOCK=SHARED(allow queries but not DML). You can further restrict concurrency for such operations by specifying LOCK=EXCLUSIVE, which prevents DML and queries.
So in layman terms, copy allow reads, but not writes/updates depending upon which locking strategy is used for the operation.
Why did algorithm inplace not work in my scenario?
A little background, the table I wanted to migrate was built in MySQL version 5.5, and now the DB got upgraded to MySQL version 5.6.
When algorithm inplace, did not work, I dug deeper into the MySQL docs and I found this
As of MySQL 5.6.16, ALTER TABLE upgrades MySQL 5.5 temporal columns to 5.6 format for ADD COLUMN, CHANGE COLUMN, MODIFY COLUMN, ADD INDEX, and FORCE operations. This conversion cannot be done using the INPLACE algorithm because the table must be rebuilt, so specifying ALGORITHM=INPLACE in these cases results in an error. Specify ALGORITHM=COPY if necessary.
In layman terms, MySQL suggests me to use algorithm copy, which I can’t on our production instance.(read downtime of few hours, where we can’t even afford few minutes).
So, what are temporal columns? On spending more time with the docs, I found this!!
Limitiations to online DDL operations:
InnoDBtables created before MySQL 5.6 do not support
ALTER TABLE ... ALGORITHM=INPLACEfor tables that include temporal columns (
TIMESTAMP) and have not been rebuilt using
ALTER TABLE ... ALGORITHM=COPY. In this case, an
ALTER TABLE ... ALGORITHM=INPLACEoperation returns the following error:
ERROR 1846 (0A000): ALGORITHM=INPLACE is not supported. Reason: Cannot change column type INPLACE. Try ALGORITHM=COPY.
A little overview of temporal types -
The date and time types for representing temporal values are DATE , TIME , DATETIME , TIMESTAMP , and YEAR . Each temporal type has a range of valid values, as well as a “zero” value that may be used when you specify an invalid value that MySQL cannot represent.
The datatypes which stores date, time, datetime, timestamp, etc are temporal types.
Finally I got the reason why inplace did not work. The reason was written clearly in the docs.
The table that I want to add column to was built in MySQL version 5.5
This table has temporal types.
MySQL version has been updated to 5.6
Any change in schema for this table can’t be done using algorithm inplace because the schema needs to be rebuilt.
To change the schema now, algorithm copy has to be run first which will change the schema by rebuilding it.
After the schema has been rebuilt using copy, any subsequent changes to the schema can be performed using inplace.
And hence I have no other option left (at least that’s what MySQL docs were suggesting me).
I have to use algorithm copy.
Solution : pt-online-schema-change by percona toolkit
I approached the go-to-guy in our tech department, and with Flash like speed he suggested me to use pt-online-schema-change to mitigate the above mentioned problem.
So what exactly does this tool do? I went through the documentation (https://www.percona.com/doc/percona-toolkit/LATEST/pt-online-schema-change.html) and came to know that it will solve my problem.
The documentation has the solution to our problem.
pt-online-schema-change alters a table’s structure without blocking reads or writes. pt-online-schema-change emulates the way that MySQL alters tables internally, but it works on a copy of the table you wish to alter. This means that the original table is not locked, and clients may continue to read and change data in it. pt-online-schema-change works by creating an empty copy of the table to alter, modifying it as desired, and then copying rows from the original table into the new table. When the copy is complete, it moves away the original table and replaces it with the new one. By default, it also drops the original table.
Any modifications to data in the original tables during the copy will be reflected in the new table, because the tool creates triggers on the original table to update the corresponding rows in the new table. The use of triggers means that the tool will not work if any triggers are already defined on the table.
So, in short, here is what pt-online-schema-change does
Creates a new temporary table with updated schema.
Copies data from old table to the newly created table in chunks.
Doesn’t put locks on read.
Doesn’t put locks on writes.
Creates triggers that will copy the newly added data in old table to the new one.
Perform an atomic RENAME operation on the new table.
Drops the old table.
Issue seems to resolve, ran the following two commands on the server for migration -
sudo apt-get install percona-toolkit pt-online-schema-change --alter "Add column <new_column> varchar(60) NOT NULL default <default_value>" --user=root D=<database_name>,t=<table_name> --execute
And… *drum roll*
of course it didn’t work the first time. Figured that we ran out of memory as pt-online-schema-change require more memory to run. So the first attempt was a failure.
We increased the memory and tried again, so a second attempt, *** drum roll again ***
Bingo it worked! But just to give you a heads-up, we started the operation at around 11:30 PM, and by the time it got completed it was already 5:45 AM! It took around 6 whopping hours to run the migration for 22 Million rows, but the best part for this operation -there was no downtime at all. Phew!!! Things done, day saved, another all-nighter, something new learnt!