Mimeo - DML Replication
For the last introductory post about mimeo, I’ll be talking about DML replication (previous posts here and here). This is the most common way replication is done on a per table basis (at least that I’ve seen). Typically, a trigger is placed on the source table that tracks all changes (INSERTS, UPDATES & DELETES) and then some mechanism is used to replay those statements on the destination.
For mimeo, this is done with a queue table that just contains the primary key columns to note that a change was done to that row. The trigger places the primary key values into a queue table (also located on the source system) and then mimeo reads the queue table values to replay them on the destination. Saying that the statements are just replayed on the destination is really simplifying things though. While that is technically a legitimate way to replicate table changes, it is far from the most efficient. What mimeo actually does is
- Grab all queue table values, using a SELECT DISTINCT to only get a single copy of each row changed (since multiple changes to the same row put the same value into the queue table multiple times).
- Grab the full row from the source using the primary key to get the most recent values.
- Perform a DELETE … USING … command, removing all rows from the destination table that have a matching primary key value in the queue table.
- INSERT full rows from step 2 into the destination table.
- Clear the processed rows from the queue table.
This method is much more efficient because
- Even if a row is updated 100,000 times between refresh runs, only one update is ever run on the destination with the latest value of that row.
- And since all rows that were changed are deleted from the destination, this avoids having to check if something was actually an update or a delete. If it was an update, it will be reinserted from the queue. If it was deleted, it won’t be in the queue for insert.
And since this is all done in a single transaction on the destination, the result appears exactly the same as if the statements had actually been replayed as they happened on the source.
Mimeo also has a specialized DML replication method that can be useful in a data warehousing environment. One common need is to preserve deleted rows, but not track every single update done to a row. Just the last value that row has needs to be kept for archive purposes. The log deletion (logdel) replication method can provide this. It basically uses the same method as the normal DML above, but the trigger & queue table on the source are a little different. The queue table has the same columns as the source table as well as an extra timestamp column that records when a row was deleted. For an insert or update, just the primary key values are stored in the queue table, but for deletes the entire row gets stored. The replication steps are pretty much the same as DML except there’s an extra one to insert the deleted rows. And the destination table has an extra timestamp column as well to record when that row was deleted on the source.
So that’s basically how mimeo dml replication works. The dml/logdel maker functions take care of setting up the source table triggers, trigger functions, and queue tables for you as long as you’ve got the permissions set properly. The rest of the replication methods also have maker & destroyer functions to make setup and tear down easier. I’ll have further blog posts with some tips and use cases of how we’ve put mimeo to use for our clients. If you’ve got any questions or suggestions, please feel free to post here, on github or poke me on freenode IRC in #postgresql.