MySQL Snippets: MySQL 5.6 GTID replication 'Got fatal error 1236 from master when reading data from binary log'

MySQL 5.6 GTID replication 'Got fatal error 1236 from master when reading data from binary log' typically occurs when the MASTER purges it's binary logs and your SLAVE is using either replicate-do-db or replicate-ignore-db.

Why does this happen?

It's because of the way MySQL (as of 5.6.17) is currently keeping track of your MASTER(s) Executed GTIDs. Here is an example:

mysql slave> SHOW SLAVE STATUS;

and we look at the Executed_Gtid_Set column.

22f1e449-03ba-11e3-9b29-003048f2f703:1-17957,
69cf02cd-1731-11e3-9a19-002590854928:6-11:16:21-50:55-58:63-64:69-169,
708bb615-d393-11e3-a682-003048c3ab22:6:11:16:21:26:31:36:41:46:51:56,
819c985c-d384-11e3-a621-00259002979a:1-18:34:39:44:49:54:59:64:69:74,
9204e764-d379-11e3-a5d9-0013726268ea:4:55-58:92-219:1669-1782:2329

In this example we have five (5) servers shown. Typically the first UUID (22f1e449-03ba-11e3-9b29-003048f2f703) is the local server (in this case, the SLAVE). The other UUIDs shown are the MASTER servers. Pay close attention to the comma (,) that separates each UUID GTID set!

It's important to note that in this discussion our MySQL replication topography is circular (aka looped) with four masters and one slave.

What am I looking at?

You are seeing the unique ids (UUIDs) for each server participating in the replication process - from this servers point of view (very important here). Lets dissect this a bit:

22f1e449-03ba-11e3-9b29-003048f2f703:1-17957

22f1e449-03ba-11e3-9b29-003048f2f703 is the UUID of the local server - an end-point slave - that is replicating from a MASTER server. The 1-17957 indicates what GTIDs have been executed by this server for this server. Yes, it's keeping track of itself.

69cf02cd-1731-11e3-9a19-002590854928 is the UUID of one of the four MASTER servers. Here we see the GTIDs this server (the SLAVE) has executed for server 69cf02cd-1731-11e3-9a19-002590854928 and they are: 6-11, 16, 21-50, 55-58 ... :69-169. Notice the sequence gaps in the GTIDs? Notice that it doesn't say 1-169?

OK, Why the GTID sequence gaps?

Because this SLAVE is using replicate-do-db in it's configuration (ex. my.cnf):

...
replicate-do-db=database1
replicate-do-db=database2
replicate-do-db=mysql
...

The gaps occur when the SLAVEs SLAVE_IO thread reads from the MASTERs binlog and encounters GTID's for databases that it will not replicate. In other words, the SLAVE is ignoring those GTIDs. When it does find a valid GTID (i.e. matches one of our replicate-do-db databases) it copies the transaction to be executed by the SLAVE_SQL thread. When the SLAVEs SLAVE_SQL thread successfully executes the GTID it updates its Executed_Gtid_Set and notates that a sequence skip has occurred. Hence the 6-11:16:21-50:55-58: etc.

Got it. By why the 1236 error?

This occurs because the MASTER purged its GTIDs. This purge typically occurs when the MASTER expires its binlogs (remember your expire-logs-days=XX). Now, when the SLAVEs SLAVE_IO thread attempts to resolve what has been executed locally versus what the MASTER has executed the SLAVEs SLAVE_IO thread finds that the MASTER is missing GTIDs. Remember those sequence gaps? The SLAVE is trying to resolve (fill in) in those sequence gaps from the MASTER but the MASTER says it has already purged those GTIDs! In this example the SLAVE has 6-11:16 so it's trying to locate GTIDs (transactions) 12-15 (and so on) but those GTIDs are no longer present on the MASTER server. Problem!

Big problem!

Yes, it is. A bug has been reported to MySQL here.

How do I fix this?

Fortunately, there is a way to fix it without having to dump and reload your slave from the master. This fix will require you to place the SLAVE server into read-only mode for a few minutes. You may need to perform these steps during some scheduled downtime. Plan accordingly.

Step 1: Stop your slave.

mysql slave> STOP SLAVE;

Step 2: Slave to READ-ONLY mode

mysql slave> FUSH TABLES WITH READ LOCK;
mysql slave> SET GLOBAL READ_ONLY = TRUE;

Why? Because we cannot allow the local server (SLAVE) to update it's binlog while we make our changes. So this step keeps updates from occurring (if you permit them).

Step 3: Get the Executed_GTID_SET

mysql slave> SHOW SLAVE STATUS;

Copy and paste the Executed_GTID_SET column into your favorite text editor. You are going to manually edit these GTIDs.

Step 4: Remove the GTID gaps

ORIGINAL

22f1e449-03ba-11e3-9b29-003048f2f703:1-17957,
69cf02cd-1731-11e3-9a19-002590854928:6-11:16:21-50:55-58:63-64:69-169,
708bb615-d393-11e3-a682-003048c3ab22:6:11:16:21:26:31:36:41:46:51:56,
819c985c-d384-11e3-a621-00259002979a:1-18:34:39:44:49:54:59:64:69:74,
9204e764-d379-11e3-a5d9-0013726268ea:4:55-58:92-219:1669-1782:2329

NEW

69cf02cd-1731-11e3-9a19-002590854928:1-169,
708bb615-d393-11e3-a682-003048c3ab22:1-56,
819c985c-d384-11e3-a621-00259002979a:1-74,
9204e764-d379-11e3-a5d9-0013726268ea:1-2329

Notice how I made the first number a one (1) and used the very last number from each servers set. I also removed the local servers entry (i.e. the 22f1e449-03ba-11e3-9b29-003048f2f703:1-17957). We do this because the local servers GTIDs will be reset in the next step.

DOUBLE CHECK your work! COMMAS are VERY IMPORTANT!

Step 5: Reset Master

mysql slave> RESET MASTER;

Why? Because we cannot update the global GTID_PURGED when it has data. The only current way to reset GTID_PURGED is by performing a RESET MASTER;

!! WARNING !! Executing RESET MASTER on a MASTER server is devastating. This command will reset this servers GTIDs. If you must perform these steps on a MASTER server then you will need to update the GTID's on all downstream servers too! Better think this through thoroughly!

Step 6: Set new global GTID_PURGED

Copy and paste your GTID changes from your editor. Remember to leave out the local servers GTID set.

mysql slave> SET GLOBAL GTID_PURGED="69cf02cd-1731-11e3-9a19-002590854928:1-169, 708bb615-d393-11e3-a682-003048c3ab22:1-56, 819c985c-d384-11e3-a621-00259002979a:1-74, 9204e764-d379-11e3-a5d9-0013726268ea:1-2329"

NOTICE:
If you get the error:

Error Code: 1840 @@GLOBAL.GTID_PURGED can only be set when @@GLOBAL.GTID_EXECUTED is empty.

This is because the server has updated its binlog and the GTID_PURGED has been updated. In this case you may need to execute Steps 5 and 6 simultaneously.

Step 7: Remove READ-ONLY

mysql slave> SET GLOBAL READ_ONLY = FALSE;
mysql slave> UNLOCK TABLES;

Step 8: Restart Slave

mysql slave> STOP SLAVE;
mysql slave> RESET SLAVE;
mysql slave> START SLAVE;

And, if everything was performed correctly, your slave should be back up and replicating.

MySQL Snippets

Thursday, May 29, 2014

MySQL 5.6 GTID replication 'Got fatal error 1236 from master when reading data from binary log'