Sunday, May 28, 2017

How to break MySQL InnoDB cluster

A few weeks ago I started experimenting with MySQL InnoDB cluster. As part of the testing, I tried to kill a node to see what happens to the cluster.

The good news is that the cluster is resilient. When the primary node goes missing, the cluster replaces it immediately, and operations continue. This is one of the features of an High Availability system, but this feature alone does not define the usefulness or the robustness of the system. In one of my previous jobs, I worked at testing a commercial HA system and I've learned a few things about what makes a reliable system.

Armed with this knowledge, I did some more experiments with InnoDB Cluster. The attempt from my previous article had no other expectation than seeing operations continue with ease (primary node replacement.) In this article, I examine a few more features of an HA system:

  • Making sure that a failed primary node does not try to force itself back into the cluster;
  • Properly welcoming a failed node into the cluster;
  • Handling a Split Brain cluster.

To explore the above features (or lack of) we are going to simulate some mundane occurrences. We start with the same cluster seen in the previous article, using Docker InnoDB Cluster. The initial state is

{
    "clusterName": "testcluster",
    "defaultReplicaSet": {
        "name": "default",
        "primary": "mysqlgr1:3306",
        "status": "OK",
        "statusText": "Cluster is ONLINE and can tolerate up to ONE failure.",
        "topology": {
            "mysqlgr1:3306": {
                "address": "mysqlgr1:3306",
                "mode": "R/W",
                "readReplicas": {},
                "role": "HA",
                "status": "ONLINE"
            },
            "mysqlgr2:3306": {
                "address": "mysqlgr2:3306",
                "mode": "R/O",
                "readReplicas": {},
                "role": "HA",
                "status": "ONLINE"
            },
            "mysqlgr3:3306": {
                "address": "mysqlgr3:3306",
                "mode": "R/O",
                "readReplicas": {},
                "role": "HA",
                "status": "ONLINE"
            }
        }
    }
}

The first experiment is to restart a non-primary node

$ docker restart mysqlgr2

and see what happens to the cluster

$ ./tests/check_cluster.sh | grep 'primary\|address\|status'
    "primary": "mysqlgr1:3306",
    "status": "OK_NO_TOLERANCE",
    "statusText": "Cluster is NOT tolerant to any failures. 1 member is not active",
            "address": "mysqlgr1:3306",
            "status": "ONLINE"
            "address": "mysqlgr2:3306",
            "status": "(MISSING)"
            "address": "mysqlgr3:3306",
            "status": "ONLINE"

The cluster detects that one member is missing. But after a few seconds, it goes back to normality:

$ ./tests/check_cluster.sh | grep 'primary\|address\|status'
    "primary": "mysqlgr1:3306",
    "status": "OK",
    "statusText": "Cluster is ONLINE and can tolerate up to ONE failure.",
            "address": "mysqlgr1:3306",
            "status": "ONLINE"
            "address": "mysqlgr2:3306",
            "status": "ONLINE"
            "address": "mysqlgr3:3306",
            "status": "ONLINE"

This looks good. Now, let's do the same to the primary node

$ docker restart mysqlgr1

$ ./tests/check_cluster.sh 2| grep 'primary\|address\|status'
    "primary": "mysqlgr2:3306",
    "status": "OK_NO_TOLERANCE",
    "statusText": "Cluster is NOT tolerant to any failures. 1 member is not active",
            "address": "mysqlgr1:3306",
            "status": "(MISSING)"
            "address": "mysqlgr2:3306",
            "status": "ONLINE"
            "address": "mysqlgr3:3306",
            "status": "ONLINE"

As before, the cluster detects that a node is missing, and excludes it from the cluster. Since it was the primary node, another one becomes primary.

However, this time the node does not come back in the cluster. Checking the cluster status again after several minutes, node 1 is still reported missing. This is not a bug. This is a feature of well behaved HA systems: a primary node that has been already replaced should not come back to the cluster automatically.

Also this experiment was good. Now, for the interesting part, let's see the Split-Brain situation.

Np brain 987746 000000

At this moment, there are two parts of the cluster, and each one sees it in a different way. The view from the current primary node is the one reported above and what we would expect: node 1 is not available. But if we ask the cluster status to node 1, we get a different situation:

$ ./tests/check_cluster.sh 1 | grep 'primary\|address\|status'
    "primary": "mysqlgr1:3306",
    "status": "OK_NO_TOLERANCE",
    "statusText": "Cluster is NOT tolerant to any failures. 2 members are not active",
            "address": "mysqlgr1:3306",
            "status": "ONLINE"
            "address": "mysqlgr2:3306",
            "status": "(MISSING)"
            "address": "mysqlgr3:3306",
            "status": "(MISSING)"

Node 1 thinks it's the primary, and two nodes are missing. Node 2 and three think that node 1 is missing.

In a sane system, the logical way to operate is to admit the failed node back into the cluster, after checking that it is safe to do so. In the InnoDB cluster management there is a rejoinInstance method that allows us to get an instance back:

$ docker exec -it mysqlgr2 mysqlsh --uri root@mysqlgr2:3306 -p$(cat secretpassword.txt)

mysql-js> cluster = dba.getCluster()
<Cluster:testcluster>

mysql-js> cluster.rejoinInstance('mysqlgr1:3306')
Rejoining the instance to the InnoDB cluster. Depending on the original
problem that made the instance unavailable, the rejoin operation might not be
successful and further manual steps will be needed to fix the underlying
problem.

Please monitor the output of the rejoin operation and take necessary action if
the instance cannot rejoin.

Please provide the password for 'root@mysqlgr1:3306':
Rejoining instance to the cluster ...

The instance 'root@mysqlgr1:3306' was successfully rejoined on the cluster.

The instance 'mysqlgr1:3306' was successfully added to the MySQL Cluster.

Sounds good, eh? Apparently, we have node 1 back in the fold. Let's check:

$ ./tests/check_cluster.sh 2| grep 'primary\|address\|status'
    "primary": "mysqlgr2:3306",
    "status": "OK_NO_TOLERANCE",
    "statusText": "Cluster is NOT tolerant to any failures. 1 member is not active",
            "address": "mysqlgr1:3306",
            "status": "(MISSING)"
            "address": "mysqlgr2:3306",
            "status": "ONLINE"
            "address": "mysqlgr3:3306",
            "status": "ONLINE"

Nope. Node 1 is still missing. And if we try to rescan the cluster, we see that the rejoin call was not effective:

mysql-js> cluster.rescan()
Rescanning the cluster...

Result of the rescanning operation:
{
    "defaultReplicaSet": {
        "name": "default",
        "newlyDiscoveredInstances": [],
        "unavailableInstances": [
            {
                "host": "mysqlgr1:3306",
                "label": "mysqlgr1:3306",
                "member_id": "6bd04911-4374-11e7-b780-0242ac170002"
            }
        ]
    }
}

The instance 'mysqlgr1:3306' is no longer part of the HA setup. It is either offline or left the HA group.
You can try to add it to the cluster again with the cluster.rejoinInstance('mysqlgr1:3306') command or you can remove it from the cluster configuration.
Would you like to remove it from the cluster metadata? [Y|n]: n

It's curious (and frustrating) that we get a recommendation to run the very same function that we've attempted a minute ago.

But, just as a devilish thought, let's try the same experiment from the invalid cluster.

$ docker exec -it mysqlgr1 mysqlsh --uri root@mysqlgr1:3306 -p$(cat secretpassword.txt)

mysql-js> cluster = dba.getCluster()
<Cluster:testcluster>


mysql-js> cluster.rejoinInstance('mysqlgr2:3306')
Rejoining the instance to the InnoDB cluster. Depending on the original
problem that made the instance unavailable, the rejoin operation might not be
successful and further manual steps will be needed to fix the underlying
problem.

Please monitor the output of the rejoin operation and take necessary action if
the instance cannot rejoin.

Please provide the password for 'root@mysqlgr2:3306':
Rejoining instance to the cluster ...

The instance 'root@mysqlgr2:3306' was successfully rejoined on the cluster.

The instance 'mysqlgr2:3306' was successfully added to the MySQL Cluster.
mysql-js> cluster.status()
{
    "clusterName": "testcluster",
    "defaultReplicaSet": {
        "name": "default",
        "primary": "mysqlgr1:3306",
        "status": "OK_NO_TOLERANCE",
        "statusText": "Cluster is NOT tolerant to any failures. 1 member is not active",
        "topology": {
            "mysqlgr1:3306": {
                "address": "mysqlgr1:3306",
                "mode": "R/W",
                "readReplicas": {},
                "role": "HA",
                "status": "ONLINE"
            },
            "mysqlgr2:3306": {
                "address": "mysqlgr2:3306",
                "mode": "R/O",
                "readReplicas": {},
                "role": "HA",
                "status": "ONLINE"
            },
            "mysqlgr3:3306": {
                "address": "mysqlgr3:3306",
                "mode": "R/O",
                "readReplicas": {},
                "role": "HA",
                "status": "(MISSING)"
            }
        }
    }
}

Now this was definitely not supposed to happen. The former failed node has invited a healthy node into its minority cluster and the operation succeeded!

The horrible part? This illegal operation succeeded into reconciling the views from node 1 and node2. Now also node 2 thinks that node1 is again the primary node, and node 3 (which was minding its own business and never had any accidents) is considered missing:

$ ./tests/check_cluster.sh 2| grep 'primary\|address\|status'
    "primary": "mysqlgr1:3306",
    "status": "OK_NO_TOLERANCE",
    "statusText": "Cluster is NOT tolerant to any failures. 1 member is not active",
            "address": "mysqlgr1:3306",
            "status": "ONLINE"
            "address": "mysqlgr2:3306",
            "status": "ONLINE"
            "address": "mysqlgr3:3306",
            "status": "(MISSING)"

And node 3 all of a sudden finds itself in the role of failed node, while it had had nothing to do about the previous operations:

$ ./tests/check_cluster.sh 3| grep 'primary\|address\|status'
    "primary": "mysqlgr3:3306",
    "status": "OK_NO_TOLERANCE",
    "statusText": "Cluster is NOT tolerant to any failures. 2 members are not active",
            "address": "mysqlgr1:3306",
            "status": "(MISSING)"
            "address": "mysqlgr2:3306",
            "status": "(MISSING)"
            "address": "mysqlgr3:3306",
            "status": "ONLINE"

In short, while we were attempting to fix a split brain, we ended up with a different split brain, and an unexpected node promotion. This is clearly a bug, and I hope the MySQL team can make the system more robust.

3 comments:

Matt Lord said...

Hi Giuseppe,

The primary node (mysqlgr1) restart issues you noticed were because the mysqlgr1 container was started with the BOOTSTRAP=1 option and thus the mysqld process is running with:
--loose-group_replication_bootstrap_group=ON

(Note: it also has no group seeds specified)

So when you restart that container, it bootstraps an entirely new group with a new group UUID. Otherwise normally when you restart the old primary, it would automatically rejoin as a secondary.

This is a current pain that we're working on addressing in multiple ways:

1. Always attempt to join an existing group first--meaning a group with same group_replication_group_name that can be joined via any of the nodes specified in group_replication_group_seeds--and only bootstrap a new group when there is no existing group found. This would allow you to keep the bootstrap option in the permanent config of nodes which would in turn allow for automatic re-bootstrap in the event of total cluster loss (DC power failure, etc.).

(Note: this wouldn't help here, as the container is bootstrapping a new/different group UUID; it also has no group_seeds specified).

2. Using a one-time command to bootstrap a group instead of using a command-line option:
START GROUP_REPLICATION BOOTSTRAP;

I have to think about how to best try and remove that option from the bootstrap container after it's been started... certainly is an awkward situation with Docker. Have any ideas there?

So most of the issues seen here are actually related to the container setup. Although what happened after the rescan() operation definitely doesn't seem right... that shouldn't have changed the state of anything. Have to dig into that with the devs.


Thank you!

Matt

Giuseppe Maxia said...

Hi Matt,
Thanks for the explanation.
While I understand the reasoning for the failure, the fact that the system allows a split brain situation to be resolved in favor of the minority slice is a deadly sin that should not happen under any circumstance.
Moreover, if MySQL shell is advertised as a generally available tool to maintain the cluster, it should give right information about what operations are available and their outcome.
Looking forward to seeing a better response from the management tool.

Matt Lord said...

Just FYI, I finally got around to addressing this in my personal container:
1. https://github.com/mattlord/Docker-InnoDB-Cluster/commit/9ee7c4bda5b048e073cbe536bd1393b9dd76bad6

2. https://github.com/mattlord/Docker-InnoDB-Cluster/commit/49239cd2a30eda61eb58e62d4c1e371e52c8cece

I fully agree with the underlying point, however, and want to see all of this become much more bulletproof. Just wanted to note that I've addressed the specific issue in my container, if you're still using it for testing/etc (as I am).