- What is Galera Replication?
- What is the difference between MySQL Replication and Galera Replication?
- What are the pros and cons of Galera replication?
- Can I use Galera Replication to replicate between Data Centers?
- What is Certification based Replication?
- How do I deploy a Galera Replication cluster?
- Failure handling with ClusterControl™ and Galera
- Adding a new node to a Galera Replication cluster
- Schema changes in a Galera cluster
- Common Questions about Galera
ClusterControl™ for MySQL Galera enables customers to Deploy, Manage, Monitor and Scale a clustered MySQL database platform based on the Galera Replication protocol.
Galera Replication is a synchronous multi-master replication plug-in for InnoDB. It is very different from the regular MySQL Replication, and addresses a number of issues including write conflicts when writing on multiple masters, replication lag and slaves being out of sync with the master. Users do not have to know which server they can write to (the master) and which servers they can read from (the slaves).
An application can write to any node in a Galera Replication cluster, and transaction commits (RBR events) are then applied on all servers, via a certification-based replication.
Certification-based replication is an alternative approach to synchronous database replication using Group Communication and transaction ordering techniques.
A minimal Galera cluster consists of 3 nodes. The reason is that, should there be a problem applying a transaction on one node (e.g., network problem or the machine becomes unresponsive), the two other nodes will have a quorum (i.e. a majority) and will be able to proceed with the transaction commit.
MySQL Replication is part of the standard MySQL database, and is mainly asynchronous in nature. Updates are always done on one master, and these are propagated to slaves. It is possible to create a ring topology with multiple masters, however this is not recommended as it is very easy for the servers to get out of sync in case of a master failing. There is no automatic failover or resynchronization in these cases.
Galera Replication is a plug-in to MySQL, and enables a true master-master setup for InnoDB. In a Galera Replication cluster, all nodes are masters and applications can read and write from any node. Transactions are synchronously committed on all nodes. In case of a node failing, the other nodes will continue to operate and kept up to date. When the failed node comes up again, it automatically synchronizes with the other nodes before it is allowed back into the cluster. No data is lost when a node fails.
Galera Replication has a number of benefits:
- A high availability solution with synchronous replication, failover and resynchronization
- No loss of data
- All servers have up-to-date data (no slave lag)
- Read scalability
- 'Pretty good' write scalability
- High availability across data centers
Like any solution, there are some limitations:
- It supports only InnoDB
- With increasing number of writeable masters, the transaction rollback rate may increase, especially if there is write contention on the same dataset. This increases transaction latency.
- It is possible for a slow/overloaded master node to affect performance of the Galera Replication cluster, therefore it is recommended to have uniform servers across the cluster.
Although Galera Replication is synchronous, it is possible to deploy a Galera Replication cluster across data centers. Synchronous replication is traditionally implemented via 2-phase commit, where messages are sent to all nodes in a cluster in a 'prepare' phase, and another set of messages are sent in a 'commit' phase. This approach is usually not suitable for geographically disparate nodes, because of the latencies in sending messages between nodes.
Galera Replication makes use of certification based replication, that is a form of synchronous replication with reduced overhead.
Certification based replication uses group communication and transaction ordering techniques to achieve synchronous replication. Transactions execute optimistically in a single node (or replica) and, at commit time, run a coordinated certification process to enforce global consistency. Global coordination is achieved with the help of a broadcast service, that establishes a global total order among concurrent transactions.
Pre-requisites for certification based replication:
- database is transactional (i.e. it can rollback uncommitted changes)
- each replication event changes the database atomically
- replicated events are globally ordered (i.e. applied on all instances in the same order)
The main idea is that a transaction is executed conventionally until the commit point, under the assumption that there will be no conflict. This is called optimistic execution. When the client issues a COMMIT command (but before the actual commit has happened), all changes made to the database by the transaction and the primary keys of changed rows are collected into a writeset. This writeset is then replicated to the rest of the nodes. After that, the writeset undergoes a deterministic certification test (using the collected primary keys) on each node (including the writeset originator node) which determines if the writeset can be applied or not.
If the certification test fails, the writeset is dropped and the original transaction is rolled back. If the test succeeds, the transaction is committed and the writeset is applied on the rest of the nodes.
The certification test implemented in Galera depends on the global ordering of transactions. Each transaction is assigned a global ordinal sequence number during replication. Thus, when a transaction reaches the commit point, it is known what was the sequence number of the last transaction it did not conflict with. The interval between those two numbers is an uncertainty land: transactions in this interval have not seen the effects of each other. Therefore, all transactions in this interval are checked for primary key conflicts with the transaction in question. The certification test fails if a conflict is detected.
Since the procedure is deterministic and all replicas receive transactions in the same order, all nodes reach the same decision about the outcome of the transaction. The node that started the transaction can then notify the client application if the transaction has been committed or not.
Certification based replication (or more precisely, certification-based conflict resolution) is based on academic research, in particular on Fernando Pedone's Ph.D. thesis (http://www.inf.usi.ch/faculty/pedone/Paper/199x/These-2090-Pedone.pdf).
A Galera Replication cluster can be configured using the Severalnines Configurator:
The wizard collects some high level data (e.g. IP addresses of machines, size of database, amount of RAM in machines, type of database workload, etc.). The application then generates the configuration of the Galera Replication cluster. At the end, a deployment package is automatically generated and emailed to the user.
The deployment package automates a number of tasks:
- copy and install binaries (including Galera plugin) on all servers)
- create datadirs
- create OS user
- create the mysql user for MySQL Servers
- install system tables
- set permissions for MySQL
- enables the Galera service
- install ClusterControl™ on all hosts to monitor and manage the cluster
At the end, it has a verification phase that verifies whether the cluster has been properly set up. The Galera Replication cluster is now deployed.
In order to keep the database cluster stable and running, it is important for the system to be resilient to failures. Failures are caused by either software bugs or hardware problems, and can happen at any time. In case a server goes down, failure handling, failover and reconfiguration of the Galera cluster needs to be automatic, so as to minimize downtime.
In case of a node failing, applications connected to that node can connect to another node and continue to do database requests. Keepalive messages are sent between nodes in order to detect failures, in which case the failed node is excluded from the cluster.
ClusterControl™ will restart the failed database process, and point it to one of the existing nodes (a 'donor') to resynchronize. The resynchronization process is handled by Galera.
While the failed node is resynchronizing, any new transactions (writesets) coming from the existing nodes will be cached in a slave queue. Once the node has caught up, it will be considered as SYNCED and is ready to accept client connections.
Adding a new node is automatic with Galera. It follows the same process as recovering a failed node. When the node is introduced, ClusterControl™ will select a 'donor' and do a full state snapshot transfer from that node.
During the synchronization process, any incoming transactions are cached in a slave queue. Once the node is SYNCED, it is ready to accept client connections.
DDL statements are replicated in statement level (and not RBR, as it is the case with DML on InnoDB data).
What can cause Galera to crash?
Stay clear of Galera's known limitations to avoid problems.
There are several reasons which can cause Galera to crash:
- Too many deadlocks under heavy load when writing to the same set of rows,
- OS is swapping and/or high iowait,
- Out of disk space,
- InnoDB crashes,
- Use only binlog_format=ROW,
- Every table must have a PRIMARY KEY,
- Replication of MyISAM table is experimental and MyISAM tables should be avoided,
- Delete from table which does not have primary key can cause a cluster crash,
- No primary components available or cluster is out of quorum,
- MySQL misconfiguration,
- Galera software bugs
What happens when disk is full?
Galera node provisioning is smart enough to kick out any problematic node from the cluster if it detects inconsistency among members. When the mysqld runs out of disk space (in the data directory), the node is not able to apply writesets. Galera detects this as failed transactions. Since this compromises node consistency, Galera will then signal it to close the group communication and force mysqld to terminate.
Restarting the node will give disk-full errors (such as "no space left on device"), and quota-exceeded errors (such as "write failed" or "user block limit reached"). You might want to add another data file on another disk or clear up some space before the node can rejoin the cluster.
OS is swapping and/or high iowait
If the OS starts swapping and/or if iowait is very high it can "freeze" the server for duration of time. During this time the Galera node may stop responding to the other nodes and is deemed dead. In virtualized environments it can also be the Host OS that is swapping.
How to handle Galera crash?
First of all, make sure you are running the latest Galera stable release so you do not run into older bugs that have already been fixed. Start with inspecting the MySQL error log on the Galera nodes as Galera will be logging to this file. Try to shed some light to any relevant line which indicates error or failing. If the Galera nodes happen to be responsive, you may also try to collect following output:
mysql> SHOW FULL PROCESSLIST; mysql> SHOW PROCESSLIST; mysql> SHOW ENGINE INNODB STATUS; mysql> SHOW STATUS LIKE 'wsrep%';
Next, inspect the system resources by checking network, firewall, disk usage and memory utilization as well as inspecting the general system activity log (syslog, message, dmesg). If still no indication of the problem found, you may hit into a bug which you can report it directly at Galera bugs on Launchpad page or request for technical support assistance directly from the vendor (Codership, Percona or MariaDB). You may also join the Galera Google Group mailing list to seek for open assistance.
- If you are using rsync for state transfer, and a node crashes before the state transfer is over, rsync process might hang forever, occupying the port and not allowing to restart the node. The problem will show up as 'port in use' in the server error log. Find the orphan rsync process and kill it manually.
- Before re-initializing the cluster, you can determine which DB node is having the most updated data by comparing the wsrep_last_commited value among nodes. The one which holding the highest number is recommended to be the reference node when bootstrapping the cluster.
What happens if I don't have primary keys in my table?
Each transaction is assigned a global ordinal sequence number during replication. Thus, when a transaction reaches the commit point, it is known what was the sequence number of the last transaction it did not conflict with. The interval between those two numbers is an uncertainty land: transactions in this interval have not seen the effects of each other. Therefore, all transactions in this interval are checked for primary key conflicts with the transaction in question. The certification test fails if a conflict is detected.
DELETE FROM statement requires PK or node(s) will die. Always define an explicit PRIMARY KEY in all tables. A simple AUTO_INCREMENT primary key will be just enough.
Is it true that MySQL Galera is as slow as the slowest node?
Yes it is. Galera relies on group communication between nodes. For any matters, it will wait for all nodes to return the status of certification test before proceed with committing or rollbacking. At this phase, an overloaded node will surely facing a hard time to reply within the time manner, delaying the rest of the cluster to wait for it. Therefore it is recommended to have uniform servers across the cluster. Also, transaction latency is no shorter than the RTT (round trip time) to the slowest node.
Usage of MyISAM tables? Why is it not recommended?
MySQL Galera treats MyISAM tables in quite different way:
- All DDL (create, drop, alter table...) on MyISAM will be replicated.
- DML (update, delete, insert) on MyISAM tables only, will not be replicated.
- Transactions containing both InnoDB and MyISAM access will be replicated.
So, MyISAM tables will appear in all nodes (since DDL is replicated). If you access the MyISAM tables outside of InnoDB transactions, then all data changes in MyISAM tables will remain locally in each node. If you access MyISAM tables inside InnoDB transactions, then MyISAM changes are replicated along InnoDB changes. However, if there happen cluster wide conflicts, MyISAM changes cannot be rolled back and your MyISAM tables will remain inconsistent.
Which version of Galera is the best (PXC, Maria, Codership)?
The Galera technology is developed by Codership Oy and is available as a patch for standard MySQL and InnoDB. Percona and MariaDB leverage the Galera library in Percona XtraDB Cluster (PXC) and MariaDB Galera Cluster respectively.
Since they all leverage the same Galera library, replication performance should be fairly similar. The Codership build usually has the latest version of Galera, although that could change in the future.