Galera uses a preallocated file with a specific size called gcache, used to store the writesets in circular buffer style. By default, its size is 128MB. In this post, we are going to explore how to leverage gcache to improve the operation of a Galera cluster.
We have a four node Galera cluster, using the latest release 23.2.7(r157). We have a table called t1 that is replicated by Galera on all nodes. The cluster nodes have allocated the default 128MB gcache.size, and we’ll try to execute a large writeset to see how gcache responds.
Let’s create a big writeset using LOAD DATA. The writeset size is about 200 MB in size:
mysql> LOAD DATA LOCAL INFILE '/tmp/mysql_statistics.sql' INTO TABLE t1 FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"' LINES TERMINATED BY '\n'; Query OK, 3784725 rows affected (4 min 2.94 sec) Records: 3784725 Deleted: 0 Skipped: 0 Warnings: 0
You will notice gcache.page files will be generated to contain the big writeset, as reported in the MySQL error log:
[Note] WSREP: Created page /var/lib/mysql/gcache.page.000001 of size 208431655 bytes [Note] WSREP: Deleted page /var/lib/mysql/gcache.page.000001
We are going to shut down one of the Galera nodes (node1) to see how it performs when rejoining the cluster:
$ service mysql stop
Next, let’s run the following two queries on one of the nodes that is up (e.g., node2):
mysql> TRUNCATE t1; mysql> LOAD DATA LOCAL INFILE ‘/tmp/mysql_statistics.sql’ INTO TABLE t1 FIELDS TERMINATED BY ‘,’ OPTIONALLY ENCLOSED BY ‘“‘ LINES TERMINATED BY ‘\n’; Query OK, 3784725 rows affected (3 min 52.24 sec) Records: 3784725 Deleted: 0 Skipped: 0 Warnings: 0
Let’s start node1 so it rejoins the cluster:
$ service mysql start --wsrep-cluster-address=gcomm://node2
Oops, the joiner node is performing SST, which is not expected since it was down for about 5 minutes:
131009 23:22:40 [Note] WSREP: Shifting OPEN -> PRIMARY (TO: 554674) 131009 23:22:40 [Note] WSREP: State transfer required: Group state: a247c359-218e-11e3-0800-e65e826600e9:554674 Local state: a247c359-218e-11e3-0800-e65e826600e9:554672 131009 23:22:40 [Note] WSREP: New cluster view: global state: a247c359-218e-11e3-0800-e65e826600e9:554674, view# 25: Primary, number of nodes: 4, my index: 2, protocol version 2 131009 23:22:40 [Warning] WSREP: Gap in state sequence. Need state transfer. 131009 23:22:40 [Note] WSREP: Setting wsrep_ready to 0 131009 23:22:42 [Note] WSREP: waiting for client connections to close: 2 131009 23:22:42 [Note] WSREP: Running: 'wsrep_sst_rsync --role 'joiner' --address '192.168.0.121' --auth 'root:password' --datadir '/var/lib/mysql/' --defaults-file '/etc/my.cnf' --parent '22605'' 131009 23:22:42 [Note] WSREP: Prepared SST request: rsync|192.168.0.121:4444/rsync_sst 131009 23:22:42 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification. 131009 23:22:42 [Note] WSREP: Assign initial position for certification: 554674, protocol version: 2 131009 23:22:42 [Note] WSREP: Prepared IST receiver, listening at: tcp://192.168.0.121:4568 131009 23:22:42 [Note] WSREP: Node 2 (g1.cluster.com) requested state transfer from '*any*'. Selected 1 (g2.cluster.com)(SYNCED) as donor. 131009 23:22:42 [Note] WSREP: Shifting PRIMARY -> JOINER (TO: 554674) 131009 23:22:42 [Note] WSREP: Requesting state transfer: success, donor: 1
Note: Using LOAD DATA is not recommended in Galera. Since it is virtually synchronous per writeset, you will notice a delay in replication as the writesets undergo certification. So, expect large writesets to take time to apply.
When a Galera node rejoins the cluster, it will check the local gcache to determine whether it needs to perform incremental state transfer (IST) or state snapshot transfer (SST). It will calculate the gap between group state and the local state stored in grastate.dat. If all of the writesets that it missed can be found in the donor's gcache, it will perform IST by getting the missing writesets and catch up with the group by replaying them. This is faster than SST, and non-blocking on the donor side.
The gcache size defines how many writesets the donor node can serve in IST. By default, Galera creates gcache file under MySQL datadir. One might think that gcache is similar to MySQL binlog but it is not actually true, as explained in details on the Galera FAQ.
IST is always the preferred choice when joining a node to a cluster. We do not want our cluster to spend time and resources on a long waiting process like SST, which can also block writes on the donor (if SST method is mysqldump or rsync). Sometime, if the writeset size grows so high that it does not fit into gcache, it will create a gcache page file (as happened in the test case). So, setting up the gcache is crucial at this state.
To increase the gcache size, just add following line into MySQL configuration file:
wsrep_provider_options="gcache.size = 5G"
Perform a rolling restart of the cluster, repeat the above test. The joiner node will now perform IST instead of SST, which you can observe from MySQL error log:
131009 23:45:53 [Note] WSREP: Signalling provider to continue. 131009 23:45:53 [Note] WSREP: SST received: a247c359-218e-11e3-0800-e65e826600e9:554676 131009 23:45:53 [Note] WSREP: Receiving IST: 3 writesets, seqnos 554676-554679
You can estimate how long down time you can afford, while still being able to perform an IST. For example, if you have a write stream of 128MB per hour and set gcache.size to 1.3GB, then you can have a node being down for about 10 hours. When brought back into the cluster, it will be able to perform an IST and not a full sync SST.
Since by default gcache will be created under the MySQL working directory, you can offload disk IO to the database partition by changing the gcache.name value to another location/partition under wsrep_provider_options:
wsrep_provider_options="gcache.size = 5G; gcache.name = /another_partition/galera.cache"
Take note that Galera will pre-allocate gcache space during MySQL startup. So, make sure you have sufficient disk space before increasing the gcache size.