1.背景概述 在一次主從複製架構中,由於主節點binlog損壞,導致從節點無法正常同步數據,只能重做從節點;因此使用MySQL 8.0.17開始提供的clone技術進行恢復,恢復後的2天都發生了主從報錯數據衝突。 通過解析binlog發現,同一時刻主從節點都在執行同一條語句,因此詢問業務是否在主從節 ...
1.背景概述
在一次主從複製架構中,由於主節點binlog損壞,導致從節點無法正常同步數據,只能重做從節點;因此使用MySQL 8.0.17開始提供的clone技術進行恢復,恢復後的2天都發生了主從報錯數據衝突。
通過解析binlog發現,同一時刻主從節點都在執行同一條語句,因此詢問業務是否在主從節點都執行了定時任務,業務回覆定時任務只在主節點執行。
最後排查發現,克隆後的從節點的定時任務也會是開啟的狀態,因此同一時刻,主從節點同時執行定時任務,導致主從報錯,最終將從節點的定時任務關閉後解決此問題。
2.問題復現
本次測試基於 GreatSQL 8.0.32-24
greatsql> SELECT VERSION();
+-----------+
| VERSION() |
+-----------+
| 8.0.32-24 |
+-----------+
1 row in set (0.00 sec)
1.搭建一套主從架構
略
2.創建event
greatsql> create database test;
greatsql> use test;
greatsql> CREATE TABLE `test` (
`id` int(11) unsigned NOT NULL AUTO_INCREMENT COMMENT 'ID',
`now` datetime DEFAULT NULL COMMENT '時間',
PRIMARY KEY (`id`)
);
greatsql> CREATE EVENT event_test
ON SCHEDULE EVERY 1 MINUTE
ON COMPLETION PRESERVE
ENABLE
COMMENT '每隔1分鐘向test表插入記錄'
DO INSERT INTO test VALUES(NULL, now());
3.查看event狀態
主節點,預設情況下event狀態為 ENABLED
greatsql> show events;
+------+------------+---------+-----------+-----------+------------+----------------+----------------+---------------------+------+---------+------------+----------------------+----------------------+--------------------+
| Db | Name | Definer | Time zone | Type | Execute at | Interval value | Interval field | Starts | Ends | Status | Originator | character_set_client | collation_connection | Database Collation |
+------+------------+---------+-----------+-----------+------------+----------------+----------------+---------------------+------+---------+------------+----------------------+----------------------+--------------------+
| test | event_test | root@% | SYSTEM | RECURRING | NULL | 1 | MINUTE | 2023-10-12 17:11:14 | NULL | ENABLED | 1 | utf8mb4 | utf8mb4_unicode_ci | utf8mb4_unicode_ci |
+------+------------+---------+-----------+-----------+------------+----------------+----------------+---------------------+------+---------+------------+----------------------+----------------------+--------------------+
1 row in set (0.00 sec)
從節點,預設情況下event狀態為 SLAVESIDE_DISABLED
greatsql> show events;
+------+------------+---------+-----------+-----------+------------+----------------+----------------+---------------------+------+--------------------+------------+----------------------+----------------------+--------------------+
| Db | Name | Definer | Time zone | Type | Execute at | Interval value | Interval field | Starts | Ends | Status | Originator | character_set_client | collation_connection | Database Collation |
+------+------------+---------+-----------+-----------+------------+----------------+----------------+---------------------+------+--------------------+------------+----------------------+----------------------+--------------------+
| test | event_test | root@% | SYSTEM | RECURRING | NULL | 1 | MINUTE | 2023-10-12 17:11:14 | NULL | SLAVESIDE_DISABLED | 1 | utf8mb4 | utf8mb4_unicode_ci | utf8mb4_unicode_ci |
+------+------------+---------+-----------+-----------+------------+----------------+----------------+---------------------+------+--------------------+------------+----------------------+----------------------+--------------------+
1 row in set (0.00 sec)
4.查看數據
greatsql> select * from test.test;
+----+---------------------+
| id | now |
+----+---------------------+
| 1 | 2023-08-08 16:00:39 |
| 2 | 2023-08-08 16:01:39 |
| 3 | 2023-08-08 16:02:39 |
+----+---------------------+
3 rows in set (0.00 sec)
5.從節點進行克隆
# 安裝克隆插件,主從節點都需要
greatsql> install plugin clone soname 'mysql_clone.so';
# 從節點進行clone
greatsql> set global clone_valid_donor_list='172.17.137.162:6001';
greatsql> clone instance from root@'172.17.137.162':6001 identified by 'greatsql';
6.重新建立主從複製
greatsql> change master to master_user='root',master_password='greatsql',master_host='172.17.137.162',master_port=6001,master_auto_position=1;
Query OK, 0 rows affected, 7 warnings (0.04 sec)
greatsql> start slave;
Query OK, 0 rows affected, 1 warning (0.04 sec)
7.查看主從狀態
greatsql> show slave status\G
*************************** 1. row ***************************
Slave_IO_State: Waiting for source to send event
Master_Host: 172.17.137.162
Master_User: root
Master_Port: 6001
Connect_Retry: 60
Master_Log_File: binlog.000001
Read_Master_Log_Pos: 2959
Relay_Log_File: relaylog.000002
Relay_Log_Pos: 395
Relay_Master_Log_File: binlog.000001
Slave_IO_Running: Yes
Slave_SQL_Running: No
Replicate_Do_DB:
Replicate_Ignore_DB:
Replicate_Do_Table:
Replicate_Ignore_Table:
Replicate_Wild_Do_Table:
Replicate_Wild_Ignore_Table:
Last_Errno: 1062
Last_Error: Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 1 failed executing transaction 'e8bf88f9-2acd-11ee-a98a-00163e605c74:8' at master log binlog.000001, end_log_pos 2606. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any.
Skip_Counter: 0
Exec_Master_Log_Pos: 2307
Relay_Log_Space: 1242
Until_Condition: None
Until_Log_File:
greatsql> select * from performance_schema.replication_applier_status_by_worker limit 1\G
*************************** 1. row ***************************
CHANNEL_NAME:
WORKER_ID: 1
THREAD_ID: NULL
SERVICE_STATE: OFF
LAST_ERROR_NUMBER: 1062
LAST_ERROR_MESSAGE: Worker 1 failed executing transaction 'e8bf88f9-2acd-11ee-a98a-00163e605c74:8' at master log binlog.000001, end_log_pos 2606; Could not execute Write_rows event on table test.test; Duplicate entry '5' for key 'test.PRIMARY', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log FIRST, end_log_pos 2606
LAST_ERROR_TIMESTAMP: 2023-08-08 16:03:39.033240
LAST_APPLIED_TRANSACTION:
LAST_APPLIED_TRANSACTION_ORIGINAL_COMMIT_TIMESTAMP: 0000-00-00 00:00:00.000000
LAST_APPLIED_TRANSACTION_IMMEDIATE_COMMIT_TIMESTAMP: 0000-00-00 00:00:00.000000
LAST_APPLIED_TRANSACTION_START_APPLY_TIMESTAMP: 0000-00-00 00:00:00.000000
LAST_APPLIED_TRANSACTION_END_APPLY_TIMESTAMP: 0000-00-00 00:00:00.000000
APPLYING_TRANSACTION: e8bf88f9-2acd-11ee-a98a-00163e605c74:8
APPLYING_TRANSACTION_ORIGINAL_COMMIT_TIMESTAMP: 2023-08-08 16:02:45.795753
APPLYING_TRANSACTION_IMMEDIATE_COMMIT_TIMESTAMP: 2023-08-08 16:02:45.795753
APPLYING_TRANSACTION_START_APPLY_TIMESTAMP: 2023-08-08 16:03:39.032510
LAST_APPLIED_TRANSACTION_RETRIES_COUNT: 0
LAST_APPLIED_TRANSACTION_LAST_TRANSIENT_ERROR_NUMBER: 0
LAST_APPLIED_TRANSACTION_LAST_TRANSIENT_ERROR_MESSAGE:
LAST_APPLIED_TRANSACTION_LAST_TRANSIENT_ERROR_TIMESTAMP: 0000-00-00 00:00:00.000000
APPLYING_TRANSACTION_RETRIES_COUNT: 0
APPLYING_TRANSACTION_LAST_TRANSIENT_ERROR_NUMBER: 0
APPLYING_TRANSACTION_LAST_TRANSIENT_ERROR_MESSAGE:
APPLYING_TRANSACTION_LAST_TRANSIENT_ERROR_TIMESTAMP: 0000-00-00 00:00:00.000000
1 row in set (0.00 sec)
可以看到從節點報錯發生了主鍵衝突。
8.查看從節點定時任務狀態
當前從節點定時任務狀態為 ENABLED
greatsql> show events;
+------+------------+---------+-----------+-----------+------------+----------------+----------------+---------------------+------+---------+------------+----------------------+----------------------+--------------------+
| Db | Name | Definer | Time zone | Type | Execute at | Interval value | Interval field | Starts | Ends | Status | Originator | character_set_client | collation_connection | Database Collation |
+------+------------+---------+-----------+-----------+------------+----------------+----------------+---------------------+------+---------+------------+----------------------+----------------------+--------------------+
| test | event_test | root@% | SYSTEM | RECURRING | NULL | 1 | MINUTE | 2023-08-08 15:58:45 | NULL | ENABLED | 1 | utf8mb4 | utf8mb4_unicode_ci | utf8mb4_unicode_ci |
+------+------------+---------+-----------+-----------+------------+----------------+----------------+---------------------+------+---------+------------+---------------------+----------------------+--------------------+
1 row in set (0.00 sec)
可以看到由於從節點的定時任務也執行了,從節點寫入數據,導致主鍵衝突。
9.故障解決
greatsql> alter event event_test DISABLE;
Query OK, 0 rows affected (0.01 sec)
關閉從節點的定時任務event,然後跳過主鍵衝突的報錯,最後重新啟動主從複製。
3.總結
1.如果主庫有定時任務,通過clone的方式搭建從庫,在從庫恢復之後需要關閉定時任務,避免主從同時執行定時任務導致主從故障。
2.克隆時,如果捐贈節點有主從複製信息,則克隆後的接收節點也會克隆此複製信息,併在克隆完成自動重啟實例後,自動啟動複製;避免此問題可以在接收節點的配置文件中增加 skip-slave-start,避免節點重啟後自動啟動複製。
Enjoy GreatSQL