Redis Sentinel是Redis的高可用方案。是Redis 2.8中正式引入的。 在之前的主從複製方案中,如果主節點出現問題,需要手動將一個從節點升級為主節點,然後將其它從節點指向新的主節點,並且需要修改應用方主節點的地址。整個過程都需要人工干預。 下麵通過日誌具體看看Sentinel的切換 ...
Redis Sentinel是Redis的高可用方案。是Redis 2.8中正式引入的。
在之前的主從複製方案中,如果主節點出現問題,需要手動將一個從節點升級為主節點,然後將其它從節點指向新的主節點,並且需要修改應用方主節點的地址。整個過程都需要人工干預。
下麵通過日誌具體看看Sentinel的切換流程。
Sentinel的切換流程
集群拓撲圖如下。
角色 IP 埠 runID
主節點 127.0.0.1 6379
從節點-1 127.0.0.1 6380
從節點-2 127.0.0.1 6381
Sentinel-1 127.0.0.1 26379 d4424b8684977767be4f5abd1e364153fbb0adbd
Sentinel-2 127.0.0.1 26380 18311edfbfb7bf89fe4b67d08ef432053db62fff
Sentinel-3 127.0.0.1 26381 3e9eb1aa9378d89cfe04fe21bf4a05a901747fa8
kill -9 將主節點進程殺死。
1. 最先反應的是從節點。
其會馬上輸出如下信息。
28244:S 08 Oct 16:03:34.184 # Connection with master lost. 28244:S 08 Oct 16:03:34.184 * Caching the disconnected master state. 28244:S 08 Oct 16:03:34.548 * Connecting to MASTER 127.0.0.1:6379 28244:S 08 Oct 16:03:34.548 * MASTER <-> SLAVE sync started 28244:S 08 Oct 16:03:34.548 # Error condition on socket for SYNC: Connection refused 28244:S 08 Oct 16:03:35.556 * Connecting to MASTER 127.0.0.1:6379 28244:S 08 Oct 16:03:35.556 * MASTER <-> SLAVE sync started ...
2. Sentinel的日誌30s後才有輸出,這個與“sentinel down-after-milliseconds mymaster 30000”的設置有關。
下麵,依次貼出哨兵各個節點及slave的日誌輸出。
Sentinel-1
28087:X 08 Oct 16:04:04.277 # +sdown master mymaster 127.0.0.1 6379 28087:X 08 Oct 16:04:04.379 # +new-epoch 1 28087:X 08 Oct 16:04:04.385 # +vote-for-leader 18311edfbfb7bf89fe4b67d08ef432053db62fff 1 28087:X 08 Oct 16:04:05.388 # +odown master mymaster 127.0.0.1 6379 #quorum 3/2 28087:X 08 Oct 16:04:05.388 # Next failover delay: I will not start a failover before Mon Oct 8 16:10:04 2018 28087:X 08 Oct 16:04:05.631 # +config-update-from sentinel 18311edfbfb7bf89fe4b67d08ef432053db62fff 127.0.0.1 26380 @ mymaster 127.0.0.1 6379 28087:X 08 Oct 16:04:05.631 # +switch-master mymaster 127.0.0.1 6379 127.0.0.1 6381 28087:X 08 Oct 16:04:05.631 * +slave slave 127.0.0.1:6380 127.0.0.1 6380 @ mymaster 127.0.0.1 6381 28087:X 08 Oct 16:04:05.631 * +slave slave 127.0.0.1:6379 127.0.0.1 6379 @ mymaster 127.0.0.1 6381 28087:X 08 Oct 16:04:35.656 # +sdown slave 127.0.0.1:6379 127.0.0.1 6379 @ mymaster 127.0.0.1 6381
Sentinel-2
28163:X 08 Oct 16:04:04.289 # +sdown master mymaster 127.0.0.1 6379 28163:X 08 Oct 16:04:04.366 # +odown master mymaster 127.0.0.1 6379 #quorum 3/2 28163:X 08 Oct 16:04:04.366 # +new-epoch 1 28163:X 08 Oct 16:04:04.366 # +try-failover master mymaster 127.0.0.1 6379 28163:X 08 Oct 16:04:04.373 # +vote-for-leader 18311edfbfb7bf89fe4b67d08ef432053db62fff 1 28163:X 08 Oct 16:04:04.385 # 3e9eb1aa9378d89cfe04fe21bf4a05a901747fa8 voted for 18311edfbfb7bf89fe4b67d08ef432053db62fff 1 28163:X 08 Oct 16:04:04.385 # d4424b8684977767be4f5abd1e364153fbb0adbd voted for 18311edfbfb7bf89fe4b67d08ef432053db62fff 1 28163:X 08 Oct 16:04:04.450 # +elected-leader master mymaster 127.0.0.1 6379 28163:X 08 Oct 16:04:04.450 # +failover-state-select-slave master mymaster 127.0.0.1 6379 28163:X 08 Oct 16:04:04.528 # +selected-slave slave 127.0.0.1:6381 127.0.0.1 6381 @ mymaster 127.0.0.1 6379 28163:X 08 Oct 16:04:04.528 * +failover-state-send-slaveof-noone slave 127.0.0.1:6381 127.0.0.1 6381 @ mymaster 127.0.0.1 6379 28163:X 08 Oct 16:04:04.586 * +failover-state-wait-promotion slave 127.0.0.1:6381 127.0.0.1 6381 @ mymaster 127.0.0.1 6379 28163:X 08 Oct 16:04:05.543 # +promoted-slave slave 127.0.0.1:6381 127.0.0.1 6381 @ mymaster 127.0.0.1 6379 28163:X 08 Oct 16:04:05.543 # +failover-state-reconf-slaves master mymaster 127.0.0.1 6379 28163:X 08 Oct 16:04:05.629 * +slave-reconf-sent slave 127.0.0.1:6380 127.0.0.1 6380 @ mymaster 127.0.0.1 6379 28163:X 08 Oct 16:04:06.554 # -odown master mymaster 127.0.0.1 6379 28163:X 08 Oct 16:04:06.555 * +slave-reconf-inprog slave 127.0.0.1:6380 127.0.0.1 6380 @ mymaster 127.0.0.1 6379 28163:X 08 Oct 16:04:06.555 * +slave-reconf-done slave 127.0.0.1:6380 127.0.0.1 6380 @ mymaster 127.0.0.1 6379 28163:X 08 Oct 16:04:06.606 # +failover-end master mymaster 127.0.0.1 6379 28163:X 08 Oct 16:04:06.606 # +switch-master mymaster 127.0.0.1 6379 127.0.0.1 6381 28163:X 08 Oct 16:04:06.606 * +slave slave 127.0.0.1:6380 127.0.0.1 6380 @ mymaster 127.0.0.1 6381 28163:X 08 Oct 16:04:06.606 * +slave slave 127.0.0.1:6379 127.0.0.1 6379 @ mymaster 127.0.0.1 6381 28163:X 08 Oct 16:04:36.687 # +sdown slave 127.0.0.1:6379 127.0.0.1 6379 @ mymaster 127.0.0.1 6381
Sentinel-3
28234:X 08 Oct 16:04:04.288 # +sdown master mymaster 127.0.0.1 6379 28234:X 08 Oct 16:04:04.378 # +new-epoch 1 28234:X 08 Oct 16:04:04.385 # +vote-for-leader 18311edfbfb7bf89fe4b67d08ef432053db62fff 1 28234:X 08 Oct 16:04:04.385 # +odown master mymaster 127.0.0.1 6379 #quorum 2/2 28234:X 08 Oct 16:04:04.385 # Next failover delay: I will not start a failover before Mon Oct 8 16:10:04 2018 28234:X 08 Oct 16:04:05.630 # +config-update-from sentinel 18311edfbfb7bf89fe4b67d08ef432053db62fff 127.0.0.1 26380 @ mymaster 127.0.0.1 6379 28234:X 08 Oct 16:04:05.630 # +switch-master mymaster 127.0.0.1 6379 127.0.0.1 6381 28234:X 08 Oct 16:04:05.630 * +slave slave 127.0.0.1:6380 127.0.0.1 6380 @ mymaster 127.0.0.1 6381 28234:X 08 Oct 16:04:05.630 * +slave slave 127.0.0.1:6379 127.0.0.1 6379 @ mymaster 127.0.0.1 6381 28234:X 08 Oct 16:04:35.709 # +sdown slave 127.0.0.1:6379 127.0.0.1 6379 @ mymaster 127.0.0.1 6381
slave2
28244:S 08 Oct 16:04:04.762 * MASTER <-> SLAVE sync started 28244:S 08 Oct 16:04:04.762 # Error condition on socket for SYNC: Connection refused 28244:S 08 Oct 16:04:05.630 * SLAVE OF 127.0.0.1:6381 enabled (user request from 'id=6 addr=127.0.0.1:43880 fd=12 name= age=148 idle=0 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=224 qbuf-free= 32544 obl=81 oll=0 omem=0 events=r cmd=slaveof')28244:S 08 Oct 16:04:05.636 # CONFIG REWRITE executed with success. 28244:S 08 Oct 16:04:05.770 * Connecting to MASTER 127.0.0.1:6381 28244:S 08 Oct 16:04:05.770 * MASTER <-> SLAVE sync started 28244:S 08 Oct 16:04:05.770 * Non blocking connect for SYNC fired the event. 28244:S 08 Oct 16:04:05.770 * Master replied to PING, replication can continue... 28244:S 08 Oct 16:04:05.770 * Trying a partial resynchronization (request b95802ca8afd97c578b355a5838d219681d0af27:24302). 28244:S 08 Oct 16:04:05.770 * Successful partial resynchronization with master. 28244:S 08 Oct 16:04:05.770 # Master replication ID changed to a4022bb5c361353a4773fd460cec5cdcc5c02031 28244:S 08 Oct 16:04:05.770 * MASTER <-> SLAVE sync: Master accepted a Partial Resynchronization.
slave3
28253:S 08 Oct 16:04:03.655 * MASTER <-> SLAVE sync started 28253:S 08 Oct 16:04:03.655 # Error condition on socket for SYNC: Connection refused 28253:M 08 Oct 16:04:04.586 # Setting secondary replication ID to b95802ca8afd97c578b355a5838d219681d0af27, valid up to offset: 24302. New replication ID is a4022bb5c361353a4773fd460cec5cdc c5c0203128253:M 08 Oct 16:04:04.586 * Discarding previously cached master state. 28253:M 08 Oct 16:04:04.586 * MASTER MODE enabled (user request from 'id=9 addr=127.0.0.1:49316 fd=8 name=sentinel-18311edf-cmd age=137 idle=0 flags=x db=0 sub=0 psub=0 multi=3 qbuf=0 qbuf- free=32768 obl=36 oll=0 omem=0 events=r cmd=exec')28253:M 08 Oct 16:04:04.593 # CONFIG REWRITE executed with success. 28253:M 08 Oct 16:04:05.770 * Slave 127.0.0.1:6380 asks for synchronization 28253:M 08 Oct 16:04:05.770 * Partial resynchronization request from 127.0.0.1:6380 accepted. Sending 156 bytes of backlog starting from offset 24302.
結合上面的日誌,可以看到,
各個Sentinel節點都判斷127.0.0.1 6379為主觀下線(Subjectively Down,縮寫為sdown)。
28163:X 08 Oct 16:04:04.289 # +sdown master mymaster 127.0.0.1 6379
達到quorum的設置,Sentinel-2判斷其為客觀下線(Objectively Down,縮寫為odown)。結合其它兩個Sentinel節點的日誌,可以看到,Sentinel-2最先判定其客觀下線。接下來,會進行Sentinel的領導者選舉。一般來說,誰先完成客觀下線的判定,誰就是領導者,只有Sentinel領導者才能進行failover。
28163:X 08 Oct 16:04:04.366 # +odown master mymaster 127.0.0.1 6379 #quorum 3/2 28163:X 08 Oct 16:04:04.366 # +new-epoch 1 28163:X 08 Oct 16:04:04.366 # +try-failover master mymaster 127.0.0.1 6379 28163:X 08 Oct 16:04:04.373 # +vote-for-leader 18311edfbfb7bf89fe4b67d08ef432053db62fff 1 28163:X 08 Oct 16:04:04.385 # 3e9eb1aa9378d89cfe04fe21bf4a05a901747fa8 voted for 18311edfbfb7bf89fe4b67d08ef432053db62fff 1 28163:X 08 Oct 16:04:04.385 # d4424b8684977767be4f5abd1e364153fbb0adbd voted for 18311edfbfb7bf89fe4b67d08ef432053db62fff 1 28163:X 08 Oct 16:04:04.450 # +elected-leader master mymaster 127.0.0.1 6379
尋找合適的slave作為master
28163:X 08 Oct 16:04:04.450 # +failover-state-select-slave master mymaster 127.0.0.1 6379
+failover-state-select-slave <instance details> -- New failover state is select-slave: we are trying to find a suitable slave for promotion.
將127.0.0.1 6381設置為新主
28163:X 08 Oct 16:04:04.528 # +selected-slave slave 127.0.0.1:6381 127.0.0.1 6381 @ mymaster 127.0.0.1 6379
+selected-slave <instance details> -- We found the specified good slave to promote.
命令6381節點執行slaveof no one,使其成為主節點
28163:X 08 Oct 16:04:04.528 * +failover-state-send-slaveof-noone slave 127.0.0.1:6381 127.0.0.1 6381 @ mymaster 127.0.0.1 6379
+failover-state-send-slaveof-noone <instance details> -- We are trying to reconfigure the promoted slave as master, waiting for it to switch.
等待6381節點升級為主節點
28163:X 08 Oct 16:04:04.586 * +failover-state-wait-promotion slave 127.0.0.1:6381 127.0.0.1 6381 @ mymaster 127.0.0.1 6379
確認6381節點已經升級為主節點
28163:X 08 Oct 16:04:05.543 # +promoted-slave slave 127.0.0.1:6381 127.0.0.1 6381 @ mymaster 127.0.0.1 6379
再來看看16:04:04.528到16:04:05.543這個時間段slave3的日誌輸出。可以看到,其開啟了MASTER模式,且重寫了配置文件。
28253:M 08 Oct 16:04:04.586 # Setting secondary replication ID to b95802ca8afd97c578b355a5838d219681d0af27, valid up to offset: 24302. New replication ID is a4022bb5c361353a4773fd460cec5cdcc5c02031
28253:M 08 Oct 16:04:04.586 * Discarding previously cached master state. 28253:M 08 Oct 16:04:04.586 * MASTER MODE enabled (user request from 'id=9 addr=127.0.0.1:49316 fd=8 name=sentinel-18311edf-cmd age=137 idle=0 flags=x db=0 sub=0 psub=0 multi=3 qbuf=0 qbuf-free=32768 obl=36 oll=0 omem=0 events=r cmd=exec')
28253:M 08 Oct 16:04:04.593 # CONFIG REWRITE executed with success.
failover進入重新配置從節點階段
28163:X 08 Oct 16:04:05.543 # +failover-state-reconf-slaves master mymaster 127.0.0.1 6379
命令6380節點複製新的主節點
28163:X 08 Oct 16:04:05.629 * +slave-reconf-sent slave 127.0.0.1:6380 127.0.0.1 6380 @ mymaster 127.0.0.1 6379
+slave-reconf-sent <instance details> -- The leader sentinel sent the SLAVEOF command to this instance in order to reconfigure it for the new slave.
看看這個時間點slave2的日誌輸出,基本吻合。其進行的是增量同步。
28244:S 08 Oct 16:04:05.630 * SLAVE OF 127.0.0.1:6381 enabled (user request from 'id=6 addr=127.0.0.1:43880 fd=12 name= age=148 idle=0 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=224 qbuf-free=32544 obl=81 oll=0 omem=0 events=r cmd=slaveof')
28244:S 08 Oct 16:04:05.636 # CONFIG REWRITE executed with success. 28244:S 08 Oct 16:04:05.770 * Connecting to MASTER 127.0.0.1:6381 28244:S 08 Oct 16:04:05.770 * MASTER <-> SLAVE sync started 28244:S 08 Oct 16:04:05.770 * Non blocking connect for SYNC fired the event. 28244:S 08 Oct 16:04:05.770 * Master replied to PING, replication can continue... 28244:S 08 Oct 16:04:05.770 * Trying a partial resynchronization (request b95802ca8afd97c578b355a5838d219681d0af27:24302). 28244:S 08 Oct 16:04:05.770 * Successful partial resynchronization with master. 28244:S 08 Oct 16:04:05.770 # Master replication ID changed to a4022bb5c361353a4773fd460cec5cdcc5c02031 28244:S 08 Oct 16:04:05.770 * MASTER <-> SLAVE sync: Master accepted a Partial Resynchronization.
同時,在這個時間點,sentinel也有日誌輸出,以sentinel1為例。從日誌中,可以看到,在這個時間點它會更改配置信息。
28087:X 08 Oct 16:04:05.631 # +config-update-from sentinel 18311edfbfb7bf89fe4b67d08ef432053db62fff 127.0.0.1 26380 @ mymaster 127.0.0.1 6379 28087:X 08 Oct 16:04:05.631 # +switch-master mymaster 127.0.0.1 6379 127.0.0.1 6381 28087:X 08 Oct 16:04:05.631 * +slave slave 127.0.0.1:6380 127.0.0.1 6380 @ mymaster 127.0.0.1 6381 28087:X 08 Oct 16:04:05.631 * +slave slave 127.0.0.1:6379 127.0.0.1 6379 @ mymaster 127.0.0.1 6381
switch-master <master name> <oldip> <oldport> <newip> <newport> -- The master new IP and address is the specified one after a configuration change. This is the message most external users are interested in.
同步過程尚未完成。
28163:X 08 Oct 16:04:06.555 * +slave-reconf-inprog slave 127.0.0.1:6380 127.0.0.1 6380 @ mymaster 127.0.0.1 6379
+slave-reconf-inprog <instance details> -- The slave being reconfigured showed to be a slave of the new master ip:port pair, but the synchronization process is not yet complete.
主從同步完成。
28163:X 08 Oct 16:04:06.555 * +slave-reconf-done slave 127.0.0.1:6380 127.0.0.1 6380 @ mymaster 127.0.0.1 6379
+slave-reconf-done <instance details> -- The slave is now synchronized with the new master.
failover切換完成。
28163:X 08 Oct 16:04:06.606 # +failover-end master mymaster 127.0.0.1 6379
failover成功後,發佈主節點的切換消息
28163:X 08 Oct 16:04:06.606 # +switch-master mymaster 127.0.0.1 6379 127.0.0.1 6381
關聯新主節點的slave信息,需要註意的是,原來的主節點會作為新主節點的slave。
28163:X 08 Oct 16:04:06.606 * +slave slave 127.0.0.1:6380 127.0.0.1 6380 @ mymaster 127.0.0.1 6381 28163:X 08 Oct 16:04:06.606 * +slave slave 127.0.0.1:6379 127.0.0.1 6379 @ mymaster 127.0.0.1 6381
+slave <instance details> -- A new slave was detected and attached.
過了30s後,判定原來的主節點主觀下線。
28163:X 08 Oct 16:04:36.687 # +sdown slave 127.0.0.1:6379 127.0.0.1 6379 @ mymaster 127.0.0.1 6381
綜合來看,Sentinel進行failover的流程如下
1. 每隔1秒,每個Sentinel節點會向主節點、從節點、其餘Sentinel節點發送一條ping命令做一次心跳檢測,來確認這些節點當前是否可達。當這些節點超過down-after-milliseconds沒有進行有效回覆,Sentinel節點就會判定該節點為主觀下線。
2. 如果被判定為主觀下線的節點是主節點,該Sentinel節點會通過sentinel is master-down-by-addr命令向其他Sentinel節點詢問對主節點的判斷,當超過<quorum>個數,Sentinel節點會判定該節點為客觀下線。如果從節點、Sentinel節點被判定為主觀下線,並不會進行後續的故障切換操作。
3. 對Sentinel進行領導者選舉,由其來進行後續的故障切換(failover)工作。選舉演算法基於Raft。
4. Sentinel領導者節點開始進行故障切換。
5. 選擇合適的從節點作為新主節點。
6. Sentinel領導者節點對上一步選出來的從節點執行slaveof no one命令讓其成為主節點。
7. 向剩餘的從節點發送命令,讓它們成為新主節點的從節點,複製規則和parallel-syncs參數有關。
8. 將原來的主節點更新為從節點,並將其納入到Sentinel的管理,讓其恢復後去複製新的主節點。
Sentinel的領導者選舉流程。
Sentinel的領導者選舉基於Raft協議。
1. 每個線上的Sentinel節點都有資格成為領導者,當它確認主節點主觀下線時候,會向其他Sentinel節點發送sentinel is-master-down-by-addr命令,要求將自己設置為領導者。
2. 收到命令的Sentinel節點,如果沒有同意過其他Sentinel節點的sentinel is-master-down-by-addr命令,將同意該請求,否則拒絕。
3. 如果該Sentinel節點發現自己的票數已經大於等於max(quorum,num(sentinels)/2+1),那麼它將成為領導者。
新主節點的選擇流程。
1. 刪除所有已經處於下線或斷線狀態的從節點。
2. 刪除最近5秒沒有回覆過領導者Sentinel的INFO命令的從節點。
3. 刪除所有與已下線主節點連接斷開超過down-after-milliseconds*10毫秒的從節點。
4. 選擇優先順序最高的從節點。
5. 選擇複製偏移量最大的從節點。
6. 選擇runid最小的從節點。
三個定時監控任務
1. 每隔10秒,每個Sentinel節點會向主節點和從節點發送info命令獲取最新的拓撲結構。其作用如下:
1> 通過向主節點執行info命令,獲取從節點的信息,這也是為什麼Sentinel節點不需要顯式配置監控從節點。
2> 當有新的從節點加入時可立刻感知出來。
3> 節點不可達或者故障切換後,可通過info命令實時更新節點拓撲信息。
2. 每隔2秒,每個Sentinel節點會向Redis數據節點的__sentinel__:hello頻道上發送該Sentinel節點對於主節點的判斷以及當前Sentinel節點的信息,同時每個Sentinel節點也會訂閱該頻道,來瞭解其它Sentinel節點以及它們對主節點的判斷。其作用如下:
1> 發現新的Sentinel節點:通過訂閱主節點的__sentinel__:hello瞭解其它Sentinel節點信息,如果是新加入的Sentinel節點,將該Sentinel節點信息保存起來,並與該Sentinel節點創建連接。
2> Sentinel節點之間交換主節點的狀態,作為後面客觀下線以及領導者選舉的依據。
3. 每隔1秒,每個Sentinel節點會向主節點、從節點、其餘Sentinel節點發送一條ping命令做一次心跳檢測,來確認這些節點當前是否可達。這個定時任務是節點失敗判定的重要依據。
Sentinel的相關參數
# bind 127.0.0.1 192.168.1.1 # protected-mode no port 26379 # sentinel announce-ip <ip> # sentinel announce-port <port> dir /tmp sentinel monitor mymaster 127.0.0.1 6379 2 # sentinel auth-pass <master-name> <password> sentinel down-after-milliseconds mymaster 30000 sentinel parallel-syncs mymaster 1 sentinel failover-timeout mymaster 180000 # sentinel notification-script mymaster /var/redis/notify.sh # sentinel client-reconfig-script mymaster /var/redis/reconfig.sh sentinel deny-scripts-reconfig yes
其中,
dir:設置Sentinel的工作目錄。
sentinel monitor mymaster 127.0.0.1 6379 2:其中2是quorum,即權重,代表至少需要兩個Sentinel節點認為主節點主觀下線,才可判定主節點為客觀下線。一般建議將其設置為Sentinel節點的一半加1。不僅如此,quorum還與Sentinel節點的領導者選舉有關。為了選出Sentinel的領導者,至少需要max(quorum, num(sentinels) / 2 + 1)個Sentinel節點參與選舉。
sentinel down-after-milliseconds mymaster 30000:每個Sentinel節點都要通過定期發送ping命令來判斷Redis節點和其餘Sentinel節點是否可達。
如果在指定的時間內,沒有收到主節點的有效回覆,則判斷其為主觀下線。需要註意的是,該參數不僅用來判斷主節點狀態,同樣也用來判斷該主節點下麵的從節點及其它Sentinel的狀態。其預設值為30s。
sentinel parallel-syncs mymaster 1:在failover期間,允許多少個slave同時指向新的主節點。如果numslaves設置較大的話,雖然複製操作並不會阻塞主節點,但多個節點同時指向新的主節點,會增加主節點的網路和磁碟IO負載。
sentinel failover-timeout mymaster 180000:定義故障切換超時時間。預設180000,單位秒,即3min。需要註意的是,該時間不是總的故障切換的時間,而是適用於故障切換的多個場景。
# Specifies the failover timeout in milliseconds. It is used in many ways: # # - The time needed to re-start a failover after a previous failover was # already tried against the same master by a given Sentinel, is two # times the failover timeout. # # - The time needed for a slave replicating to a wrong master according # to a Sentinel current configuration, to be forced to replicate # with the right master, is exactly the failover timeout (counting since # the moment a Sentinel detected the misconfiguration). # # - The time needed to cancel a failover that is already in progress but # did not produced any configuration change (SLAVEOF NO ONE yet not # acknowledged by the promoted slave). # # - The maximum time a failover in progress waits for all the slaves to be # reconfigured as slaves of the new master. However even after this time # the slaves will be reconfigured by the Sentinels anyway, but not with # the exact parallel-syncs progression as specified.
第一種適用場景:如果Redis Sentinel對一個主節點故障切換失敗,那麼下次再對該主節點做故障切換的起始時間是failover-timeout的2倍。這點從Sentinel的日誌就可體現出來(28234:X 08 Oct 16:04:04.385 # Next failover delay: I will not start a failover before Mon Oct 8 16:10:04 2018)
sentinel notification-script:定義通知腳本,當Sentinel出現WARNING級別的事件時,會調用該腳本,其會傳入兩個參數:事件類型,事件描述。
sentinel client-reconfig-script:當主節點發生切換時,會調用該參數定義的腳本,其會傳入以下參數:<master-name> <role> <state> <from-ip> <from-port> <to-ip> <to-port>
關於腳本,其必須遵循一定的規則。
# SCRIPTS EXECUTION # # sentinel notification-script and sentinel reconfig-script are used in order # to configure scripts that are called to notify the system administrator # or to reconfigure clients after a failover. The scripts are executed # with the following rules for error handling: # # If script exits with "1" the execution is retried later (up to a maximum # number of times currently set to 10). # # If script exits with "2" (or an higher value) the script execution is # not retried. # # If script terminates because it receives a signal the behavior is the same # as exit code 1. # # A script has a maximum running time of 60 seconds. After this limit is # reached the script is terminated with a SIGKILL and the execution retried.
sentinel deny-scripts-reconfig:不允許使用SENTINEL SET設置notification-script和client-reconfig-script。
Sentinel的常見操作
- PING This command simply returns PON