為什麼創建 Redis 集群時會自動錯開主從節點？

哈嘍大家好，我是鹹魚在《[一臺伺服器上部署 Redis 偽集群》](https://mp.weixin.qq.com/s?__biz=MzkzNzI1MzE2Mw==&mid=2247486439&idx=1&sn=0b10317397ef3259dd98d493915dd706&chksm=c2 ...

哈嘍大家好，我是鹹魚

在《一臺伺服器上部署 Redis 偽集群》這篇文章中，鹹魚在創建 Redis 集群時並沒有明確指定哪個 Redis 實例將擔任 master，哪個將擔任 slave

/usr/local/redis-4.0.9/src/redis-trib.rb create --replicas 1 192.168.149.131:6379 192.168.149.131:26379 192.168.149.131:6380 192.168.149.131:26380 192.168.149.131:6381 192.168.149.131:26381

然而 Redis 卻自動完成了主從節點的分配工作

如果大家在多台伺服器部署過 Redis 集群的話，比如說在三台機器上部署三主三從的 redis 集群，你會觀察到 Redis 自動地將主節點和從節點的部署位置錯開

舉個例子： master 1 和 slave 3 在同一臺機器上； master 2和 slave 1 在同一臺機器上； master 3 和 slave 2 在同一臺機器上

這是為什麼呢？

我們知道老版本的 Redis 集群管理命令是 redis-trib.rb，新版本則換成了 redis-cli

這兩個可執行文件其實是一個用 C 編寫的腳本，小伙伴們如果看過這兩個文件的源碼就會發現原因就在下麵這段代碼里

/* Return the anti-affinity score, which is a measure of the amount of
 * violations of anti-affinity in the current cluster layout, that is, how
 * badly the masters and slaves are distributed in the different IP
 * addresses so that slaves of the same master are not in the master
 * host and are also in different hosts.
 *
 * The score is calculated as follows:
 *
 * SAME_AS_MASTER = 10000 * each slave in the same IP of its master.
 * SAME_AS_SLAVE  = 1 * each slave having the same IP as another slave
                      of the same master.
 * FINAL_SCORE = SAME_AS_MASTER + SAME_AS_SLAVE
 *
 * So a greater score means a worse anti-affinity level, while zero
 * means perfect anti-affinity.
 *
 * The anti affinity optimization will try to get a score as low as
 * possible. Since we do not want to sacrifice the fact that slaves should
 * not be in the same host as the master, we assign 10000 times the score
 * to this violation, so that we'll optimize for the second factor only
 * if it does not impact the first one.
 *
 * The ipnodes argument is an array of clusterManagerNodeArray, one for
 * each IP, while ip_count is the total number of IPs in the configuration.
 *
 * The function returns the above score, and the list of
 * offending slaves can be stored into the 'offending' argument,
 * so that the optimizer can try changing the configuration of the
 * slaves violating the anti-affinity goals. */
static int clusterManagerGetAntiAffinityScore(clusterManagerNodeArray *ipnodes,
    int ip_count, clusterManagerNode ***offending, int *offending_len)
{
	...
    return score;
}

static void clusterManagerOptimizeAntiAffinity(clusterManagerNodeArray *ipnodes,
    int ip_count)
{
	...
}

通過註釋我們可以得知，clusterManagerGetAntiAffinityScore 函數是用來計算反親和性得分，這個得分表示了當前 Redis 集群佈局中是否符合反親和性的要求

反親和性指的是 master 和 slave 不應該在同一臺機器上，也不應該在相同的 IP 地址上

那如何計算反親和性得分呢？

如果有多個 slave 與同一個 master 在相同的 IP 地址上，那麼對於每個這樣的 slave，得分增加 10000
如果有多個 slave 在相同的 IP 地址上，但它們彼此之間不是同一個 master，那麼對於每個這樣的 slave，得分增加 1
最終得分是上述兩部分得分之和

也就是說，得分越高，親和性越高；得分越低，反親和性越高；得分為零表示完全符合反親和性的要求

獲得得分之後，就會對得分高（反親和性低）的節點進行優化

為了讓 Redis 主從之間的反親和性更高，clusterManagerOptimizeAntiAffinity 函數會對那些反親和性很低的節點進行優化，它會嘗試通過交換從節點的主節點，來改善集群中主從節點分佈，從而減少反親和性低問題

接下來我們分別來看下這兩個函數

反親和性得分計算

static int clusterManagerGetAntiAffinityScore(clusterManagerNodeArray *ipnodes,
    int ip_count, clusterManagerNode ***offending, int *offending_len)
{
	...
}

可以看到，該函數接受了四個參數：

ipnodes：一個包含多個 clusterManagerNodeArray 結構體的數組，每個結構體表示一個 IP 地址上的節點數組
ip_count：IP 地址的總數
offending：用於存儲違反反親和性規則的節點的指針數組（可選參數）
offending_len：存儲 offending 數組中節點數量的指針（可選參數）

第一層 for 迴圈是遍歷 ip 地址，第二層迴圈是遍歷每個 IP 地址的節點數組

    ...
    for (i = 0; i < ip_count; i++) {
        clusterManagerNodeArray *node_array = &(ipnodes[i]);
        dict *related = dictCreate(&clusterManagerDictType);
        char *ip = NULL;
        for (j = 0; j < node_array->len; j++) {
			...
        }
    ...

我們來看下第二層 for 迴圈

    for (i = 0; i < ip_count; i++) {
        /* 獲取每個 IP 地址的節點數組 */
        clusterManagerNodeArray *node_array = &(ipnodes[i]);
        /* 創建字典 related */
        dict *related = dictCreate(&clusterManagerDictType);
        char *ip = NULL;
        for (j = 0; j < node_array->len; j++) {
            /* 獲取當前節點 */
            clusterManagerNode *node = node_array->nodes[j];
			...
            /* 在 related 字典中查找是否已經存在相應的鍵 */
            dictEntry *entry = dictFind(related, key);
            if (entry) types = sdsdup((sds) dictGetVal(entry));
            else types = sdsempty();
            if (node->replicate) types = sdscat(types, "s");
            else {
                sds s = sdscatsds(sdsnew("m"), types);
                sdsfree(types);
                types = s;
            }
            dictReplace(related, key, types);
        }

首先遍歷每個 IP 地址的節點數組，對於每個 IP 地址上的節點數組，函數通過字典related來記錄相同主節點和從節點的關係

其中字典 related的 key 是節點的名稱，value 是一個字元串，表示該節點類型 types

對於每個節點，根據節點構建一個字元串類型的關係標記（types），將主節點標記為 m，從節點標記為 s

然後通過字典將相同關係標記的節點關聯在一起，構建了一個記錄相同主從節點關係的字典 related

	...     
		/* 創建字典迭代器，用於遍歷節點分組信息 */
		dictIterator *iter = dictGetIterator(related);
        dictEntry *entry;
        while ((entry = dictNext(iter)) != NULL) {
            /* key 是節點名稱，value 是 types，即節點類型 */
            sds types = (sds) dictGetVal(entry);
            sds name = (sds) dictGetKey(entry);
            int typeslen = sdslen(types);
            if (typeslen < 2) continue;
            /* 計算反親和性得分 */
            if (types[0] == 'm') score += (10000 * (typeslen - 1));
            else score += (1 * typeslen);
            ...
        }

上面代碼片段可知，while 迴圈遍歷字典 related中的分組信息，計算相同主從節點關係的得分

獲取節點類型信息並長度
如果是主節點類型，得分 += (10000 * (typeslen - 1))；否則，得分 += (1 * typeslen)

如果有提供 offending 參數，將找到違反反親和性規則的節點並存儲到 offending 數組中，同時更新違反規則節點的數量，如下代碼所示

            if (offending == NULL) continue;
            /* Populate the list of offending nodes. */
            listIter li;
            listNode *ln;
            listRewind(cluster_manager.nodes, &li);
            while ((ln = listNext(&li)) != NULL) {
                clusterManagerNode *n = ln->value;
                if (n->replicate == NULL) continue;
                if (!strcmp(n->replicate, name) && !strcmp(n->ip, ip)) {
                    *(offending_p++) = n;
                    if (offending_len != NULL) (*offending_len)++;
                    break;
                }
            }

最後返回得分 score，完整函數代碼如下

static int clusterManagerGetAntiAffinityScore(clusterManagerNodeArray *ipnodes,
    int ip_count, clusterManagerNode ***offending, int *offending_len)
{
    int score = 0, i, j;
    int node_len = cluster_manager.nodes->len;
    clusterManagerNode **offending_p = NULL;
    if (offending != NULL) {
        *offending = zcalloc(node_len * sizeof(clusterManagerNode*));
        offending_p = *offending;
    }
    /* For each set of nodes in the same host, split by
     * related nodes (masters and slaves which are involved in
     * replication of each other) */
    for (i = 0; i < ip_count; i++) {
        clusterManagerNodeArray *node_array = &(ipnodes[i]);
        dict *related = dictCreate(&clusterManagerDictType);
        char *ip = NULL;
        for (j = 0; j < node_array->len; j++) {
            clusterManagerNode *node = node_array->nodes[j];
            if (node == NULL) continue;
            if (!ip) ip = node->ip;
            sds types;
            /* We always use the Master ID as key. */
            sds key = (!node->replicate ? node->name : node->replicate);
            assert(key != NULL);
            dictEntry *entry = dictFind(related, key);
            if (entry) types = sdsdup((sds) dictGetVal(entry));
            else types = sdsempty();
            /* Master type 'm' is always set as the first character of the
             * types string. */
            if (node->replicate) types = sdscat(types, "s");
            else {
                sds s = sdscatsds(sdsnew("m"), types);
                sdsfree(types);
                types = s;
            }
            dictReplace(related, key, types);
        }
        /* Now it's trivial to check, for each related group having the
         * same host, what is their local score. */
        dictIterator *iter = dictGetIterator(related);
        dictEntry *entry;
        while ((entry = dictNext(iter)) != NULL) {
            sds types = (sds) dictGetVal(entry);
            sds name = (sds) dictGetKey(entry);
            int typeslen = sdslen(types);
            if (typeslen < 2) continue;
            if (types[0] == 'm') score += (10000 * (typeslen - 1));
            else score += (1 * typeslen);
            if (offending == NULL) continue;
            /* Populate the list of offending nodes. */
            listIter li;
            listNode *ln;
            listRewind(cluster_manager.nodes, &li);
            while ((ln = listNext(&li)) != NULL) {
                clusterManagerNode *n = ln->value;
                if (n->replicate == NULL) continue;
                if (!strcmp(n->replicate, name) && !strcmp(n->ip, ip)) {
                    *(offending_p++) = n;
                    if (offending_len != NULL) (*offending_len)++;
                    break;
                }
            }
        }
        //if (offending_len != NULL) *offending_len = offending_p - *offending;
        dictReleaseIterator(iter);
        dictRelease(related);
    }
    return score;
}

反親和性優化

計算出反親和性得分之後，對於那些得分很低的節點，redis 就需要對其進行優化，提高集群中節點的分佈，以避免節點在同一主機上

static void clusterManagerOptimizeAntiAffinity(clusterManagerNodeArray *ipnodes, int ip_count){	
    clusterManagerNode **offenders = NULL;
    int score = clusterManagerGetAntiAffinityScore(ipnodes, ip_count,
                                                   NULL, NULL);
    if (score == 0) goto cleanup;  
    ...
cleanup:
    zfree(offenders);
}

從上面的代碼可以看到，如果得分為 0 ，說明反親和性已經很好，無需優化。直接跳到 cleanup 去釋放 offenders 節點的記憶體空間

如果得分不為 0 ，則就會設置一個最大迭代次數maxiter，這個次數跟節點的數量成正比，然後 while 迴圈在有限次迭代內進行優化操作

    ...
    int maxiter = 500 * node_len; // Effort is proportional to cluster size...
    while (maxiter > 0) {
    	...
        maxiter--;
    }
    ...

這個函數的核心就在 while 迴圈里，我們來看一下其中的一些片段

首先 offending_len 來存儲違反規則的節點數，然後如果之前有違反規則的節點(offenders != NULL)則釋放掉（zfree(offenders)）

然後重新計算得分，如果得分為0或沒有違反規則的節點，退出 while 迴圈

    int offending_len = 0;  
    if (offenders != NULL) {
        zfree(offenders);  // 釋放之前存儲的違反規則的節點
        offenders = NULL;
    }
    score = clusterManagerGetAntiAffinityScore(ipnodes,
                                               ip_count,
                                               &offenders,
                                               &offending_len);
    if (score == 0 || offending_len == 0) break;

接著去隨機選擇一個違反規則的節點，嘗試交換分配的 master

        int rand_idx = rand() % offending_len;
        clusterManagerNode *first = offenders[rand_idx],
                           *second = NULL;

		// 創建一個數組，用來存儲其他可交換 master 的 slave
        clusterManagerNode **other_replicas = zcalloc((node_len - 1) *
                                                      sizeof(*other_replicas));

然後遍歷集群中的節點，尋找能夠交換 master 的 slave。如果沒有找到，那就退出迴圈

    while ((ln = listNext(&li)) != NULL) {
        clusterManagerNode *n = ln->value;
        if (n != first && n->replicate != NULL)
            other_replicas[other_replicas_count++] = n;
    }
    
    if (other_replicas_count == 0) {
        zfree(other_replicas);
        break;
    }

如果找到了，就開始交換並計算交換後的反親和性得分

    // 隨機選擇一個可交換的節點作為交換目標
    rand_idx = rand() % other_replicas_count;
    second = other_replicas[rand_idx];
    
    // 交換兩個 slave 的 master 分配
    char *first_master = first->replicate,
         *second_master = second->replicate;
    first->replicate = second_master, first->dirty = 1;
    second->replicate = first_master, second->dirty = 1;
    
    // 計算交換後的反親和性得分
    int new_score = clusterManagerGetAntiAffinityScore(ipnodes,
                                                       ip_count,
                                                       NULL, NULL);

如果交換後的得分比之前的得分還大，說明白交換了，還不如不交換，就會回顧；如果交換後的得分比之前的得分小，說明交換是沒毛病的

    if (new_score > score) {
        first->replicate = first_master;
        second->replicate = second_master;
    }

最後釋放資源，準備下一次 while 迴圈

    zfree(other_replicas);
    maxiter--;

總結一下：

每次 while 迴圈會嘗試隨機選擇一個違反反親和性規則的從節點，並與另一個隨機選中的從節點交換其主節點分配，然後重新計算交換後的反親和性得分
如果交換後的得分變大，說明交換不利於反親和性，會回滾交換
如果交換後得分變小，則保持，後面可能還需要多次交換
這樣，通過多次隨機的交換嘗試，最終可以達到更好的反親和性分佈

最後則是一些收尾工作，像輸出日誌信息，釋放記憶體等，這裡不過多介紹

    char *msg;
    int perfect = (score == 0);
    int log_level = (perfect ? CLUSTER_MANAGER_LOG_LVL_SUCCESS :
                               CLUSTER_MANAGER_LOG_LVL_WARN);
    if (perfect) msg = "[OK] Perfect anti-affinity obtained!";
    else if (score >= 10000)
        msg = ("[WARNING] Some slaves are in the same host as their master");
    else
        msg=("[WARNING] Some slaves of the same master are in the same host");
    clusterManagerLog(log_level, "%s\n", msg);

下麵是完整代碼

static void clusterManagerOptimizeAntiAffinity(clusterManagerNodeArray *ipnodes,
    int ip_count)
{
    clusterManagerNode **offenders = NULL;
    int score = clusterManagerGetAntiAffinityScore(ipnodes, ip_count,
                                                   NULL, NULL);
    if (score == 0) goto cleanup;
    clusterManagerLogInfo(">>> Trying to optimize slaves allocation "
                          "for anti-affinity\n");
    int node_len = cluster_manager.nodes->len;
    int maxiter = 500 * node_len; // Effort is proportional to cluster size...
    srand(time(NULL));
    while (maxiter > 0) {
        int offending_len = 0;
        if (offenders != NULL) {
            zfree(offenders);
            offenders = NULL;
        }
        score = clusterManagerGetAntiAffinityScore(ipnodes,
                                                   ip_count,
                                                   &offenders,
                                                   &offending_len);
        if (score == 0 || offending_len == 0) break; // Optimal anti affinity reached
        /* We'll try to randomly swap a slave's assigned master causing
         * an affinity problem with another random slave, to see if we
         * can improve the affinity. */
        int rand_idx = rand() % offending_len;
        clusterManagerNode *first = offenders[rand_idx],
                           *second = NULL;
        clusterManagerNode **other_replicas = zcalloc((node_len - 1) *
                                                      sizeof(*other_replicas));
        int other_replicas_count = 0;
        listIter li;
        listNode *ln;
        listRewind(cluster_manager.nodes, &li);
        while ((ln = listNext(&li)) != NULL) {
            clusterManagerNode *n = ln->value;
            if (n != first && n->replicate != NULL)
                other_replicas[other_replicas_count++] = n;
        }
        if (other_replicas_count == 0) {
            zfree(other_replicas);
            break;
        }
        rand_idx = rand() % other_replicas_count;
        second = other_replicas[rand_idx];
        char *first_master = first->replicate,
             *second_master = second->replicate;
        first->replicate = second_master, first->dirty = 1;
        second->replicate = first_master, second->dirty = 1;
        int new_score = clusterManagerGetAntiAffinityScore(ipnodes,
                                                           ip_count,
                                                           NULL, NULL);
        /* If the change actually makes thing worse, revert. Otherwise
         * leave as it is because the best solution may need a few
         * combined swaps. */
        if (new_score > score) {
            first->replicate = first_master;
            second->replicate = second_master;
        }
        zfree(other_replicas);
        maxiter--;
    }
    score = clusterManagerGetAntiAffinityScore(ipnodes, ip_count, NULL, NULL);
    char *msg;
    int perfect = (score == 0);
    int log_level = (perfect ? CLUSTER_MANAGER_LOG_LVL_SUCCESS :
                               CLUSTER_MANAGER_LOG_LVL_WARN);
    if (perfect) msg = "[OK] Perfect anti-affinity obtained!";
    else if (score >= 10000)
        msg = ("[WARNING] Some slaves are in the same host as their master");
    else
        msg=("[WARNING] Some slaves of the same master are in the same host");
    clusterManagerLog(log_level, "%s\n", msg);
cleanup:
    zfree(offenders);
}