Linux CFS調度器之隊列操作--Linux進程的管理與調度(二十七）

1. CFS進程入隊和出隊完全公平調度器CFS中有兩個函數可用來增刪隊列的成員: 和`dequeue_task_fair`分別用來向CFS就緒隊列中添加或者刪除進程 2 enqueue_task_fair入隊操作 2.1 enque_task_fair函數向就緒隊列中放置新進程的工作由函數函數 ...

1. CFS進程入隊和出隊

完全公平調度器CFS中有兩個函數可用來增刪隊列的成員:enqueue_task_fair和dequeue_task_fair分別用來向CFS就緒隊列中添加或者刪除進程

2 enqueue_task_fair入隊操作

2.1 enque_task_fair函數

向就緒隊列中放置新進程的工作由函數enqueue_task_fair函數完成, 該函數定義在kernel/sched/fair.c, line 5442, 其函數原型如下

該函數將task_struct *p所指向的進程插入到rq所在的就緒隊列中, 除了指向所述的就緒隊列rq和task_struct的指針外, 該函數還有另外一個參數wakeup. 這使得可以指定入隊的進程是否最近才被喚醒並轉換為運行狀態(此時需指定wakeup = 1), 還是此前就是可運行的(那麼wakeup = 0).

static void
enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)

enqueue_task_fair的執行流程如下

如果通過struct sched_entity的on_rq成員判斷進程已經在就緒隊列上, 則無事可做.
否則, 具體的工作委托給enqueue_entity完成, 其中內核會藉機用update_curr更新統計量

在enqueue_entity內部如果需要會調用__enqueue_entity將進程插入到CFS紅黑樹中合適的結點

2.2 enque_task_fair完全函數

/*
 * The enqueue_task method is called before nr_running is
 * increased. Here we update the fair scheduling stats and
 * then put the task into the rbtree:
 */
static void
enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
{
    struct cfs_rq *cfs_rq;
    struct sched_entity *se = &p->se;

    for_each_sched_entity(se) {
        if (se->on_rq)
            break;
        cfs_rq = cfs_rq_of(se);
        enqueue_entity(cfs_rq, se, flags);

        /*
         * end evaluation on encountering a throttled cfs_rq
         *
         * note: in the case of encountering a throttled cfs_rq we will
         * post the final h_nr_running increment below.
        */
        if (cfs_rq_throttled(cfs_rq))
            break;
        cfs_rq->h_nr_running++;

        flags = ENQUEUE_WAKEUP;
    }

    for_each_sched_entity(se) {
        cfs_rq = cfs_rq_of(se);
        cfs_rq->h_nr_running++;

        if (cfs_rq_throttled(cfs_rq))
            break;

        update_load_avg(se, 1);
        update_cfs_shares(cfs_rq);
    }

    if (!se)
        add_nr_running(rq, 1);

    hrtick_update(rq);
}

2.3 for_each_sched_entity

首先內核查找到待天機進程p所在的調度實體信息, 然後通過for_each_sched_entity迴圈所有調度實體,

//  enqueue_task_fair函數
{
    struct cfs_rq *cfs_rq;
    struct sched_entity *se = &p->se;

    for_each_sched_entity(se)
    {
    /*  ......  */
    }
}

linux對組調度的支持可以通過CONFIG_FAIR_GROUP_SCHED來啟用, 在啟用和不啟用的條件下, 內核對很多函數的實現也會因條件而異, 這點對for_each_sched_entity函數尤為明顯, 參見啟用CONFIG_FAIR_GROUP_SCHED和不啟用CONFIG_FAIR_GROUP_SCHED

如果通過struct sched_entity的on_rq成員判斷進程已經在就緒隊列上, 則無事可做.
否則, 具體的工作委托給enqueue_entity完成, 其中內核會藉機用update_curr更新統計量.

//  enqueue_task_fair函數
{
        /*  如果當前進程已經在就緒隊列上  */
        if (se->on_rq)
            break;

        /*  獲取到當前進程所在的cfs_rq就緒隊列  */
        cfs_rq = cfs_rq_of(se);
        /*  內核委托enqueue_entity完成真正的插入工作  */
        enqueue_entity(cfs_rq, se, flags);
}

2.4 enqueue_entity插入進程

enqueue_entity完成了進程真正的入隊操作, 其具體流程如下所示

更新一些統計統計量, update_curr, update_cfs_shares等
如果進程此前是在睡眠狀態, 則調用place_entity中首先會調整進程的虛擬運行時間
最後如果進程最近在運行, 其虛擬運行時間仍然有效, 那麼則直接用__enqueue_entity加入到紅黑樹

首先如果進程最近正在運行, 其虛擬時間時間仍然有效, 那麼(除非它當前在執行中)它可以直接用__enqueue_entity插入到紅黑樹, 該函數徐婭萍處理一些紅黑樹的機制, 這可以依靠內核的標準實現, 參見__enqueue_entity函數, kernel/sched/fair.c, line483

static void
enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
{
    /*
     * Update the normalized vruntime before updating min_vruntime
     * through calling update_curr().
     *
     * 如果當前進程之前已經是可運行狀態不是被喚醒的那麼其虛擬運行時間要增加
     */
    if (!(flags & ENQUEUE_WAKEUP) || (flags & ENQUEUE_WAKING))
        se->vruntime += cfs_rq->min_vruntime;

    /*
     * Update run-time statistics of the 'current'.
     * 更新進程的統計量信息
     */
    update_curr(cfs_rq);
    enqueue_entity_load_avg(cfs_rq, se);
    account_entity_enqueue(cfs_rq, se);
    update_cfs_shares(cfs_rq);

    /*  如果當前進行之前在睡眠剛被喚醒  */
    if (flags & ENQUEUE_WAKEUP)
    {
        /*  調整進程的虛擬運行時間  */
        place_entity(cfs_rq, se, 0);
        if (schedstat_enabled())
            enqueue_sleeper(cfs_rq, se);
    }

    check_schedstat_required();
    if (schedstat_enabled()) {
        update_stats_enqueue(cfs_rq, se);
        check_spread(cfs_rq, se);
    }

    /*  將進程插入到紅黑樹中  */
    if (se != cfs_rq->curr)
        __enqueue_entity(cfs_rq, se);
    se->on_rq = 1;

    if (cfs_rq->nr_running == 1) {
        list_add_leaf_cfs_rq(cfs_rq);
        check_enqueue_throttle(cfs_rq);
    }
}

2.5 place_entity處理睡眠進程

如果進程此前在睡眠, 那麼則調用place_entity處理其虛擬運行時間

設想一下子如果休眠進程的vruntime保持不變, 而其他運行進程的 vruntime一直在推進, 那麼等到休眠進程終於喚醒的時候, 它的vruntime比別人小很多, 會使它獲得長時間搶占CPU的優勢, 其他進程就要餓死了. 這顯然是另一種形式的不公平，因此CFS是這樣做的：在休眠進程被喚醒時重新設置vruntime值，以min_vruntime值為基礎，給予一定的補償，但不能補償太多. 這個重新設置其虛擬運行時間的工作就是就是通過place_entity來完成的, 另外新進程創建完成後, 也是通過place_entity完成其虛擬運行時間vruntime的設置的. place_entity通過其第三個參數initial來標識新進程創建和休眠進程蘇醒兩種不同情形的.

place_entity函數定義在kernel/sched/fair.c, line 3135中首先會調整進程的虛擬運行時間

//  http://lxr.free-electrons.com/source/kernel/sched/fair.c?v=4.6#L3134
static void
place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)
{
    u64 vruntime = cfs_rq->min_vruntime;

    /*
     * The 'current' period is already promised to the current tasks,
     * however the extra weight of the new task will slow them down a
     * little, place the new task so that it fits in the slot that
     * stays open at the end.
     *
     * 如果是新進程第一次要入隊, 那麼就要初始化它的vruntime
     * 一般就把cfsq的vruntime給它就可以
     * 但是如果當前運行的所有進程被承諾了一個運行周期
     * 那麼則將新進程的vruntime後推一個他自己的slice
     * 實際上新進程入隊時要重新計算運行隊列的總權值
     * 總權值顯然是增加了，但是所有進程總的運行時期並不一定隨之增加
     * 則每個進程的承諾時間相當於減小了，就是減慢了進程們的虛擬時鐘步伐。 
     */
    /*  initial標識了該進程是新進程  */
    if (initial && sched_feat(START_DEBIT))
        vruntime += sched_vslice(cfs_rq, se);

    /* sleeps up to a single latency don't count. 
     * 休眠進程  */
    if (!initial)
    {
        /*  一個調度周期  */
        unsigned long thresh = sysctl_sched_latency;

        /*
         * Halve their sleep time's effect, to allow
         * for a gentler effect of sleepers:
         */
        /*  若設了GENTLE_FAIR_SLEEPERS  */
        if (sched_feat(GENTLE_FAIR_SLEEPERS))
            thresh >>= 1;   /*  補償減為調度周期的一半  */

        vruntime -= thresh;
    }

    /* ensure we never gain time by being placed backwards.
     * 如果是喚醒已經存在的進程，則單調附值
     */
    se->vruntime = max_vruntime(se->vruntime, vruntime);
}

我們可以看到enqueue_task_fair調用place_entity傳遞的initial參數為0

place_entity(cfs_rq, se, 0);

所以會執行if (!initial)後的語句。因為進程睡眠後，vruntime就不會增加了，當它醒來後不知道過了多長時間，可能vruntime已經比 min_vruntime小了很多，如果只是簡單的將其插入到就緒隊列中，它將拼命追趕min_vruntime，因為它總是在紅黑樹的最左面。如果這樣，它將會占用大量的CPU時間，導致紅黑樹右邊的進程被餓死。但是我們又必須及時響應醒來的進程，因為它們可能有一些工作需要立刻處理，所以系統採取了一種折衷的辦法，將當前cfs_rq->min_vruntime時間減去sysctl_sched_latency賦給vruntime，這時它會被插入到就緒隊列的最左邊。這樣剛喚醒的進程在當前執行進程時間耗盡時就會被調度上處理器執行。當然如果進程沒有睡眠那麼多時間，我們只需保留原來的時間vruntime = max_vruntime(se->vruntime, vruntime)。這有什麼好處的，我覺得它可以將所有喚醒的進程排個隊，睡眠越久的越快得到響應。

對於新進程創建時initial為1，所以它會執行vruntime += sched_vslice(cfs_rq, se);這句，而這裡的vruntime就是當前CFS就緒隊列的min_vruntime，新加進程應該在最近很快被調度，這樣減少系統的響應時間，我們已經知道當前進程的vruntime越小，它在紅黑樹中就會越靠左，就會被很快調度到處理器上執行。但是，Linux內核需要根據新加入的進程的權重決策一下應該何時調度該進程，而不能任意進程都來搶占當前隊列中靠左的進程，因為必須保證就緒隊列中的所有進程儘量得到他們應得的時間響應， sched_vslice函數就將其負荷權重轉換為等價的虛擬時間, 其定義在kernel/sched/fair.c, line 626

函數就是根據initial的值來區分兩種情況, 一般來說只有在新進程被加到系統中時,才會首次設置該參數, 但是這裡的情況並非如此:

由於內核已經承諾在當前的延遲周期內使所有活動進程都至少運行一次, 隊列的min_vruntime用作基準虛擬時間, 通過減去sysctl_sched_latency, 則可以確保新喚醒新喚醒的進程只有在當前延遲周期結束後才能運行.

但是如果進程在睡眠的過程中累積了比較大的不公平值(即se->vruntime值比較大), 則內核必須考慮這一點. 如果se->vruntime比先前的差值更大, 則將其作為進程的vruntime, 這會導致高進程在紅黑樹中處於靠左的位置, 而具有較小vruntime值得進程可以更早調度執行.

2.6 __enqueue_entity完成紅黑樹的插入

如果進程最近在運行, 其虛擬時間是有效的, 那麼它可以直接通過__enqueue_entity加入到紅黑樹

//  enqueue_entity函數解析
    /*  將進程插入到紅黑樹中  */
    if (se != cfs_rq->curr)
        __enqueue_entity(cfs_rq, se);
    se->on_rq = 1;

__enqueue_entity函數定義在kernel/sched/fair.c, line 486中, 其實就是一個機械性地紅黑樹插入操作

/*
 * Enqueue an entity into the rb-tree:
 */
static void __enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
    struct rb_node **link = &cfs_rq->tasks_timeline.rb_node;
    struct rb_node *parent = NULL;
    struct sched_entity *entry;
    int leftmost = 1;

    /*
     * Find the right place in the rbtree:
     * 從紅黑樹中找到se所應該在的位置
     * 同時leftmost標識其位置是不是最左結點
     * 如果在查找結點的過程中向右走了, 則置leftmost為0
     * 否則說明一直再相左走, 最終將走到最左節點, 此時leftmost恆為1
     */
    while (*link) {
        parent = *link;
        entry = rb_entry(parent, struct sched_entity, run_node);
        /*
         * We dont care about collisions. Nodes with
         * the same key stay together.
         * 以se->vruntime值為鍵值進行紅黑樹結點的比較
         */
        if (entity_before(se, entry)) {
            link = &parent->rb_left;
        } else {
            link = &parent->rb_right;
            leftmost = 0;
        }
    }
    /*
     * Maintain a cache of leftmost tree entries (it is frequently
     * used):
     * 如果leftmost為1, 說明se是紅黑樹當前的最左結點, 即vruntime最小
     * 那麼把這個節點保存在cfs就緒隊列的rb_leftmost域中
     */
    if (leftmost)
        cfs_rq->rb_leftmost = &se->run_node;

    /*  將新進程的節點加入到紅黑樹中  */
    rb_link_node(&se->run_node, parent, link);
    /*  為新插入的結點進行著色  */
    rb_insert_color(&se->run_node, &cfs_rq->tasks_timeline);
}

3 dequeue_task_fair出隊操作

dequeue_task_fair函數在完成睡眠等情況下調度, 將任務從就緒隊列中移除

其執行的過程正好跟enqueue_task_fair的思路相同, 只是操作剛好相反

dequeue_task_fair的執行流程如下

如果通過struct sched_entity的on_rq成員判斷進程已經在就緒隊列上, 則無事可做.
否則, 具體的工作委托給dequeue_entity完成, 其中內核會藉機用update_curr更新統計量
在enqueue_entity內部如果需要會調用__dequeue_entity將進程插入到CFS紅黑樹中合適的結點

dequeue_task_fair定義在/kernel/sched/fair.c, line 4155, 其大致框架流程如下

/*
 * The dequeue_task method is called before nr_running is
 * decreased. We remove the task from the rbtree and
 * update the fair scheduling stats:
 */
static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
;

    struct cfs_rq *cfs_rq;
    struct sched_entity *se = &p->se;
    int task_sleep = flags & DEQUEUE_SLEEP;

    //   設置
    flags |= DEQUEUE_SLEEP;


    for_each_sched_entity(se) {
        cfs_rq = cfs_rq_of(se);
        cfs_rq->h_nr_running--;

        if (cfs_rq_throttled(cfs_rq))
            break;

        update_load_avg(se, 1);
        update_cfs_shares(cfs_rq);
    }

    if (!se)
        sub_nr_running(rq, 1);

    hrtick_update(rq);
}

3.2 dequeue_entity將調度實體出隊

static void
dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
{
    /*
     * Update run-time statistics of the 'current'.
     */
    update_curr(cfs_rq);
    dequeue_entity_load_avg(cfs_rq, se);

    if (schedstat_enabled())
        update_stats_dequeue(cfs_rq, se, flags);

    clear_buddies(cfs_rq, se);

    if (se != cfs_rq->curr)
        __dequeue_entity(cfs_rq, se);
    se->on_rq = 0;
    account_entity_dequeue(cfs_rq, se);

    /*
     * Normalize the entity after updating the min_vruntime because the
     * update can refer to the ->curr item and we need to reflect this
     * movement in our normalized position.
     */
    if (!(flags & DEQUEUE_SLEEP))
        se->vruntime -= cfs_rq->min_vruntime;

    /* return excess runtime on last dequeue */
    return_cfs_rq_runtime(cfs_rq);

    update_min_vruntime(cfs_rq);
    update_cfs_shares(cfs_rq);
}

3.3 __dequeue_entity完成真正的出隊操作

static void __dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
    if (cfs_rq->rb_leftmost == &se->run_node) {
        struct rb_node *next_node;

        next_node = rb_next(&se->run_node);
        cfs_rq->rb_leftmost = next_node;
    }

    rb_erase(&se->run_node, &cfs_rq->tasks_timeline);
}