Linux進程退出詳解(do_exit)--Linux進程的管理與調度(十四)

Linux進程的退出 linux下進程退出的方式正常退出從main函數返回return 調用exit 調用_exit 異常退出調用abort 由信號終止 _exit, exit和_Exit的區別和聯繫 _exit是linux系統調用，關閉所有文件描述符，然後退出進程。 exit是c語言的庫函數 ...

Linux進程的退出

linux下進程退出的方式

正常退出

從main函數返回return
調用exit
調用_exit

異常退出

調用abort
由信號終止

_exit, exit和_Exit的區別和聯繫

_exit是linux系統調用，關閉所有文件描述符，然後退出進程。

exit是c語言的庫函數，他最終調用_exit。在此之前，先清洗標準輸出的緩存，調用用atexit註冊的函數等, 在c語言的main函數中調用return就等價於調用exit。

_Exit是c語言的庫函數，自c99後加入，等價於_exit，即可以認為它直接調用_Exit。

基本來說，_Exit（或 _exit，建議使用大寫版本）是為 fork 之後的子進程準備的特殊 API。功能見 POSIX 標準：_Exit，討論見
c - how to exit a child process

由fork()函數創建的子進程分支里，正常情況下使用函數exit()是不正確的，這是因為使用它會導致標準輸入輸出的緩衝區被清空兩次，而且臨時文件可能被意外刪除。”

因為在 fork 之後，exec 之前，很多資源還是共用的（如某些文件描述符），如果使用 exit 會關閉這些資源，導致某些非預期的副作用（如刪除臨時文件等）。

「刷新」是對應 flush，意思是把內容從記憶體緩存寫出到文件里，而不僅僅是清空（所以常見的對 stdin 調用 flush 的方法是耍流氓而已）。如果在 fork 的時候父進程記憶體有緩衝內容，則這個緩衝會帶到子進程，並且兩個進程會分別 fflush （寫出）一次，造成數據重覆。參見c - How does fork() work with buffered streams like stdout?

進程退出的系統調用

_exit和exit_group系統調用

_exit系統調用

進程退出由exit系統調用來完成, 這使得內核有機會將該進程所使用的資源釋放回系統中

進程終止時，一般是調用exit庫函數（無論是程式員顯式調用還是編譯器自動地把exit庫函數插入到main函數的最後一條語句之後）來釋放進程所擁有的資源。

exit系統調用的入口點是sys_exit()函數, 需要一個錯誤碼作為參數, 以便退出進程。

其定義是體繫結構無關的, 見kernel/exit.c

而我們用戶空間的多線程應用程式, 對應內核中就有多個進程, 這些進程共用虛擬地址空間和資源, 他們有各自的進程id(pid), 但是他們的組進程id(tpid)是相同的, 都等於組長(領頭進程)的pid

在linux內核中對線程並沒有做特殊的處理，還是由task_struct來管理。所以從內核的角度看，用戶態的線程本質上還是一個進程。對於同一個進程（用戶態角度）中不同的線程其tgid是相同的，但是pid各不相同。主線程即group_leader（主線程會創建其他所有的子線程）。如果是單線程進程（用戶態角度），它的pid等於tgid。

這個信息我們已經討論過很多次了

參見

Linux進程ID號–Linux進程的管理與調度（三）

Linux進程描述符task_struct結構體詳解–Linux進程的管理與調度（一）

為什麼還需要exit_group

我們如果瞭解linux的線程實現機制的話, 會知道所有的線程是屬於一個線程組的, 同時即使不是線程, linux也允許多個進程組成進程組, 多個進程組組成一個會話, 因此我們本質上瞭解到不管是多線程, 還是進程組起本質都是多個進程組成的一個集合, 那麼我們的應用程式在退出的時候, 自然希望一次性的退出組內所有的進程。

因此exit_group就誕生了

group_exit函數會殺死屬於當前進程所線上程組的所有進程。它接受進程終止代號作為參數，進程終止代號可能是系統調用exit_group（正常結束）指定的一個值，也可能是內核提供的一個錯誤碼（異常結束）。

因此C語言的庫函數exit使用系統調用exit_group來終止整個線程組，庫函數pthread_exit使用系統調用_exit來終止某一個線程

_exit和exit_group這兩個系統調用在Linux內核中的入口點函數分別為sys_exit和sys_exit_group。

因此exit_group就誕生了

因此C語言的庫函數exit使用系統調用exit_group來終止整個線程組，庫函數pthread_exit使用系統調用_exit來終止某一個線程

_exit和exit_group這兩個系統調用在Linux內核中的入口點函數分別為sys_exit和sys_exit_group。

系統調用聲明

聲明見include/linux/syscalls.h, line 535

asmlinkage long sys_exit(int error_code);
asmlinkage long sys_exit_group(int error_code);

asmlinkage long sys_wait4(pid_t pid, int __user *stat_addr,
                                int options, struct rusage __user *ru);
asmlinkage long sys_waitid(int which, pid_t pid,
                           struct siginfo __user *infop,
                           int options, struct rusage __user *ru);
asmlinkage long sys_waitpid(pid_t pid, int __user *stat_addr, int options);

系統調用號

然後系統調用的實現在kernel/exit.c 中

SYSCALL_DEFINE1(exit, int, error_code)
{
        do_exit((error_code&0xff)<<8);
}


/*
 * this kills every thread in the thread group. Note that any externally
 * wait4()-ing process will get the correct exit code - even if this
 * thread is not the thread group leader.
 */
SYSCALL_DEFINE1(exit_group, int, error_code)
{
        do_group_exit((error_code & 0xff) << 8);
        /* NOTREACHED */
        return 0;
}

do_exit_group流程

do_group_exit()函數殺死屬於current線程組的所有進程。它接受進程終止代碼作為參數，進程終止代號可能是系統調用exit_group()指定的一個值，也可能是內核提供的一個錯誤代號。

該函數執行下述操作

檢查退出進程的SIGNAL＿GROUP_EXIT標誌是否不為0，如果不為0，說明內核已經開始為線性組執行退出的過程。在這種情況下，就把存放在current->signal->group_exit_code的值當作退出碼，然後跳轉到第4步。
否則，設置進程的SIGNAL＿GROUP_EXIT標誌並把終止代號放到current->signal->group_exit_code欄位。
調用zap_other_threads()函數殺死current線程組中的其它進程。為了完成這個步驟，函數掃描與current->tgid對應的PIDTYPE＿TGID類型的散列表中的每PID鏈表，向表中所有不同於current的進程發送SIGKILL信號，結果，所有這樣的進程都將執行do_exit()函數，從而被殺死。
調用do_exit()函數，把進程的終止代碼傳遞給它。正如我們將在下麵看到的，do_exit()殺死進程而且不再返回。

/*
 * Take down every thread in the group.  This is called by fatal signals
 * as well as by sys_exit_group (below).
 */
void
do_group_exit(int exit_code)
{
    struct signal_struct *sig = current->signal;

    BUG_ON(exit_code & 0x80); /* core dumps don't get here */
    /*
        檢查current->sig->flags的SIGNAL_GROUP_EXIT標誌是否置位
        或者current->sig->group_exit_task是否不為NULL
    */
    if (signal_group_exit(sig))
        exit_code = sig->group_exit_code;   /*  group_exit_code存放的是線程組終止代碼  */
    else if (!thread_group_empty(current)) {    /*  檢查線程組鏈表是否不為空  */
        struct sighand_struct *const sighand = current->sighand;

        spin_lock_irq(&sighand->siglock);
        if (signal_group_exit(sig))
            /* Another thread got here before we took the lock.  */
            exit_code = sig->group_exit_code;
        else {
            sig->group_exit_code = exit_code;
            sig->flags = SIGNAL_GROUP_EXIT;
            zap_other_threads(current);     /*  遍歷整個線程組鏈表，並殺死其中的每個線程  */
        }
        spin_unlock_irq(&sighand->siglock);
    }

    do_exit(exit_code);
    /* NOTREACHED */
}

do_exit流程

進程終止所要完成的任務都是由do_exit函數來處理。

該函數定義在kernel/exit.c中

觸發task_exit_nb通知鏈實例的處理函數

profile_task_exit(tsk);

該函數會定義在觸發kernel/profile.c

void profile_task_exit(struct task_struct *task)
{
    blocking_notifier_call_chain(&task_exit_notifier, 0, task);
}

會觸發task_exit_notifier通知, 從而觸發對應的處理函數

其中task_exit_notifier被定義如下

//  http://lxr.free-electrons.com/source/kernel/profile.c?v=4.6#L134
static BLOCKING_NOTIFIER_HEAD(task_exit_notifier);


// http://lxr.free-electrons.com/source/include/linux/notifier.h?v=4.6#L111
#define BLOCKING_NOTIFIER_INIT(name) {                      \
                .rwsem = __RWSEM_INITIALIZER((name).rwsem),     \
                .head = NULL }

// http://lxr.free-electrons.com/source/include/linux/rwsem.h?v4.6#L74
#define __RWSEM_INITIALIZER(name)                               \
        { .count = RWSEM_UNLOCKED_VALUE,                        \
          .wait_list = LIST_HEAD_INIT((name).wait_list),        \
          .wait_lock = __RAW_SPIN_LOCK_UNLOCKED(name.wait_lock) \
          __RWSEM_OPT_INIT(name)                                \
          __RWSEM_DEP_MAP_INIT(name) }

檢查進程的blk_plug是否為空

保證task_struct中的plug欄位是空的，或者plug欄位指向的隊列是空的。plug欄位的意義是stack plugging

//  http://lxr.free-electrons.com/source/include/linux/blkdev.h?v=4.6#L1095
WARN_ON(blk_needs_flush_plug(tsk));

其中blk_needs_flush_plug函數定義在include/linux/blkdev.h, 如下

static inline bool blk_needs_flush_plug(struct task_struct *tsk)
{
    struct blk_plug *plug = tsk->plug;

    return plug &&
        (!list_empty(&plug->list) ||
        !list_empty(&plug->mq_list) ||
        !list_empty(&plug->cb_list));
}

OOPS消息

中斷上下文不能執行do_exit函數, 也不能終止PID為0的進程。

if (unlikely(in_interrupt()))
    panic("Aiee, killing interrupt handler!");
if (unlikely(!tsk->pid))
    panic("Attempted to kill the idle task!");

設定進程可以使用的虛擬地址的上限（用戶空間）

/*
 * If do_exit is called because this processes oopsed, it's possible
 * that get_fs() was left as KERNEL_DS, so reset it to USER_DS before
 * continuing. Amongst other possible reasons, this is to prevent
 * mm_release()->clear_child_tid() from writing to a user-controlled
 * kernel address.
 *
 * 設定進程可以使用的虛擬地址的上限（用戶空間）
 * http://lxr.free-electrons.com/ident?v=4.6;i=set_fs
 */
set_fs(USER_DS);

這個是一個體繫結構相關的代碼, 其定義如下
其定義在arch/對應體系/include/asm/uaccess.h中

體系	定義
arm	arch/arm/include/asm/uaccess.h, line 99
arm64	arch/arm64/include/asm/uaccess.h, line 66
x86	arch/x86/include/asm/uaccess.h, line 32
通用	include/asm-generic/uaccess.h, line 28

arm64的定義如下

static inline void set_fs(mm_segment_t fs)
{
    current_thread_info()->addr_limit = fs;

    /*
     * Enable/disable UAO so that copy_to_user() etc can access
     * kernel memory with the unprivileged instructions.
    */
    if (IS_ENABLED(CONFIG_ARM64_UAO) && fs == KERNEL_DS)
        asm(ALTERNATIVE("nop", SET_PSTATE_UAO(1), ARM64_HAS_UAO));
    else
        asm(ALTERNATIVE("nop", SET_PSTATE_UAO(0), ARM64_HAS_UAO,
        CONFIG_ARM64_UAO));
}

檢查進病設置進程程PF_EXITING

首先是檢查PF_EXITING標識, 此標識表示進程正在退出,　

如果此標識已被設置, 則進一步設置PF_EXITPIDONE標識,　並將進程的狀態設置為不可中斷狀態TASK_UNINTERRUPTIBLE,　併進程一次進程調度

    /*current->flags的PF_EXITING標誌表示進程正在被刪除  */
    if (unlikely(tsk->flags & PF_EXITING)) {  /*  檢查PF_EXITING標誌是否未被設置  */
        pr_alert("Fixing recursive fault but reboot is needed!\n");
        /*
         * We can do this unlocked here. The futex code uses
         * this flag just to verify whether the pi state
         * cleanup has been done or not. In the worst case it
         * loops once more. We pretend that the cleanup was
         * done as there is no way to return. Either the
         * OWNER_DIED bit is set by now or we push the blocked
         * task into the wait for ever nirwana as well.
         */
        /*  設置進程標識為PF_EXITPIDONE*/
        tsk->flags |= PF_EXITPIDONE;
        /*  設置進程狀態為不可中斷的等待狀態 */
        set_current_state(TASK_UNINTERRUPTIBLE);
        /*  調度其它進程  */
        schedule();
    }

如果此標識未被設置, 則通過exit_signals來設置

    /*
        tsk->flags |= PF_EXITING;
        http://lxr.free-electrons.com/source/kernel/signal.c#L2383
    */
    exit_signals(tsk);  /* sets tsk->flags PF_EXITING  設置PF_EXITING標誌

記憶體屏障

    /*
     * tsk->flags are checked in the futex code to protect against
     * an exiting task cleaning up the robust pi futexes.
     */
    /*  記憶體屏障，用於確保在它之後的操作開始執行之前，它之前的操作已經完成  */
    smp_mb();
    /*  一直等待，直到獲得current->pi_lock自旋鎖  */
    raw_spin_unlock_wait(&tsk->pi_lock);

同步進程的mm的rss_stat

    /* sync mm's RSS info before statistics gathering */
    if (tsk->mm)
        sync_mm_rss(tsk->mm);

獲取current->mm->rss_stat.count[member]計數

    /*
        cct_update_integrals - update mm integral fields in task_struct
        更新進程的運行時間, 獲取current->mm->rss_stat.count[member]計數 
        http://lxr.free-electrons.com/source/kernel/tsacct.c?v=4.6#L152
    */
    acct_update_integrals(tsk);

函數的實現如下, 參見 http://lxr.free-electrons.com/source/kernel/tsacct.c?v=4.6#L156

void acct_update_integrals(struct task_struct *tsk)
{
    cputime_t utime, stime;
    unsigned long flags;

    local_irq_save(flags);
    task_cputime(tsk, &utime, &stime);
    __acct_update_integrals(tsk, utime, stime);
    local_irq_restore(flags);
}

其中task_cputime獲取了進程的cpu時間

__acct_update_integr定義如下

參照http://lxr.free-electrons.com/source/kernel/tsacct.c#L125

static void __acct_update_integrals(struct task_struct *tsk,
                    cputime_t utime, cputime_t stime)
{
    cputime_t time, dtime;
    u64 delta;

    if (!likely(tsk->mm))
        return;

    time = stime + utime;
    dtime = time - tsk->acct_timexpd;
    /* Avoid division: cputime_t is often in nanoseconds already. */
    delta = cputime_to_nsecs(dtime);

    if (delta < TICK_NSEC)
        return;

    tsk->acct_timexpd = time;
    /*
     * Divide by 1024 to avoid overflow, and to avoid division.
     * The final unit reported to userspace is Mbyte-usecs,
     * the rest of the math is done in xacct_add_tsk.
     */
    tsk->acct_rss_mem1 += delta * get_mm_rss(tsk->mm) >> 10;
    tsk->acct_vm_mem1 += delta * tsk->mm->total_vm >> 10;
}

清除定時器

    group_dead = atomic_dec_and_test(&tsk->signal->live);
    if (group_dead) {
        hrtimer_cancel(&tsk->signal->real_timer);
        exit_itimers(tsk->signal);
        if (tsk->mm)
            setmax_mm_hiwater_rss(&tsk->signal->maxrss, tsk->mm);
    }

收集進程會計信息

    acct_collect(code, group_dead);

審計

    if (group_dead)
        tty_audit_exit();   //記錄審計事件
    audit_free(tsk);    //  釋放struct audit_context結構體

釋放進程占用的資源

釋放線性區描述符和頁表

    /*  釋放存儲空間
    放棄進程占用的mm,如果沒有其他進程使用該mm，則釋放它。
     */
    exit_mm(tsk);

輸出進程會計信息

if (group_dead)
        acct_process();
    trace_sched_process_exit(tsk);

釋放用戶空間的“信號量”

exit_sem(tsk);   /*  釋放用戶空間的“信號量”  */

遍歷current->sysvsem.undo_list鏈表，並清除進程所涉及的每個IPC信號量的操作痕跡

釋放鎖

exit_shm(tsk);  /* 釋放鎖  */

釋放文件對象相關資源

exit_files(tsk); /*  釋放已經打開的文件   */




exit_fs(tsk);   /*  釋放用於表示工作目錄等結構  */

脫離控制終端

    if (group_dead)
        disassociate_ctty(1);

釋放命名空間

exit_task_namespaces(tsk);  /*  釋放命名空間  */
exit_task_work(tsk);

釋放task_struct中的thread_struct結構

    exit_thread();      /*     */

觸發thread_notify_head鏈表中所有通知鏈實例的處理函數，用於處理struct thread_info結構體

Performance Event功能相關資源的釋放

perf_event_exit_task(tsk);

Performance Event功能相關資源的釋放

cgroup_exit(tsk);

註銷斷點

   /*
     * FIXME: do that only when needed, using sched_exit tracepoint
     */
    flush_ptrace_hw_breakpoint(tsk);

更新所有子進程的父進程

    exit_notify(tsk, group_dead);

進程事件連接器（通過它來報告進程fork、exec、exit以及進程用戶ID與組ID的變化）

    proc_exit_connector(tsk);

用於NUMA，當引用計數為0時，釋放struct mempolicy結構體所占用的記憶體

#ifdef CONFIG_NUMA
    task_lock(tsk);
    mpol_put(tsk->mempolicy);
    tsk->mempolicy = NULL;
    task_unlock(tsk);
#endif

釋放struct futex_pi_state結構體所占用的記憶體

    if (tsk->io_context)
        exit_io_context(tsk);

釋放與進程描述符splice_pipe欄位相關的資源

if (tsk->splice_pipe)
        free_pipe_info(tsk->splice_pipe);

 if (tsk->task_frag.page)
        put_page(tsk->task_frag.page);

檢查有多少未使用的進程內核棧

    check_stack_usage();

調度其它進程

/* causes final put_task_struct in finish_task_switch(). */
    tsk->state = TASK_DEAD;
    tsk->flags |= PF_NOFREEZE;      /* tell freezer to ignore us */
    /*
        重新調度，因為該進程已經被設置成了僵死狀態，因此永遠都不會再把它調度回來運行了，也就實現了do_exit不會有返回的目標    */
    schedule();

在設置了進程狀態為TASK_DEAD後, 進程進入僵死狀態, 進程已經無法被再次調度, 因為對應用程式或者用戶空間來說此進程已經死了, 但是儘管進程已經不能再被調度，但系統還是保留了它的進程描述符，這樣做是為了讓系統有辦法在進程終止後仍能獲得它的信息。

在父進程獲得已終止子進程的信息後，子進程的task_struct結構體才被釋放（包括此進程的內核棧）。