[apue] 進程式控制制那些事兒_ZenDei技術網路在線

進程 ID 是唯一的嗎？fork 後子進程記憶體頁會 Copy-On-Write 嗎？vfork 後子進程為何不能使用 return 或 exit？如何在 exec 後保持目錄流打開？解釋器文件首行能支持多於一個參數嗎？切換進程身份時 setuid、setreuid、seteuid 該用哪個？set-... ...

進程標識

在介紹進程的創建、啟動與終止之前，首先瞭解一下進程的唯一標識——進程 ID，它是一個非負整數，在系統範圍內唯一，不過這種唯一是相對的，當一個進程消亡後，它的 ID 可能被重用。不過大多數 Unix 系統實現延遲重用演算法，防止將新進程誤認為是使用同一 ID 的某個已終止的進程，下麵這個例子展示了這一點：

#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <time.h>
#include <set>

int main (int argc, char *argv[])
{
    std::set<pid_t> pids;
    pid_t pid = getpid();
    time_t start = time(NULL);
    pids.insert(pid);
    while (true)
    {
        if ((pid = fork ()) < 0)
        {
            printf ("fork error\n");
            return 1;
        }
        else if (pid == 0)
        {
            printf ("[%u] child running\n", getpid());
            break;
        }
        else
        {
            printf ("fork and exec child %u\n", pid);

            int status = 0;
            pid = wait(&status);
            printf ("wait child %u return %d\n", pid, status);
            if (pids.find (pid) == pids.end())
            {
                pids.insert(pid);
            }
            else
            {
                time_t end = time(NULL);
                printf ("duplicated pid find: %u, total %lu, elapse %lu\n", pid, pids.size(), end-start);
                break;
            }
        }
    }

    exit (0);
}

上面的程式製造了一個進程 ID 復用的場景：父進程不斷創建子進程，將它的進程 ID 保存在 set 容器中，並將每次新創建的 pid 與容器中已有的進行對比，如果發現有重覆的 pid，則列印一條消息退出迴圈，下麵是程式輸出日誌：

> ./pid
fork and exec child 18687
[18687] child running
wait child 18687 return 0
fork and exec child 18688
[18688] child running
wait child 18688 return 0
fork and exec child 18689
...
wait child 18683 return 0
fork and exec child 18684
[18684] child running
wait child 18684 return 0
fork and exec child 18685
[18685] child running
wait child 18685 return 0
fork and exec child 18687
[18687] child running
wait child 18687 return 0
duplicated pid find: 18687, total 31930, elapse 8

在大約創建了 3W 個進程後，進程 ID 終於復用了，整個耗時大約 8 秒左右，可見在頻繁創建進程的場景中，進程 ID 被覆用的間隔還是很快的，如果依賴進程 ID 的唯一性做一些記錄的話，還是要小心，例如使用進程 ID 做為文件名，最好是加上時間戳等其它維度以確保唯一性。

另外一個有趣的現象是，進程 ID 重覆時，剛好是第一個子進程的進程 ID，看起來這個進程 ID 分配是個周而複始的過程，在漲到一定數量後會回捲，追蹤中間的日誌，發現有以下輸出：

...
[32765] child running
wait child 32765 return 0
fork and exec child 32766
[32766] child running
wait child 32766 return 0
fork and exec child 32767
[32767] child running
wait child 32767 return 0
fork and exec child 300
[300] child running
wait child 300 return 0
fork and exec child 313
[313] child running
wait child 313 return 0
fork and exec child 314
[314] child running
wait child 314 return 0
...

看起來最大達到 32767 (SHORT_MAX) 後就開始回捲了，這比我想象的要早，畢竟 pid_t 類型為 4 位元組整型：

sizeof (pid_t) = 4

最大可以達到 2147483647，這也許是出於某種向後相容考慮吧。在 macOS 上這個過程略長一些：

> ./pid
fork and exec child 12629
[12629] child running
wait child 12629 return 0
fork and exec child 12630
[12630] child running
wait child 12630 return 0
fork and exec child 12631
[12631] child running
wait child 12631 return 0
...
[12626] child running
wait child 12626 return 0
fork and exec child 12627
[12627] child running
wait child 12627 return 0
fork and exec child 12629
[12629] child running
wait child 12629 return 0
duplicated pid find: 12629, total 99420, elapse 69

總共產生了不到 10W 個 pid，歷時一分多鐘，看起來要比 linux 做的好一點。查看中間日誌，pid 也發生了回捲：

...
fork and exec child 99996
[99996] child running
wait child 99996 return 0
fork and exec child 99997
[99997] child running
wait child 99997 return 0
fork and exec child 99998
[99998] child running
wait child 99998 return 0
fork and exec child 100
[100] child running
wait child 100 return 0
fork and exec child 102
[102] child running
wait child 102 return 0
fork and exec child 103
[103] child running
wait child 103 return 0
...

回捲點是 99999，emmmm 有意思，會不會是喬布斯定的，哈哈。

雖然進程 ID 的合法範圍是 [0~INT_MAX]，但實際上前幾個進程 ID 會被系統占用：

0: swapper 進程 (調度)
1: init 進程 (用戶態)
…

其中 ID=0 的通常是調度進程，也稱為交換進程，是內核的一部分，並不執行任何磁碟上的程式，因此也被稱為系統進程；ID=1 的通常是 init 進程，在自舉過程結束時由內核調用，該程式的程式文件在 UNIX 早期版本中是 /sbin/init，不過在我的測試機 CentOS 上是 /usr/lib/systemd/systemd：

> ps -ef
UID        PID  PPID  C STIME TTY          TIME CMD
root         1     0  0 Oct24 ?        00:00:19 /usr/lib/systemd/systemd --switched-root --system --deserialize 22
root         2     0  0 Oct24 ?        00:00:00 [kthreadd]
root         4     2  0 Oct24 ?        00:00:00 [kworker/0:0H]
root         6     2  0 Oct24 ?        00:00:01 [ksoftirqd/0]
root         7     2  0 Oct24 ?        00:00:01 [migration/0]
root         8     2  0 Oct24 ?        00:00:00 [rcu_bh]
...

查看文件系統：

> ls -lh /sbin/init
lrwxrwxrwx. 1 root root 22 Sep  7  2022 /sbin/init -> ../lib/systemd/systemd

就是個軟鏈接，其實是一回事。在 macOS 又略有不同，

> ps -ef
  UID   PID  PPID   C STIME   TTY           TIME CMD
    0     1     0   0  3:34PM ??         0:15.45 /sbin/launchd
    0    74     1   0  3:34PM ??         0:00.89 /usr/sbin/syslogd
    0    75     1   0  3:34PM ??         0:01.42 /usr/libexec/UserEventAgent (System)
...

為 launched。這裡將進程 ID=1 的統稱為 init 進程，它通常讀取與系統有關的初始化文件，並將系統引導到一個狀態 (e.g. 多用戶)，且不會終止，雖然運行在用戶態，但具有超級用戶許可權。在孤兒進程場景下，它負責做預設的父進程，關於這一點可以參考後面 "進程終止" 一節。正因為進程 ID 0 永遠不可能分配給用戶進程，所以它可以用作介面的臨界值，就如上面例子中 fork 所做的那樣，關於 fork 的詳細說明可以參考後面 "進程創建" 一節。

各種進程 ID 通過下麵的介面返回：

#include <sys/types.h>
#include <unistd.h>

pid_t getpid(void);     // process ID
pid_t getppid(void);    // parent process ID
uid_t getuid(void);     // user ID
uid_t geteuid(void);    // effect user ID
gid_t getgid(void);     // group ID
gid_t getegid(void);    // effect group ID

各個介面返回的 ID 已在註釋中說明。進程是動態的程式文件、文件又由進程生成，而它們都受系統中用戶和組的轄制，用戶態進程必然屬於某個用戶和組，就像文件一樣，關於這一點，可以參考這篇《[apue] linux 文件訪問許可權那些事兒》。再說深入一點，用戶 ID、組 ID 標識的是執行進程的用戶；有效用戶 ID、有效組 ID 則標識了進程程式文件通過 set-user-id、set-group-id 標識指定的用戶，一般用作許可權"後門"；還有 saved-user-id、saved-group-id，則用來恢復更改 uid、gid 之前的身份。關於兩組三種 ID 之間的關係、相互如何轉換及這樣做的目的，可以參考後面 "更改用戶 ID 和組 ID" 一節。

進程創建

Unix 系統的進程主要依賴 fork 創建：

#include <unistd.h>
pid_t fork(void);

fork 本意為分叉，像一條路突然分開變成兩條一樣，調用 fork 後會裂變出兩個進程，新進程具有和原進程完全相同的環境，包括執行堆棧。即在調用 fork 處會產生兩次返回，一次是在父進程，一次是在子進程。

但是父、子進程的返回值卻大不相同，父進程返回的是成功創建的子進程 ID；子進程返回的是 0。通過上一節對進程 ID 的說明，0 是一個合法但不會分配給用戶進程的 ID，這裡作為區分父子進程的關鍵，從而執行不同的邏輯。父進程 fork 返回子進程的 pid 也是必要的，因為目前沒有一種介面可以返回父進程所有的子進程 ID，通過 fork 返回值父進程就可以得到子進程的 ID；而反過來，子進程可以通過 get_ppid 輕鬆獲取父進程 ID，所以不需要在 fork 處返回，且為了區別於父進程中的 fork 返回，這裡有必要返回 0 來快速標識自己是子進程 (通過記錄父進程 ID 等辦法也可以標識，但是明顯不如這種來得簡潔)。

int pid = fork(); 
if (pid < 0)
{
    // error
    exit(1); 
}
else if (pid == 0)
{
    // child
    printf ("%d spawn from %d\n", getpid(), getppid()); 
}
else 
{
    // parent
    sleep (1); 
    printf ("%d create %d\n", getpid(), pid); 
}

新建的子進程具有和父進程完全相同的數據空間、堆、棧，但這不意味著與父進程共用，除代碼段這種只讀的區域，其他的都可以理解為父進程的副本：

#include <unistd.h>
#include <stdlib.h>
#include <stdio.h>

int g_count = 1;
int main()
{
    int v_count = 42;
    static int s_count = 1024;
    int* h_count = (int*)malloc (sizeof (int)); 
    *h_count = 36; 
    
    int pid = fork();
    if (pid < 0)
    {
        // error
        exit(1);
    }
    else if (pid == 0)
    {
        // child
        g_count ++;
        v_count ++;
        s_count ++;
        (*h_count) ++; 
        printf ("%d spawn from %d\n", getpid(), getppid());
    }
    else
    {
        // parent
        sleep (1);
        printf ("%d create %d\n", getpid(), pid);
    }

    printf ("%d: global %d, local %d, static %d, heap %d\n", getpid(), g_count, v_count, s_count, *h_count);
    return 0;
}

這個例子就很說明問題，運行得到下麵的輸出：

$ ./forkit
18270 spawn from 18269
18270: global 2, local 43, static 1025, heap 37
18269 create 18270
18269: global 1, local 42, static 1024, heap 36

子進程修改全局、局部、靜態、堆變數對父進程不可見，父、子進程是相互隔離的，子進程一般會在 fork 之後調用函數族來將進程空間替換為新的程式文件。這就是 exec 函數族，它們會把當前進程內容替換為磁碟上的程式文件並執行新程式的代碼段，和 fork 是一對好搭檔。關於 exec 函數族的更多內容，請參考後面 "exec" 一節。

對於習慣在 Windows 上創建進程的用戶來說，CreateProcess 還是更容易理解一些，它直接把 fork + exec 的工作都包攬了，完全不知道還有複製進程這種騷操作。那 Unix 為什麼要繞這樣一個大彎呢？這是由於如果想在執行新程式文件之前，對進程屬性做一些設置，則必需在 fork 之後、exec 之前進行處理，例如 I/O 重定向、設置用戶 ID 和組 ID、信號安排等等，而封裝成一整個的 CretaeProcess 對此是無能為力的，只能將這些代碼安排在新程式的開頭才行，而有時新進程的代碼是不受我們控制的，對此就無能為力了。

Unix 有沒有類似 CreateProcess 這樣的東西呢，也有，而且是在 POSIX 標準層面定義的：

#include <spawn.h>

int posix_spawn(pid_t *restrict pid, const char *restrict path,
       const posix_spawn_file_actions_t *file_actions,
       const posix_spawnattr_t *restrict attrp,
       char *const argv[restrict], char *const envp[restrict]);
int posix_spawnp(pid_t *restrict pid, const char *restrict file,
       const posix_spawn_file_actions_t *file_actions,
       const posix_spawnattr_t *restrict attrp,
       char *const argv[restrict], char * const envp[restrict]);

這就是 posix_spawn 和 posix_spawnp，兩者的參數完全相同，區別僅在於路徑參數是絕對路徑 (path) 還是帶搜索能力的相對路徑 (file)。不過這個介面無意取代 fork + exec，僅用來支持對存儲管理缺少硬體支持的系統，這種系統通常難以有效的實現 fork。

有的人認為基於 fork+exec 的 posix_spawn 不如 CreateProcess 性能好，畢竟要複製父進程一大堆東西，而大部分對新進程又無用。實際上 Unix 採取了兩個策略，導致 fork+exec 也不是那麼低效，通常情況下都能媲美 CreateProcess。這些策略分別是寫時複製 (COW：Copy-On-Write) 與 vfork。

COW

fork 之後並不執行一個父進程數據段、棧、堆的完全複製，作為替代，這些區域由父、子進程共用，並且內核將它們的訪問許可權標記為只讀。如果父、子進程中的任一個試圖修改這些區域，則內核只為修改區域的那塊記憶體製作一個副本，通常是虛擬存儲器系統中的一頁。在更深入的說明這個技術之前，先來看看 Linux 是如何將虛擬地址轉換為物理地址的：

#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <stdint.h>

unsigned long virtual2physical(void* ptr)
{
    unsigned long vaddr = (unsigned long)ptr;
    int pageSize = getpagesize();
    unsigned long v_pageIndex = vaddr / pageSize;
    unsigned long v_offset = v_pageIndex * sizeof(uint64_t);
    unsigned long page_offset = vaddr % pageSize;
    uint64_t item = 0;

    int fd = open("/proc/self/pagemap", O_RDONLY);
    if(fd == -1)
    {
        printf("open /proc/self/pagemap error\n");
        return NULL;
    }

    if(lseek(fd, v_offset, SEEK_SET) == -1)
    {
        printf("sleek error\n");
        return NULL;
    }

    if(read(fd, &item, sizeof(uint64_t)) != sizeof(uint64_t))
    {
        printf("read item error\n");
        return NULL;
    }

    if((((uint64_t)1 << 63) & item) == 0)
    {
        printf("page present is 0\n");
        return NULL;
    }

    uint64_t phy_pageIndex = (((uint64_t)1 << 55) - 1) & item;
    return (unsigned long)((phy_pageIndex * pageSize) + page_offset);
}

這段代碼可以在用戶空間將一個虛擬記憶體地址轉換為一個物理地址，具體原理就不介紹了，感興趣的請參考附錄 2。用它做個小測試：

void test_ptr(void *ptr, char const* prompt)
{
   uint64_t addr = virtual2physical(ptr);
   printf ("%s: virtual: 0x%x, physical: 0x%x\n", prompt, ptr, addr);
}

int g_val1=0;
int g_val2=1;

int main(void) {
    test_ptr(&g_val1, "global1");
    test_ptr(&g_val2, "global2");

    int l_val3=2;
    int l_val4=3;
    test_ptr(&l_val3, "local1");
    test_ptr(&l_val4, "local2");

    static int s_val5=4;
    static int s_val6=5;
    test_ptr(&s_val5, "static1");
    test_ptr(&s_val6, "static2");

    int *h_val7=(int*)malloc(sizeof(int));
    int *h_val8=(int*)malloc(sizeof(int));
    test_ptr(h_val7, "heap1");
    test_ptr(h_val8, "heap2");
    free(h_val7);
    free(h_val8);
    return 0;
};

測試種類還是比較豐富的，有局部變數、靜態變數、全局變數和堆上分配的變數。在 CentOS 上有以下輸出：

> sudo ./memtrans
global1: virtual: 0x60107c, physical: 0x8652f07c
global2: virtual: 0x60106c, physical: 0x8652f06c
local1: virtual: 0x9950ff2c, physical: 0xfb1df2c
local2: virtual: 0x9950ff28, physical: 0xfb1df28
static1: virtual: 0x601070, physical: 0x8652f070
static2: virtual: 0x601074, physical: 0x8652f074
heap1: virtual: 0xc3e010, physical: 0xb7ebe010
heap2: virtual: 0xc3e030, physical: 0xb7ebe030

發現以下幾點：

同類型的變數虛擬、物理地址相差不大
靜態和全局變數虛擬地址相近、物理地址也相近，很可能是分配在同一個頁上了
局部、全局、堆上的變數虛擬地址相差較大、物理地址也相差較大，應該是分配在不同的頁上了

必需使用超級用戶許可權執行這段程式，否則看起來不是那麼正確：

> ./memtrans
global1: virtual: 0x60107c, physical: 0x7c
global2: virtual: 0x60106c, physical: 0x6c
local1: virtual: 0x6a433e2c, physical: 0xe2c
local2: virtual: 0x6a433e28, physical: 0xe28
static1: virtual: 0x601070, physical: 0x70
static2: virtual: 0x601074, physical: 0x74
heap1: virtual: 0x116b010, physical: 0x10
heap2: virtual: 0x116b030, physical: 0x30

雖然 virtual2physical 沒有報錯，但是一眼看上去這個結果就是有問題的。能將虛擬地址轉化為物理地址後，就可以拿它在 fork 的場景中做個驗證：

int g_count = 1;
int main()
{
    int v_count = 42;
    static int s_count = 1024;
    int* h_count = (int*)malloc (sizeof (int));
    *h_count = 36;
    printf ("%d: global ptr 0x%x:0x%x, local ptr 0x%x:0x%x, static ptr 0x%x:0x%x, heap ptr 0x%x:0x%x\n", getpid(),
            &g_count, virtual2physical(&g_count),
            &v_count, virtual2physical(&v_count),
            &s_count, virtual2physical(&s_count),
            h_count, virtual2physical(h_count));

    int pid = fork();
    if (pid < 0)
    {
        // error
        exit(1);
    }
    else if (pid == 0)
    {
        // child
        printf ("%d spawn from %d\n", getpid(), getppid());
#if 0
        g_count ++;
        v_count ++;
        s_count ++;
        (*h_count) ++;
#endif
    }
    else
    {
        // parent
        sleep (1);
        printf ("%d create %d\n", getpid(), pid);
    }

    printf ("%d: global %d, local %d, static %d, heap %d\n", getpid(), g_count, v_count, s_count, *h_count);
    printf ("%d: global ptr 0x%x:0x%x, local ptr 0x%x:0x%x, static ptr 0x%x:0x%x, heap ptr 0x%x:0x%x\n", getpid(),
            &g_count, virtual2physical(&g_count),
            &v_count, virtual2physical(&v_count),
            &s_count, virtual2physical(&s_count),
            h_count, virtual2physical(h_count));

    return 0;
}

增加了對虛擬、物理地址的列印，並屏蔽了子進程對變數的修改，先看看父、子進程是否共用了記憶體頁：

> sudo ./forkit
19216: global ptr 0x60208c:0x5769308c, local ptr 0x22c50040:0xf4fe2040, static ptr 0x602090:0x57693090, heap ptr 0x1e71010:0x89924010
19217 spawn from 19216
19217: global 1, local 42, static 1024, heap 36
19217: global ptr 0x60208c:0x5769308c, local ptr 0x22c50040:0xf4fe2040, static ptr 0x602090:0x57693090, heap ptr 0x1e71010:0x89924010
19216 create 19217
19216: global 1, local 42, static 1024, heap 36
19216: global ptr 0x60208c:0x412f308c, local ptr 0x22c50040:0xea994040, static ptr 0x602090:0x412f3090, heap ptr 0x1e71010:0x89924010

發現以下現象：

所有變數虛擬地址不變
僅堆變數的物理地址不變
子進程所有變數的物理地址不變，父進程局部、靜態、全局變數的物理地址發生了改變

從現象可以得到以下結論：

COW 生效，否則堆變數的物理地址不可能不變
局部、靜態、全局變數的物理地址發生改變很可能是因為該頁上有其它數據發生了變更需要複製
率先複製的那一方物理地址會發生變更

下麵再看下子進程修改變數的情況：

> sudo ./forkit
23182: global ptr 0x60208c:0x1037008c, local ptr 0x677e8540:0xe65b6540, static ptr 0x602090:0x10370090, heap ptr 0x252d010:0x9fb3d010
23183 spawn from 23182
23183: global 2, local 43, static 1025, heap 37
23183: global ptr 0x60208c:0x1037008c, local ptr 0x677e8540:0xe65b6540, static ptr 0x602090:0x10370090, heap ptr 0x252d010:0x6dafb010
23182 create 23183
23182: global 1, local 42, static 1024, heap 36
23182: global ptr 0x60208c:0xf045708c, local ptr 0x677e8540:0x5bc6f540, static ptr 0x602090:0xf0457090, heap ptr 0x252d010:0x9fb3d010

這下所有變數的物理地址都改變了，進一步驗證了 COW 的介入，特別是子進程堆變數物理地址改變 (0x6dafb010) 而父進程的沒有改變 (0x9fb3d010)，說明系統確實為修改頁的一方分配了新的頁。另一方面，子進程修改了局部、靜態、全局變數而物理地址沒有發生改變，則說明當頁不再標記為共用後，子進程再修改這些頁也不會為它重新分配頁了。最後父進程沒有修改局部、靜態、全局變數而物理地址發生了變化，一定是這些變數所在頁的其它部分被修改導致的，且這些修改發生在用戶修改這些變數之前，即 fork 內部。

vfork

另外一種提高 fork 性能的方法是 vfork：

#include <unistd.h>
pid_t vfork(void);

它的聲明與 fork 完全一致，用法也差不多，但是卻有以下根本不同：

父、子進程並不進行任何數據段、棧、堆的複製，連 COW 都沒有，完全是共用同樣的記憶體空間
父進程只有在子進程調用 exec 或 exit 之後才能繼續運行

vfork 是面向 fork+exec 使用場景的優化，所以在 exec (或 exit) 之前，子進程就是在父進程的地址空間運行的。而為了避免父、子進程訪問同一個記憶體頁導致的競爭問題，父進程在此期間會被短暫掛起，預期子進程會立刻調用 exec，所以這個延遲還是可以接受的。修改上面的 forkit 代碼：

#if 0
    int pid = fork();
#else
    int pid = vfork();
#endif

使用 vfork 代替 fork，再來觀察結果有何不同：

> sudo ./forkit
15421: global ptr 0x60208c:0x9f6d608c, local ptr 0x91d548c0:0xa98148c0, static ptr 0x602090:0x9f6d6090, heap ptr 0x1cc1010:0xf3a5c010
15422 spawn from 15421
15422: global 2, local 43, static 1025, heap 37
15422: global ptr 0x60208c:0x9f6d608c, local ptr 0x91d548c0:0xa98148c0, static ptr 0x602090:0x9f6d6090, heap ptr 0x1cc1010:0xf3a5c010
15421 create 15422
Segmentation fault

子進程運行正常而父進程在 fork 返回後崩潰了，打開 gdb 掛上 coredmp 文件查看：

> sudo gdb ./forkit --core=core.15421
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-120.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /ext/code/apue/08.chapter/forkit...done.
[New LWP 15421]
Core was generated by `./forkit'.
Program terminated with signal 11, Segmentation fault.
#0  0x0000000000400ace in main () at forkit.c:90
90	    printf ("%d: global %d, local %d, static %d, heap %d\n", getpid(), g_count, v_count, s_count, *h_count);
Missing separate debuginfos, use: debuginfo-install glibc-2.17-326.el7_9.x86_64
(gdb) i lo
v_count = 43
s_count = 1025
h_count = 0x0
pid = 15422
(gdb)

因為生成的 core 文件具有 root 許可權，所以這裡也使用 sudo 提權。列印本地變數查看，發現 h_count 指針為空了，導致 printf 崩潰。再看 vfork 的使用說明，發現有下麵這麼一段：

       vfork()  differs  from  fork(2)  in  that  the  calling  thread is suspended until the child terminates (either normally, by calling
       _exit(2), or abnormally, after delivery of a fatal signal), or it makes a call to execve(2).  Until that point, the child shares all
       memory  with  its  parent,  including  the stack.  The child must not return from the current function or call exit(3), but may call
       _exit(2).

大意是說因 vfork 後子進程甚至會共用父進程執行堆棧，所以子進程不能通過 return 和 exit 退出，只能通過 _exit。嘖嘖，一不小心就踩了坑，修改代碼如下：

    
    if (pid < 0)
    {
        // error
        exit(1);
    }
    else if (pid == 0)
    {
        // child
        printf ("%d spawn from %d\n", getpid(), getppid());
#if 1
        g_count ++;
        v_count ++;
        s_count ++;
        (*h_count) ++;
#endif
        printf ("%d: global %d, local %d, static %d, heap %d\n", getpid(), g_count, v_count, s_count, *h_count);
        printf ("%d: global ptr 0x%x:0x%x, local ptr 0x%x:0x%x, static ptr 0x%x:0x%x, heap ptr 0x%x:0x%x\n", getpid(),
                &g_count, virtual2physical(&g_count),
                &v_count, virtual2physical(&v_count),
                &s_count, virtual2physical(&s_count),
                h_count, virtual2physical(h_count));

        _exit(0);
    }
    else
    {
        // parent
        // sleep (1);
        printf ("%d create %d\n", getpid(), pid);
    }

主要修改點如下：

列印語句複製一份到子進程
子進程通過 _exit 退出
父進程去除 sleep 調用

再次編譯運行：

> sudo ./forkit
22831: global ptr 0x60208c:0xde9ee08c, local ptr 0x9c8a3ac0:0x2661dac0, static ptr 0x602090:0xde9ee090, heap ptr 0x1a90010:0x88797010
22832 spawn from 22831
22832: global 2, local 43, static 1025, heap 37
22832: global ptr 0x60208c:0xde9ee08c, local ptr 0x9c8a3ac0:0x2661dac0, static ptr 0x602090:0xde9ee090, heap ptr 0x1a90010:0x88797010
22831 create 22832
22831: global 2, local 43, static 1025, heap 37
22831: global ptr 0x60208c:0xde9ee08c, local ptr 0x9c8a3ac0:0x2661dac0, static ptr 0x602090:0xde9ee090, heap ptr 0x1a90010:0x88797010

這回不崩潰了，而且可以看到以下有趣的現象：

父進程的所有變數都被子進程修改了
父、子進程的所有變數虛擬、物理地址完全一致

進一步印證了上面的結論。由於 vfork 根本不存在記憶體空間的複製，所以理論上它是性能最高的，畢竟 COW 在底層還是發生了很多記憶體頁複製的。

vfork 這個介面是屬於 SUS 標準的，目前流行的 Unix 都支持，只不過它被標識為了廢棄，使用時需要小心，尤其是處理子進程的退出。

fork + fd

子進程會繼承父進程以下屬性：

打開文件描述符
實際用戶 ID、實際組 ID、有效用戶 ID、有效組 ID
附加組 ID
進程組 ID
會話 ID
控制終端
設置用戶 ID 標誌和設置組 ID 標誌
當前工作目錄
根目錄
文件模式創建屏蔽字
信號屏蔽和安排
打開文件描述符的 close-on-exec 標誌
環境變數
連接的共用存儲段
存儲映射
資源限制
……

以打開文件描述符為例，有如下測試程式：

#include <unistd.h>
#include <stdlib.h>
#include <stdio.h>

int main()
{
    printf ("before fork\n");
    int pid = fork();
    if (pid < 0)
    {
        // error
        exit(1);
    }
    else if (pid == 0)
    {
        // child
        printf ("%d spawn from %d\n", getpid(), getppid());
    }
    else
    {
        // parent
        sleep (1);
        printf ("%d create %d\n", getpid(), pid);
    }

    printf ("after fork\n");
    return 0;
}

運行程式輸出如下：

> ./forkfd
before fork
7204 spawn from 7203
after fork
7203 create 7204
after fork

before fork 只在父進程輸出一次，符合預期，如果在 main 函數第一行插入以下代碼：

    setvbuf (stdout, NULL, _IOFBF, 0);

將標準輸出設置為全緩衝模式，(關於標準 IO 的緩衝模式，可以參考這篇《[apue] 標準 I/O 庫那些事兒》)，則輸出會發生改變：

> ./forkfd
before fork
6955 spawn from 6954
after fork
before fork
6954 create 6955
after fork

可以看到 before fork 這條語句輸出了兩次，分別在父、子進程各輸出一次，這是由於 stdout 由行緩衝變更為全緩衝後，積累的內容並不隨換行符 flush，從而就會被 fork 複製到子進程，並與子進程生成的信息一起 flush 到控制台，最終輸出兩次。如果仍保持行緩衝模式，還會導致多次輸出嗎？答案是有可能，只要將上面的換行符去掉就可以：

printf ("before fork ");

新的輸出如下：

> ./forkfd
before fork 17736 spawn from 17735
after fork
before fork 17735 create 17736
after fork

原理是一樣的。其實還存在另外的隱式修改標準輸出緩衝方式的辦法：文件重定向，仍以有換行符的版本為例：

> ./forkfd > output.txt
> cat output.txt
before fork
15505 spawn from 15504
after fork
before fork
15504 create 15505
after fork

通過將標準輸出重定向到 output.txt 文件，實現了行緩衝到全緩衝的變化，從而得到了與調用 setvbuf 相同的結果。使用不帶緩衝的 write、或者在 fork 前主動 flush 緩衝，以避免上面的問題。

除了緩存複製，父、子進程共用打開文件描述符的另外一個問題是讀寫競爭，fork 後父、子進程共用文件句柄的情況如下圖 (參考《[apue] 一圖讀懂 unix 文件句柄及文件共用過程》)：

父、子進程共用文件句柄特別像進程內 dup 的情況，此時對於共用的雙方而言，任一進程更新文件偏移量對另一個進程都是可見的，保證了一個進程添加的數據會在另一個進程之後。但如果不做任何同步，它們的數據會相互混合，從而使輸出變得混亂。一般遵循以下慣例來保證父、子進程不會在共用的文件句柄上產生讀寫競爭：

父進程等待子進程完成
父、子進程各自執行不同的程式段 (關閉各自不需要使用的文件描述符)

如果必需使用共用的文件句柄，則需要引入進程間同步機制來解決讀寫衝突，關於這一點，可以參考後續 "父子進程同步" 的文章。

在上一節介紹 vfork 時，瞭解到它是不複製進程空間的，子進程需要保證在退出時使用 _exit 來清理進程，避免 return 語句破壞棧指針。這裡有個疑問，如果使用 exit 代替上例中的 _exit 會如何呢？修改上面的程式進行驗證：

#include <unistd.h>
#include <stdlib.h>
#include <stdio.h>

int main()
{
    setvbuf (stdout, NULL, _IOFBF, 0);
    printf ("before fork\n");
    int pid = vfork();
    if (pid < 0)
    {
        // error
        exit(1);
    }
    else if (pid == 0)
    {
        // child
        printf ("%d spawn from %d\n", getpid(), getppid());
        exit(0);
    }
    else
    {
        // parent
        printf ("%d create %d\n", getpid(), pid);
    }

    printf ("after fork\n");
    return 0;
}

發現父進程可以正常終止：

> ./forkfd
before fork
25923 spawn from 25922
25922 create 25923
after fork

_exit 是不會做任何清理工作的，所以是安全的；exit 至少會 flush 標準 IO，至於是否關閉它們則沒有標準明確的要求這一點，由各個實現自行決定。如果 exit 關閉了標準 IO，那麼父進程一定無法輸出 after fork 這句，可見 CentOS 上的exit 沒有關閉標準 IO。目前大多數系統的 exit 實現不在這方面給自己找麻煩，畢竟進程結束時系統會自動關閉進程打開的所有文件句柄，在庫中關閉它們，只是增加了開銷而不會帶來任何益處。

apue 原文講，即使 exit 關閉了標準 IO，STDOUT_FILENO 句柄還是可用的，通過 write 仍可以正常輸出，子進程關閉自己的標準 IO 句柄並不影響父進程的那一份，對此進行驗證：

#include <unistd.h>
#include <stdlib.h>
#include <stdio.h>

int main()
{
    printf ("before fork\n");
    char buf[128] = { 0 };
    int pid = vfork();
    if (pid < 0)
    {
        // error
        exit(1);
    }
    else if (pid == 0)
    {
        // child
        printf ("%d spawn from %d\n", getpid(), getppid());
        fclose (stdin);
        fclose (stdout);
        fclose (stderr);
        exit(0);
    }
    else
    {
        // parent
        sprintf (buf, "%d create %d\n", getpid(), pid);
        write (STDOUT_FILENO, buf, strlen(buf));
    }
    
    sprintf (buf, "after fork\n");
    write (STDOUT_FILENO, buf, strlen(buf));
    return 0;
}

主要修改點有三處：

去除標準輸出重定向
在 child exit 前主動關閉標準 IO 庫
在 parent vfork 返回後，使用 write 代替 printf 列印日誌

新的輸出如下：

> ./forkfd
before fork
20910 spawn from 20909
20909 create 20910
after fork

和書上說的一致，看來關閉標準 IO 庫隻影響父進程的 printf 調用，不影響 write 調用。再試試直接關閉文件句柄：

        close (STDIN_FILENO);
        close (STDOUT_FILENO);
        close (STDERR_FILENO);

新的輸出如下：

> ./forkfd
before fork
17462 spawn from 17461
17461 create 17462
after fork

仍然沒有影響！看起來 vfork 子進程雖然沒有複製任何父進程空間的內容，但句柄仍是做了 dup 的，所以關閉子進程的任何句柄，對父進程沒有影響。

標準 IO (stdin/stdout/stderr) 還和文件句柄不同，它們帶有一些額外信息例如緩存等是存儲在堆或棧上的，如果 vfork 後子進程的 exit 關閉了它們，父進程是會受到影響的，這進一步反證了 exit 不會關閉標準 IO。

關於子進程繼承父進程的其它屬性，這裡就不一一驗證了，有興趣的讀者可以自行構造 demo。最後補充一下 fork 後子進程與父進程不同的屬性：

fork 返回值
進程 ID
父進程 ID
子進程的 CPU 時間 (tms_utime / tms_stime / tms_cutime / tms_ustime 均置為 0)
文件鎖不會繼承
未處理的鬧鐘 (alarm) 將被清除
未處理的信號集將設置為空
……

clone

在 fork 的全複製和 vfork 全不複製之間，有沒有一個介面可以自由定製進程哪些信息需要複製？答案是 clone，不過這個是 Linux 特有的：

#include <sched.h>
int clone(int (*fn)(void *), void *child_stack,int flags, void *arg, .../* pid_t *ptid, void *newtls, pid_t *ctid */ );

與 fork 不同，clone 子進程啟動時將運行用戶提供的 fn(arg) ，並且需要用戶提前開闢好棧空間 (child_stack)，而控制各種信息共用就是通過 flags 參數了，下麵列一些主要的控制參數：

CLONE_FILES：是否共用文件句柄
CLONE_FS：是否共用文件系統相關信息，這些信息由 chroot、chdir、umask 指定
CLONE_NEWIPC：是否共用 IPC 命名空間
CLONE_PID：是否共用 PID
CLONE_SIGHAND：是否共用信號處理
CLONE_THREAD：是否共用相同的線程組
CLONE_VFORK：是否在子進程 exit 或 execve 之前掛起父進程
CLONE_VM：是否共用同一地址空間
……

其實 glibc clone 底層依賴的 clone 系統調用 (sys_clone) 介面更接近於 fork 系統調用，glibc 僅僅是在 sys_clone 的子進程返回中調用用戶提供的 fn(arg) 而已。它將 fork 中的各種進程信息是否共用的決定權交給了用戶，所以有更大的靈活性，甚至可以基於 clone 實現用戶態線程庫。上一節中說 vfork 後子進程在退出時可以關閉 STDOUT_FILENO 而不影響父進程，這是因為標準 IO 句柄是經過 vfork dup 的，如果使用 clone 並指定共用父進程的文件句柄 (CLONE_FILES) 會如何？下麵寫個例子進行驗證：

#include <unistd.h>
#include <stdlib.h>
#include <stdio.h>
#include <sched.h>
#include <signal.h>

int child_func(void *arg)
{
    // child
    printf ("%d spawned from %d\n", getpid(), getppid());
    return 1;
}

int main()
{
    printf ("before fork\n");

    size_t stack_size = 1024 * 1024;
    char *stack = (char *)malloc (stack_size);
    int pid = clone(child_func, stack+stack_size, CLONE_VM | CLONE_VFORK | SIGCHLD, 0);
    if (pid < 0)
    {
        // error
        exit(1);
    }

    // parent
    printf ("[1] %d create %d\n", getpid(), pid);

    char buf[128] = { 0 };
    sprintf (buf, "[2] %d create %d\n", getpid(), pid);
    write (STDOUT_FILENO, buf, strlen(buf));
    return 0;
}

先演示下不加 CLONE_FILES 的效果：

> ./clonefd
before fork
1271 spawned from 1270
[1] 1270 create 1271
[2] 1270 create 1271

這個和 vfork 效果相同。這裡為了驗證標準 IO 庫被關閉的情況，父進程最後一句日誌使用兩種方法列印，能輸出兩行就證明標準 IO 和底層句柄都沒有被關閉，不同的方法使用首碼數字進行區別。

clone 在這個場景的使用有幾點需要註意：

至少需要為 clone 指定 CLONE_VM 選項，用於父、子進程共用記憶體地址空間
指定的 stack 地址是開闢記憶體地址的末尾，因為棧是向上增長的，剛開始 child 進程一啟動就掛掉，就是這裡沒設置對
指定 CLONE_VFORK 標記，這樣父進程會在子進程退出後才繼續運行，避免了多餘的 sleep

在子進程關閉標準 IO 庫嘗試：

> ./clonefd
before fork
5433 spawned from 5432
[2] 5432 create 5433

父進程的 printf 不工作但 write 可以工作，符合預期。在子進程關閉 STDOUT_FILENO 嘗試：

> ./clonefd
before fork
11688 spawned from 11687
[1] 11687 create 11688
[2] 11687 create 11688

兩個都能列印，證實了 fd 是經過 dup 的，與之前 vfork 的結果完全一致。下麵為 clone 增加一個共用文件描述表的設置：

    int pid = clone(child_func, stack+stack_size, CLONE_VM | CLONE_VFORK | CLONE_FILES | SIGCHLD, 0);

再運行上面兩個用例：

> ./clonefd
before fork
8676 spawned from 8675

兩個場景父進程的 printf 與 write 都不輸出了，但是原理稍有差別，前者是因為關閉標準 IO 對象後底層的句柄也被關閉了；後者是雖然標準 IO 對象雖然還打開著，但底層的句柄已經失效了，所以也無法輸出信息。

clone 雖然強大但不具備可移植性，唯一與它類似的是 FreeBSD 上的 rfork。

fork + pthread

fork 並不複製進程的線程信息，請看下例：

#include "../apue.h"
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/wait.h>
#include <fcntl.h>
#include <pthread.h>
#include <errno.h>

static void* thread_start (void *arg)
{
  printf ("thread start %lu\n", pthread_self ());
  sleep (2);
  printf ("thread exit %lu\n", pthread_self ());
  return 0;
}

int main (int argc, char *argv[])
{
    int ret = 0;
    pthread_t tid = 0;
    ret = pthread_create (&tid, NULL, &thread_start, NULL);
    if (ret != 0)
        err_sys ("pthread_create");

    pid_t pid = 0;
    if ((pid = fork ()) < 0)
        err_sys ("fork error");
    else if (pid == 0)
    {
        printf ("[%u] child running, thread %lu\n", getpid(), pthread_self());
        sleep (3);
    }
    else
    {
        printf ("fork and exec child %u in thread %lu\n", pid, pthread_self());
        sleep (4);
    }

    exit (0);
}

做個簡單說明：

父進程啟動一個線程 (thread_start)
線程啟動後休眠 2 秒
父進程啟動一個子進程，子進程啟動後休眠 3 秒後退出
父進程休眠 4 秒後退出

執行程式有如下輸出：

> ./fork_pthread
fork and exec child 9825 in thread 140542546036544
thread start 140542537676544
[9825] child running, thread 140542546036544
thread exit 140542537676544
> ./fork_pthread
fork and exec child 28362 in thread 139956664842048
[28362] child running, thread 139956664842048
thread start 139956656482048
thread exit 139956656482048

註意這個 threadid，長長的一串首尾相同，容易讓人誤認為是同一個 thread，實際上兩個是不同的，體現在中間的差異，以第二次執行的輸出為例，一個是 6484，另一個是 5648，猛的一眼看上去不容易看出來，坑爹~

兩次運行線程的啟動和子進程的啟動順序有別，但結果都是一樣的，子進程沒有觀察到線程的退出日誌，從而可以斷定沒有複製父進程的線程信息。對上面的例子稍加改造，看看線上程中 fork 子進程會如何：

#include "../apue.h"
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/wait.h>
#include <fcntl.h>
#include <pthread.h>
#include <errno.h>

static void* thread_start (void *arg)
{
  printf ("thread start %lu\n", pthread_self ());
  pid_t pid = 0;
  if ((pid = fork ()) < 0)
      err_sys ("fork error");
  else if (pid == 0)
  {
      printf ("[%u] child running, thread %lu\n", getpid(), pthread_self());
      sleep (3);
  }
  else
  {
      printf ("fork and exec child %u in thread %lu\n", pid, pthread_self());
      sleep (2);
  }
  printf ("thread exit %lu\n", pthread_self ());
  return 0;
}

int main (int argc, char *argv[])
{
    int ret = 0;
    pthread_t tid = 0;
    ret = pthread_create (&tid, NULL, &thread_start, NULL);
    if (ret != 0)
        err_sys ("pthread_create");

    sleep (4);
    printf ("main thread exit %lu\n", pthread_self());
    exit (0);
}

重新執行：

> ./fork_pthread
thread start 139848844396288
fork and exec child 17141 in thread 139848844396288
[17141] child running, thread 139848844396288
thread exit 139848844396288
thread exit 139848844396288
main thread exit 139848852756288

發現這次只複製了新線程 (4439)，沒有複製主線程 (5275)，仍然是不完整的。不過 POSIX 語義本來如此：只複製 fork 所在的線程，如果想複製進程的所有線程信息，目前僅有 Solaris 系統能做到，而且只對 Solaris 線程有效，POSIX 線程仍保持只複製一個的語義。而為了和 POSIX 語義一致 (即只複製一個 Solaris 線程)，它特意推出了 fork1 介面乾這件事，看來複制全部線程反而是個小眾需求。

exec

exec 函數族並不創建新的進程，只是用一個全新的程式替換了當前進程的正文、數據、堆和棧段，所以調用前後進程 ID 並不改變。函數族共包含六個原型：

#include <unistd.h>
extern char **environ;
int execl(const char *path, const char *arg, ...);
int execlp(const char *file, const char *arg, ...);
int execle(const char *path, const char *arg, ..., char * const envp[]);
int execv(const char *path, char *const argv[]);
int execvp(const char *file, char *const argv[]);
int execve(const char *file, char *const argv[], char *const envp[]);

不同的尾碼有不同的含義：

l：使用可變參數列表傳遞新程式參數 (list)，一般需要配合 va_arg / va_start / va_end 來提取參數
v：與 l 參數相反，使用參數數組傳遞新程式參數 (vector)
p：傳遞程式文件名而非路徑，如果 file 參數不包含 / 字元，則在 PATH 環境變數中搜索可執行文件
e：指定環境變數數組 envp 參數而不是預設的 environ 變數作為新程式的環境變數

書上有個圖很好的解釋了它們的之間的關係：

做個簡單說明：

所有 l 尾碼的介面，將參數列表提取為數組後調用 v 尾碼的介面
execvp 在 PATH 環境變數中查找可執行文件，確認新程式路徑後調用 execv
execv 使用 environ 全局變數作為 envp 參數調用 execve

百川入海，execve 就是最終被調用的那個，實際上它是一個系統調用，而其它 5 個都是庫函數。上面就是 exec 函數族的主要關係，還有一些細節需要註意，下麵分別說明。

路徑搜索

帶 p 尾碼的函數在搜索 PATH 環境變數時，會依據分號(:)分隔多個路徑欄位，例如

> echo $PATH
/bin:/usr/bin:/usr/local/bin:.

包含了四個路徑，按順序分別是

/bin
/usr/bin
/usr/local/bin
當前目錄

其中當前目錄的表示方式有多種，除了顯示指定點號外，還可以

放置在最前 PATH=:/bin:/usr/bin:/usr/local/bin
放置在最後 PATH=/bin:/usr/bin:/usr/local/bin:
放置在中間 PATH=/bin::/usr/bin:/usr/local/bin

當然了，不同的位置搜索優先順序也不同，並且也不建議將當前路徑放置在 PATH 環境變數中。

參數列表

帶 l 尾碼的函數，以空指針作為參數列表的結尾，像下麵這個例子

if (execlp("echoall", "echoall", "test", (char *)0) < 0)
    err_sys ("execlp error");

如果使用常數 0，必需使用 char* 進行強制轉換，否則它將被解釋為整型參數，在整型長度與指針長度不同的平臺上， exec 函數的實際參數將會出錯。

帶 v 尾碼的函數，也需要保證數組以空指針結尾，無論是 argv 還是 envp，最終都會被新程式的 main 函數接收，所以要求與 main 函數參數相同 (參考《[apue] 進程環境那些事兒》)，它們的 man 手冊頁中也有明確說明：

       The execv(), execvp(), and execvpe() functions  provide  an  array  of  pointers  to  null-terminated
       strings  that  represent the argument list available to the new program.  The first argument, by con‐
       vention, should point to the filename associated with the file being executed.  The array of pointers
       must be terminated by a NULL pointer.

配合 execve 的 man 內容閱讀：

       argv is an array of argument strings passed to the new program.  By convention, the  first  of  these
       strings  should  contain  the  filename associated with the file being executed.  envp is an array of
       strings, conventionally of the form key=value, which are passed as environment to  the  new  program.
       Both  argv and envp must be terminated by a NULL pointer.  The argument vector and environment can be
       accessed by the called program's main function, when it is defined as:

           int main(int argc, char *argv[], char *envp[])

像附錄 8 那樣沒有給 argv 參數以空指針結尾帶來的問題就很好理解了。

參數列表中的第一個參數一般指定為程式文件名，但這隻是一種慣例，並無任何強制校驗。每個系統對命令行參數和環境變數參數的總長度都有一個限制，通過sysconf(ARG_MAX)可獲取：

> getconf ARG_MAX
2097152

POSIX 規定此值不得小於 4096，當使用 shell 的文件名擴充功能 (*) 產生一個文件列表時，可能會超過這個限制從而被截斷，為避免產生這種問題，可藉助 xargs 命令將長參數拆分成幾部分傳遞，書上給了一個查找 man 手冊中所有的 getrlimit 的例子：

查看代碼

> zgrep getrlimit /usr/share/man/*/*.gz
/usr/share/man/man0p/sys_resource.h.0p.gz:for the \fIresource\fP argument of \fIgetrlimit\fP() and \fIsetrlimit\fP():
/usr/share/man/man0p/sys_resource.h.0p.gz:int  getrlimit(int, struct rlimit *);
/usr/share/man/man0p/sys_resource.h.0p.gz:\fIgetrlimit\fP()
/usr/share/man/man1/g++.1.gz:\&\s-1RAM \s0>= 1GB.  If \f(CW\*(C`getrlimit\*(C'\fR is available, the notion of \*(L"\s-1RAM\*(R"\s0 is
/usr/share/man/man1/gcc.1.gz:\&\s-1RAM \s0>= 1GB.  If \f(CW\*(C`getrlimit\*(C'\fR is available, the notion of \*(L"\s-1RAM\*(R"\s0 is
/usr/share/man/man1/perl561delta.1.gz:offers the getrlimit/setrlimit interface that can be used to adjust
/usr/share/man/man1/perl56delta.1.gz:offers the getrlimit/setrlimit interface that can be used to adjust
/usr/share/man/man1/perlhpux.1.gz:  truncate,       getrlimit,      setrlimit
/usr/share/man/man2/brk.2.gz:.BR getrlimit (2),
/usr/share/man/man2/execve.2.gz:.BR getrlimit (2))
/usr/share/man/man2/fcntl.2.gz:.BR getrlimit (2)
/usr/share/man/man2/getpriority.2.gz:.BR getrlimit (2)
/usr/share/man/man2/getrlimit.2.gz:.\" 2004-11-16 -- mtk: the getrlimit.2 page, which formally included
/usr/share/man/man2/getrlimit.2.gz:getrlimit, setrlimit, prlimit \- get/set resource limits
/usr/share/man/man2/getrlimit.2.gz:.BI "int getrlimit(int " resource ", struct rlimit *" rlim );
/usr/share/man/man2/getrlimit.2.gz:.BR getrlimit ()
/usr/share/man/man2/getrlimit.2.gz:.BR getrlimit ()
/usr/share/man/man2/getrlimit.2.gz:.BR getrlimit ().
/usr/share/man/man2/getrlimit.2.gz:.BR getrlimit ().
/usr/share/man/man2/getrlimit.2.gz:.BR getrlimit (),
/usr/share/man/man2/getrlimit.2.gz:.\" getrlimit() and setrlimit() that use prlimit() to work around
/usr/share/man/man2/getrusage.2.gz:.\" 2004-11-16 -- mtk: the getrlimit.2 page, which formerly included
/usr/share/man/man2/getrusage.2.gz:.\" history, etc., see getrlimit.2
/usr/share/man/man2/getrusage.2.gz:.BR getrlimit (2),
/usr/share/man/man2/madvise.2.gz:.BR getrlimit (2),
/usr/share/man/man2/mremap.2.gz:.BR getrlimit (2),
/usr/share/man/man2/prlimit.2.gz:.so man2/getrlimit.2
/usr/share/man/man2/quotactl.2.gz:.BR getrlimit (2),
/usr/share/man/man2/sched_setscheduler.2.gz:.BR getrlimit (2))
/usr/share/man/man2/sched_setscheduler.2.gz:.BR getrlimit (2)).
/usr/share/man/man2/sched_setscheduler.2.gz:.BR getrlimit (2)
/usr/share/man/man2/sched_setscheduler.2.gz:.BR getrlimit (2).
/usr/share/man/man2/setrlimit.2.gz:.so man2/getrlimit.2
/usr/share/man/man2/syscalls.2.gz:\fBgetrlimit\fP(2)    1.0
/usr/share/man/man2/syscalls.2.gz:\fBugetrlimit\fP(2)   2.4
/usr/share/man/man2/syscalls.2.gz:.BR getrlimit (2)
/usr/share/man/man2/syscalls.2.gz:.IR sys_old_getrlimit ()
/usr/share/man/man2/syscalls.2.gz:.IR __NR_getrlimit )
/usr/share/man/man2/syscalls.2.gz:.IR sys_getrlimit ()
/usr/share/man/man2/syscalls.2.gz:.IR __NR_ugetrlimit ).
/usr/share/man/man2/ugetrlimit.2.gz:.so man2/getrlimit.2
/usr/share/man/man3/getdtablesize.3.gz:.BR getrlimit (2);
/usr/share/man/man3/getdtablesize.3.gz:.BR getrlimit (2)
/usr/share/man/man3/getdtablesize.3.gz:.BR getrlimit (2),
/usr/share/man/man3/malloc.3.gz:.BR getrlimit (2)).
/usr/share/man/man3/pcrestack.3.gz:  getrlimit(RLIMIT_STACK, &rlim);
/usr/share/man/man3/pcrestack.3.gz:This reads the current limits (soft and hard) using \fBgetrlimit()\fP, then
/usr/share/man/man3p/exec.3p.gz:\fIgetenv\fP(), \fIgetitimer\fP(), \fIgetrlimit\fP(), \fImmap\fP(),
/usr/share/man/man3p/fclose.3p.gz:\fIclose\fP(), \fIfopen\fP(), \fIgetrlimit\fP(), \fIulimit\fP(),
/usr/share/man/man3p/fflush.3p.gz:\fIgetrlimit\fP(), \fIulimit\fP(), the Base Definitions volume of
/usr/share/man/man3p/fputc.3p.gz:\fIferror\fP(), \fIfopen\fP(), \fIgetrlimit\fP(), \fIputc\fP(),
/usr/share/man/man3p/fseek.3p.gz:\fIgetrlimit\fP(), \fIlseek\fP(), \fIrewind\fP(), \fIulimit\fP(),
/usr/share/man/man3p/getrlimit.3p.gz:.\" getrlimit
/usr/share/man/man3p/getrlimit.3p.gz:getrlimit, setrlimit \- control maximum resource consumption
/usr/share/man/man3p/getrlimit.3p.gz:int getrlimit(int\fP \fIresource\fP\fB, struct rlimit *\fP\fIrlp\fP\fB);
/usr/share/man/man3p/getrlimit.3p.gz:The \fIgetrlimit\fP() function shall get, and the \fIsetrlimit\fP()
/usr/share/man/man3p/getrlimit.3p.gz:Each call to either \fIgetrlimit\fP() or \fIsetrlimit\fP() identifies
/usr/share/man/man3p/getrlimit.3p.gz:considered to be larger than any other limit value. If a call to \fIgetrlimit\fP()
/usr/share/man/man3p/getrlimit.3p.gz:When using the \fIgetrlimit\fP() function, if a resource limit can
/usr/share/man/man3p/getrlimit.3p.gz:is unspecified unless a previous call to \fIgetrlimit\fP()
/usr/share/man/man3p/getrlimit.3p.gz:Upon successful completion, \fIgetrlimit\fP() and \fIsetrlimit\fP()
/usr/share/man/man3p/getrlimit.3p.gz:The \fIgetrlimit\fP() and \fIsetrlimit\fP() functions shall fail if:
/usr/share/man/man3p/setrlimit.3p.gz:.so man3p/getrlimit.3p
/usr/share/man/man3/pthread_attr_setstacksize.3.gz:.BR getrlimit (2),
/usr/share/man/man3/pthread_create.3.gz:.BR getrlimit (2),
/usr/share/man/man3/pthread_getattr_np.3.gz:.BR getrlimit (2),
/usr/share/man/man3/pthread_setschedparam.3.gz:.BR getrlimit (2),
/usr/share/man/man3/pthread_setschedprio.3.gz:.BR getrlimit (2),
/usr/share/man/man3p/ulimit.3p.gz:\fIgetrlimit\fP(), \fIsetrlimit\fP(), \fIwrite\fP(), the Base Definitions
/usr/share/man/man3p/write.3p.gz:\fIchmod\fP(), \fIcreat\fP(), \fIdup\fP(), \fIfcntl\fP(), \fIgetrlimit\fP(),
/usr/share/man/man3/ulimit.3.gz:.BR getrlimit (2),
/usr/share/man/man3/ulimit.3.gz:.BR getrlimit (2),
/usr/share/man/man3/vlimit.3.gz:.so man2/getrlimit.2
/usr/share/man/man3/vlimit.3.gz:.\" getrlimit(2) briefly discusses vlimit(3), so point the user there.
/usr/share/man/man5/core.5.gz:.BR getrlimit (2)
/usr/share/man/man5/core.5.gz:.BR getrlimit (2)
/usr/share/man/man5/core.5.gz:.BR getrlimit (2),
/usr/share/man/man5/limits.conf.5.gz:\fBgetrlimit\fR(2)\fBgetrlimit\fR(3p)
/usr/share/man/man5/proc.5.gz:.BR getrlimit (2)).
/usr/share/man/man5/proc.5.gz:.BR getrlimit (2).
/usr/share/man/man5/proc.5.gz:.BR getrlimit (2)).
/usr/share/man/man5/proc.5.gz:.BR getrlimit (2))
/usr/share/man/man7/credentials.7.gz:.BR getrlimit (2);
/usr/share/man/man7/daemon.7.gz:\fBgetrlimit()\fR
/usr/share/man/man7/mq_overview.7.gz:.BR getrlimit (2).
/usr/share/man/man7/mq_overview.7.gz:.BR getrlimit (2),
/usr/share/man/man7/signal.7.gz:.BR getrlimit (2),
/usr/share/man/man7/time.7.gz:.BR getrlimit (2),

我做了兩點改進：

使用 zgrep 代替 grep 或 bzgrep 搜索 gz 壓縮文件中的內容
使用 /usr/share/man/*/*.gz 代替 */* 過濾子目錄

實測沒有報錯，看起來是因為數據量還不夠大：

$ find /usr/share/man/ -type f -name "*.gz" | wc
   9509    9509  361540

總位元組大小為 361540 仍小於限制值 2097152。不過還是改成下麵的形式更安全：

> find /usr/share/man -type f -name "*.gz" | xargs zgrep getrlimit

xargs 會自動切分參數，確保它們不超過限制，分批“喂”給 zgrep，從而實現參數長度限制的突破，不過這樣做的前提是作業可被切分為多個進程，如果必需由單個進程完成，就不能這樣搞了。

最後，exec 的環境變數與命令行參數有類似的地方：

必需以空指針結尾
有總長度限制

也有不同之處，那就是不指定 envp 參數時，也可以通過修改當前進程的環境變數，來影響子進程中的環境變數，這主要是通過 setenv、putenv 介面，關於這點請參考《[apue] 進程環境那些事兒》中環境變數一節的說明。

解釋器文件

如果為帶 p 尾碼的 exec 指定的文件不是一個由鏈接器產生的可執行文件，則將該文件當作一個腳本文件處理，此時將嘗試調用腳本首行中記錄的解釋器，格式如下：

#! pathname [ optional-argument ]

對這種文件的識別是由內核作為 exec 系統調用處理的一部分來完成的，pathname 通常是路徑名 (絕對 & 相對)，並不對它進行路徑搜索。內核使調用 exec 函數的進程實際執行的並不是 file 參數本身，而是腳本第一行中 pathname 所指定的解釋器，例如最常見的：

#!/bin/sh

相當於調用 /bin/sh path/to/script，其中 #! 之後的空格是可選的；如果沒有首行標記，則預設是 shell 腳本；若解釋器需要選項才能支持腳本文件，則需要帶上相應的選項 (optional-argument)，例如：

#! /bin/awk -f

最終相當於調用 /bin/awk -f path/to/script。書上有個不錯的例子拿來做個測試：

#! /bin/awk -f
BEGIN {
  for (i =0; i<ARGC; i++)
    printf "argv[%d]: %s\n", i, ARGV[i]
  exit
}

用於列印所有傳遞到 awk 腳本中的命令行參數，執行之：

> ./echoall.awk file1 FILENAME2 f3
argv[0]: awk
argv[1]: file1
argv[2]: FILENAME2
argv[3]: f3

有以下發現：

第一個參數是 awk 而不是 echoall.awk
沒有參數 -f

和書上講的不同，懷疑是 awk 做了處理 (-f 明顯沒有傳遞到內部的必要)，改為自己寫 C 程式版 echoall 驗證：

#include <stdio.h>

int main (int argc, char *argv[])
{
  int i;
  for (i=0; i<argc; ++ i)
    printf ("argv[%d]: %s\n", i, argv[i]);

  exit (0);
}

腳本也需要稍微改進一下：

#! ./echoall -f

因為程式已經做了所有工作，這裡腳本內容反而只有首行解釋器定義，再次執行：

> ./echoall.sh file1 FILENAME2 f3
argv[0]: ./echoall
argv[1]: -f
argv[2]: ./echoall.sh
argv[3]: file1
argv[4]: FILENAME2
argv[5]: f3

這回有了 -f 選項，並且它會被編排到 exec 函數中 argv 參數列表之前。書上的例子是直接使用 execl 來模擬內核處理解釋器文件的：

#include "../apue.h"
#include <sys/wait.h>
#include <limits.h>

int main (int argc, char *argv[])
{
  pid_t pid;
  char *exename = "echoall.sh";
  char pwd[PATH_MAX] = { 0 };
  getcwd(pwd, PATH_MAX);
  if (argc > 1)
    exename = argv[1];

  strcat (pwd, "/");
  strcat (pwd, exename);
  if ((pid = fork ()) < 0)
    err_sys ("fork error");
  else if (pid == 0)
  {
    if (execl (pwd, exename, "file1", "FILENAME2", "f3", (char *)0) < 0)
      err_sys ("execl error");
  }

  if (waitpid (pid, NULL, 0) < 0)
    err_sys ("wait error");

  exit (0);
}

輸出與上例完全一致：

> ./exec
argv[0]: ./echoall
argv[1]: -f
argv[2]: /ext/code/apue/08.chapter/echoall.sh
argv[3]: file1
argv[4]: FILENAME2
argv[5]: f3

有趣的是 optional-argument (-f) 之後的第一個參數 (argv[2])，execl 使用的是 path 參數 (pwd)，而不是參數列表中的第一個參數 (exename)：這是因為 path 參數包含了比第一個參數更多的信息，或者說第一個參數是人為指定的，可以傳入任意值，存在較大的隨意性，遠不如 path 參數可靠。

再考查一下多個 optional-argument 的場景：

#! ./echoall -f test foo bar

新的輸出看起來把他們當作了一個：

> ./echoall.sh
argv[0]: ./echoall
argv[1]: -f test foo bar
argv[2]: ./echoall.sh

最多只有一個解釋器參數，這就意味著除了 -f，不能為 awk 指定更多的額外參數，例如 -F 指定分隔符，這一點需要註意。

解釋器首行也有最大長度限制，而且與命令行參數長度限制不是一回事，以上面的腳本為例，設置一個 128 長度的參數：

#! ./echoall aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa