Linux內核筆記：epoll實現原理（一）

一、說明針對的內核版本為4.4.10。本文只是我自己看源碼的簡單筆記，如果想瞭解epoll的實現，強烈推薦下麵的文章: The Implementation of epoll(1) The Implementation of epoll(2) The Implementation of epol ...

一、說明

針對的內核版本為4.4.10。

本文只是我自己看源碼的簡單筆記，如果想瞭解epoll的實現，強烈推薦下麵的文章:

The Implementation of epoll(1)

The Implementation of epoll(2)

The Implementation of epoll(3)

The Implementation of epoll(4)

二、epoll_create()

系統調用epoll_create()會創建一個epoll實例並返回該實例對應的文件描述符fd。在內核中，每個epoll實例會和一個struct eventpoll類型的對象一一對應，該對象是epoll的核心，其聲明在fs/eventpoll.c文件中.

epoll_create的介面定義在這裡，主要源碼分析如下：

首先創建一個struct eventpoll對象：

struct eventpoll *ep = NULL;
...
error = ep_alloc(&ep);
if (error < 0)
    return error;

然後分配一個未使用的文件描述符：

fd = get_unused_fd_flags(O_RDWR | (flags & O_CLOEXEC));
if (fd < 0) {
    error = fd;
    goto out_free_ep;
}

然後創建一個struct file對象，將file中的struct file_operations *f_op設置為全局變數eventpoll_fops，將void *private指向剛創建的eventpoll對象ep：

struct file *file;
...
file = anon_inode_getfile("[eventpoll]", &eventpoll_fops, ep, O_RDWR | (flags & O_CLOEXEC));
if (IS_ERR(file)) {
    error = PTR_ERR(file);
    goto out_free_fd;
}

然後設置eventpoll中的file指針：

ep->file = file;

最後將文件描述符添加到當前進程的文件描述符表中，並返回給用戶

fd_install(fd, file);
return fd;

操作結束後主要結構關係如下圖：

三、epoll_ctl()

系統調用epoll_ctl()在內核中的定義如下，各個參數的含義可參見epoll_ctl的man手冊

SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd, struct epoll_event __user *, event)

epoll_ctl()首先判斷op是不是刪除操作，如果不是則將event參數從用戶空間拷貝到內核中：

struct epoll_event epds;
...
if (ep_op_has_event(op) &&
     copy_from_user(&epds, event, sizeof(struct epoll_event)))
         goto error_return;

ep_op_has_event()實際就是判斷op是不是刪除操作：

static inline int ep_op_has_event(int op)
{
    return op != EPOLL_CTL_DEL;
}

接下來判斷用戶是否設置了EPOLLEXCLUSIVE標誌，這個標誌是4.5版本內核才有的，主要是為瞭解決同一個文件描述符同時被添加到多個epoll實例中造成的“驚群”問題，詳細描述可以看這裡。這個標誌的設置有一些限制條件，比如只能是在EPOLL_CTL_ADD操作中設置，而且對應的文件描述符本身不能是一個epoll實例，下麵代碼就是對這些限制的檢查：

/*
 *epoll adds to the wakeup queue at EPOLL_CTL_ADD time only,
 * so EPOLLEXCLUSIVE is not allowed for a EPOLL_CTL_MOD operation.
 * Also, we do not currently supported nested exclusive wakeups.
 */
 if (epds.events & EPOLLEXCLUSIVE) {
     if (op == EPOLL_CTL_MOD)
         goto error_tgt_fput;
     if (op == EPOLL_CTL_ADD && (is_file_epoll(tf.file) ||
            (epds.events & ~EPOLLEXCLUSIVE_OK_BITS)))
         goto error_tgt_fput;
}

接下來從傳入的文件描述符開始，一步步獲得struct file對象，再從struct file中的private_data欄位獲得struct eventpoll對象：

struct fd f, tf;
struct eventpoll *ep;
... 
f = fdget(epfd); 
... 
tf = fdget(fd); 
...
ep = f.file->private_data;

如果要添加的文件描述符本身也代表一個epoll實例，那麼有可能會造成死迴圈，內核對此情況做了檢查，如果存在死迴圈則返回錯誤。這部分的代碼目前我還沒細看，這裡不再貼出。

接下來會從epoll實例的紅黑樹里尋找和被監控文件對應的epollitem對象，如果不存在，也就是之前沒有添加過該文件，返回的會是NULL。

struct epitem *epi;
...
epi = ep_find(ep, tf.file, fd);

ep_find()函數本質是一個紅黑樹查找過程，紅黑樹查找和插入使用的比較函數是ep_cmp_ffd()，先比較struct file對象的地址大小，相同的話再比較文件描述符大小。struct file對象地址相同的一種情況是通過dup()系統調用將不同的文件描述符指向同一個struct file對象。

static inline int ep_cmp_ffd(struct epoll_filefd *p1, 
                             struct epoll_filefd *p2)
{
        return (p1->file > p2->file ? +1:
                (p1->file < p2->file ? -1 : p1->fd - p2->fd));
}

接下來會根據操作符op的不同做不同的處理，這裡我們只看op等於EPOLL_CTL_ADD時的添加操作。首先會判斷上一步操作中返回的epollitem對象地址是否為NULL，不是NULL說明該文件已經添加過了，返回錯誤，否則調用ep_insert()函數進行真正的添加操作。在添加文件之前內核會自動為該文件增加POLLERR和POLLHUP事件。

if (!epi) {
    epds.events |= POLLERR | POLLHUP;
    error = ep_insert(ep, &epds, tf.file, fd, full_check);
} else
    error = -EEXIST;
if (full_check)
    clear_tfile_check_list();

ep_insert()返回之後會判斷full_check標誌，該標誌和上文提到的死迴圈檢測相關，這裡也略去。

四、ep_insert()

ep_insert()函數中，首先判斷epoll實例中監視的文件數量是否已超過限制，沒問題則為待添加的文件創建一個epollitem對象：

int error, revents, pwake = 0;
unsigned long flags;
long user_watches;
struct epitem *epi;
struct ep_pqueue epq;
 
user_watches = atomic_long_read(&ep->user->epoll_watches);
if (unlikely(user_watches >= max_user_watches))
        return -ENOSPC;
if (!(epi = kmem_cache_alloc(epi_cache, GFP_KERNEL)))
        return -ENOMEM;

接下來是對epollitem的初始化：

INIT_LIST_HEAD(&epi->rdllink);
INIT_LIST_HEAD(&epi->fllink);
INIT_LIST_HEAD(&epi->pwqlist);
epi->ep = ep;
ep_set_ffd(&epi->ffd, tfile, fd);
epi->event = *event;
epi->nwait = 0;
epi->next = EP_UNACTIVE_PTR;
if (epi->event.events & EPOLLWAKEUP) {
        error = ep_create_wakeup_source(epi);
        if (error)
                goto error_create_wakeup_source;
} else {
        RCU_INIT_POINTER(epi->ws, NULL);
}

接下來是比較重要的操作：將epollitem對象添加到被監視文件的等待隊列上去。等待隊列實際上就是一個回調函數鏈表，定義在/include/linux/wait.h文件中。因為不同文件系統的實現不同，無法直接通過struct file對象獲取等待隊列，因此這裡通過struct file的poll操作，以回調的方式返回對象的等待隊列，這裡設置的回調函數是ep_ptable_queue_proc:

struct ep_pqueue epq;
...
/* Initialize the poll table using the queue callback */
epq.epi = epi;
init_poll_funcptr(&epq.pt, ep_ptable_queue_proc);

/*
 * Attach the item to the poll hooks and get current event bits.
 * We can safely use the file* here because its usage count has
 * been increased by the caller of this function. Note that after
 * this operation completes, the poll callback can start hitting
 * the new item.
 */
revents = ep_item_poll(epi, &epq.pt);

上面代碼中結構體ep_queue的作用是能夠在poll的回調函數中取得對應的epollitem對象，這種做法在Linux內核里非常常見。

在回調函數ep_ptable_queue_proc中，內核會創建一個struct eppoll_entry對象，然後將等待隊列中的回調函數設置為ep_poll_callback()。也就是說，當被監控文件有事件到來時，比如socker收到數據時，ep_poll_callback()會被回調。ep_ptable_queue_proc()代碼如下：

static void ep_ptable_queue_proc(struct file *file, wait_queue_head_t *whead,
                                 poll_table *pt)
{
        struct epitem *epi = ep_item_from_epqueue(pt);
        struct eppoll_entry *pwq;

        if (epi->nwait >= 0 && (pwq = kmem_cache_alloc(pwq_cache, GFP_KERNEL))) {
                init_waitqueue_func_entry(&pwq->wait, ep_poll_callback);
                pwq->whead = whead;
                pwq->base = epi;
                if (epi->event.events & EPOLLEXCLUSIVE)
                        add_wait_queue_exclusive(whead, &pwq->wait);
                else
                        add_wait_queue(whead, &pwq->wait);
                list_add_tail(&pwq->llink, &epi->pwqlist);
                epi->nwait++;
        } else {
                /* We have to signal that an error occurred */
                epi->nwait = -1;
        }
}

eppoll_entry和epitem等結構關係如下圖：

在回到ep_insert()函數中。ep_item_poll()調用完成之後，會將epitem中的fllink欄位添加到struct file中的f_ep_links鏈表中，這樣就可以通過struct file找到所有對應的struct epollitem對象，進而通過struct epollitem找到所有的epoll實例對應的struct eventpoll。

spin_lock(&tfile->f_lock);
list_add_tail_rcu(&epi->fllink, &tfile->f_ep_links);
spin_unlock(&tfile->f_lock);

然後就是將epollitem插入到紅黑樹中：

ep_rbtree_insert(ep, epi)

最後再更新下狀態就返回了，插入操作也就完成了。

在返回之前還會判斷一次剛纔添加的文件是不是當前已經有事件就緒了，如果是就將其加入到epoll的就緒鏈表中，關於就緒鏈表放到下一部分中講，這裡略過。

最後是我畫的幾個結構體之間的結構圖。