圖解Linux的IO模型和相關技術_ZenDei技術網路在線

阻塞IO模型（Blocking I/O）內核一開始提供了與阻塞式操作。當客戶端連接時，會在對應進程的文件描述符目錄（/proc/進程號/fd）生成對應的文件描述符（0 標準輸入；1 標準輸出；2 標準錯誤輸出；），比如 fd 8 , fd 9；應用程式需要讀取的時候，通過系統調用讀取，如 ...

阻塞IO模型（Blocking I/O）

file

Linux 內核一開始提供了 read 與 write 阻塞式操作。

當客戶端連接時，會在對應進程的文件描述符目錄（/proc/進程號/fd）生成對應的文件描述符（0 標準輸入；1 標準輸出；2 標準錯誤輸出；），比如 fd 8 , fd 9；
應用程式需要讀取的時候，通過系統調用 read (fd8)讀取，如果數據還沒到來，此應用程式的進程或線程會阻塞等待。

man 2 read

概述
       #include <unistd.h>
       ssize_t read(int fd, void *buf, size_t count);
描述
       read() 從文件描述符 fd 中讀取 count 位元組的數據並放入從 buf 開始的緩衝區中.
       如果 count 為零,read()返回0,不執行其他任何操作. 如果 count 大於SSIZE_MAX,那麼結果將不可預料.
返回值
       成功時返回讀取到的位元組數(為零表示讀到文件描述符), 此返回值受文件剩餘位元組數限制.當返回值小於指定的位元組數時 並不意味著錯誤;這可能是因為當前可讀取的位元組數小於指定的 位元組數(比如已經接近文件結尾,或
       者正在從管道或者終端讀取數 據,或者 read()被信號中斷). 發生錯誤時返回-1,並置 errno 為相應值.在這種情況下無法得知文件偏移位置是否有變化.

問題

如果出現了很多的客戶端連接，比如1000個，那麼應用程式就會啟用1000個進程或線程阻塞等待。此時會出現性能問題：

CPU 會不停的切換，造成進程或線程上下文切換開銷，實際讀取IO的時間占比會下降，造成CPU算力浪費。
因此，推動了 non-blocking I/O 的誕生。

非阻塞IO模型（non-blocking I/O）

file

此時，Linux 內核一開始提供了 read 與 write 非阻塞式操作，可以通過socket設置SOCK_NONBLOCK標記。

此時應用程式就不需要每一個文件描述符一個線程去處理，可以只有一個線程不停輪詢去讀取read，如果沒有數據到來，也會直接返回。
如果有數據，則可以調度去處理業務邏輯。

man 2 socket

Since  Linux  2.6.27, the type argument serves a second purpose: in addition to specifying a socket type, it may include the bitwise OR of any of the following values, to modify the behavior of
       socket():

       SOCK_NONBLOCK   Set the O_NONBLOCK file status flag on the open file description (see open(2)) referred to by the new file descriptor.  Using this flag saves extra calls to fcntl(2) to  achieve
                       the same result.

從這裡可以看出來 socket Linux 2.6.27內核開始支持非阻塞模式。

問題

同理，當出現了很多的客戶端連接，比如1000個，那就會觸發1000次系統調用。（1000次系統調用開銷也很客觀）

因此，有了 select。

IO復用模型（I/O multiplexing） - select

file

此時，Linux 內核一開始提供了 select 操作，可以把1000次的系統調用，簡化為一次系統調用，輪詢發生在內核空間。

select系統調用會返回可用的 fd集合，應用程式此時只需要遍歷可用的 fd 集合，去讀取數據進行業務處理即可。

man 2 select

SYNOPSIS
       #include <sys/select.h>
       int select(int nfds, fd_set *readfds, fd_set *writefds, fd_set *exceptfds, struct timeval *timeout);
         
DESCRIPTION
       select() allows a program to monitor multiple file descriptors, waiting until one or more of the file descriptors become "ready" for some class of I/O operation (e.g., input possible). A file
       descriptor is considered ready if it is possible to perform a corresponding I/O operation (e.g., read(2), or a sufficiently small write(2)) without blocking.

       select() can monitor only file descriptors numbers that are less than FD_SETSIZE; poll(2) and epoll(7) do not have this limitation. See BUGS.

可以看到支持傳輸多個文件描述符交由內核輪詢。

問題

雖然從1000次系統調用，降為一次系統調用的開銷，但是系統調用開銷中需要傳參1000個文件描述符。這也會造成一定的記憶體開銷。

因此，有了 epoll。

select() can monitor only file descriptors numbers that are less than FD_SETSIZE; poll(2) and epoll(7) do not have this limitation. See BUGS.

IO復用模型（I/O multiplexing） - epoll

file

man epoll
man 2 epoll_create
man 2 epoll_ctl
man 2 epoll_wait

epoll：

SYNOPSIS
       #include <sys/epoll.h>
			 
DESCRIPTION
       The  epoll  API  performs  a  similar task to poll(2): monitoring multiple file descriptors to see if I/O is possible on any of them.  The epoll API can be used either as an edge-triggered or a
       level-triggered interface and scales well to large numbers of watched file descriptors.

       The central concept of the epoll API is the epoll instance, an in-kernel data structure which, from a user-space perspective, can be considered as a container for two lists:

       • The interest list (sometimes also called the epoll set): the set of file descriptors that the process has registered an interest in monitoring.

       • The ready list: the set of file descriptors that are "ready" for I/O.  The ready list is a subset of (or, more precisely, a set of references to) the file descriptors in  the  interest  list.
         The ready list is dynamically populated by the kernel as a result of I/O activity on those file descriptors.

epoll_create ：

內核會產生一個epoll 實例數據結構並返回一個文件描述符epfd。

epoll_ctl ：

對文件描述符 fd 和其監聽事件 epoll_event 進行註冊，刪除，或者修改其監聽事件 epoll_event 。

SYNOPSIS
       #include <sys/epoll.h>
       int epoll_ctl(int epfd, int op, int fd, struct epoll_event *event);

DESCRIPTION
       This system call is used to add, modify, or remove entries in the interest list of the epoll(7) instance referred to by the file descriptor epfd. It requests that the operation op be performed
       for the target file descriptor, fd.

       Valid values for the op argument are:
       EPOLL_CTL_ADD
              Add an entry to the interest list of the epoll file descriptor, epfd. The entry includes the file descriptor, fd, a reference to the corresponding open file description (see epoll(7)
              and open(2)), and the settings specified in event.
       EPOLL_CTL_MOD
              Change the settings associated with fd in the interest list to the new settings specified in event.
       EPOLL_CTL_DEL
          Remove (deregister) the target file descriptor fd from the interest list. The event argument is ignored and can be NULL (but see BUGS below).

epoll_wait ：

阻塞等待註冊的事件發生，返回事件的數目，並將觸發的可用事件寫入epoll_events數組中。

擴展

其他IO優化技術

man 2 mmap
man 2 sendfile
man 2 fork

mmap：

就是在用戶的虛擬地址空間中尋找空閑的一段地址進行對文件的操作，不必再調用read、write系統調用，它的最終目的是將磁碟中的文件映射到用戶進程的虛擬地址空間，實現用戶進程對文件的直接讀寫，減少了文件複製的開銷，提高了用戶的訪問效率。

以讀為例：

file

深入剖析mmap原理 - 從三個關鍵問題說起： https://www.jianshu.com/p/eece39beee20
使用場景

kafka的數據文件就是用的mmap，寫入文件，可以不經過用戶空間到內核的拷貝，直接內核空間落盤。

再比如Java中的MappedByteBuffer底層在Linux就是mmap。

sendfile：

file

sendfile系統調用在兩個文件描述符之間直接傳遞數據(完全在內核中操作)，從而避免了數據在內核緩衝區和用戶緩衝區之間的拷貝，操作效率很高，被稱之為零拷貝。

使用場景

比如 kafka，消費者進行消費時，kafka直接調用 sendfile（Java中的FileChannel.transferTo），實現內核數據從記憶體或數據文件中讀出，直接發送到網卡，而不需要經過用戶空間的兩次拷貝，實現了所謂"零拷貝"。

再比如Tomcat、Nginx、Apache等web伺服器返回靜態資源等，將數據用網路發送出去，都運用了sendfile。

fork

man 2 fork

創建子進程有三種方式：

fork，調用後，子進程有自己的pid和task_struct結構，基於父進程的所有數據資源進行副本拷貝，主要是複製自己的指針，並不會複製父進程的虛存空間，並且父子進程同時進行，變數互相隔離，互不幹擾。

現在Linux中是採取了Copy-On-Write(COW，寫時複製)技術，為了降低開銷，fork最初並不會真的產生兩個不同的拷貝，因為在那個時候，大量的數據其實完全是一樣的。
寫時複製是在推遲真正的數據拷貝。若後來確實發生了寫入，那意味著父進程和子進程的數據不一致了，於是產生複製動作，每個進程拿到屬於自己的那一份，這樣就可以降低系統調用的開銷。

NOTES
       Under  Linux,  fork()  is implemented using copy-on-write pages, so the only penalty that it incurs is the time and memory required to duplicate the parent's page tables, and to create a unique
       task structure for the child.

vfork，vfork系統調用不同於fork，用vfork創建的子進程與父進程共用地址空間，也就是說子進程完全運行在父進程的地址空間上，也就是子進程對虛擬地址空間任何數據的修改同樣為父進程所見。並且vfork完子進程，父進程是阻塞等待子進程結束才會繼續。
clone，可以認為是fork 與 vfork的混合用法。由用戶通過參clone_flags 的設置來決定哪些資源共用，哪些資源副本拷貝。由標誌CLONE_VFORK來決定子進程在執行時父進程是阻塞還是運行，若沒有設置該標誌，則父子進程同時運行，設置了該標誌，則父進程掛起，直到子進程結束為止。
總結
- fork的用途
  一個進程希望對自身進行副本拷貝，從而父子進程能同時執行不同段的代碼。
  比如 redis的RDB持久化就是採用的就是fork，保證副本拷貝的時點準確，並且速度快，不影響父進程繼續提供服務。
- vfork的用途
  用vfork創建的進程主要目的是用exec函數先執行另外的程式。
- clone的用途
  用於有選擇地設置父子進程之間哪些資源需要共用，哪些資源需要副本拷貝。

@SvenAugustus (https://my.oschina.net/langxSpirit)