python多線程與多進程及其區別

個人一直覺得對學習任何知識而言，概念是相當重要的。掌握了概念和原理，細節可以留給實踐去推敲。掌握的關鍵在於理解，通過具體的實例和實際操作來感性的體會概念和原理可以起到很好的效果。本文通過一些具體的例子簡單介紹一下python的多線程和多進程，後續會寫一些進程通信和線程通信的一些文章。 python多 ...

個人一直覺得對學習任何知識而言，概念是相當重要的。掌握了概念和原理，細節可以留給實踐去推敲。掌握的關鍵在於理解，通過具體的實例和實際操作來感性的體會概念和原理可以起到很好的效果。本文通過一些具體的例子簡單介紹一下python的多線程和多進程，後續會寫一些進程通信和線程通信的一些文章。

python多線程

python中提供兩個標準庫thread和threading用於對線程的支持，python3中已放棄對前者的支持，後者是一種更高層次封裝的線程庫，接下來均以後者為例。

創建線程

python中有兩種方式實現線程：

實例化一個threading.Thread的對象，並傳入一個初始化函數對象（initial function )作為線程執行的入口；
繼承threading.Thread，並重寫run函數；

方式1：創建threading.Thread對象

import threading
import time

def tstart(arg):
    time.sleep(0.5)
    print("%s running...." % arg)

if __name__ == '__main__':
    t1 = threading.Thread(target=tstart, args=('This is thread 1',))
    t2 = threading.Thread(target=tstart, args=('This is thread 2',))
    t1.start()
    t2.start()
    print("This is main function")

結果：

This is main function
This is thread 2 running....
This is thread 1 running....

View Code

方式2：繼承threading.Thread，並重寫run

import threading
import time

class CustomThread(threading.Thread):
    def __init__(self, thread_name):
        # step 1: call base __init__ function
        super(CustomThread, self).__init__(name=thread_name)
        self._tname = thread_name

    def run(self):
        # step 2: overide run function
        time.sleep(0.5)
        print("This is %s running...." % self._tname)

if __name__ == "__main__":
    t1 = CustomThread("thread 1")
    t2 = CustomThread("thread 2")
    t1.start()
    t2.start()
    print("This is main function")

執行結果同方式1.

threading.Thread

上面兩種方法本質上都是直接或者間接使用threading.Thread類

threading.Thread(group=None, target=None, name=None, args=(), kwargs={})

關聯上面兩種創建線程的方式：

import threading
import time

class CustomThread(threading.Thread):
    def __init__(self, thread_name, target = None):
        # step 1: call base __init__ function
        super(CustomThread, self).__init__(name=thread_name, target=target, args = (thread_name,))
        self._tname = thread_name

    def run(self):
        # step 2: overide run function
        # time.sleep(0.5)
        # print("This is %s running....@run" % self._tname)
        super(CustomThread, self).run()

def target(arg):
    time.sleep(0.5)
    print("This is %s running....@target" % arg)

if __name__ == "__main__":
    t1 = CustomThread("thread 1", target)
    t2 = CustomThread("thread 2", target)
    t1.start()
    t2.start()
    print("This is main function")

結果：

This is main function
This is thread 1 running....@target
This is thread 2 running....@target

上面這段代碼說明：

兩種方式創建線程，指定的參數最終都會傳給threading.Thread類；
傳給線程的目標函數是在基類Thread的run函數體中被調用的，如果run沒有被重寫的話。

threading模塊的一些屬性和方法可以參照官網，這裡重點介紹一下threading.Thread對象的方法

下麵是threading.Thread提供的線程對象方法和屬性：

start()：創建線程後通過start啟動線程，等待CPU調度，為run函數執行做準備；

run()：線程開始執行的入口函數，函數體中會調用用戶編寫的target函數，或者執行被重載的run函數；

join([timeout])：阻塞掛起調用該函數的線程，直到被調用線程執行完成或超時。通常會在主線程中調用該方法，等待其他線程執行完成。

name、getName()&setName()：線程名稱相關的操作；

ident：整數類型的線程標識符，線程開始執行前（調用start之前）為None；

isAlive()、is_alive()：start函數執行之後到run函數執行完之前都為True；

daemon、isDaemon()&setDaemon()：守護線程相關；

這些是我們創建線程之後通過線程對象對線程進行管理和獲取線程信息的方法。

多線程執行

在主線程中創建若線程之後，他們之間沒有任何協作和同步，除主線程之外每個線程都是從run開始被執行，直到執行完畢。

join

我們可以通過join方法讓主線程阻塞，等待其創建的線程執行完成。

import threading
import time

def tstart(arg):
    print("%s running....at: %s" % (arg,time.time()))
    time.sleep(1)
    print("%s is finished! at: %s" % (arg,time.time()))

if __name__ == '__main__':
    t1 = threading.Thread(target=tstart, args=('This is thread 1',))
    t1.start()
    t1.join()   # 當前線程阻塞，等待t1線程執行完成
    print("This is main function at：%s" % time.time())

結果：

This is thread 1 running....at: 1564906617.43
This is thread 1 is finished! at: 1564906618.43
This is main function at：1564906618.43

如果不加任何限制，當主線程執行完畢之後，當前程式並不會結束，必須等到所有線程都結束之後才能結束當前進程。

將上面程式中的t1.join()去掉，執行結果如下：

This is thread 1 running....at: 1564906769.52
This is main function at：1564906769.52
This is thread 1 is finished! at: 1564906770.52

可以通過將創建的線程指定為守護線程（daemon），這樣主線程執行完畢之後會立即結束未執行完的線程，然後結束程式。

deamon守護線程

import threading
import time

def tstart(arg):
    print("%s running....at: %s" % (arg,time.time()))
    time.sleep(1)
    print("%s is finished! at: %s" % (arg,time.time()))

if __name__ == '__main__':
    t1 = threading.Thread(target=tstart, args=('This is thread 1',))
    t1.setDaemon(True)
    t1.start()
    # t1.join()   # 當前線程阻塞，等待t1線程執行完成
    print("This is main function at：%s" % time.time())

結果：

This is thread 1 running....at: 1564906847.85
This is main function at：1564906847.85

python多進程

相比較於threading模塊用於創建python多線程，python提供multiprocessing用於創建多進程。先看一下創建進程的兩種方式。

The multiprocessing package mostly replicates the API of the threading module.　　—— python doc

創建進程

創建進程的方式和創建線程的方式類似：

實例化一個multiprocessing.Process的對象，並傳入一個初始化函數對象（initial function )作為新建進程執行入口；
繼承multiprocessing.Process，並重寫run函數；

方式1：

from multiprocessing import Process  
import os, time

def pstart(name):
    # time.sleep(0.1)
    print("Process name: %s, pid: %s "%(name, os.getpid()))

if __name__ == "__main__": 
    subproc = Process(target=pstart, args=('subprocess',))  
    subproc.start()  
    subproc.join()
    print("subprocess pid: %s"%subproc.pid)
    print("current process pid: %s" % os.getpid())

結果：

Process name: subprocess, pid: 4888 
subprocess pid: 4888
current process pid: 9912

方式2：

from multiprocessing import Process  
import os, time

class CustomProcess(Process):
    def __init__(self, p_name, target=None):
        # step 1: call base __init__ function()
        super(CustomProcess, self).__init__(name=p_name, target=target, args=(p_name,))

    def run(self):
        # step 2:
        # time.sleep(0.1)
        print("Custom Process name: %s, pid: %s "%(self.name, os.getpid()))

if __name__ == '__main__':
    p1 = CustomProcess("process_1")
    p1.start()
    p1.join()
    print("subprocess pid: %s"%p1.pid)
    print("current process pid: %s" % os.getpid())

這裡可以思考一下，如果像多線程一樣，存在一個全局的變數share_data，不同進程同時訪問share_data會有問題嗎？

由於每一個進程擁有獨立的記憶體地址空間且互相隔離，因此不同進程看到的share_data是不同的、分別位於不同的地址空間，同時訪問不會有問題。這裡需要註意一下。

Subprocess模塊

既然說道了多進程，那就順便提一下另一種創建進程的方式。

python提供了Sunprocess模塊可以在程式執行過程中，調用外部的程式。

如我們可以在python程式中打開記事本，打開cmd，或者在某個時間點關機:

>>> import subprocess
>>> subprocess.Popen(['cmd'])
<subprocess.Popen object at 0x0339F550>
>>> subprocess.Popen(['notepad'])
<subprocess.Popen object at 0x03262B70>
>>> subprocess.Popen(['shutdown', '-p'])

或者使用ping測試一下網路連通性：

>>> res = subprocess.Popen(['ping', 'www.cnblogs.com'], stdout=subprocess.PIPE).communicate()[0]
>>> print res
正在 Ping www.cnblogs.com [101.37.113.127] 具有 32 位元組的數據:

來自 101.37.113.127 的回覆: 位元組=32 時間=1ms TTL=91
來自 101.37.113.127 的回覆: 位元組=32 時間=1ms TTL=91
來自 101.37.113.127 的回覆: 位元組=32 時間=1ms TTL=91
來自 101.37.113.127 的回覆: 位元組=32 時間=1ms TTL=91

101.37.113.127 的 Ping 統計信息:
數據包: 已發送 = 4，已接收 = 4，丟失 = 0 (0% 丟失)，
往返行程的估計時間(以毫秒為單位):
最短 = 1ms，最長 = 1ms，平均 = 1ms

python多線程與多進程比較

先來看兩個例子：

開啟兩個python線程分別做一億次加一操作，和單獨使用一個線程做一億次加一操作：

def tstart(arg):
    var = 0
    for i in xrange(100000000):
        var += 1

if __name__ == '__main__':
    t1 = threading.Thread(target=tstart, args=('This is thread 1',))
    t2 = threading.Thread(target=tstart, args=('This is thread 2',))
    start_time = time.time()
    t1.start()
    t2.start()
    t1.join()
    t2.join()
    print("Two thread cost time: %s" % (time.time() - start_time))
    start_time = time.time()
    tstart("This is thread 0")
    print("Main thread cost time: %s" % (time.time() - start_time))

結果：

Two thread cost time: 20.6570000648
Main thread cost time: 2.52800011635

上面的例子如果只開啟t1和t2兩個線程中的一個，那麼運行時間和主線程基本一致。這個後面會解釋原因。

使用兩個進程進行上面的操作：

def pstart(arg):
    var = 0
    for i in xrange(100000000):
        var += 1

if __name__ == '__main__':
    p1 = Process(target = pstart, args = ("1", ))
    p2 = Process(target = pstart, args = ("2", ))
    start_time = time.time()
    p1.start()
    p2.start()
    p1.join()
    p2.join()
    print("Two process cost time: %s" % (time.time() - start_time))
    start_time = time.time()
    pstart("0")
    print("Current process cost time: %s" % (time.time() - start_time))

結果：

Two process cost time: 2.91599988937
Current process cost time: 2.52400016785

對比分析

雙進程並行執行和單進程執行相同的運算代碼，耗時基本相同，雙進程耗時會稍微多一些，可能的原因是進程創建和銷毀會進行系統調用，造成額外的時間開銷。

但是對於python線程，雙線程並行執行耗時比單線程要高的多，效率相差近10倍。如果將兩個並行線程改成串列執行，即：

    t1.start()
    t1.join()
    t2.start()
    t2.join()
    #Two thread cost time: 5.12199997902
    #Main thread cost time: 2.54200005531

可以看到三個線程串列執行，每一個執行的時間基本相同。

本質原因雙線程是併發執行的，而不是真正的並行執行。原因就在於GIL鎖。

GIL鎖

提起python多線程就不得不提一下GIL(Global Interpreter Lock 全局解釋器鎖)，這是目前占統治地位的python解釋器CPython中為了保證數據安全所實現的一種鎖。不管進程中有多少線程，只有拿到了GIL鎖的線程才可以在CPU上運行，即時是多核處理器。對一個進程而言，不管有多少線程，任一時刻，只會有一個線程在執行。對於CPU密集型的線程，其效率不僅僅不高，反而有可能比較低。python多線程比較適用於IO密集型的程式。對於的確需要並行運行的程式，可以考慮多進程。

多線程對鎖的爭奪，CPU對線程的調度，線程之間的切換等均會有時間開銷。

線程與進程區別

下麵簡單的比較一下線程與進程

進程是資源分配的基本單位，線程是CPU執行和調度的基本單位；
通信/同步方式：

進程：
- 通信方式：管道，FIFO，消息隊列，信號，共用記憶體，socket，stream流；
- 同步方式：PV信號量，管程
線程：
- 同步方式：互斥鎖，遞歸鎖，條件變數，信號量
- 通信方式：位於同一進程的線程共用進程資源，因此線程間沒有類似於進程間用於數據傳遞的通信方式，線程間的通信主要是用於線程同步。

CPU上真正執行的是線程，線程比進程輕量，其切換和調度代價比進程要小；
線程間對於共用的進程數據需要考慮線程安全問題，由於進程之間是隔離的，擁有獨立的記憶體空間資源，相對比較安全，只能通過上面列出的IPC(Inter-Process Communication)進行數據傳輸；
系統有一個個進程組成，每個進程包含代碼段、數據段、堆空間和棧空間，以及操作系統共用部分，有等待，就緒和運行三種狀態；
一個進程可以包含多個線程，線程之間共用進程的資源（文件描述符、全局變數、堆空間等），寄存器變數和棧空間等是線程私有的；
操作系統中一個進程掛掉不會影響其他進程，如果一個進程中的某個線程掛掉而且OS對線程的支持是多對一模型，那麼會導致當前進程掛掉；
如果CPU和系統支持多線程與多進程，多個進程並行執行的同時，每個進程中的線程也可以並行執行，這樣才能最大限度的榨取硬體的性能；

線程和進程的上下文切換

進程切換過程切換牽涉到非常多的東西，寄存器內容保存到任務狀態段TSS，切換頁表，堆棧等。簡單來說可以分為下麵兩步：

頁全局目錄切換，使CPU到新進程的線性地址空間定址；
切換內核態堆棧和硬體上下文，硬體上下文包含CPU寄存器的內容，存放在TSS中；

線程運行於進程地址空間，切換過程不涉及到空間的變換，只牽涉到第二步；

使用多線程還是多進程？

CPU密集型：程式需要占用CPU進行大量的運算和數據處理；

I/O密集型：程式中需要頻繁的進行I/O操作；例如網路中socket數據傳輸和讀取等；

由於python多線程並不是並行執行，因此較適合與I/O密集型程式，多進程並行執行適用於CPU密集型程式；