Python正則進階_ZenDei技術網路在線

[TOC] 1.Python正則表達式模塊 1.1 正則表達式處理字元串主要有四大功能 1. 匹配查看一個字元串是否符合正則表達式的語法，一般返回true或者false 2. 獲取正則表達式來提取字元串中符合要求的文本 3. 替換查找字元串中符合正則表達式的文本，並用相應的字元串替換 4. 分 ...

1.Python正則表達式模塊
2 正則匹配與替換

1.Python正則表達式模塊

1.1 正則表達式處理字元串主要有四大功能

匹配查看一個字元串是否符合正則表達式的語法，一般返回true或者false
獲取正則表達式來提取字元串中符合要求的文本
替換查找字元串中符合正則表達式的文本，並用相應的字元串替換
分割使用正則表達式對字元串進行分割。

1.2 Python中re模塊使用正則表達式的兩種方法

使用re.compile(r,f)方法生成正則表達式對象，然後調用正則表達式對象的相應方法。這種做法的好處是生成正則對象之後可以多次使用。
re模塊中對正則表達式對象的每個對象方法都有一個對應的模塊方法，唯一不同的是傳入的第一個參數是正則表達式字元串。此種方法適合於只使用一次的正則表達式。

1.3 正則表達式對象的常用方法

1. rx.findall(s,start, end):

　　返回一個列表，如果正則表達式中沒有分組，則列表中包含的是所有匹配的內容，如果正則表達式中有分組，則列表中的每個元素是一個元組，元組中包含子分組中匹配到的內容，但是沒有返回整個正則表達式匹配的內容

2. rx.finditer(s, start, end):

　　返回一個可迭代對象
　　對可迭代對象進行迭代，每一次返回一個匹配對象，可以調用匹配對象的group()方法查看指定組匹配到的內容，0表示整個正則表達式匹配到的內容

3. rx.search(s, start, end):

　　返回一個匹配對象,倘若沒匹配到，就返回None
　　search方法只匹配一次就停止，不會繼續往後匹配

4. rx.match(s, start, end):

　　　如果正則表達式在字元串的起始處匹配，就返回一個匹配對象，否則返回None

5. rx.sub(x, s, m):

　　返回一個字元串。每一個匹配的地方用x進行替換，返回替換後的字元串，如果指定m，則最多替換m次。對於x可以使用/i或者/g

6. rx.subn(x, s, m):

　　　與re.sub()方法相同，區別在於返回的是二元組，其中一項是結果字元串，一項是做替換的個數。

7. rx.split(s, m):

 分割字元串，返回一個列表，用正則表達式匹配到的內容對字元串進行分割

　　　　　　如果正則表達式中存在分組，則把分組匹配到的內容放在列表中每兩個分割的中間作為列表的一部分，如：

　　rx = re.compile(r"(\d)[a-z]+(\d)")
　　s = "ab12dk3klj8jk9jks5"
　　result = rx.split(s)

返回['ab1', '2', '3', 'klj', '8', '9', 'jks5']

8. rx.flags():正則表達式編譯時設置的標誌

9. rx.pattern():正則表達式編譯時使用的字元串

1.4 匹配對象的屬性與方法

01. m.group(g, ...)

　　返回編號或者組名匹配到的內容，預設或者0表示整個表達式匹配到的內容，如果指定多個，就返回一個元組

02. m.groupdict(default)

　　　返回一個字典。字典的鍵是所有命名的組的組名，值為命名組捕獲到的內容
　　　如果有default參數，則將其作為那些沒有參與匹配的組的預設值。

03. m.groups(default)

　　　　　　返回一個元組。包含所有捕獲到內容的子分組，從1開始，如果指定了default值，則這個值作為那些沒有捕獲到內容的組的值

04. m.lastgroup()

　　　　　　匹配到內容的編號最高的捕獲組的名稱，如果沒有或者沒有使用名稱則返回None(不常用)

05. m.lastindex()

　　　　　　匹配到內容的編號最高的捕獲組的編號，如果沒有就返回None。

06. m.start(g):

　　　　　　當前匹配對象的子分組是從字元串的那個位置開始匹配的,如果當前組沒有參與匹配就返回-1

07. m.end(g)

　　　　　　當前匹配對象的子分組是從字元串的那個位置匹配結束的，如果當前組沒有參與匹配就返回-1

08. m.span()

　　　　　　返回一個二元組，內容分別是m.start(g)和m.end(g)的返回值

09. m.re()

　　　　　　產生這一匹配對象的正則表達式

10. m.string()

　　　　　　傳遞給match或者search用於匹配的字元串

11. m.pos()

　　　　　　搜索的起始位置。即字元串的開頭，或者start指定的位置(不常用)

12. m.endpos()

　　　　　　搜索的結束位置。即字元串的末尾位置，或者end指定的位置(不常用)

1.5 總結

對於正則表達式的匹配功能，Python沒有返回true和false的方法，但可以通過對match或者search方法的返回值是否是None來判斷
對於正則表達式的搜索功能，如果只搜索一次可以使用search或者match方法返回的匹配對象得到，對於搜索多次可以使用finditer方法返回的可迭代對象來迭代訪問
對於正則表達式的替換功能，可以使用正則表達式對象的sub或者subn方法來實現，也可以通過re模塊方法sub或者subn來實現，區別在於模塊的sub方法的替換文本可以使用一個函數來生成
對於正則表達式的分割功能，可以使用正則表達式對象的split方法，需要註意如果正則表達式對象有分組的話，分組捕獲的內容也會放到返回的列表中

2 正則匹配與替換

1.python里使用正則表達式的組匹配自引用

在前面學習過組的匹配，也就是一個括弧包含就叫做一個組。在一個複雜一點的正則表達式里，比如像（1）（2）（3）這樣，就匹配三組，如果想在這個表達式里引用前面匹配的組，怎麼辦呢？其實最簡單的方式是通過組號來引用，比如像（1）（2）（3）——\1。使用“\num”的語法來自引用，如下例子：

#python 3.6
#
import re
 
address = re.compile(
    r'''
    # The regular name
    (\w+)               # first name
    \s+
    (([\w.]+)\s+)?      # optional middle name or initial
    (\w+)               # last name
    \s+
    <
    # The address: [email protected]
    (?P<email>
      \1               # first name
      \.
      \4               # last name
      @
      ([\w\d.]+\.)+    # domain name prefix
      (com|org|edu)    # limit the allowed top-level domains
    )
    >
    ''',
    re.VERBOSE | re.IGNORECASE)
 
candidates = [
    u'First Last <[email protected]>',
    u'Different Name <[email protected]>',
    u'First Middle Last <[email protected]>',
    u'First M. Last <[email protected]>',
]
 
for candidate in candidates:
    print('Candidate:', candidate)
    match = address.search(candidate)
    if match:
        print('  Match name :', match.group(1), match.group(4))
        print('  Match email:', match.group(5))
    else:
        print('  No match')

結果輸出如下：

Candidate: First Last <[email protected]>
  Match name : First Last
  Match email: [email protected]
Candidate: Different Name <[email protected]>
  No match
Candidate: First Middle Last <[email protected]>
  Match name : First Last
  Match email: [email protected]
Candidate: First M. Last <[email protected]>
  Match name : First Last
  Match email: [email protected]

在這個例子里，就引用了第1組first name和第4組last name的值，實現了前後不一致的EMAIL的姓名，就丟掉它。

2.python里使用正則表達式的組匹配通過名稱自引用

在前學習過正則表達式的組可以通過組號來自引用，看起來使用很簡單的樣子，其實它還是不容易維護的，比如你某一天需要在這個正則表達式里插入一個組時，就發現後面的組號全打亂了，因此需要一個一個地更改組號，有沒有更容易維護的方式呢？是有的，就是使用組名稱來引用。如下麵的例子：

#python 3.6

#
import re
 
address = re.compile(
    '''
    # The regular name
    (?P<first_name>\w+)
    \s+
    (([\w.]+)\s+)?      # optional middle name or initial
    (?P<last_name>\w+)
    \s+
    <
    # The address: [email protected]
    (?P<email>
      (?P=first_name)
      \.
      (?P=last_name)
      @
      ([\w\d.]+\.)+    # domain name prefix
      (com|org|edu)    # limit the allowed top-level domains
    )
    >
    ''',
    re.VERBOSE | re.IGNORECASE)
 
candidates = [
    u'cai junsheng <[email protected]>',
    u'Different Name <[email protected]>',
    u'Cai Middle junsheng <[email protected]>',
    u'Cai M. junsheng <[email protected]>',
]
 
for candidate in candidates:
    print('Candidate:', candidate)
    match = address.search(candidate)
    if match:
        print('  Match name :', match.groupdict()['first_name'],
              end=' ')
        print(match.groupdict()['last_name'])
        print('  Match email:', match.groupdict()['email'])
    else:
        print('  No match')

結果輸出如下：

Candidate: cai junsheng <[email protected]>
  Match name : cai junsheng
  Match email: [email protected]
Candidate: Different Name <[email protected]>
  No match
Candidate: Cai Middle junsheng <[email protected]>
  Match name : Cai junsheng
  Match email: [email protected]
Candidate: Cai M. junsheng <[email protected]>
  Match name : Cai junsheng
  Match email: [email protected]

在這個例子里，就是通過(?P=first_name)引用。

3.python里使用正則表達式的組匹配是否成功之後再自引用

在前面學習了通過名稱或組號來引用本身正則表達式里的組內容，可以實現前後關聯式的相等判斷。如果再更進一步，比如當前面組匹配成功之後，就選擇一種模式來識別，而不匹配成功又選擇另外一種模式進行識別，這相當於if...else...語句的選擇。我們來學習這種新的語法：(?(id)yes-expression|no-expression)。其中id是表示組名稱或組編號， yes-expression是當組匹配成功之後選擇的正則表達式，而no-expression 是不匹配成功之後選擇的正則表達式。如下例子：

#python 3.6
#
import re
 
address = re.compile(
    '''
    ^
    # A name is made up of letters, and may include "."
    # for title abbreviations and middle initials.
    (?P<name>
       ([\w.]+\s+)*[\w.]+
     )?
    \s*
    # Email addresses are wrapped in angle brackets, but
    # only if a name is found.
    (?(name)
      # remainder wrapped in angle brackets because
      # there is a name
      (?P<brackets>(?=(<.*>$)))
      |
      # remainder does not include angle brackets without name
      (?=([^<].*[^>]$))
     )
    # Look for a bracket only if the look-ahead assertion
    # found both of them.
    (?(brackets)<|\s*)
    # The address itself: [email protected]
    (?P<email>
      [\w\d.+-]+       # username
      @
      ([\w\d.]+\.)+    # domain name prefix
      (com|org|edu)    # limit the allowed top-level domains
     )
    # Look for a bracket only if the look-ahead assertion
    # found both of them.
    (?(brackets)>|\s*)
    $
    ''',
    re.VERBOSE)
 
candidates = [
    u'Cai junsheng <[email protected]>',
    u'No Brackets [email protected]',
    u'Open Bracket <[email protected]',
    u'Close Bracket [email protected]>',
    u'[email protected]',
]
 
for candidate in candidates:
    print('Candidate:', candidate)
    match = address.search(candidate)
    if match:
        print('  Match name :', match.groupdict()['name'])
        print('  Match email:', match.groupdict()['email'])
    else:
        print('  No match')

結果輸出如下：

Candidate: Cai junsheng <[email protected]>
  Match name : Cai junsheng
  Match email: [email protected]
Candidate: No Brackets [email protected]
  No match
Candidate: Open Bracket <[email protected]
  No match
Candidate: Close Bracket [email protected]>
  No match
Candidate: [email protected]
  Match name : None
  Match email: [email protected]

在這裡，當name組出現時才會尋找括弧< >，如果括弧不成對就不匹配成功；如果name組不出現，就不需要括弧，因此選擇了另一個正則表達式。

4.python里使用正則表達式來替換匹配成功的組

在前面主要學習了怎麼樣匹配成功，都沒有修改原來的內容的。現在來學習一個匹配成功之後修改相應的內容，在這裡使用sub()函數來實現這個功能，同時使用引用組號來插入原來的字元，例子如下：

#python 3.6
#
import re
 
bold = re.compile(r'\*{2}(.*?)\*{2}')
 
text = 'Make this **cai**.  This **junsheng**.'
 
print('Text:', text)
print('Bold:', bold.sub(r'<b>\1</b>', text))

結果輸出如下：

Text: Make this **cai**.  This **junsheng**.
Bold: Make this <b>cai</b>.  This <b>junsheng</b>.

5.python里使用正則表達式來替換匹配成功的組名

在前面學習了找到組之後，通過組序號來替換，比如像bold.sub(r'\1', text))，這裡是通過\1來替換的，這樣的方式就是簡單，快捷。但是不方便維護，不方便記憶，要想把這點改進一下，就得使用組名稱的方式來替換，就跟前面學習組名稱匹配一樣，給一個組起一個名稱，也像為什麼給每一個人起一個名稱一樣，方便區分和記憶。因此使用這樣的語法：\g

#python 3.6
#
import re
 
bold = re.compile(r'\*{2}(?P<bold_text>.*?)\*{2}')
 
text = 'Make this **cai**.  This **junsheng**.'
 
print('Text:', text)
print('Bold:', bold.sub(r'<b>\g<bold_text></b>', text))

結果輸出如下：

Text: Make this **cai**.  This **junsheng**.
Bold: Make this <b>cai</b>.  This <b>junsheng</b>.

6.python里使用正則表達式來替換匹配成功的組並限定替換的次數

在前面學習過通過組名稱來替換原來的字元串，這種替換隻要出現相同的匹配成功，就會替換，而不管出現多少次。如果有一天，項目經理說要只需要替換第一個，或者前5個，怎麼辦呢？哈哈，這時你就得使用sub函數的count參數了，它可以指定替換的次數，輕鬆地解決了問題，例子如下：

#python 3.6 
#
import re
 
bold = re.compile(r'\*{2}(?P<bold_text>.*?)\*{2}')
 
text = 'Make this **cai**.  This **junsheng**.'
 
print('Text:', text)
print('Bold:', bold.sub(r'<b>\g<bold_text></b>', text, count=1))

結果輸出如下：

Text: Make this **cai**.  This **junsheng**.
Bold: Make this <b>cai</b>.  This **junsheng**.

7.python里使用正則表達式來替換匹配成功的組並輸出替換的次數

在前面我們學習過怎麼樣限制替換的次數，如果我們想知道正則表達式里匹配成功之後，替換字元串的次數，那麼需要怎麼辦呢？這是一個好問題，這時就需要採用另一個外函數subn()了。這個函數不但輸出替換後的內容，還輸出替換的次數，例子：

#python 3.6
#
import re
 
bold = re.compile(r'\*{2}(?P<bold_text>.*?)\*{2}')
 
text = 'Make this **cai**.  This **junsheng**.'
 
print('Text:', text)
print('Bold:', bold.subn(r'<b>\g<bold_text></b>', text))

結果輸出如下：

Text: Make this **cai**.  This **junsheng**.
Bold: ('Make this <b>cai</b>.  This <b>junsheng</b>.', 2)