day4 正則表達式（regular）_ZenDei技術網路在線

正則（regular），要使用正則表達式需要導入Python中的re（regular正則的縮寫）模塊。正則表達式是對字元串的處理，我們知道，字元串中有時候包含很多我們想要提取的信息，掌握這些處理字元串的方法，能夠方便很多我們的操作。正則表達式（regular），處理字元串的方法。http://ww ...

正則（regular），要使用正則表達式需要導入Python中的re（regular正則的縮寫）模塊。正則表達式是對字元串的處理，我們知道，字元串中有時候包含很多我們想要提取的信息，掌握這些處理字元串的方法，能夠方便很多我們的操作。

正則表達式（regular），處理字元串的方法。http://www.cnblogs.com/alex3714/articles/5169958.html

正則是一種常用的方法，因為python中文件處理很常見，文件裡面包含的是字元串，要想處理字元串，那麼就需要用到正則表達式。因而要掌握好正則表達式。下麵下來看看正則表達式中包含的方法：

（1）match(pattern, string, flags=0)

def match(pattern, string, flags=0):
　　　　"""Try to apply the pattern at the start of the string, returning
　　　　a match object, or None if no match was found."""
　　　　return _compile(pattern, flags).match(string)

從上面註釋：Try to apply the pattern at the start of the string,returning a match object,or None if no match was found.從字元串的開頭開始查找，返回一個match object對象，如果沒有找到，返回一個None。

重點：（1）從開頭開始查找；（2）如果查找不到返回None。

下麵來看看幾個實例：

    import re
　　string = "abcdef"
　　m = re.match("abc",string)     （1）匹配"abc"，並查看返回的結果是什麼
　　print(m)
　　print(m.group())

　　n = re.match("abcf",string)
　　print(n)                        (2）字元串不在列表中查找的情況

　　l = re.match("bcd",string)      （3）字元串在列表中間查找情況
　　print(l)

運行結果如下：

    <_sre.SRE_Match object; span=(0, 3), match='abc'>     （1）
　　abc                                                   （2）
　　None                                                  （3）
　　None                                                  （4）

從上面輸出結果（1）可以看出，使用match()匹配，返回的是一個match object對象，要想轉換為看得到的情況，要使用group()進行轉換（2）處所示；如果匹配的正則表達式不在字元串中，則返回None（3）；match(pattern,string,flag)是從字元串開始的地方匹配的，並且只能從字元串的開始處進行匹配（4）所示。

（2）fullmatch(pattern, string, flags=0)

def fullmatch(pattern, string, flags=0):
　　　　"""Try to apply the pattern to all of the string, returning
　　　　a match object, or None if no match was found."""
　　　　return _compile(pattern, flags).fullmatch(string)

從上面註釋：Try to apply the pattern to all of the string,returning a match object,or None if no match was found...

（3）search(pattern,string,flags)

def search(pattern, string, flags=0):
　　　　"""Scan through string looking for a match to the pattern, returning
　　　　a match object, or None if no match was found."""
　　　　return _compile(pattern, flags).search(string)
search(pattern,string,flags)的註釋是Scan throgh string looking for a match to the pattern,returning a match object,or None if no match was found.在字元串任意一個位置查找正則表達式，如果找到了則返回match object對象，如果查找不到則返回None。

重點：（1）從字元串中間任意一個位置查找，不像match()是從開頭開始查找；（2）如果查找不到則返回None；

    import re
　　string = "ddafsadadfadfafdafdadfasfdafafda"

　　m = re.search("a",string)         （1）從中間開始匹配
　　print(m)
　　print(m.group())

　　n = re.search("N",string)          （2）匹配不到的情況
　　print(n)

運行結果如下：

    <_sre.SRE_Match object; span=(2, 3), match='a'>     （1）
　　a                                                    （2）
　　None                                                 （3）

從上面結果(1）可以看出，search(pattern,string,flag=0)可以從中間任意一個位置匹配，擴大了使用範圍，不像match()只能從開頭匹配，並且匹配到了返回的也是一個match_object對象；（2）要想展示一個match_object對象，那麼需要使用group()方法；（3）如果查找不到，則返回一個None。

（4）sub(pattern,repl,string,count=0,flags=0)

def sub(pattern, repl, string, count=0, flags=0):
　　　　"""Return the string obtained by replacing the leftmost
　　　　non-overlapping occurrences of the pattern in string by the
　　　　replacement repl. repl can be either a string or a callable;
　　　　if a string, backslash escapes in it are processed. If it is
　　　　a callable, it's passed the match object and must return
　　　　a replacement string to be used."""
　　　　return _compile(pattern, flags).sub(repl, string, count)

sub(pattern,repl,string,count=0,flags=0)查找替換，就是先查找pattern是否在字元串string中；repl是要把pattern匹配的對象，就要把正則表達式找到的字元替換為什麼；count可以指定匹配個數，匹配多少個。示例如下：

import re
　　string = "ddafsadadfadfafdafdadfasfdafafda"

　　m = re.sub("a","A",string) #不指定替換個數（1）
　　print(m)

　　n = re.sub("a","A",string,2) #指定替換個數（2）
　　print(n)

　　l = re.sub("F","B",string) #匹配不到的情況（3）
　　print(l)

運行結果如下：

  ddAfsAdAdfAdfAfdAfdAdfAsfdAfAfdA        --（1）
　　ddAfsAdadfadfafdafdadfasfdafafda        -- (2）
　　ddafsadadfadfafdafdadfasfdafafda        --（3）

上面代碼（1）是沒有指定匹配的個數，那麼預設是把所有的都匹配了；（2）處指定了匹配的個數，那麼只匹配指定個數的；（3）處要匹配的正則pattern不在字元串中，則返回原來的字元串。

重點：（1）可以指定匹配個數，不指定匹配所有；（2）如果匹配不到會返回原來的字元串；

（5）subn(pattern,repl,string,count=0,flags=0)

def subn(pattern, repl, string, count=0, flags=0):
　　　　"""Return a 2-tuple containing (new_string, number).
　　　　new_string is the string obtained by replacing the leftmost
　　　　non-overlapping occurrences of the pattern in the source
　　　　string by the replacement repl. number is the number of
　　　　substitutions that were made. repl can be either a string or a
　　　　callable; if a string, backslash escapes in it are processed.
　　　　If it is a callable, it's passed the match object and must
　　　　return a replacement string to be used."""
　　　　return _compile(pattern, flags).subn(repl, string, count)

上面註釋Return a 2-tuple containing(new_string,number):返回一個元組,用於存放正則匹配之後的新的字元串和匹配的個數(new_string,number)。

import re
　　string = "ddafsadadfadfafdafdadfasfdafafda"

　　m = re.subn("a","A",string) #全部替換的情況（1）
　　print(m)

　　n = re.subn("a","A",string,3) #替換部分（2）
　　print(n)

　　l = re.subn("F","A",string) #指定替換的字元串不存在（3）
　　print(l)

運行結果如下：

    ('ddAfsAdAdfAdfAfdAfdAdfAsfdAfAfdA', 11)     （1）
　　('ddAfsAdAdfadfafdafdadfasfdafafda', 3)      （2）
　　('ddafsadadfadfafdafdadfasfdafafda', 0)       （3）

從上面代碼輸出的結果可以看出，sub()和subn(pattern,repl,string,count=0,flags=0)可以看出，兩者匹配的效果是一樣的，只是返回的結果不同而已，sub()返回的還是一個字元串，而subn()返回的是一個元組，用於存放正則之後新的字元串，和替換的個數。

（6）split(pattern,string,maxsplit=0,flags=0)

def split(pattern, string, maxsplit=0, flags=0):
　　　　"""Split the source string by the occurrences of the pattern,
　　　　returning a list containing the resulting substrings. If
　　　　capturing parentheses are used in pattern, then the text of all
　　　　groups in the pattern are also returned as part of the resulting
　　　　list. If maxsplit is nonzero, at most maxsplit splits occur,
　　　　and the remainder of the string is returned as the final element
　　　　of the list."""
　　　　return _compile(pattern, flags).split(string, maxsplit)

split(pattern,string,maxsplit=0,flags=0)是字元串的分割，按照某個正則要求pattern分割字元串，返回一個列表returning a list containing the resulting substrings.就是按照某種方式分割字元串，並把字元串放在一個列表中。實例如下：

import re
　　string = "ddafsadadfadfafdafdadfasfdafafda"

　　m = re.split("a",string) #分割字元串（1）
　　print(m)

　　n = re.split("a",string,3) #指定分割次數
　　print(n)

　　l = re.split("F",string) #分割字元串不存在列表中
　　print(l)

運行結果如下：

    ['dd', 'fs', 'd', 'df', 'df', 'fd', 'fd', 'df', 'sfd', 'f', 'fd', '']     （1）
　　['dd', 'fs', 'd', 'dfadfafdafdadfasfdafafda']                             （2）
　　['ddafsadadfadfafdafdadfasfdafafda']                                      （3）

從（1）處可以看出，如果字元串開頭或者結尾包括要分割的字元串，後面元素會是一個""；（2）處我們可以指定要分割的次數；（3）處如果要分割的字元串不存在列表中，則把原字元串放在列表中。

（7）findall(pattern,string,flags=)

def findall(pattern, string, flags=0):
　　　　"""Return a list of all non-overlapping matches in the string.

　　　　If one or more capturing groups are present in the pattern, return
　　　　a list of groups; this will be a list of tuples if the pattern
　　　　has more than one group.

　　　　Empty matches are included in the result."""
　　　　return _compile(pattern, flags).findall(string)
findall(pattern,string,flags=)是返回一個列表，包含所有匹配的元素。存放在一個列表中。示例如下：

    import re
　　string = "dd12a32d46465fad1648fa1564fda127fd11ad30fa02sfd58afafda"

　　m = re.findall("[a-z]",string)       #匹配字母，匹配所有的字母，返回一個列表（1）
　　print(m)

　　n = re.findall("[0-9]",string)       #匹配所有的數字，返回一個列表          （2）
　　print(n)

　　l = re.findall("[ABC]",string)       #匹配不到的情況                        （3）
　　print(l)

運行結果如下：

['d', 'd', 'a', 'd', 'f', 'a', 'd', 'f', 'a', 'f', 'd', 'a', 'f', 'd', 'a', 'd', 'f', 'a', 's', 'f', 'd', 'a', 'f', 'a', 'f', 　 'd', 'a']        （1）
　　['1', '2', '3', '2', '4', '6', '4', '6', '5', '1', '6', '4', '8', '1', '5', '6', '4', '1', '2', '7', '1', '1', '3', '0', '0', 　 '2', '5', '8']      （2）
    []                 （3）

上面代碼運行結果（1）處匹配了所有的字元串，單個匹配；（2)處匹配了字元串中的數字，返回到一個列表中；（3）處匹配不存在的情況，返回一個空列表。

重點：（1）匹配不到的時候返回一個空的列表；（2）如果沒有指定匹配次數，則只單個匹配。

（8）finditer(pattern,string,flags=0)

def finditer(pattern, string, flags=0):
　　　　"""Return an iterator over all non-overlapping matches in the
　　　　string. For each match, the iterator returns a match object.

　　　　Empty matches are included in the result."""
　　　　return _compile(pattern, flags).finditer(string)

finditer(pattern,string)查找模式，Return an iterator over all non-overlapping matches in the string.For each match,the iterator a match object.

代碼如下：

import re
　　string = "dd12a32d46465fad1648fa1564fda127fd11ad30fa02sfd58afafda"

　　m = re.finditer("[a-z]",string)
　　print(m)

　　n = re.finditer("AB",string)
　　print(n)

運行結果如下：

<callable_iterator object at 0x7fa126441898> （1）
　　<callable_iterator object at 0x7fa124d6b710> （2）

從上面運行結果可以看出，finditer(pattern,string,flags=0)返回的是一個iterator對象。

（9）compile(pattern,flags=0)

def compile(pattern, flags=0):
　　　　"Compile a regular expression pattern, returning a pattern object."
　　　　return _compile(pattern, flags)

（10）pruge()

def purge():
　　　　"Clear the regular expression caches"
　　　　_cache.clear()
　　　　_cache_repl.clear()

（11）template(pattern,flags=0)

def template(pattern, flags=0):
　　　　"Compile a template pattern, returning a pattern object"
　　　　return _compile(pattern, flags|T)
正則表達式：

語法：

　　import re
　　string = "dd12a32d46465fad1648fa1564fda127fd11ad30fa02sfd58afafda"

　　p = re.compile("[a-z]+") #先使用compile(pattern)進行編譯
　　m = p.match(string) #然後進行匹配
　　print(m.group())

上面的第2 和第3行也可以合併成一行來寫：

m = p.match("^[0-9]",'14534Abc')

效果是一樣的，區別在於，第一種方式是提前對要匹配的格式進行了編譯（對匹配公式進行解析），這樣再去匹配的時候就不用在編譯匹配的格式，第2種簡寫是每次匹配的時候都要進行一次匹配公式的編譯，所以，如果你需要從一個5w行的文件中匹配出所有以數字開頭的行，建議先把正則公式進行編譯再匹配，這樣速度會快點。

匹配的格式：

（1）^ 匹配字元串的開頭

    import re
　　string = "dd12a32d41648f27fd11a0sfdda"

　　#^匹配字元串的開頭，現在我們使用search()來匹配以數字開始的
　　m = re.search("^[0-9]",string) #匹配字元串開頭以數字開始        （1）
　　print(m)

　　n = re.search("^[a-z]+",string) #匹配字元串開頭以字母開始，如果是從開頭匹配，就與search()沒有太多的區別了    （2）
　　print(n.group())

運行結果如下：

None
　　dd

在上面（1）處我們使用^從字元串開頭開始匹配，匹配開始是否是數字，由於字元串前面是字母，不是數字，所以匹配失敗，返回None；（2）處我們以字母開始匹配，由於開頭是字母，匹配正確，返回正確的結果；這樣看，其實^類似於match()從開頭開始匹配。

（2）$ 匹配字元串的末尾

import re
　　string = "15111252598"

　　#^匹配字元串的開頭，現在我們使用search()來匹配以數字開始的
　　m = re.match("^[0-9]{11}$",string)
　　print(m.group())

運行結果如下：

15111252598

re.match("^[0-9]{11}$",string)含義是匹配以數字開頭，長度為11，結尾為數字的格式；

（3）點（·）匹配任意字元，除了換行符。當re.DoTALL標記被指定時，則可以匹配包括換行符的任意字元

  import re
　　string = "1511\n1252598"

　　#點（·）是匹配除了換行符以外所有的字元
　　m = re.match(".",string) #點(·)是匹配任意字元，沒有指定個數就匹配單個     （1）
　　print(m.group())

　　n = re.match(".+",string) #.+是匹配多個任意字元，除了換行符                （2）
　　print(n.group())
    運行結果如下：

1
　　1511

從上面代碼運行結果可以看出，（1）處點（·）是匹配任意字元；（2）處我們匹配任意多個字元，但是由於字元串中間包含了空格，結果就只匹配了字元串中換行符前面的內容，後面的內容沒有匹配。

重點：（1）點（·）匹配除了換行符之外任意字元；（2）.+可以匹配多個任意除了換行符的字元。

（4）[...] 如[abc]匹配"a","b"或"c"

[object]匹配括弧中的包含的字元。[A-Za-z0-9]表示匹配A-Z或a-z或0-9。

import re
　　string = "1511\n125dadfadf2598"

　　#[]匹配包含括弧中的字元
　　m = re.findall("[5fd]",string) #匹配字元串中的5,f,d
　　print(m)

運行結果如下：

['5', '5', 'd', 'd', 'f', 'd', 'f', '5']

上面代碼，我們是要匹配字元串中的5,f,d並返回一個列表。

（5）[^...] [^abc]匹配除了abc之外的任意字元

import re
　　string = "1511\n125dadfadf2598"

　　#[^]匹配包含括弧中的字元
　　m = re.findall("[^5fd]",string) #匹配字元串除5,f,d之外的字元
　　print(m)

運行如下：

['1', '1', '1', '\n', '1', '2', 'a', 'a', '2', '9', '8']

上面代碼，我們匹配除了5,f,d之外的字元，[^]是匹配非中括弧內字元之外的字元。

（6）* 匹配0個或多個的表達式

import re
　　string = "1511\n125dadfadf2598"

　　#*是匹配0個或多個的表達式
　　m = re.findall("\d*",string) #匹配0個或多個數字
　　print(m)

運行結果如下：

['1511', '', '125', '', '', '', '', '', '', '', '2598', '']

從上面運行結果可以看出(*)是匹配0個或多個字元的表達式，我們匹配的是0個或多個數字，可以看出，如果匹配不到返回的是空，並且最後位置哪裡返回的是一個空("")。

（7）+ 匹配1個或多個的表達式

import re
　　string = "1511\n125dadfadf2598"

　　#（+）是匹配1個或多個的表達式
　　m = re.findall("\d+",string) #匹配1個或多個數字
　　print(m)

運行如下：

['1511', '125', '2598']

加（＋）是匹配1個或多個表達式，上面\d+是匹配1個或多個數字表達式，至少匹配一個數字。

（8）? 匹配0個或1個的表達式，非貪婪方式

import re
　　string = "1511\n125dadfadf2598"

　　#（?）是匹配0個或1個的表達式
　　m = re.findall("\d?",string) #匹配0個或1個的表達式
　　print(m)　　

運行結果如下：

['1', '5', '1', '1', '', '1', '2', '5', '', '', '', '', '', '', '', '2', '5', '9', '8', '']

上面問號（？）是匹配0個或1個表達式，上面是匹配0個或1個的表達式，如果匹配不到則返回空("")

（9）{n} 匹配n次，定義一個字元串匹配的次數

（10）{n,m} 匹配n到m次表達式

（11）\w 匹配字母數字

\w是匹配字元串中的字母和數字，代碼如下：

import re
　　string = "1511\n125dadfadf2598"

　　#（?）是匹配0個或1個的表達式
　　m = re.findall("\w",string) #匹配0個或1個的表達式
　　print(m)

運行如下：

['1', '5', '1', '1', '1', '2', '5', 'd', 'a', 'd', 'f', 'a', 'd', 'f', '2', '5', '9', '8']

從上面代碼可以看出，\w是用來匹配字元串中的字母數字的。我們使用正則匹配字母和數字。

（12）\W \W大寫的W是用來匹配非字母和數字的，與小寫w正好相反

實例如下：

import re
　　string = "1511\n125dadfadf2598"

　　#\W用來匹配字元串中的非字母和數字
　　m = re.findall("\W",string) #\W用來匹配字元串中的非字母和數字
　　print(m)

運行如下：

['\n']

上面代碼中，\W是用來匹配非字母和數字的，結果把換行符匹配出來了。

（13）\s 匹配任意空白字元，等價於[\n\t\f]

實例如下：

import re
　　string = "1511\n125d\ta\rdf\fadf2598"

　　#\s是用來匹配字元串中的任意空白字元，等價於[\n\t\r\f]
　　m = re.findall("\s",string) #\s用來匹配字元串中任意空白字元
　　print(m)　　

運行如下：

['\n', '\t', '\r', '\x0c']

從上面代碼運行結果可以看出:\s是用來匹配任意空的字元，我們把空的字元匹配出來了

（14）\S 匹配任意非空字元

實例如下：

import re
　　string = "1511\n125d\ta\rdf\fadf2598"

　　#\S是用來匹配任意非空字元
　　m = re.findall("\S",string) #\S用來匹配日任意非空字元
　　print(m) 　　

運行如下：

['1', '5', '1', '1', '1', '2', '5', 'd', 'a', 'd', 'f', 'a', 'd', 'f', '2', '5', '9', '8']

從上面代碼可以看出，\S是用來匹配任意非空字元，結果中，我們匹配了任意非空的字元。

（15）\d 匹配任意數字，等價於[0-9]

（16）\D 匹配任意非數字

總結：findall()，split()生成的都是列表，一個是以某個為分隔符，一個是以查找中所有的值。正好相反。