BUG觸發時的完整報錯內容(本地無關路徑用已經用 隱去): 在解析HTML時,標簽開始部分使用形如 的瀏覽器判斷標識符,結束時結束標簽 (正確的開始和結束標簽應該為 和 )無法正常匹配關閉即可觸發。 觸發BUG的示例代碼如下: 在 Python 3.7.0 版本中,觸發BUG部分的代碼存在於 中的 ...
BUG觸發時的完整報錯內容(本地無關路徑用已經用 **** 隱去):
**************\lib\site-packages\bs4\builder\_htmlparser.py:78: UserWarning: unknown status keyword 'end ' in marked section
warnings.warn(msg)
Traceback (most recent call last):
File "**************/test.py", line 5, in <module>
bs = BeautifulSoup(html, 'html.parser')
File "**************\lib\site-packages\bs4\__init__.py", line 281, in __init__
self._feed()
File "**************\lib\site-packages\bs4\__init__.py", line 342, in _feed
self.builder.feed(self.markup)
File "**************\lib\site-packages\bs4\builder\_htmlparser.py", line 247, in feed
parser.feed(markup)
File "D:\Program Files\Python37\lib\html\parser.py", line 111, in feed
self.goahead(0)
File "D:\Program Files\Python37\lib\html\parser.py", line 179, in goahead
k = self.parse_html_declaration(i)
File "D:\Program Files\Python37\lib\html\parser.py", line 264, in parse_html_declaration
return self.parse_marked_section(i)
File "D:\Program Files\Python37\lib\_markupbase.py", line 160, in parse_marked_section
if not match:
UnboundLocalError: local variable 'match' referenced before assignment
在解析HTML時,標簽開始部分使用形如 <!-[if IE eq 9]>
的瀏覽器判斷標識符,結束時結束標簽<![end if]->
(正確的開始和結束標簽應該為<!--[if IE 9]>
和 <![endif]-->
)無法正常匹配關閉即可觸發。
觸發BUG的示例代碼如下:
from bs4 import BeautifulSoup
html = """
<!-[if IE eq 9]>
<a href="https://www.shwww.net/">https://www.shwww.net/</a>
<![end if]->
"""
bs = BeautifulSoup(html, 'html.parser')
在 Python 3.7.0 版本中,觸發BUG部分的代碼存在於 \Lib\_markupbase.py
中的 146 行的 parse_marked_section
方法,該方法代碼如下:
https://github.com/python/cpython/blob/bb9ddee3d4e293f0717f8c167afdf5749ebf843d/Lib/_markupbase.py#L160
def parse_marked_section(self, i, report=1):
rawdata= self.rawdata
assert rawdata[i:i+3] == '<![', "unexpected call to parse_marked_section()"
sectName, j = self._scan_name( i+3, i )
if j < 0:
return j
if sectName in {"temp", "cdata", "ignore", "include", "rcdata"}:
# look for standard ]]> ending
match= _markedsectionclose.search(rawdata, i+3)
elif sectName in {"if", "else", "endif"}:
# look for MS Office ]> ending
match= _msmarkedsectionclose.search(rawdata, i+3)
else:
self.error('unknown status keyword %r in marked section' % rawdata[i+3:j])
if not match:
return -1
if report:
j = match.start(0)
self.unknown_decl(rawdata[i+3: j])
return match.end(0)
由於錯誤的HTML代碼未正確關閉,使得流程判斷既沒有進入 if sectName in {"temp", "cdata", "ignore", "include", "rcdata"}:
和 elif sectName in {"if", "else", "endif"}:
,而是報出一個錯誤 UserWarning: unknown status keyword 'end ' in marked section warnings.warn(msg)
後執行到 if not match
,而此時 match
未申明,故而觸發錯誤。
此BUG存在於多個Python版本中,修複方法,在 if sectName in {"temp", "cdata", "ignore", "include", "rcdata"}:
之前預定義一個match變數即可:
https://github.com/python/cpython/blob/bb9ddee3d4e293f0717f8c167afdf5749ebf843d/Lib/_markupbase.py#L152
def parse_marked_section(self, i, report=1):
rawdata= self.rawdata
assert rawdata[i:i+3] == '<![', "unexpected call to parse_marked_section()"
sectName, j = self._scan_name( i+3, i )
if j < 0:
return j
match = None
if sectName in {"temp", "cdata", "ignore", "include", "rcdata"}:
# look for standard ]]> ending
match= _markedsectionclose.search(rawdata, i+3)
elif sectName in {"if", "else", "endif"}:
# look for MS Office ]> ending
match= _msmarkedsectionclose.search(rawdata, i+3)
else:
self.error('unknown status keyword %r in marked section' % rawdata[i+3:j])
if not match:
return -1
if report:
j = match.start(0)
self.unknown_decl(rawdata[i+3: j])
return match.end(0)