接觸python有一段時間了,一直想寫個爬蟲,然而最近臨近期末實在沒什麼時間,就做了個demo出來,有的時候會出現一些error,但是跑還是能跑起來,下個幾百張圖片還是沒問題,剩下的問題估計要到放假才能解決好了,先把代碼放上來,以供交流,歡迎大家提出指導意見 進入正題 我寫這個爬蟲的時候參考了純潔的 ...
接觸python有一段時間了,一直想寫個爬蟲,然而最近臨近期末實在沒什麼時間,就做了個demo出來,有的時候會出現一些error,但是跑還是能跑起來,下個幾百張圖片還是沒問題,剩下的問題估計要到放假才能解決好了,先把代碼放上來,以供交流,歡迎大家提出指導意見
進入正題
我寫這個爬蟲的時候參考了純潔的微笑的博客,思路基本差不多,把他的那篇博客也貼出來:http://www.cnblogs.com/ityouknow/p/6013074.html
我的代碼如下
from bs4 import BeautifulSoup
import re
import os
import requests
import json
import time
import OpenSSL
mainsite="http://1024的網址就不貼了.com/"
def getbs(url):
header={"User-Agent":"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36",
"Referer":"http://t66y.com//thread0806.php?fid=16&search=&page=1",
"Host":"t66y.com"
}
req=requests.get(url,headers=header)
req.encoding="gbk"#這裡因為1024圖片帖子內的編碼是gbk,如果不指明編碼,得到的是亂碼
bsobj = BeautifulSoup(req.text, "html5lib")
return bsobj
def getallpage(start,end):
urls=[]
for i in range(start,end+1):
url="http://地址打碼/thread0806.php?fid=16&search=&page={}".format(str(i))
bsobj=getbs(url)
urls+=bsobj.find_all("a",{"href":re.compile("^htm_data.*")})
return urls
def getpicofpage(url):
bsobj=getbs(url)
div=bsobj.find("div",{"class":"tpc_content do_not_catch"})
if div==None:
print("獲取不到內容,跳過")
return -1
inputs=div.find_all("input")
title=bsobj.find("h4").text
if inputs==[]:
print("本頁無圖片,跳過")
return -1
num=1
if os.path.exists(path + "new\\tupian\\" + "\\" + title)==False:
os.mkdir(path + "new\\tupian\\" + "\\" + title)
else:
print("已存在該文件夾,跳過")
return -1
for i in inputs:
try:#問題主要出在這裡
res = requests.get(i["src"],timeout=25)
with open(path +"new\\tupian\\"+"\\"+title+"\\"+str(time.time())[:10]+".jpg", 'wb') as f:
f.write(res.content)
except requests.exceptions.Timeout:#爬圖片時有的會超時,如果不設置超時,可能會一直卡在那裡
print("已超時,跳過本頁")
return -1
except OpenSSL.SSL.WantReadError:#這裡也是個問題,有的時候會跳出這個異常,但是我這裡是捕捉不到的,這個異常到底是怎麼回事,我還沒弄清楚
print("OpenSSL.SSL.WantReadError,跳過")
return -1
print(num)
num+=1
l=getallpage(5,10)
page=1
ed=[]
for i in l:
url=mainsite+i["href"]
if url in ed:
print(url+"本頁已採集過,跳過")
continue
print(url)
getpicofpage(url)
ed.append(url)
print("採集完第{}頁".format(page))
page+=1
time.sleep(3)
另外也把上面說的ssl異常貼出來:
Traceback (most recent call last):
File "D:\python\Lib\site-packages\urllib3\contrib\pyopenssl.py", line 441, in wrap_socket
cnx.do_handshake()
File "D:\python\Lib\site-packages\OpenSSL\SSL.py", line 1806, in do_handshake
self._raise_ssl_error(self._ssl, result)
File "D:\python\Lib\site-packages\OpenSSL\SSL.py", line 1521, in _raise_ssl_error
raise WantReadError()
OpenSSL.SSL.WantReadError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "D:\python\Lib\site-packages\urllib3\connectionpool.py", line 595, in urlopen
self._prepare_proxy(conn)
File "D:\python\Lib\site-packages\urllib3\connectionpool.py", line 816, in _prepare_proxy
conn.connect()
File "D:\python\Lib\site-packages\urllib3\connection.py", line 326, in connect
ssl_context=context)
File "D:\python\Lib\site-packages\urllib3\util\ssl_.py", line 329, in ssl_wrap_socket
return context.wrap_socket(sock, server_hostname=server_hostname)
File "D:\python\Lib\site-packages\urllib3\contrib\pyopenssl.py", line 445, in wrap_socket
raise timeout('select timed out')
socket.timeout: select timed out
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "D:\python\Lib\site-packages\requests\adapters.py", line 440, in send
timeout=timeout
File "D:\python\Lib\site-packages\urllib3\connectionpool.py", line 639, in urlopen
_stacktrace=sys.exc_info()[2])
File "D:\python\Lib\site-packages\urllib3\util\retry.py", line 388, in increment
raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='www.srimg.com', port=443): Max retries exceeded with url: /u/20180104/11315126.jpg (Caused by ProxyError('Cannot connect to proxy.', timeout('select timed out',)))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "D:\PyCharm 2017.3.1\helpers\pydev\pydev_run_in_console.py", line 52, in run_file
pydev_imports.execfile(file, globals, locals) # execute the script
File "D:\PyCharm 2017.3.1\helpers\pydev\_pydev_imps\_pydev_execfile.py", line 18, in execfile
exec(compile(contents+"\n", file, 'exec'), glob, loc)
File "D:/learnPython/crawler/crawler.py", line 301, in <module>
getpicofpage(url)
File "D:/learnPython/crawler/crawler.py", line 281, in getpicofpage
res = requests.get(i["src"],timeout=25)
File "D:\python\Lib\site-packages\requests\api.py", line 72, in get
return request('get', url, params=params, **kwargs)
File "D:\python\Lib\site-packages\requests\api.py", line 58, in request
return session.request(method=method, url=url, **kwargs)
File "D:\python\Lib\site-packages\requests\sessions.py", line 508, in request
resp = self.send(prep, **send_kwargs)
File "D:\python\Lib\site-packages\requests\sessions.py", line 618, in send
r = adapter.send(request, **kwargs)
File "D:\python\Lib\site-packages\requests\adapters.py", line 502, in send
raise ProxyError(e, request=request)
requests.exceptions.ProxyError: HTTPSConnectionPool(host='www.srimg.com', port=443): Max retries exceeded with url: /u/20180104/11315126.jpg (Caused by ProxyError('Cannot connect to proxy.', timeout('select timed out',)))
PyDev console: starting.
還有一點,雖然我開了vpn,但是直接爬是獲取不到內容的,會提示主機沒有響應,但是後來發現開了fiddler就能爬了,估計是ip的原因,這個我還沒仔細深究,也請各位不吝賜教