Python_網路爬蟲（新浪新聞抓取）

爬取前的準備： BeautifulSoup的導入：pip install BeautifulSoup4 requests的導入：pip install requests 下載jupyter notebook：pip install jupyter notebook 下載python，配置環境（可使用 ...

爬取前的準備：

BeautifulSoup的導入：pip install BeautifulSoup4

requests的導入：pip install requests

下載jupyter notebook：pip install jupyter notebook

下載python，配置環境（可使用anocanda，裡面提供了很多python模塊）

json

定義：是一種格式，用於數據交換。

Javascript 對象

定義：一種javascript的引用類型

中文格式除了‘ utf-8 ’還有‘ GBK ’、‘ GB2312 ’ 、‘ ISO-8859-1 ’、‘ GBK ‘’等

用requests可獲取網頁信息

用BeautifulSoup可以將網頁信息轉換為可操作物塊

1 soup = BeautifulSoup(res.text,'html.parser')
2 # 將requests獲取的網頁信息轉換為BeautifulSoup的物件存於soup中，並指明其剖析器為'html.parser'，否則會出現警告。

用beautifulSoup中的select方法可以獲取相應的元素，且獲取的元素為list形式，可以用for迴圈將其逐個解析出來

1 alink = soup.select('h1')
2 
3 for link in alink:
4     print(link.text)

獲取html標簽值後，可以用[‘href’]獲取‘href’屬性的值,如

1 for link in soup.select('a'):
2   　print(link['href'])

獲取新聞編號：

* .strip（）可以去除前後空白格，括弧內加入字元串可以去除指定字元串，rstrip（）可以去除右邊的，lstrip（）可以去除左邊的；

* split（'/'）根據指定的字元對字元串進行切割

re正則表達式的使用：

1 import re
2 
3 m = re.search（' doc-i(.*).shtml ',newsurl）　　# 返回在newsurl中匹配到的字元串
4 print(m.group(1))　　# group（0）可以取得所有匹配到的部分，group（1）只可以取得括弧內的部分

使用for迴圈獲取新聞的多頁鏈接

1 url = 'http://api.roll.news.sina.com.cn/zt_list?channel=news&cat_1=gnxw&cat_2==gdxw1||=gatxw||=zs-pl||=mtjj&level==1||=2&show_ext=1&show_all=1&show_num=22&tag=1&format=json&page={}&callback=newsloadercallback&_=1501000415111'
2  
3 for i in rannge（0,10）：
4 　　print( url.format( i ) )
5 # format可以將url裡面的大括弧（要修改的部分我們把它刪去並換成大括弧）換為我們要加入的值（如上面代碼中的 i）

獲取新聞發佈的時間：

　　獲取的信息可能會有包含的成分，即會獲取到如出版社的其他我們不需要的元素，可以用contents將裡面的元素分離成list形式，用contents[0]即可獲取相應元素

1 # 獲取出版時間
2 from datetime import datetime
3 
4 res = requests.get('http://news.sina.com.cn/c/nd/2017-07-22/doc-ifyihrmf3191202.shtml')
5 res.encoding = 'utf-8'
6 soup = BeautifulSoup(res.text,'html.parser')
7 timesource = soup.select('.time-source')
8 print(timesource[0].contents[0])

　　時間字元串轉換　

1 # 字元串轉時間：-strptime
2 dt = datetime.strptime(timesource,'%Y年%m月%d日%H：%M ’）
3 
4 # 時間轉換字元串：-strftime
5 dt.strftime('%Y-%m-%d‘）

獲取新聞內文：

　　檢查其所屬類後按照上面的 select 獲取新聞內文，獲取的內容為list形式，可用for迴圈將內容去除標簽後加入到自己創建的的list中（如article = []）

　　* 其中可以用 ‘ \n ’.join( article ) 將article列表中的每一項用換行符‘ \n ’分隔開；

1 # 獲取單篇新聞內容
2 article = []
3 for p in soup.select('.article p'):
4     article.append(p.text.strip())
5 print('\n'.join(article))

　　上面獲取單篇新聞的代碼可用一行完成：

1 # 一行完成上面獲取新聞內容的代碼
2 print('\n'.join([p.text.strip() for p in soup.select('.article p')]))

獲取評論數量：（在獲取評論數量時會發現評論是用js的形式發送給瀏覽器的，所以要先把獲取的內容轉化為json格式讀取python字典

1 # 取得評論數的數量
2 import requests
3 import json
4 comment = requests.get('http://comment5.news.sina.com.cn/page/info?version=1&format=js&c\
5 hannel=gn&newsid=comos-fyihrmf3218511&group=&compress=0&ie=utf-8&oe=utf-8&page=1&page_size=20')　　# 從評論地址獲取相關內容
6 comment.encoding = 'utf-8'
7 jd = json.loads(comment.text.strip('var data='))
8 jd['result']['count']['total']

完整代碼（以獲取新浪新聞為例）：

 1 # 獲取新聞的標題，內容，時間和評論數
 2 import requests
 3 from bs4 import BeautifulSoup
 4 from datetime import datetime
 5 import re
 6 import json
 7 import pandas
 8 
 9 def getNewsdetial(newsurl):
10     res = requests.get(newsurl)
11     res.encoding = 'utf-8'
12     soup = BeautifulSoup(res.text,'html.parser')
13     newsTitle = soup.select('.page-header h1')[0].text.strip()
14     nt = datetime.strptime(soup.select('.time-source')[0].contents[0].strip(),'%Y年%m月%d日%H:%M')
15     newsTime = datetime.strftime(nt,'%Y-%m-%d %H:%M')
16     newsArticle = getnewsArticle(soup.select('.article p'))
17     newsAuthor = newsArticle[-1]
18     return newsTitle,newsTime,newsArticle,newsAuthor
19 def getnewsArticle(news):
20     newsArticle = []
21     for p in news:
22          newsArticle.append(p.text.strip())
23     return newsArticle
24 
25 # 獲取評論數量
26 
27 def getCommentCount(newsurl):
28     m = re.search('doc-i(.+).shtml',newsurl)
29     newsid = m.group(1)
30     commenturl = 'http://comment5.news.sina.com.cn/page/info?version=1&format=js&channel=gn&newsid=comos-{}&group=&compress=0&ie=utf-8&oe=utf-8&page=1&page_size=20'
31     comment = requests.get(commenturl.format(newsid))   #將要修改的地方換成大括弧，並用format將newsid放入大括弧的位置
32     jd = json.loads(comment.text.lstrip('var data='))
33     return jd['result']['count']['total']
34 
35 
36 def getNewsLinkUrl():
37 #     得到非同步載入的新聞地址（即獲得所有分頁新聞地址）
38     urlFormat = 'http://api.roll.news.sina.com.cn/zt_list?channel=news&cat_1=gnxw&cat_2==gdxw1||=gatxw||=zs-pl||=mtjj&level==1||=2&show_ext=1&show_all=1&show_num=22&tag=1&format=json&page={}&callback=newsloadercallback&_=1501000415111'
39     url = []
40     for i in range(1,10):
41         res = requests.get(urlFormat.format(i))
42         jd = json.loads(res.text.lstrip('  newsloadercallback(').rstrip(');'))
43         url.extend(getUrl(jd))     #entend和append的區別
44     return url
45 
46 def getUrl(jd):
47 #     獲取每一分頁的新聞地址
48     url = []
49     for i in jd['result']['data']:
50         url.append(i['url'])
51     return url
52 
53 # 取得新聞時間，編輯，內容，標題，評論數量並整合在total_2中
54 def getNewsDetial():
55     title_all = []
56     author_all = []
57     commentCount_all = []
58     article_all = []
59     time_all = []
60     url_all = getNewsLinkUrl()
61     for url in url_all:
62         title_all.append(getNewsdetial(url)[0])
63         time_all.append(getNewsdetial(url)[1])
64         article_all.append(getNewsdetial(url)[2])
65         author_all.append(getNewsdetial(url)[3])
66         commentCount_all.append(getCommentCount(url))
67     total_2 = {'a_title':title_all,'b_article':article_all,'c_commentCount':commentCount_all,'d_time':time_all,'e_editor':author_all}
68     return total_2
69 
70 # ( 運行起始點 )用pandas模塊處理數據並轉化為excel文檔
71 
72 df = pandas.DataFrame(getNewsDetial())
73 df.to_excel('news2.xlsx')

存儲的excel文檔如下：

TIPS：

問題：在jupyter notebook導入pandas時可能會出現導入錯誤

解決：不要用命令行打開jupyter notebook，直接找到軟體打開或者在Anocanda Navigator中打開

2017-07-29 21:49:37