初次接觸python,寫的很簡單,開發工具PyCharm,python 3.4很方便python 部分模塊安裝時需要其他的附屬模塊之類的,可以先pip install wheel然後可以直接下載whl文件進行安裝pip installlxml-3.5.0-cp34-none-win32.whl定義一...
初次接觸python,寫的很簡單,開發工具PyCharm,python 3.4很方便
python 部分模塊安裝時需要其他的附屬模塊之類的,可以先
pip install wheel
然後可以直接下載whl文件進行安裝
pip install lxml-3.5.0-cp34-none-win32.whl
定義一個類,準備保存的類型
class CnblogArticle: def __init__(self): self.num='' self.category='' self.title='' self.author='' self.postTime='' self.articleComment='' self.articleView=''
因為CSDN博客頻道只有18頁,所以解析18頁,有多線程解析(main註釋部分)及普通解析,在main方法里
註意事項:每個item以class=blog_list區分,部分item下有class=category,少部分沒有,所有要註意,否則會報錯
<div class="blog_list"> <h1> <a href="/other/index.html" class="category">[綜合]</a> <a name="49786427" href="http://blog.csdn.net/matrix_space/article/details/49786427" target="_blank">Python: scikit-image canny 邊緣檢測</a> <img src="http://static.blog.csdn.net/images/icon-zhuanjia.gif" class="blog-icons" alt="專家" title="專家"> </h1> <dl> <dt> <a href="http://blog.csdn.net/matrix_space" target="_blank"> <img src="http://avatar.csdn.net/F/9/7/3_shinian1987.jpg" alt="shinian1987" /> </a> </dt> <dd>這個用例說明canny 邊緣檢測的用法 import numpy as np import matplotlib.pyplot as plt from scipy import ndimage as ndi from skimage import feature # Generate noisy image of a square im = np.zeros((128, 128)) im[3...</dd> </dl> <p> <a class="tag" href="/tag/details.html?tag=python" target="_blank">python</a> </p> <div class="about_info"> <span class="fr digg" id="digg_49786427" blog="1164951" digg="0" bury="0"></span> <span class="fl"> <a href="http://blog.csdn.net/matrix_space" target="_blank" class="user_name">shinian1987</a> <span class="time">3小時前</span> <a href="http://blog.csdn.net/matrix_space/article/details/49786427" target="_blank" class="view">閱讀(104)</a> <a href="http://blog.csdn.net/matrix_space/article/details/49786427#comments" target="_blank" class="comment">評論(0)</a> </span> </div> </div>
<div class="blog_list"> <h1> <a name="50524490" href="http://blog.csdn.net/u010579068/article/details/50524490" target="_blank">STL_演算法 for_each 和 transform 比較</a> </h1> <dl> <dt> <a href="http://blog.csdn.net/u010579068" target="_blank"> <img src="http://avatar.csdn.net/9/9/B/3_u010579068.jpg" alt="u010579068" /> </a> </dt> <dd>C++ Primer 學習中。。。   簡單記錄下我的學習過程 (代碼為主) 所有容器適用 /**---------------------------------------------------------------------------------- for_each                    速度快              ...</dd> </dl> <p> <a class="tag" href="/tag/details.html?tag=STL_演算法" target="_blank">STL_演算法</a> <a class="tag" href="/tag/details.html?tag=for_each" target="_blank">for_each</a> <a class="tag" href="/tag/details.html?tag=transform" target="_blank">transform</a> <a class="tag" href="/tag/details.html?tag=STL" target="_blank">STL</a> </p> <div class="about_info"> <span class="fr digg" id="digg_50524490" blog="1499803" digg="0" bury="0"></span> <span class="fl"> <a href="http://blog.csdn.net/u010579068" target="_blank" class="user_name">u010579068</a> <span class="time">3小時前</span> <a href="http://blog.csdn.net/u010579068/article/details/50524490" target="_blank" class="view">閱讀(149)</a> <a href="http://blog.csdn.net/u010579068/article/details/50524490#comments" target="_blank" class="comment">評論(0)</a> </span> </div> </div>
Beautiful Soup 4.2.0 文檔 可以去官網直接查看
# -*- coding:utf-8 -*- from bs4 import BeautifulSoup import urllib.request import os import sys import time import threading class CnblogUtils(object): def __init__(self): self.headers={'User-Agent':'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36'} self.contentAll=set() def getPage(self,url=None): request=urllib.request.Request(url,headers=self.headers) response=urllib.request.urlopen(request) soup=BeautifulSoup(response.read(),"lxml") return soup def parsePage(self,url=None,page_num=None): soup=self.getPage(url) itemBlog=soup.find_all('div','blog_list') cnArticle=CnblogUtils for i,itemSingle in enumerate(itemBlog): cnArticle.num=i cnArticle.author=itemSingle.find('a','user_name').string cnArticle.postTime=itemSingle.find('span','time').string cnArticle.articleComment=itemSingle.find('a','comment').string cnArticle.articleView=itemSingle.find('a','view').string if itemSingle.find('a').has_attr('class'): cnArticle.category=itemSingle.find('a','category').string cnArticle.title=itemSingle.find('a',attrs={'name':True}).string else: cnArticle.category="None" cnArticle.title=itemSingle.find('a').string self.contentAll.add(str(cnArticle.author)) self.writeFile(page_num,cnArticle.num,cnArticle.author,cnArticle.postTime,cnArticle.articleComment,cnArticle.articleView,cnArticle.category,cnArticle.title) def writeFile(self,page_num,num,author,postTime,articleComment,articleView,category,title): f=open("a.txt",'a+') f.write(str('page_num is {}'.format(page_num))+'\t'+str(num)+'\t'+str(author)+'\t'+str(postTime)+'\t'+str(articleComment)+'\t'+str(articleView)+'\t'+str(category)+'\t'+str(title)+'\n') f.close() def main(thread_num): start=time.clock() cnblog=CnblogUtils() ''' thread_list = list(); for i in range(0, thread_num): thread_list.append(threading.Thread(target = cnblog.parsePage, args = ('http://blog.csdn.net/?&page={}'.format(i),i+1,))) for thread in thread_list: thread.start() for thread in thread_list: thread.join() print(cnblog.contentAll) ''' for i in range(0,18): cnblog.parsePage('http://blog.csdn.net/?&page={}'.format(i),i+1) end=time.clock() print('time = {}'.format(end-start)) if __name__ == '__main__': main(18)
程式運行結果:
page_num is 1 0 foruok 18分鐘前 評論(0) 閱讀(0) [編程語言] Windows下從源碼編譯SKIA page_num is 1 1 u013467442 31分鐘前 評論(0) 閱讀(3) [編程語言] Cubieboard學習資源 page_num is 1 2 tuke_tuke 32分鐘前 評論(0) 閱讀(15) [移動開發] UI組件之AdapterView及其子類關係,Adapter介面及其實現類關係 page_num is 1 3 xiaominghimi 53分鐘前 評論(0) 閱讀(51) [移動開發] 【COCOS2D-X 備註篇】ASSETMANAGEREX使用異常解決備註->CHECK_JNI/CC‘JAVA.LANG.NOCLASSDEFFOUNDERROR’ page_num is 1 4 shinian1987 1小時前 評論(0) 閱讀(64) [綜合] Python: scikit-image canny 邊緣檢測 page_num is 1 5 u010579068 1小時前 評論(0) 閱讀(90) None STL_演算法 for_each 和 transform 比較 page_num is 1 6 u013467442 1小時前 評論(0) 閱讀(94) [編程語言] OpenGLES2.0著色器語言glsl page_num is 1 7 u013467442 1小時前 評論(0) 閱讀(89) [編程語言] OpenGl 坐標轉換 page_num is 1 8 AaronGZK 1小時前 評論(0) 閱讀(95) [編程語言] bzoj4390【Usaco2015 Dec】Max Flow page_num is 1 9 AaronGZK 1小時前 評論(0) 閱讀(95) [編程語言] bzoj1036【ZJOI2008】樹的統計Count page_num is 1 10 danhuang2012 1小時前 評論(0) 閱讀(90) [編程語言] Node.js如何處理健壯性 page_num is 1 11 EbowTang 1小時前 評論(0) 閱讀(102) [編程語言] <LeetCode OJ> 121. Best Time to Buy and Sell Stock page_num is 1 12 cartzhang 2小時前 評論(0) 閱讀(98) [架構設計] 給虛幻4添加記憶體跟蹤功能 page_num is 1 13 u013595419 2小時前 評論(0) 閱讀(93) [綜合] 第2章第1節練習題3 共用棧的基本操作 page_num is 1 14 ghostbear 2小時前 評論(0) 閱讀(115) [系統運維] Dynamics CRM 2016 Series: Overview page_num is 1 15 u014723529 2小時前 評論(0) 閱讀(116) [編程語言] 將由BeanUtils的getProperty方法返回的Date對象的字元串表示還原為對象 page_num is 1 16 Evankaka 2小時前 評論(1) 閱讀(142) [架構設計] Jenkins詳細安裝與構建部署使用教程 page_num is 1 17 Evankaka 2小時前 評論(0) 閱讀(141) [編程語言] Ubuntu安裝配置JDK、Tomcat、SVN伺服器
網速不好時多線程可能報錯
獲取了數據了就可以進行數據分析,或者深度搜索,根據author去獲取author對應的blog等