網路爬蟲入門——案例一：爬取百度貼吧帖子

-Advertisement-

參考資料： Python:http://www.runoob.com/python/python-intro.html Python爬蟲系列教程：http://www.cnblogs.com/xin-xin/p/4297852.html 正則表達式：http://www.cnblogs.com/de ...

參考資料：

Python:http://www.runoob.com/python/python-intro.html

Python爬蟲系列教程：http://www.cnblogs.com/xin-xin/p/4297852.html

正則表達式：http://www.cnblogs.com/deerchao/archive/2006/08/24/zhengzhe30fengzhongjiaocheng.html

本貼目標:

1.對百度貼吧的任意帖子進行抓取

2.指定是否只抓取樓主發帖內容

3.將抓取到的內容分析並保存到文件

4.抓取帖子中出現的美圖

# -*- coding: utf-8 -*-
"""
Created on Fri Apr 15 11:47:02 2016

@author: wuhan
"""
import urllib
import urllib2
import re
import time
import os


#reload(sys)
#sys.setdefaultencoding("utf-8")

class Tool:
    removeImg = re.compile('<img.*?>| {12}')
    removeAddr = re.compile('<a.*?>|</a>')
    replaceLine = re.compile('<tr>|<div>|</div>|</p>')
    replaceTD = re.compile('<td>')
    replacePara = re.compile('<p.*?>')
    replaceBR = re.compile('<br><br>|<br>')
    removeExtraTag = re.compile('<.*?>')
    
    def replace(self,x):
        x = re.sub(self.removeImg, "", x)
        x = re.sub(self.removeAddr, "", x)
        x = re.sub(self.replaceLine, "\n", x)
        x = re.sub(self.replaceBR, "\n", x)
        x = re.sub(self.replacePara, "\n  ", x)
        x = re.sub(self.replaceTD, "\t", x)
        x = re.sub(self.removeExtraTag, "", x)
        return x.strip()
        

class BDTB:
    def __init__(self, baseUrl, seeLZ, floorTag):
        self.baseURL = baseUrl
        self.seeLZ = '?see_lz=' + str(seeLZ)
        self.tool = Tool()
        self.file = None
        self.floor = 1
        self.defaultTitle = u'百度貼吧'
        self.floorTag = floorTag
        
    def getPage(self, pageNum):
        try:
            url = self.baseURL + self.seeLZ + '&pn=' + str(pageNum)
            request = urllib2.Request(url)
            response = urllib2.urlopen(request)
            return response.read().decode('utf-8')
        except urllib2.URLError, e:
            if hasattr(e, "reason"):
                print u'百度貼吧鏈接失敗，錯誤原因 ：', e.reason
                return None
                
    def getTitle(self, page):
         pattern = re.compile('<h1 class="core_title_txt.*?>(.*?)</h1>',re.S)
         result = re.search(pattern, page)
         if result:
             return result.group(1).strip()
         else:
             return None
             
    def getPageNum(self, page):
        pattern = re.compile('<li class="l_reply_num.*?</span>.*?<span.*?>(.*?)</span>',re.S)
        result = re.search(pattern, page)
        if result:
            return result.group(1).strip()
        else:
            return None
    
    def getContents(self,page):
        pattern = re.compile('<div id="post_content.*?>(.*?)</div>', re.S)
        items = re.findall(pattern, page)
        contents = []
        for item in items:
            content = "\n" + self.tool.replace(item) + "\n"
            contents.append(content.encode('utf-8'))
        return contents
        
    def setFileTitle(self, title):
        if title is not None:
            self.file = open(title + ".txt" , "w+")
        else:
            self.file = open(self.defaultTitle + ".txt" , "w+")
            
    def writeData(self, contents):
        for item in contents:
            if self.floorTag == '1':
                floorLine = "\n" + str(self.floor) + u"-----------------------------------------------------------------------------------------------------------------------------------------\n"
                self.file.write(floorLine)
            self.file.write(item)
            self.floor += 1
    
    def start(self):
        indexPage = self.getPage(1)
        pageNum = self.getPageNum(indexPage)
        title = self.getTitle(indexPage)
        self.setFileTitle(title)
        if pageNum == None:
            print "URL已失效，請重試"
            return
        try:
            print "該貼子共有" + str(pageNum) + "頁"
            for i in range(1, int(pageNum)+1):
                print "正在寫入第" + str(i) + "頁數據"
                page = self.getPage(i)
                contents = self.getContents(page)
                self.writeData(contents)
                self.getPicture(page, i)
        except IOError, e:
            print "寫入異常，原因" + e.message
        finally:
            print "寫入任務完成"
            
    def getPicture(self, page, PageNum):
        reg = r'<img class="BDE_Image".*?src="(.+?.jpg)'
        imgre = re.compile(reg)#可以把正則表達式編譯成一個正則表達式對象
        imglist = re.findall(imgre,page)#讀取html 中包含 imgre（正則表達式）的數據
        t = time.localtime(time.time())
        foldername = str(t.__getattribute__("tm_year"))+"-"+str(t.__getattribute__("tm_mon"))+"-"+str(t.__getattribute__("tm_mday"))
        picpath = 'E:\\Python\\ImageDownload\\%s' % (foldername) #下載到的本地目錄  
        if not os.path.exists(picpath):   #路徑不存在時創建一個
            os.makedirs(picpath)
    
        x = 0
        for imgurl in imglist:
            target = picpath+'\\%s_%s.jpg' % (PageNum, x)
            urllib.urlretrieve(imgurl, target)#直接將遠程數據下載到本地
            x+=1
        
print u"請輸入帖子代號"
baseURL = 'http://tieba.baidu.com/p/' + str(raw_input(u'http://tieba.baidu.com/p/'))
seeLZ = raw_input("是否只獲取樓主發言，是輸入1，否輸入0\n".decode('utf-8').encode('gbk'))
floorTag = raw_input("是否寫入樓層信息，是輸入1，否輸入0\n".decode('utf-8').encode('gbk'))
bdtb = BDTB(baseURL,seeLZ,floorTag)
bdtb.start()

您的分享是我們最大的動力!

-Advertisement-

更多相關文章

用groovy腳本進行每日工作的自動化【groovy】

我們可以用groovy編寫日常的批處理腳本，類似windows下的bat或者unix下的shell。其具體的編寫方式非常簡單，比如我們想要執行一個dir的命令，只要編寫一個test.groovy，其中內容為： println 'cmd /c dir'.execute().text 因為dir這個命令 ...
Spring-security+Oauth2.0 零散知識彙總(備忘)

//資源和認證伺服器不相同http://xxxx:8080/flowAuth/oauth/authorize?client_id=jerry&redirect_uri=http%3a%2f%2fxxxx%3a8080%2fAuthProvider%2ftest%2findex.do&response ...
python3+任務計劃實現的人人影視網站自動簽到

python3+任務計劃實現的人人影視網站自動簽到這是一個自動化程度較高的程式，運行本程式後會從chrome中讀取cookies用於登錄人人影視簽到，並且會自動添加一個windows 任務計劃，這個任務計劃每天下午兩點會執行本程式進行簽到。 sys.executable == 'C:\\Pyth ...
PYTHON壓平嵌套列表

list 是 Python 中使用最頻繁的數據類型, 標準庫裡面有豐富的函數可以使用。不過，如果把多維列表轉換成一維列表(不知道這種需求多不多),還真不容易找到好用的函數，要知道Ruby、Mathematica、Groovy中可是有flatten的啊。如果列表是維度少的、規則的，還算好辦例如: li ...
解析php開發中的中文編碼問題

其實php開發中的中文編碼並沒有想像的那麼複雜，雖然定位和解決問題沒有定規，各種運行環境也各不盡然，但後面的原理是一樣的。瞭解字元集的知識是解決字元問題的基礎。 PHP程式設計中中文編碼問題曾經困擾很多人，導致這個問題的原因其實很簡單，每個國家(或區域)都規定了電腦信息交換用的字元編碼集，如美國的 ...
C++重載、重寫、重定義

一、重載（overload）指函數名相同，但是它的參數表列個數或順序，類型不同。但是不能靠返回類型來判斷。（1）相同的範圍（在同一個作用域中）；（2）函數名字相同；（3）參數不同；（4）virtual 關鍵字可有可無。（5）返回值可以不同；二、重寫（也稱為覆蓋 override）是指派生類重新定 ...
探索JSP中的 "9大內置對象!"

1.什麼是JSP內置對象? jsp內置對象就是Web容器創建的一組對象,我們都知道Tomcat可以看成是一種Web容器,所以我們可以知道所謂的內置對象Tomcat創建的,使用內置對象時可以不適用new關鍵字, 直接使用即可. 2.什麼是內置對象？解析:就是Web容器創建的一組對象,當Tocmat啟 ...
ReactiveCocoa鏈式編程初探

在使用 masonry 框架實現自動佈局時，在程式里為一個佈局穿插著6行左右這樣的代碼 [View mas_makeConstraints:^(MASConstraintMaker *make) { make.top.equalTo(anotherView); make.left.equalTo(a ...