Scrapy爬取博客園精華區內容_ZenDei技術網路在線

Scrapy爬取博客園精華區內容

-Advertisement-

程式爬取目標獲取博客園精華區文章的標題、標題鏈接、作者、作者博客主頁鏈接、摘要、發佈時間、評論數、閱讀數和推薦數，並存儲到 MongoDB 中。程式環境已安裝scrapy 已安裝MongoDB 創建工程在命令提示符中執行上述命令後，會建立一個名為的文件夾。創建爬蟲文件執行上述命令後 ...

程式爬取目標

獲取博客園精華區文章的標題、標題鏈接、作者、作者博客主頁鏈接、摘要、發佈時間、評論數、閱讀數和推薦數，並存儲到MongoDB中。

程式環境

已安裝scrapy
已安裝MongoDB

創建工程

scrapy startproject cnblogs

在命令提示符中執行上述命令後，會建立一個名為cnblogs的文件夾。

創建爬蟲文件

cd cnblogs
scrapy genspider cn cnblogs.com

執行上述命令後，會在cnblogs\spiders\下新建一個名為cn.py的爬蟲文件，cnblogs.com為允許爬取的功能變數名稱。

編寫items.py文件

定義需要爬取的內容。

import scrapy

class CnblogsItem(scrapy.Item):
    # define the fields for your item here like:
    post_author = scrapy.Field()    #發佈作者
    author_link = scrapy.Field()    #作者博客主頁鏈接
    post_date = scrapy.Field()      #發佈時間
    digg_num = scrapy.Field()       #推薦數
    title = scrapy.Field()          #標題
    title_link = scrapy.Field()     #標題鏈接
    item_summary = scrapy.Field()   #摘要
    comment_num = scrapy.Field()    #評論數
    view_num = scrapy.Field()       #閱讀數

編寫爬蟲文件cn.py

import scrapy
from cnblogs.items import CnblogsItem

class CnSpider(scrapy.Spider):
    name = 'cn'
    allowed_domains = ['cnblogs.com']
    start_urls = ['https://www.cnblogs.com/pick/']

    def parse(self, response):
        div_list = response.xpath("//div[@id='post_list']/div")
        for div in div_list:
            item = CnblogsItem()
            item["post_author"] = div.xpath(".//div[@class='post_item_foot']/a/text()").extract_first()
            item["author_link"] = div.xpath(".//div[@class='post_item_foot']/a/@href").extract_first()
            item["post_date"] = div.xpath(".//div[@class='post_item_foot']/text()").extract()
            item["comment_num"] = div.xpath(".//span[@class='article_comment']/a/text()").extract_first()
            item["view_num"] = div.xpath(".//span[@class='article_view']/a/text()").extract_first()
            item["title"] = div.xpath(".//h3/a/text()").extract_first()
            item["title_link"] = div.xpath(".//h3/a/@href").extract_first()
            item["item_summary"] = div.xpath(".//p[@class='post_item_summary']/text()").extract()
            item["digg_num"] = div.xpath(".//span[@class='diggnum']/text()").extract_first()
            yield item

        next_url = response.xpath(".//a[text()='Next >']/@href").extract_first()
        if next_url is not None:
            next_url = "https://www.cnblogs.com" + next_url
            yield scrapy.Request(
                next_url,
                callback=self.parse
            )

編寫pipelines.py文件

對抓取到的數據進行簡單處理，去除無效的字元串，並保存到MongoDB中。

from pymongo import MongoClient
import re

client = MongoClient()
collection = client["test"]["cnblogs"]

class CnblogsPipeline(object):
    def process_item(self, item, spider):
        item["post_date"] = self.process_string_list(item["post_date"])
        item["comment_num"] = self.process_string(item["comment_num"])
        item["item_summary"] = self.process_string_list(item["item_summary"])
        print(item)
        collection.insert(dict(item))
        return item

    def process_string(self,content_string):
        if content_string is not None:
            content_string = re.sub(" |\s","",content_string)
        return content_string

    def process_string_list(self,string_list):
        if string_list is not None:
            string_list = [re.sub(" |\s","",i) for i in string_list]
            string_list = [i for i in string_list if len(i) > 0][0]
        return string_list

修改settings.py文件

添加USER_AGENT

USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36'

啟用pipelines

ITEM_PIPELINES = {
   'cnblogs.pipelines.CnblogsPipeline': 300,
}

運行程式

執行下麵的命令，開始運行程式。

scrapy crawl cn

程式運行結果

程式運行結束後，MongoDB中的數據如下圖所示，採用的可視化工具是Robo 3T。

感謝大家的閱讀，如果文中有不正確的地方，希望大家指出，我會積極地學習、改正。
再次感謝您耐心的讀完本篇文章。

您的分享是我們最大的動力!

-Advertisement-

更多相關文章

【解題報告】洛谷 P2571 [SCOI2010]傳送帶

【解題報告】洛谷 P2571 [SCOI2010]傳送帶今天無聊，很久沒有做過題目了，但是又不想做什麼太難的題目，所以就用洛谷隨機跳題，跳到了一道題目，感覺好像不是太難。 [CSDN鏈接](https://blog.csdn.net/Liang_Si_FFF/article/details/8457 ...
Kali Linux搭建Go語言環境

準備：（1）Kali Linux系統（此實驗為VMware環境）（2）Go語言安裝包具體過程: （1）到官網下載Go語言安裝包，如圖示操作(官網可能需要梯子，沒有的可以從國內相關網站下載) （2）下載好安裝包後，進行解壓操作，如圖所示命令：tar zxvf gol.11.2.linux-am ...
樹莓派的串口通信

工具有：樹莓派zero v1.3 CH340 USB轉串口工具電腦端的串口調試助手硬體接線如圖：配置：樹莓派的串口預設為串口終端調試使用，如要正常使用串口則需要修改樹莓派設置。關閉串口終端調試功能後則不能再通過串口登陸訪問樹莓派，需從新開啟後才能通過串口控制樹莓派。首先釋放串口，執行如下命 ...
Python基礎--print函數用法解釋

註意：以下代碼均針對python3.x ，python2.x 需要把括弧去掉，如：print ''This is the python 2. x format '' 1.print([object, ..., ]*, sep=' ', end='\n', file=sys.stdout,flush ...
1、生鮮電商平臺-系統簡介

1.生鮮電商平臺的價值與定位。生鮮電商平臺是一家致力於打造全國餐飲行業智能化、便利化、平臺化與透明化服務的創新型移動互聯網平臺，連接買家與賣家之間的一個平臺看以下的圖標：（商業模式）名稱解釋：買家：所有的大中小型餐館，酒店等餐飲行業都屬於我們常說的買家。生鮮電商APP: 買家通過在APP上 ...
APDL link180單元

APDL代碼實現link180單元的使用由於不知道怎樣使用LINK180單元，故按照相關的教程和理解，整理了一下比較完整的APDL的代碼。其中包含的圖片的保存和背景顏色的改變。標簽：'LINK180' ' APDL' [toc] APDL代碼 FINISH /CLEAR /PREP7 ET,1, ...
python項目1：自動解壓並刪除壓縮包

目的：實現壓縮包的自動解壓及刪除。思路：獲取壓縮包 > 解壓 > 刪除壓縮包代碼實現：此處代碼實現前提為.py文件和壓縮包在同一文件夾結果：.zip文件一旦出現，則立刻被解壓並刪除 ...
05,.字典,集合

1.什麼是字典字典是以key:value的形式來保存數據,用{}表示. 存儲的是key:value 坑: 字典存儲數據的時候是用的hash值來存儲. 演算法不能變(python的) # 數據必須是不可變的(可哈希). 字典的key必須是可哈希的(不可變). dic = {"jay":"周傑倫", " ...