21天打造分散式爬蟲-Scrapy框架（七）

-Advertisement-

7.1.糗事百科安裝 pip install pypiwin32 pip install Twisted-18.7.0-cp36-cp36m-win_amd64.whl pip install scrapy 創建和運行項目代碼 qsbk_spider.py item.py pipelines.p ...

7.1.糗事百科

安裝

pip install pypiwin32

pip install Twisted-18.7.0-cp36-cp36m-win_amd64.whl

pip install scrapy

創建和運行項目

scrapy startproject qsbk   #創建項目

scrapy genspider qsbk_spider "qiushibaike.com"   #創建爬蟲

scrapy crawl qsbk_spider         #運行爬蟲

代碼

qsbk_spider.py

# -*- coding: utf-8 -*-
import scrapy
from qsbk.items import QsbkItem

class QsbkSpiderSpider(scrapy.Spider):
    name = 'qsbk_spider'
    allowed_domains = ['qiushibaike.com']
    start_urls = ['https://www.qiushibaike.com/8hr/page/1/']
    base_domain = "https://www.qiushibaike.com"

    def parse(self, response):
        duanzidivs = response.xpath("//div[@id='content-left']/div")
        for duanzidiv in duanzidivs:
            author = duanzidiv.xpath(".//h2/text()").get().strip()
            content = duanzidiv.xpath(".//div[@class='content']//text()").getall()
            content = "".join(content).strip()
            item = QsbkItem(author=author,content=content)
            yield item
        #爬後面頁的數據
        next_url = response.xpath("//ul[@class='pagination']/li[last()]/a/@href").get()
        if not next_url:
            return
        else:
            yield scrapy.Request(self.base_domain+next_url,callback=self.parse)

item.py

import scrapy

class QsbkItem(scrapy.Item):
    author = scrapy.Field()
    content = scrapy.Field()

pipelines.py

# -*- coding: utf-8 -*-

import json

#1.手動把dick轉換成json格式

# class QsbkPipeline(object):
#     def __init__(self):
#         self.fp = open('duanzi.json','w',encoding='utf-8')
#
#     def open_spider(self,spider):
#         print('開始爬蟲')
#
#     def process_item(self, item, spider):
#         item_json = json.dumps(dict(item),ensure_ascii=False)
#         self.fp.write(item_json+'\n')
#         return item
#
#     def close_spider(self,spider):
#         self.fp.close()
#         print('爬蟲結束了')

#2.適用JsonItemExporter，使用與數據量小的情況下
# from scrapy.exporters import JsonItemExporter
# class QsbkPipeline(object):
#     def __init__(self):
#         self.fp = open('duanzi.json','wb')
#         self.exporter = JsonItemExporter(self.fp,ensure_ascii=False,encoding='utf-8')
#         self.exporter.start_exporting()
#
#     def open_spider(self,spider):
#         print('開始爬蟲')
#
#     def process_item(self, item, spider):
#         self.exporter.export_item(item)
#         return item
#
#     def close_spider(self,spider):
#         self.exporter.finish_exporting()
#         self.fp.close()
#         print('爬蟲結束了')


#3.JsonLinesItemExporter，適用與數據量大的情況下
from scrapy.exporters import JsonLinesItemExporter
class QsbkPipeline(object):
    def __init__(self):
        self.fp = open('duanzi.json','wb')
        self.exporter = JsonLinesItemExporter(self.fp,ensure_ascii=False,encoding='utf-8')

    def open_spider(self,spider):
        print('開始爬蟲')

    def process_item(self, item, spider):
        self.exporter.export_item(item)
        return item

    def close_spider(self,spider):
        self.fp.close()
        print('爬蟲結束了')

settings.py

ROBOTSTXT_OBEY = False

DOWNLOAD_DELAY = 1

DEFAULT_REQUEST_HEADERS = {
  'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
  # 'Accept-Language': 'en',
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36',
}

ITEM_PIPELINES = {
   'qsbk.pipelines.QsbkPipeline': 300,
}

start.py

from scrapy import cmdline

cmdline.execute("scrapy crawl qsbk_spider".split())

您的分享是我們最大的動力!

-Advertisement-

更多相關文章

oninput、onchange與onpropertychange事件的區別, 與input輸入框實時檢測

這幾天項目著急，同時也學到好多以前沒有接觸過的知識。oninput、onchange與onpropertychange事件的區別, 與input輸入框實時檢測 onchange事件只在鍵盤或者滑鼠操作改變對象屬性，value的值發生變化且失去焦點時觸發，用戶js改變value時無法觸發； onkey ...
【設計模式】單例模式 Singleton Parttern

通常我們在寫程式的時候會碰到一個類只允許在整個系統中只存在一個實例（Instance）的情況，比如說我們想做一計數器，統計某些介面調用的次數，通常我們的資料庫連接也是只期望有一個實例。Windows系統的系統任務管理器也是始終只有一個，如果你打開了windows管理器，你再想打開一個那麼他還是同 ...
系統優化怎麼做-Tomcat優化

大家好，這裡是「聊聊系統優化」，併在下列地址同步更新博客園：http://www.cnblogs.com/changsong/ 知乎專欄：https://zhuanlan.zhihu.com/youhua 全網私活,免費訂閱: http://www.zsihuo.com 在這裡我會從基於J2EE ...
【設計模式】建造者模式 Builder Pattern

前面學習了簡單工廠模式，工廠方法模式以及抽象工廠模式，這些都是創建類的對象所使用的一些常用的方法和套路，那麼如果我們創建一個很複雜的對象可上面的三種方法都不太適合，那麼“專業的事交給專業人去做”，23設計模式總有一個模式是適合這種複雜對象的創建。比如現在的智能手機組成, 它包括一個屏幕，攝像頭，耳 ...
Java8新特性-Lambda表達式是什麼？

[TOC] 前言 Java8新特性 Lambda表達式，好像很酷炫的樣子，直接搬運官方文檔： Purpose This tutorial introduces the new lambda expressions included in Java Platform Standard Edition ...
golang高性能RPC：Apache Thrift安裝使用完全攻略

在企業應用中RPC的使用可以說是十分的廣泛，使用該技術可以方便的與各種程式交互而不用考慮其編寫使用的語言。如果你對RPC的概念還不太清楚，可以點擊這裡。現今市面上已經有許多應用廣泛的RPC框架，比如GRPC，而今天我們要介紹的是同樣使用廣泛的Apache Thrift。這篇文章將帶你安全越過所有 ...
統計演算法_數值/線性關係度量

繼續統計演算法，這次也沒什麼特別的，還沒到那麼深入，也是比較基礎的1、方差-樣本2、協方差(標準差)-樣本3、變異繫數4、相關係數依然是先造個list，這次把這個功能寫個函數，方便以後調用，另外上一篇寫過的函數這次也會繼承def create_rand_list(min_num,max_num,co ...
利用phpspreadsheet切割excel大文件

背景: 利用phpspreadsheet可以輕鬆的解析excel文件，但是phpspreadsheet的記憶體消耗也是比較大的，我試過解析將近5M的純文字excel記憶體使用量就會超過php預設的最大記憶體128M。當然這可以用調節記憶體大小的方法來解決，但是在併發量大的時候就比較危險了。所以今天介紹下第 ...