爬完數據存哪裡?當然是資料庫啊!數據入庫之去重與資料庫詳解!

来源:https://www.cnblogs.com/py1357/archive/2018/06/08/9157217.html
-Advertisement-
Play Games

4、為何需要進行url去重? 運行爬蟲時,我們不需要一個網站被下載多次,這會導致cpu浪費和增加引擎負擔,所以我們需要在爬取的時候對url去重,另一方面:當我們大規模爬取數據時,當故障發生時,不需要進行url鏈接重跑(重跑會浪費資源、造成時間浪費) 5、如何確定去重強度? 這裡使用去重周期確定強度: ...


 

 

4、為何需要進行url去重?

運行爬蟲時,我們不需要一個網站被下載多次,這會導致cpu浪費和增加引擎負擔,所以我們需要在爬取的時候對url去重,另一方面:當我們大規模爬取數據時,當故障發生時,不需要進行url鏈接重跑(重跑會浪費資源、造成時間浪費)

5、如何確定去重強度?

這裡使用去重周期確定強度:

周期一小時以內,不對抓取的鏈接進行持久化(存儲url,方便設計成增量抓取方案使用)

周期一天以內(或總量30w以下),對抓取的鏈接做一個簡單的持久化

周期一天以上,對抓取鏈接做持久化操作

 

 

step2:安裝依賴包:

 

 

step3:安裝scrapy-deltafetch

啟動終端一鍵安裝即可:pip install scrapy-deltafetch

 

下麵補充下ubuntu16.04下包的安裝過程(參考博文:http://jinbitou.net/2018/01/27/2579.html)

這裡直接貼下載成功界面:首先安裝資料庫Berkeley DB

 

接著安裝scrapy-deltafetch即可,在此之前同樣安裝依賴包bsddb3

 1 (course-python3.5-env) bourne@bourne-vm:~$ pip install bsddb3
 2 Collecting bsddb3
 3 Using cached https://files.pythonhosted.org/packages/ba/a7/131dfd4e3a5002ef30e20bee679d5e6bcb2fcc6af21bd5079dc1707a132c/bsddb3-6.2.5.tar.gz
 4 Building wheels for collected packages: bsddb3
 5 Running setup.py bdist_wheel for bsddb3 ... done
 6 Stored in directory: /home/bourne/.cache/pip/wheels/58/8e/e5/bfbc89dd084aa896e471476925d48a713bb466842ed760d43c
 7 Successfully built bsddb3
 8 Installing collected packages: bsddb3
 9 Successfully installed bsddb3-6.2.5
10 (course-python3.5-env) bourne@bourne-vm:~$ pip install scrapy-deltafetch
11 Collecting scrapy-deltafetch
12 Using cached https://files.pythonhosted.org/packages/90/81/08bd21bc3ee364845d76adef09d20d85d75851c582a2e0bb7f959d49b8e5/scrapy_deltafetch-1.2.1-py2.py3-none-any.whl
13 Requirement already satisfied: bsddb3 in ./course-python3.5-env/lib/python3.5/site-packages (from scrapy-deltafetch) (6.2.5)
14 Requirement already satisfied: Scrapy>=1.1.0 in ./course-python3.5-env/lib/python3.5/site-packages (from scrapy-deltafetch) (1.5.0)
15 Requirement already satisfied: PyDispatcher>=2.0.5 in ./course-python3.5-env/lib/python3.5/site-packages (from Scrapy>=1.1.0->scrapy-deltafetch) (2.0.5)
16 Requirement already satisfied: lxml in ./course-python3.5-env/lib/python3.5/site-packages (from Scrapy>=1.1.0->scrapy-deltafetch) (4.2.1)
17 Requirement already satisfied: cssselect>=0.9 in ./course-python3.5-env/lib/python3.5/site-packages (from Scrapy>=1.1.0->scrapy-deltafetch) (1.0.3)
18 Requirement already satisfied: queuelib in ./course-python3.5-env/lib/python3.5/site-packages (from Scrapy>=1.1.0->scrapy-deltafetch) (1.5.0)
19 Requirement already satisfied: w3lib>=1.17.0 in ./course-python3.5-env/lib/python3.5/site-packages (from Scrapy>=1.1.0->scrapy-deltafetch) (1.19.0)
20 Requirement already satisfied: service-identity in ./course-python3.5-env/lib/python3.5/site-packages (from Scrapy>=1.1.0->scrapy-deltafetch) (17.0.0)
21 Requirement already satisfied: Twisted>=13.1.0 in ./course-python3.5-env/lib/python3.5/site-packages (from Scrapy>=1.1.0->scrapy-deltafetch) (18.4.0)
22 Requirement already satisfied: parsel>=1.1 in ./course-python3.5-env/lib/python3.5/site-packages (from Scrapy>=1.1.0->scrapy-deltafetch) (1.4.0)
23 Requirement already satisfied: pyOpenSSL in ./course-python3.5-env/lib/python3.5/site-packages (from Scrapy>=1.1.0->scrapy-deltafetch) (17.5.0)
24 Requirement already satisfied: six>=1.5.2 in ./course-python3.5-env/lib/python3.5/site-packages (from Scrapy>=1.1.0->scrapy-deltafetch) (1.11.0)
25 Requirement already satisfied: attrs in ./course-python3.5-env/lib/python3.5/site-packages (from service-identity->Scrapy>=1.1.0->scrapy-deltafetch) (18.1.0)
26 Requirement already satisfied: pyasn1-modules in ./course-python3.5-env/lib/python3.5/site-packages (from service-identity->Scrapy>=1.1.0->scrapy-deltafetch) (0.2.1)
27 Requirement already satisfied: pyasn1 in ./course-python3.5-env/lib/python3.5/site-packages (from service-identity->Scrapy>=1.1.0->scrapy-deltafetch) (0.4.2)
28 Requirement already satisfied: incremental>=16.10.1 in ./course-python3.5-env/lib/python3.5/site-packages (from Twisted>=13.1.0->Scrapy>=1.1.0->scrapy-deltafetch) (17.5.0)
29 Requirement already satisfied: constantly>=15.1 in ./course-python3.5-env/lib/python3.5/site-packages (from Twisted>=13.1.0->Scrapy>=1.1.0->scrapy-deltafetch) (15.1.0)
30 Requirement already satisfied: Automat>=0.3.0 in ./course-python3.5-env/lib/python3.5/site-packages (from Twisted>=13.1.0->Scrapy>=1.1.0->scrapy-deltafetch) (0.6.0)
31 Requirement already satisfied: hyperlink>=17.1.1 in ./course-python3.5-env/lib/python3.5/site-packages (from Twisted>=13.1.0->Scrapy>=1.1.0->scrapy-deltafetch) (18.0.0)
32 Requirement already satisfied: zope.interface>=4.4.2 in ./course-python3.5-env/lib/python3.5/site-packages (from Twisted>=13.1.0->Scrapy>=1.1.0->scrapy-deltafetch) (4.5.0)
33 Requirement already satisfied: cryptography>=2.1.4 in ./course-python3.5-env/lib/python3.5/site-packages (from pyOpenSSL->Scrapy>=1.1.0->scrapy-deltafetch) (2.2.2)
34 Requirement already satisfied: idna>=2.5 in ./course-python3.5-env/lib/python3.5/site-packages (from hyperlink>=17.1.1->Twisted>=13.1.0->Scrapy>=1.1.0->scrapy-deltafetch) (2.6)
35 Requirement already satisfied: setuptools in ./course-python3.5-env/lib/python3.5/site-packages (from zope.interface>=4.4.2->Twisted>=13.1.0->Scrapy>=1.1.0->scrapy-deltafetch) (39.1.0)
36 Requirement already satisfied: cffi>=1.7; platform_python_implementation != "PyPy" in ./course-python3.5-env/lib/python3.5/site-packages (from cryptography>=2.1.4->pyOpenSSL->Scrapy>=1.1.0->scrapy-deltafetch) (1.11.5)
37 Requirement already satisfied: asn1crypto>=0.21.0 in ./course-python3.5-env/lib/python3.5/site-packages (from cryptography>=2.1.4->pyOpenSSL->Scrapy>=1.1.0->scrapy-deltafetch) (0.24.0)
38 Requirement already satisfied: pycparser in ./course-python3.5-env/lib/python3.5/site-packages (from cffi>=1.7; platform_python_implementation != "PyPy"->cryptography>=2.1.4->pyOpenSSL->Scrapy>=1.1.0->scrapy-deltafetch) (2.18)
39 Installing collected packages: scrapy-deltafetch
40 Successfully installed scrapy-deltafetch-1.2.1
41 (course-python3.5-env) bourne@bourne-vm:~$ 

 

 1 def process_spider_output(self, response, result, spider):
 2 for r in result:
 3 if isinstance(r, Request): #判斷是否是url,如果是則進行下一步操作
 4 key = self._get_key(r) #通過_get_key()函數生成key
 5 if key in self.db: #判斷key是否在資料庫中
 6 logger.info("Ignoring already visited: %s" % r) #日誌記錄用來判斷如果key在資料庫中,就忽略它
 7 if self.stats:
 8 self.stats.inc_value('deltafetch/skipped', spider=spider)
 9 continue
10 elif isinstance(r, (BaseItem, dict)): #判斷從spider組件中出來item
11 key = self._get_key(response.request) #結果頁的url,(不針對過程,即只對拿到數據頁的url)進行去重
12 self.db[key] = str(time.time()) #將key塞入資料庫並帶了時間戳
13 if self.stats:
14 self.stats.inc_value('deltafetch/stored', spider=spider)
15 yield r
1 def _get_key(self, request):
2 key = request.meta.get('deltafetch_key') or request_fingerprint(request) #第一種是遵循你自己設計的唯一標識,第二種就是scrapy內置的去重方案生成的指紋,這裡我們點開源碼會發現使用了哈希演算法
3 # request_fingerprint() returns `hashlib.sha1().hexdigest()`, is a string
4 return to_bytes(key)
1 def _get_key(self, request):
2 key = request.meta.get('deltafetch_key') or request_fingerprint(request) #第一種是遵循你自己設計的唯一標識,第二種就是scrapy內置的去重方案生成的指紋,這裡我們點開源碼會發現使用了哈希演算法
3 # request_fingerprint() returns `hashlib.sha1().hexdigest()`, is a string
4 return to_bytes(key)
 1 """
 2 This module provides some useful functions for working with
 3 scrapy.http.Request objects
 4 """
 5 
 6 from __future__ import print_function
 7 import hashlib
 8 import weakref
 9 from six.moves.urllib.parse import urlunparse
10 
11 from w3lib.http import basic_auth_header
12 from scrapy.utils.python import to_bytes, to_native_str
13 
14 from w3lib.url import canonicalize_url
15 from scrapy.utils.httpobj import urlparse_cached
16 
17 
18 _fingerprint_cache = weakref.WeakKeyDictionary()
19 def request_fingerprint(request, include_headers=None):
20 """
21 Return the request fingerprint.
22 
23 The request fingerprint is a hash that uniquely identifies the resource the
24 request points to. For example, take the following two urls:
25 
26 http://www.example.com/query?id=111&cat=222
27 http://www.example.com/query?cat=222&id=111
28 
29 Even though those are two different URLs both point to the same resource
30 and are equivalent (ie. they should return the same response).
31 
32 Another example are cookies used to store session ids. Suppose the
33 following page is only accesible to authenticated users:
34 
35 http://www.example.com/members/offers.html
36 
37 Lot of sites use a cookie to store the session id, which adds a random
38 component to the HTTP Request and thus should be ignored when calculating
39 the fingerprint.
40 
41 For this reason, request headers are ignored by default when calculating
42 the fingeprint. If you want to include specific headers use the
43 include_headers argument, which is a list of Request headers to include.
44 
45 """
46 if include_headers:
47 include_headers = tuple(to_bytes(h.lower())
48 for h in sorted(include_headers))
49 cache = _fingerprint_cache.setdefault(request, {})
50 if include_headers not in cache:
51 fp = hashlib.sha1() #哈希演算法,生成一段暗紋,用來進行唯一標識
52 fp.update(to_bytes(request.method))
53 fp.update(to_bytes(canonicalize_url(request.url)))
54 fp.update(request.body or b'')
55 if include_headers:
56 for hdr in include_headers:
57 if hdr in request.headers:
58 fp.update(hdr)
59 for v in request.headers.getlist(hdr):
60 fp.update(v)
61 cache[include_headers] = fp.hexdigest()
62 return cache[include_headers]
63 
64 
65 def request_authenticate(request, username, password):
66 """Autenticate the given request (in place) using the HTTP basic access
67 authentication mechanism (RFC 2617) and the given username and password
68 """
69 request.headers['Authorization'] = basic_auth_header(username, password)
70 
71 
72 def request_httprepr(request):
73 """Return the raw HTTP representation (as bytes) of the given request.
74 This is provided only for reference since it's not the actual stream of
75 bytes that will be send when performing the request (that's controlled
76 by Twisted).
77 """
78 parsed = urlparse_cached(request)
79 path = urlunparse(('', '', parsed.path or '/', parsed.params, parsed.query, ''))
80 s = to_bytes(request.method) + b" " + to_bytes(path) + b" HTTP/1.1\r\n"
81 s += b"Host: " + to_bytes(parsed.hostname or b'') + b"\r\n"
82 if request.headers:
83 s += request.headers.to_string() + b"\r\n"
84 s += b"\r\n"
85 s += request.body
86 return s
87 
88 
89 def referer_str(request):
90 """ Return Referer HTTP header suitable for logging. """
91 referrer = request.headers.get('Referer')
92 if referrer is None:
93 return referrer
94 return to_native_str(referrer, errors='replace')

 

(3)、實例體驗

創建名為spider_city_58的項目--生成spider.py爬蟲

(1)、修改spider.py

 

 1 # -*- coding: utf-8 -*-
 2 import scrapy
 3 from scrapy.http import Request
 4 
 5 class SpiderSpider(scrapy.Spider):
 6 name = 'spider'
 7 allowed_domains = ['58.com']
 8 start_urls = ['http://cd.58.com/']
 9 
10 def parse(self, response):
11 pass
12 yield Request('http://bj.58.com',callback=self.parse)
13 yield Request('http://wh.58.com',callback=self.parse)

 

(2)、新建init_utils.py並修改

 

 1 #author: "xian"
 2 #date: 2018/6/1
 3 from scrapy.http import Request
 4 
 5 def init_add_request(spider, url):
 6 """
 7 此方法用於在,scrapy啟動的時候添加一些已經跑過的url,讓爬蟲不需要重覆跑
 8 
 9 """
10 rf = spider.crawler.engine.slot.scheduler.df #找到實例化對象
11 
12 request = Request(url)
13 rf.request_seen(request) #調用request_seen方法

 

(3)、修改pipeline.py

 

 1 # -*- coding: utf-8 -*-
 2 
 3 # Define your item pipelines here
 4 #
 5 # Don't forget to add your pipeline to the ITEM_PIPELINES setting
 6 # See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
 7 
 8 from .init_utils import init_add_request
 9 
10 class City58Pipeline(object):
11 def process_item(self, item, spider):
12 return item
13 
14 def open_spider(self,spider):
15 init_add_request(spider,'http://wh.58.com')

 

(4)、修改settings.py

 

 

(5)、創建測試文件main.py

1 #author: "xian"
2 #date: 2018/6/1
3 from scrapy.cmdline import execute
4 execute('scrapy crawl spider'.split())

運行結果:

 

結語:針對scrapy-redis的去重,我們後續分析!

進群:125240963    即可獲取神秘大禮包哦!


您的分享是我們最大的動力!

-Advertisement-
Play Games
更多相關文章
  • Description 下麵程式的輸出結果是: destructor B destructor A 請完整寫出 class A。 限制條件:不得為 class A 編寫構造函數。 ~~~~ include using namespace std; class A { // Your Code Her ...
  • 一、概念 早期的 Java API 只支持由本地系統套接字型檔提供所謂的阻塞函數來支持網路編程。由於是阻塞 I/O ,要管理多個併發客戶端,需要為每個新的客戶端Socket 創建一個 Thread 。這將導致一系列的問題,第一,在任何時候都可能有大量的線程處於休眠狀態(不可能每時每刻都有對應的併發數) ...
  • 必需學會SpringBoot基礎知識 Takes an opinionated view of building production-ready Spring applications. Spring Boot favors convention over configuration and is ...
  • 本篇教大家如何用Python來實現QQ機器人,如有不足歡迎在評論方指出! 簡單介紹 安裝方法 可在 Python個版本下使用,用 pip 安裝: 使用方法 一、啟動 QQBot 二、操作 QQBot QQBot 啟動後,在另一個控制台視窗使用 qq 命令來操作 QQBot ,目前提供以下命令: li ...
  • 現有要求如下: 通過cmd的方式,求簡單表達式的值。 比如輸入 java Expression 3 + 4 得到的結果為:7 代碼: import java.text.DecimalFormat; public class Expression { public static void main(S ...
  • 本節主要內容:1. 初識⽂件操作2. 只讀(r, rb)3. 只寫(w, wb)4. 追加(a, ab)5. r+讀寫6. w+寫讀7. a+寫讀(追加寫讀)8. 其他操作⽅法9. ⽂件的修改以及另⼀種打開⽂件句柄的⽅式 主要內容:⼀. 初識⽂件操作使⽤python來讀寫⽂件是⾮常簡單的操作. 我們 ...
  • 一、常用字元串操作 upper(x)把字母變成大寫 lower(x)把字母變成小寫 split(str,num) 對字元串進行切割,返回一個列表:str-分隔符,預設為所有的空字元,包括空格,換行(\n),製表符(\t)等;num -- 分割次數 strip(chars ) 移除字元串頭尾指定的字元 ...
  • 有話要說: 這次準備講述後臺伺服器的搭建以及前臺訪問到數據的過程。 成果: 準備: 搭建伺服器: 用eclipse直接創建一個web工程,並將運行環境設置為Tomcat7 接著定義了四個類來實現了一個簡單的介面(通過servlet的方式),下麵來看看這四個類 NewsBean.java 該類是段子類 ...
一周排行
    -Advertisement-
    Play Games
  • 移動開發(一):使用.NET MAUI開發第一個安卓APP 對於工作多年的C#程式員來說,近來想嘗試開發一款安卓APP,考慮了很久最終選擇使用.NET MAUI這個微軟官方的框架來嘗試體驗開發安卓APP,畢竟是使用Visual Studio開發工具,使用起來也比較的順手,結合微軟官方的教程進行了安卓 ...
  • 前言 QuestPDF 是一個開源 .NET 庫,用於生成 PDF 文檔。使用了C# Fluent API方式可簡化開發、減少錯誤並提高工作效率。利用它可以輕鬆生成 PDF 報告、發票、導出文件等。 項目介紹 QuestPDF 是一個革命性的開源 .NET 庫,它徹底改變了我們生成 PDF 文檔的方 ...
  • 項目地址 項目後端地址: https://github.com/ZyPLJ/ZYTteeHole 項目前端頁面地址: ZyPLJ/TreeHoleVue (github.com) https://github.com/ZyPLJ/TreeHoleVue 目前項目測試訪問地址: http://tree ...
  • 話不多說,直接開乾 一.下載 1.官方鏈接下載: https://www.microsoft.com/zh-cn/sql-server/sql-server-downloads 2.在下載目錄中找到下麵這個小的安裝包 SQL2022-SSEI-Dev.exe,運行開始下載SQL server; 二. ...
  • 前言 隨著物聯網(IoT)技術的迅猛發展,MQTT(消息隊列遙測傳輸)協議憑藉其輕量級和高效性,已成為眾多物聯網應用的首選通信標準。 MQTTnet 作為一個高性能的 .NET 開源庫,為 .NET 平臺上的 MQTT 客戶端與伺服器開發提供了強大的支持。 本文將全面介紹 MQTTnet 的核心功能 ...
  • Serilog支持多種接收器用於日誌存儲,增強器用於添加屬性,LogContext管理動態屬性,支持多種輸出格式包括純文本、JSON及ExpressionTemplate。還提供了自定義格式化選項,適用於不同需求。 ...
  • 目錄簡介獲取 HTML 文檔解析 HTML 文檔測試參考文章 簡介 動態內容網站使用 JavaScript 腳本動態檢索和渲染數據,爬取信息時需要模擬瀏覽器行為,否則獲取到的源碼基本是空的。 本文使用的爬取步驟如下: 使用 Selenium 獲取渲染後的 HTML 文檔 使用 HtmlAgility ...
  • 1.前言 什麼是熱更新 游戲或者軟體更新時,無需重新下載客戶端進行安裝,而是在應用程式啟動的情況下,在內部進行資源或者代碼更新 Unity目前常用熱更新解決方案 HybridCLR,Xlua,ILRuntime等 Unity目前常用資源管理解決方案 AssetBundles,Addressable, ...
  • 本文章主要是在C# ASP.NET Core Web API框架實現向手機發送驗證碼簡訊功能。這裡我選擇是一個互億無線簡訊驗證碼平臺,其實像阿裡雲,騰訊雲上面也可以。 首先我們先去 互億無線 https://www.ihuyi.com/api/sms.html 去註冊一個賬號 註冊完成賬號後,它會送 ...
  • 通過以下方式可以高效,並保證數據同步的可靠性 1.API設計 使用RESTful設計,確保API端點明確,並使用適當的HTTP方法(如POST用於創建,PUT用於更新)。 設計清晰的請求和響應模型,以確保客戶端能夠理解預期格式。 2.數據驗證 在伺服器端進行嚴格的數據驗證,確保接收到的數據符合預期格 ...