webmagic 基本的方法

-Advertisement-

WebMagic的結構分為Downloader、PageProcessor、Scheduler、Pipeline四大組件，並由Spider將它們彼此組織起來。這四大組件對應爬蟲生命周期中的下載、處理、管理和持久化等功能 PageProcessor 需要自己寫 Scheduler 除非項目有一些特殊的 ...

WebMagic的結構分為Downloader、PageProcessor、Scheduler、Pipeline四大組件，並由Spider將它們彼此組織起來。這四大組件對應爬蟲生命周期中的下載、處理、管理和持久化等功能

PageProcessor 需要自己寫

Scheduler 除非項目有一些特殊的分散式需求，否則無需自己定製

Pipeline 要保存到資料庫需要自己定製

Selectable

方法	說明	示例
xpath(String xpath)	使用XPath選擇	html.xpath("//div[@class='title']")
$(String selector)	使用Css選擇器選擇	html.$("div.title")
$(String selector,String attr)	使用Css選擇器選擇	html.$("div.title","text")
css(String selector)	功能同$()，使用Css選擇器選擇	html.css("div.title")
links()	選擇所有鏈接	html.links()
regex(String regex)	使用正則表達式抽取	html.regex("\<div\>(.\*?)\")
regex(String regex,int group)	使用正則表達式抽取，並指定捕獲組	html.regex("\<div\>(.\*?)\",1)
replace(String regex, String replacement)	替換內容	html.replace("\","")

返回結果

方法	說明	示例
get()	返回一條String類型的結果	String link= html.links().get()
toString()	功能同get()，返回一條String類型的結果	String link= html.links().toString()
all()	返回所有抽取結果	List links= html.links().all()
match()	是否有匹配結果	if (html.links().match()){ xxx; }

Spider

方法	說明	示例
create(PageProcessor)	創建Spider	Spider.create(new GithubRepoProcessor())
addUrl(String…)	添加初始的URL	spider .addUrl("http://webmagic.io/docs/")
addRequest(Request...)	添加初始的Request	spider .addRequest("http://webmagic.io/docs/")
thread(n)	開啟n個線程	spider.thread(5)
run()	啟動，會阻塞當前線程執行	spider.run()
start()/runAsync()	非同步啟動，當前線程繼續執行	spider.start()
stop()	停止爬蟲	spider.stop()
test(String)	抓取一個頁面進行測試	spider .test("http://webmagic.io/docs/")
addPipeline(Pipeline)	添加一個Pipeline，一個Spider可以有多個Pipeline	spider .addPipeline(new ConsolePipeline())
setScheduler(Scheduler)	設置Scheduler，一個Spider只能有個一個Scheduler	spider.setScheduler(new RedisScheduler())
setDownloader(Downloader)	設置Downloader，一個Spider只能有個一個Downloader	spider .setDownloader(new SeleniumDownloader())
get(String)	同步調用，並直接取得結果	ResultItems result = spider .get("http://webmagic.io/docs/")
getAll(String…)	同步調用，並直接取得一堆結果	List<ResultItems> results = spider .getAll("http://webmagic.io/docs/", "http://webmagic.io/xxx")

Site

方法	說明	示例
setCharset(String)	設置編碼	site.setCharset("utf-8")
setUserAgent(String)	設置UserAgent	site.setUserAgent("Spider")
setTimeOut(int)	設置超時時間，單位是毫秒	site.setTimeOut(3000)
setRetryTimes(int)	設置重試次數	site.setRetryTimes(3)
setCycleRetryTimes(int)	設置迴圈重試次數	site.setCycleRetryTimes(3)
addCookie(String,String)	添加一條cookie	site.addCookie("dotcomt_user","code4craft")
setDomain(String)	設置功能變數名稱，需設置功能變數名稱後，addCookie才可生效	site.setDomain("github.com")
addHeader(String,String)	添加一條addHeader	site.addHeader("Referer","https://github.com")
setHttpProxy(HttpHost)	設置Http代理	site.setHttpProxy(new HttpHost("127.0.0.1",8080))

Xsoup

Name	Expression	Support
nodename	nodename	yes
immediate parent	/	yes
parent	//	yes
attribute	[@key=value]	yes
nth child	tag[n]	yes
attribute	/@key	yes
wildcard in tagname	/*	yes
wildcard in attribute	/[@*]	yes
function	function()	part
or	a \| b	yes since 0.2.0
parent in path	. or ..	no
predicates	price>35	no
predicates logic	@class=a or @class=b	yes since 0.2.0

另外作者自己定義了幾個對於爬蟲來說，很方便的XPath函數。但是請註意，這些函數式標準XPath沒有的。

Expression	Description	XPath1.0
text(n)	第n個直接文本子節點，為0表示所有	text() only
allText()	所有的直接和間接文本子節點	not support
tidyText()	所有的直接和間接文本子節點，並將一些標簽替換為換行，使純文本顯示更整潔	not support
html()	內部html，不包括標簽的html本身	not support
outerHtml()	內部html，包括標簽的html本身	not support
regex(@attr,expr,group)	這裡@attr和group均可選，預設是group0	not support

代理

API	說明
HttpClientDownloader.setProxyProvider(ProxyProvider proxyProvider)	設置代理

1.設置單一的普通HTTP代理為101.101.101.101的8888埠，並設置密碼為"username","password"

HttpClientDownloader httpClientDownloader = new HttpClientDownloader();
    httpClientDownloader.setProxyProvider(SimpleProxyProvider.from(new Proxy("101.101.101.101",8888,"username","password")));
    spider.setDownloader(httpClientDownloader);

2.設置代理池，其中包括101.101.101.101和102.102.102.102兩個IP，沒有密碼

 HttpClientDownloader httpClientDownloader = new HttpClientDownloader();
    httpClientDownloader.setProxyProvider(SimpleProxyProvider.from(
    new Proxy("101.101.101.101",8888)
    ,new Proxy("102.102.102.102",8888)));

HttpRequestBody

API	說明
HttpRequestBody.form(Map\<string,object> params, String encoding)	使用表單提交的方式
HttpRequestBody.json(String json, String encoding)	使用JSON的方式，json是序列化後的結果
HttpRequestBody.xml(String xml, String encoding)	設置xml的方式，xml是序列化後的結果
HttpRequestBody.custom(byte[] body, String contentType, String encoding)	設置自定義的requestBody

組件的使用

方法	說明	示例
setScheduler()	設置Scheduler	spipder.setScheduler(new FileCacheQueueScheduler("D:\data\webmagic"))
setDownloader()	設置Downloader	spipder.setDownloader(new SeleniumDownloader()))
addPipeline()	設置Pipeline，一個Spider可以有多個Pipeline	spipder.addPipeline(new FilePipeline())

結果輸出方式

類	說明	備註
ConsolePipeline	輸出結果到控制台	抽取結果需要實現toString方法
FilePipeline	保存結果到文件	抽取結果需要實現toString方法
JsonFilePipeline	JSON格式保存結果到文件
ConsolePageModelPipeline	(註解模式)輸出結果到控制台
FilePageModelPipeline	(註解模式)保存結果到文件
JsonFilePageModelPipeline	(註解模式)JSON格式保存結果到文件	想要持久化的欄位需要有getter方法

您的分享是我們最大的動力!

-Advertisement-

更多相關文章

操作符！與操作符！！的區別

邏輯非（！）操作符，首先它的操作數轉換為一個布爾值，然後對其反求。兩個邏輯非(!!)操作符，第一個(!)操作符，首先對她的操作數轉換為一個布爾值，然後對其求反，得到布爾值；第二個邏輯非(!)操作符則對這個布爾值再求反。 ...
獲取DOM節點的幾種方式

DOM 是一個樹形結構，操作一個DOM節點，實際上就是這幾個操作：更新、刪除、添加、遍歷在操作DOM節點之前，需要通過各種方式先拿到這個DOM節點，常用的方法有：一、通過元素類型的方法來操作：註意：首碼為document，意思是在document節點下調用這些方法，當然也可以在其他的元素節點 ...
小程式下拉刷新上拉載入

微信小程式下拉刷新上拉載入，簡單方便，易於上手。 1.首先上list.wxml代碼 2.再上js代碼 3.簡單的list.wxss 4.list.json配置文件至此，一個簡單的下拉刷新上拉載入基本搞定了。巧用微信的各種Api，就很舒服。繼續擴展的話： 1.updateDom那裡下拉刷新是簡 ...
react-router簡明學習

[1]Router [2]Route [3]Redirect [4]Link [5]Switch ...
會HTML/CSS就可以輕鬆創建網站

網站其本質就是HTML + CSS 外加一些JavaScript構成的。所以基本上只要你會一些前端，就可以開始花樣搭網站了。如果只用HTML/CSS那做出來的網站只能叫靜態網站，性能好但維護不方便，所以你需要一個建站程式來幫你做這個事情。如果你已經有一臺VPS（阿裡的ECS或者騰訊的CVM）... ...
EventUtil對象之跨瀏覽器的事件處理程式

跨瀏覽器事件處理程式最近在讀javascript高級程式設計，讀第十三章的時候有感。開發中經常考慮的事情就是相容性，樣式相容，腳本相容等~~ 經常考慮的對象常為： DOM 與 IE （這裡的dom對象我理解為ie9，Firefox，chrome，safari，opera以上。IE則為ie8及以下 ...
Cas 使用maven的overlay搭建開發環境 (二)

關於cas-server的安裝、部署網上教程很多。但是使用Cas，只通過部署時修改配置是無法滿足產品需求的，因此需要我們改造Cas。本文講解如何使用maven的overlay無侵入的改造Cas。什麼是maven的overlay？ overlay可以把多個項目war合併成為一個項目，並且如果項目存在 ...
微服務之SpringCloud基礎

微服務/SpringCloud/SpringMVC/Spring Boot/SpringCloud微服務基礎 ...