對抗防採集選擇python更方便些,如果需要處理複雜的網頁,解析網頁內容生成結構化數據或者對網頁內容精細的解析則可以選擇java,簡單一點的數據採集我們可以選擇python的爬蟲,需要具體到結構的化採集存儲最好採用java ...
0x00前言
對比與Python的爬蟲機制和java的爬蟲機制來詳解一下java的爬蟲,對於一般性的需求無論java還是python都可以勝任。
如需要模擬登陸、對抗防採集選擇python更方便些,如果需要處理複雜的網頁,解析網頁內容生成結構化數據或者對網頁內容精細的解析則可以選擇java,簡單一點的數據採集我們可以選擇python的爬蟲,需要具體到結構的化採集存儲最好採用java
0x01基礎的get和post爬取
0x1pos和get的基礎代碼
public class JAVA_TEST1 {
public static void main(String[] args) throws IOException {
CloseableHttpClient httpClient= HttpClients.createDefault();//創建一個預設對象
HttpGet httpGet=new HttpGet("https://www.itcast.cn");
httpClient.execute(httpGet);//對象去調用
CloseableHttpResponse response=httpClient.execute(httpGet);
if (response.getStatusLine().getStatusCode()==200){
HttpEntity httpEntity =response.getEntity();
String s=EntityUtils.toString(response.getEntity(),"utf-8");
System.out.println(s);
}
}
}
public class JAVA_test02 {
public static void main(String[] args) throws Exception {
//創建httpclient對象
//創建httpget對象
//發起請求
//爬到數據進行
CloseableHttpClient httpClinet = HttpClients.createDefault();
URIBuilder uriBuilder=new URIBuilder("https://www.baidu.com");
uriBuilder.setParameter("question","hellow");
System.out.println(uriBuilder.build().toString());
HttpGet httpGet=new HttpGet(uriBuilder.build());
httpClinet.execute(httpGet);
CloseableHttpResponse response=httpClinet.execute(httpGet);
if (response.getStatusLine().getStatusCode()==200){
String s= EntityUtils.toString(response.getEntity(),"utf-8");
System.out.println(s.length());
}
//關閉response
response.close();
httpClinet.close();
}
}
0x2常用方法
1.CloseableHttpClient httpClinet = HttpClients.createDefault();
創造一個HttpClients
2.HttpGet httpGet=new HttpGet()
創建一個Httpgetm,裡面可以直接更如String形的網址
a.也可以更輸入一個URI對象,URI對象可以增加額外的參數
URIBuilder uriBuilder=new URIBuilder("https://www.baidu.com");
uriBuilder.setParameter("param","value");
跟入參數的方式是:創建對象調用調用方法setParameter前面是參數名後面是
HttpPost httpPost=new HttpPost(uriBuilder.build());
參數值。類似sql註入的id=1=====param=value
b. CloseableHttpResponse response=httpClient.execute(httpPost);
創建一個CloseableHttpResponse去接受返回值
3.類方法
EntityUtils.toString(response.getEntity())//獲取的返回值輸出出來
httpClient.execute(httpPost);//調用HttpClient對象去執行請求
response.getStatusLine().getStatusCode()==200//判斷返回的頁面狀態
4.post攜帶參數請求
List<NameValuePair> pairList=new ArrayList<NameValuePair>();
pairList.add(new BasicNameValuePair("qustion","wwww"));
UrlEncodedFormEntity formEntity=new UrlEncodedFormEntity(pairList,"utf-8");
httpPost.setEntity(formEntity);
因為post傳輸的是表格所有我們需要創建一個list來構建一個表格泛型選擇NameValuePair,然後把list轉換成功一個UrlEncodedFormEntity 表格
0x03連接池
每次創建一個HttpClient對象需要打開關閉很麻煩,所以有一個連接池實現自動化管理
PoolingHttpClientConnectionManager cm=new PoolingHttpClientConnectionManager();
創建連接池
public void setMaxTotal(int max)
設置最大連接數
public void setDefaultMaxPerRoute(int max)
設置每個主機的併發數
HttpClient的配置
setConnectTimeout(1000)
// 設置創建連接的最長時間
setConnectionRequestTimeout(500)
//設置獲取連接最長時間
setSocketTimeout(500).build();
//設置數據傳輸最長時間
點擊查看代碼
package is.text;
import org.apache.http.client.config.RequestConfig;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.util.EntityUtils;
import java.io.IOException;
public class gethttp1params {
public static void main(String[] args) throws IOException {
CloseableHttpClient client = HttpClients.createDefault();
HttpGet httpGet = new HttpGet("http://www.baidu.com");
RequestConfig config = RequestConfig.custom().setConnectTimeout(1000) // 設置創建連接的最長時間
.setConnectionRequestTimeout(500) //設置獲取連接最長時間
.setSocketTimeout(500).build(); //設置數據傳輸最長時間
httpGet.setConfig(config);
CloseableHttpResponse response = client.execute(httpGet);
String s = EntityUtils.toString(response.getEntity());
System.out.println(s);
}
}
public class JAVA_test05 {
public static void main(String[] args) {
PoolingHttpClientConnectionManager cm=new PoolingHttpClientConnectionManager();
cm.setMaxTotal(10);
cm.setDefaultMaxPerRoute(2);
doget(cm);
doget(cm);
dopost(cm);
}
private static void dopost(PoolingHttpClientConnectionManager cm) {
CloseableHttpClient httpClient = HttpClients.custom().setConnectionManager(cm).build();
HttpPost httpPost = new HttpPost("https://www.baidu.com");
try {
CloseableHttpResponse response = httpClient.execute(httpPost);
if (response.getStatusLine().getStatusCode()==200);
} catch (IOException e) {
e.printStackTrace();
}
}
private static void doget(PoolingHttpClientConnectionManager cm) {
CloseableHttpClient httpClient =HttpClients.custom().setConnectionManager(cm).build();
HttpGet httpGet=new HttpGet("http://www.baicu.com");
try {
httpClient.execute(httpGet);
CloseableHttpResponse response=httpClient.execute(httpGet);
if (response.getStatusLine().getStatusCode()==200){
String string = EntityUtils.toString(response.getEntity());
System.out.println(string);
}
} catch (IOException e) {
e.printStackTrace();
}
}
}
0x04後文和總結
準備手寫一個爬蟲去嘗試爬出一些文章,java的爬蟲主要依賴於Jsoup,它可以實現的python爬蟲的正則表達式功能和獲取html解析分解內容,需要設計css和html的一些內容這裡就把它放在新的板塊