java從零到變身爬蟲大神_ZenDei技術網路在線

剛開始先從最簡單的爬蟲邏輯入手爬蟲最簡單的解析面真的是這樣這隻是一個函數而已那麼在下麵加上：哈哈，搞定就是這麼一個爬蟲了太神奇但是得到的只是網頁的html頁面的東西而且還沒篩選那麼就篩選吧那就用上面的來解析一下我的博客園解析的是<a>...</a>之間的東西看起來還不錯吧我 ...

剛開始先從最簡單的爬蟲邏輯入手

爬蟲最簡單的解析面真的是這樣

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import java.io.IOException;

public class Test {
    public static void Get_Url(String url) {
        try {
         Document doc = Jsoup.connect(url) 
          //.data("query", "Java")
          //.userAgent("頭部")
          //.cookie("auth", "token")
          //.timeout(3000)
          //.post()
          .get();
        } catch (IOException e) {
              e.printStackTrace();
        }
    }
}

這隻是一個函數而已

那麼在下麵加上：

  //main函數
     public static void main(String[] args) {
         String url = "...";
        Get_Url(url);
     }

哈哈，搞定

就是這麼一個爬蟲了

太神奇

但是得到的只是網頁的html頁面的東西

而且還沒篩選

那麼就篩選吧

public static void Get_Url(String url) {
        try {
         Document doc = Jsoup.connect(url) 
          //.data("query", "Java")
          //.userAgent("頭部")
          //.cookie("auth", "token")
          //.timeout(3000)
          //.post()
          .get();
         
        //得到html的所有東西
        Element content = doc.getElementById("content");
        //分離出html下<a>...</a>之間的所有東西
        Elements links = content.getElementsByTag("a");
        //Elements links = doc.select("a[href]");
        // 擴展名為.png的圖片
        Elements pngs = doc.select("img[src$=.png]");
        // class等於masthead的div標簽
        Element masthead = doc.select("div.masthead").first();
            
        for (Element link : links) {
              //得到<a>...</a>裡面的網址
              String linkHref = link.attr("href");
              //得到<a>...</a>裡面的漢字
              String linkText = link.text();
              System.out.println(linkText);
            }
        } catch (IOException e) {
              e.printStackTrace();
        }
    }

那就用上面的來解析一下我的博客園

解析的是<a>...</a>之間的東西

看起來還不錯吧

-------------------------------我是一根牛逼的分割線-------------------------------

其實還有另外一種爬蟲的方法更加好

他能批量爬取網頁保存到本地

先保存在本地再去正則什麼的篩選自己想要的東西

這樣效率比上面的那個高了很多

看代碼！

//將抓取的網頁變成html文件，保存在本地
    public static void Save_Html(String url) {
        try {
            File dest = new File("src/temp_html/" + "保存的html的名字.html");
            //接收位元組輸入流
            InputStream is;
            //位元組輸出流
            FileOutputStream fos = new FileOutputStream(dest);
    
            URL temp = new URL(url);
            is = temp.openStream();
            
            //為位元組輸入流加緩衝
            BufferedInputStream bis = new BufferedInputStream(is);
            //為位元組輸出流加緩衝
            BufferedOutputStream bos = new BufferedOutputStream(fos);
    
            int length;
    
            byte[] bytes = new byte[1024*20];
            while((length = bis.read(bytes, 0, bytes.length)) != -1){
                fos.write(bytes, 0, length);
            }

            bos.close();
            fos.close();
            bis.close();
            is.close();
        } catch (IOException e) {
            e.printStackTrace();
        }
    }

這個方法直接將html保存在了文件夾src/temp_html/裡面

在批量抓取網頁的時候

都是先抓下來，保存為html或者json

然後在正則什麼的進資料庫

東西在本地了，自己想怎麼搞就怎麼搞

反爬蟲關我什麼事

上面兩個方法都會造成一個問題

這個錯誤代表

這種爬蟲方法太low逼

大部分網頁都禁止了

所以，要加個頭

就是UA

方法一那裡的頭部那裡直接

userAgent("Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0; MALC)")

方法二間接加：

 URL temp = new URL(url);
 URLConnection uc = temp.openConnection();
uc.addRequestProperty("User-Agent", "Mozilla/5.0 (iPad; U; CPU OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5");
  is = temp.openStream();

加了頭部，幾乎可以應付大部分網址了

-------------------------------我是一根牛逼的分割線-------------------------------

將html下載到本地後需要解析啊

解析啊看這裡啊

//解析本地的html
    public static void Get_Localhtml(String path) {

        //讀取本地html的路徑
        File file = new File(path);
        //生成一個數組用來存儲這些路徑下的文件名
        File[] array = file.listFiles();
        //寫個迴圈讀取這些文件的名字
        
        for(int i=0;i<array.length;i++){
            try{
                if(array[i].isFile()){
                //文件名字
                System.out.println("正在解析網址：" + array[i].getName());

                //下麵開始解析本地的html
                Document doc = Jsoup.parse(array[i], "UTF-8");
                //得到html的所有東西
                Element content = doc.getElementById("content");
                //分離出html下<a>...</a>之間的所有東西
                Elements links = content.getElementsByTag("a");
                //Elements links = doc.select("a[href]");
                // 擴展名為.png的圖片
                Elements pngs = doc.select("img[src$=.png]");
                // class等於masthead的div標簽
                Element masthead = doc.select("div.masthead").first();
                
                for (Element link : links) {
                      //得到<a>...</a>裡面的網址
                      String linkHref = link.attr("href");
                      //得到<a>...</a>裡面的漢字
                      String linkText = link.text();
                      System.out.println(linkText);
                        }
                    }
                }catch (Exception e) {
                    System.out.println("網址：" + array[i].getName() + "解析出錯");
                    e.printStackTrace();
                    continue;
                }
        }
    }

文字配的很漂亮

就這樣解析出來啦

主函數加上

//main函數
    public static void main(String[] args) {
        String url = "http://www.cnblogs.com/TTyb/";
        String path = "src/temp_html/";
        Get_Localhtml(path);
    }

那麼這個文件夾裡面的所有的html都要被我解析掉

好啦

3天java1天爬蟲的結果就是這樣子咯

-------------------------------我是快樂的分割線-------------------------------

其實對於這兩種爬取html的方法來說，最好結合在一起

作者測試過

方法二穩定性不足

方法一速度不好

所以自己改正

將方法一放到方法二的catch裡面去

當方法二出現錯誤的時候就會用到方法一

但是當方法一也錯誤的時候就跳過吧

結合如下：

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import java.io.BufferedInputStream;
import java.io.BufferedOutputStream;
import java.io.BufferedReader;
import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.io.OutputStream;
import java.io.OutputStreamWriter;
import java.net.HttpURLConnection;
import java.net.URL;
import java.net.URLConnection;
import java.util.Date;
import java.text.SimpleDateFormat;

public class JavaSpider {
    
    //將抓取的網頁變成html文件，保存在本地
    public static void Save_Html(String url) {
        try {
            File dest = new File("src/temp_html/" + "我是名字.html");
            //接收位元組輸入流
            InputStream is;
            //位元組輸出流
            FileOutputStream fos = new FileOutputStream(dest);
    
            URL temp = new URL(url);
            URLConnection uc = temp.openConnection();
            uc.addRequestProperty("User-Agent", "Mozilla/5.0 (iPad; U; CPU OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5");
            is = temp.openStream();
            
            //為位元組輸入流加緩衝
            BufferedInputStream bis = new BufferedInputStream(is);
            //為位元組輸出流加緩衝
            BufferedOutputStream bos = new BufferedOutputStream(fos);
    
            int length;
    
            byte[] bytes = new byte[1024*20];
            while((length = bis.read(bytes, 0, bytes.length)) != -1){
                fos.write(bytes, 0, length);
            }

            bos.close();
            fos.close();
            bis.close();
            is.close();
        } catch (IOException e) {
            e.printStackTrace();
            System.out.println("openStream流錯誤，跳轉get流");
            //如果上面的那種方法解析錯誤
            //那麼就用下麵這一種方法解析
            try{
                Document doc = Jsoup.connect(url)
                .userAgent("Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0; MALC)")
                .timeout(3000) 
                .get();
                
                File dest = new File("src/temp_html/" + "我是名字.html");
                if(!dest.exists())
                    dest.createNewFile();
                FileOutputStream out=new FileOutputStream(dest,false);
                out.write(doc.toString().getBytes("utf-8"));
                out.close();

            }catch (IOException E) {
                E.printStackTrace();
                System.out.println("get流錯誤，請檢查網址是否正確");
            }
            
        }
    }
    
    //解析本地的html
    public static void Get_Localhtml(String path) {

        //讀取本地html的路徑
        File file = new File(path);
        //生成一個數組用來存儲這些路徑下的文件名
        File[] array = file.listFiles();
        //寫個迴圈讀取這些文件的名字
        
        for(int i=0;i<array.length;i++){
            try{
                if(array[i].isFile()){
                    //文件名字
                    System.out.println("正在解析網址：" + array[i].getName());
                    //文件地址加文件名字
                    //System.out.println("#####" + array[i]); 
                    //一樣的文件地址加文件名字
                    //System.out.println("*****" + array[i].getPath()); 
                    
                    
                    //下麵開始解析本地的html
                    Document doc = Jsoup.parse(array[i], "UTF-8");
                    //得到html的所有東西
                    Element content = doc.getElementById("content");
                    //分離出html下<a>...</a>之間的所有東西
                    Elements links = content.getElementsByTag("a");
                    //Elements links = doc.select("a[href]");
                    // 擴展名為.png的圖片
                    Elements pngs = doc.select("img[src$=.png]");
                    // class等於masthead的div標簽
                    Element masthead = doc.select("div.masthead").first();
                    
                    for (Element link : links) {
                          //得到<a>...</a>裡面的網址
                          String linkHref = link.attr("href");
                          //得到<a>...</a>裡面的漢字
                          String linkText = link.text();
                          System.out.println(linkText);
                        }
                    }
                }catch (Exception e) {
                    System.out.println("網址：" + array[i].getName() + "解析出錯");
                    e.printStackTrace();
                    continue;
                }
            }
        }
    //main函數
    public static void main(String[] args) {
        String url = "http://www.cnblogs.com/TTyb/";
        String path = "src/temp_html/";
        //保存到本地的網頁地址
        Save_Html(url);
        //解析本地的網頁地址
        Get_Localhtml(path);
    }
}

總的來說

java爬蟲的方法比python的多好多

java的庫真特麽變態