前言 文的文字及圖片來源於網路,僅供學習、交流使用,不具有任何商業用途,版權歸原作者所有,如有問題請及時聯繫我們以作處理。 爬蟲的網站:萬邦國際集團。其成立於2010年,總部位於河南省鄭州市,以“立足三農、保障民生、服務全國”為宗旨,業務涵蓋綜合性農產品冷鏈物流、高效生態農業開發、生鮮連鎖超市、跨境 ...
前言
文的文字及圖片來源於網路,僅供學習、交流使用,不具有任何商業用途,版權歸原作者所有,如有問題請及時聯繫我們以作處理。
爬蟲的網站:萬邦國際集團。其成立於2010年,總部位於河南省鄭州市,以“立足三農、保障民生、服務全國”為宗旨,業務涵蓋綜合性農產品冷鏈物流、高效生態農業開發、生鮮連鎖超市、跨境電子商務、進出口貿易等農業全產業鏈。榮獲重點龍頭企業、全國農產品“綜合十強市場”、“星創天地”、全國“萬企幫萬村”精準扶貧先進民營企業等榮譽稱號。目前,集團在中牟縣建設運營的萬邦農產品物流園區,已累計完成投資100億元,占地5000畝,建築面積達350萬平方米。擁有固定商戶6000多家,2017年各類農副產品交易額913億元,交易量1720萬噸,位居全國前列,實現農產品“買全球、賣全國”。
其價格信息查詢為get請求,網頁比較規範,且短期內不會有大的變動,很容易分析,故選擇之。
一、使用request爬取數據
# _*_ coding:utf-8 _*_ # 開發人員:未央 # 開發時間:2020/4/12 16:03 # 文件名:Scrapy_lab1.py # 開發工具:PyCharm import csv import codecs import requests # 導入requests包 from bs4 import BeautifulSoup # 導入bs4包 from datetime import datetime class Produce: price_data = [] # 農產品的價格數據列表 item_name = "" # 農產品的類別名 def __init__(self, category): self.item_name = category self.price_data = [] # 讀取某一頁的數據,預設是第一頁 def get_price_page_data(self, page_index=1): url = 'http://www.wbncp.com/PriceQuery.aspx?PageNo=' + str( page_index) + '&ItemName=' + self.item_name + '&DateStart=2017/10/1&DateEnd=2020/3/31 ' strhtml = requests.get(url) # GET方式,獲取網頁數據 # print(strhtml.text) soup = BeautifulSoup(strhtml.text, 'html.parser') # 解析網頁文檔 # print(soup) table_node = soup.find_all('table') # number = 0 # for table in table_node: # number += 1 # print(number, table) all_price_table = table_node[21] # 獲取含有農產品價錢的table的數據 # print(all_price_table) for tr in all_price_table.find_all('tr'): number = 0 price_line = [] for td in tr.find_all('td'): number += 1 # print(number, td) if number == 1: price_line.append(td.get_text().split()) # 獲取品名 elif number == 2: price_line.append(td.get_text().split()) # 獲取產地 elif number == 3: price_line.append(td.get_text().split()) # 獲取規格 elif number == 4: price_line.append(td.get_text().split()) # 獲取單位 elif number == 5: price_line.append(td.get_text().split()) # 獲取最高價 elif number == 6: price_line.append(td.get_text().split()) # 獲取最低價 elif number == 7: price_line.append(td.get_text().split()) # 獲取均價 elif number == 8: price_line.append(datetime.strptime(td.get_text().replace('/', '-'), '%Y-%m-%d')) # 獲取日期 self.price_data.append(price_line) return # 獲取全部頁面的數據 def get_price_data(self): for i in range(33): self.get_price_page_data(str(i)) return # 講爬蟲的數據寫入到CSV文件,路徑為:D:\Data_pytorch\名字.csv def data_write_csv(self): # file_address為寫入CSV文件的路徑,self.price_data為要寫入數據列表 self.get_price_data() file_address = "D:\Data_pytorch\\" + self.item_name.__str__() + ".csv" file_csv = codecs.open(file_address, 'w+', 'utf-8') # 追加 writer = csv.writer(file_csv, delimiter=' ', quotechar=' ', quoting=csv.QUOTE_MINIMAL) for temp_data in self.price_data: writer.writerow(temp_data) print(self.item_name + "爬蟲數據保存到文件成功!") # 以字典類型讀取csv文件,讀取路徑為:D:\Data_pytorch\名字.csv def data_reader_csv(self): file_address = "D:\Data_pytorch\\" + self.item_name.__str__() + ".csv" with open(file_address, 'r', encoding='utf8')as fp: # 使用列表推導式,將讀取到的數據裝進列表 data_list = [i for i in csv.DictReader(fp, fieldnames=None)] # csv.DictReader 讀取到的數據是list類型 print(self.item_name + "數據如下:") print(data_list) return data_list list = ["白菜", "包菜", "土豆", "菠菜", "蒜苔"] for temp_name in list: produce = Produce(temp_name) produce.data_write_csv() data = produce.data_reader_csv()
運行之後,文件顯示內容如下:
二、使用Scrapy爬取數據
類似之前的學習案例,這裡不再一步一步的介紹,直接上代碼:
items.py代碼如下:
# -*- coding: utf-8 -*- # Define here the models for your scraped items # # See documentation in: # https://doc.scrapy.org/en/latest/topics/items.html import scrapy from scrapy.loader import ItemLoader from scrapy.loader.processors import TakeFirst class PriceSpiderItemLoader(ItemLoader): # 自定義itemloader,用於存儲爬蟲所抓取的欄位內容的 default_output_processor = TakeFirst() class PriceSpiderItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() name = scrapy.Field() # 品名 address = scrapy.Field() # 產地 norms = scrapy.Field() # 規格 unit = scrapy.Field() # 單位 high = scrapy.Field() # 最高價 low = scrapy.Field() # 最低價 price_ave = scrapy.Field() # 均價 price_date = scrapy.Field() # 日期
setting.py代碼如下:
# -*- coding: utf-8 -*- # Scrapy settings for price_spider project # # For simplicity, this file contains only settings considered important or # commonly used. You can find more settings consulting the documentation: # # https://doc.scrapy.org/en/latest/topics/settings.html # https://doc.scrapy.org/en/latest/topics/downloader-middleware.html # https://doc.scrapy.org/en/latest/topics/spider-middleware.html from scrapy.exporters import JsonLinesItemExporter # 預設顯示的中文是閱讀性較差的Unicode字元 # 需要定義子類顯示出原來的字元集(將父類的ensure_ascii屬性設置為False即可) class CustomJsonLinesItemExporter(JsonLinesItemExporter): def __init__(self, file, **kwargs): super(CustomJsonLinesItemExporter, self).__init__(file, ensure_ascii=False, **kwargs) # 啟用新定義的Exporter類 FEED_EXPORTERS = { 'json': 'price_spider.settings.CustomJsonLinesItemExporter', } BOT_NAME = 'price_spider' SPIDER_MODULES = ['price_spider.spiders'] NEWSPIDER_MODULE = 'price_spider.spiders' # Crawl responsibly by identifying yourself (and your website) on the user-agent # USER_AGENT = 'price_spider (+http://www.yourdomain.com)' # Obey robots.txt rules ROBOTSTXT_OBEY = False # Configure maximum concurrent requests performed by Scrapy (default: 16) # CONCURRENT_REQUESTS = 32 # Configure a delay for requests for the same website (default: 0) # See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay # See also autothrottle settings and docs DOWNLOAD_DELAY = 3
爬蟲邏輯(spider.py)代碼如下:
# _*_ coding:utf-8 _*_ # 開發人員:未央 # 開發時間:2020/4/16 14:55 # 文件名:spider.py # 開發工具:PyCharm import scrapy from price_spider.items import PriceSpiderItemLoader, PriceSpiderItem class SpiderSpider(scrapy.Spider): name = 'spider' allowed_domains = ['www.wbncp.com'] start_urls = ['http://www.wbncp.com/PriceQuery.aspx?PageNo=1&ItemName=%e7%99%bd%e8%8f%9c&DateStart=2017/10/1' '&DateEnd=2020/3/31', 'http://www.wbncp.com/PriceQuery.aspx?PageNo=1&ItemName=土豆&DateStart=2017/10/1' '&DateEnd=2020/3/31', 'http://www.wbncp.com/PriceQuery.aspx?PageNo=1&ItemName' '=芹菜&DateStart=2017/10/1 &DateEnd=2020/3/31'] def parse(self, response): item_nodes = response.xpath("//tr[@class='Center' or @class='Center Gray']") for item_node in item_nodes: item_loader = PriceSpiderItemLoader(item=PriceSpiderItem(), selector=item_node) item_loader.add_css("name", "td:nth-child(1) ::text") item_loader.add_css("address", "td:nth-child(2) ::text") item_loader.add_css("norms", "td:nth-child(3) ::text") item_loader.add_css("unit", "td:nth-child(4) ::text") item_loader.add_css("high", "td:nth-child(5) ::text") item_loader.add_css("low", "td:nth-child(6) ::text") item_loader.add_css("price_ave", "td:nth-child(7)::text") item_loader.add_css("price_date", "td:nth-child(8)::text") price_item = item_loader.load_item() yield price_item next_page = response.xpath("//*[@id='cphRight_lblPage']/div/a[10]/@href").extract_first() if next_page is not None: next_page = response.urljoin(next_page) yield scrapy.Request(next_page, callback=self.parse)
替代運行命令(price_scrapy_main.py)的代碼如下:
# _*_ coding:utf-8 _*_ # 開發人員:未央 # 開發時間:2020/4/16 14:55 # 文件名:price_scrapy_main.py # 開發工具:PyCharm from scrapy.cmdline import execute execute(["scrapy", "crawl", "spider", "-o", "price_data.csv"])
運作後,將csv數據導入excel中,結果如下:
三、經驗總結:
1.使用request確實比較靈活,但是如果爬取數據多很不方便,代碼也會很長,還是使用scrapy方便。特別是爬取多個頁面,scrapy 的橫向和縱向爬取,超級膩害!
2.Scrapy主要是設置文件(setting.py)的各種設置以及爬蟲文件(本文是spider.py)的爬蟲邏輯,其中主要是選擇器部分比較麻煩
如果你處於想學Python或者正在學習Python,Python的教程不少了吧,但是是最新的嗎?說不定你學了可能是兩年前人家就學過的內容,在這小編分享一波2020最新的Python教程。獲取方式,私信小編 “ 資料 ”,即可免費獲取哦!