任務 爬取https://www.aliexpress.com/wholesale?SearchText=cartoon+case&d=y&origin=n&catId=0&initiative_id=SB_20200523214041這個頁面下的商品詳情,由於頁面是非同步載入的,需要使用Seleni ...
任務
爬取https://www.aliexpress.com/wholesale?SearchText=cartoon+case&d=y&origin=n&catId=0&initiative_id=SB_20200523214041這個頁面下的商品詳情,由於頁面是非同步載入的,需要使用Selenium模擬瀏覽器來獲取商品url。但直接使用Selenium定位網頁元素速度又很慢,因此需要結合Re或者BeautifulSoup來提高爬取效率。
模擬登陸
使用Selenium模擬登錄,登錄成功後獲取cookie。
def login(username, password, driver=None): driver.get('https://login.aliexpress.com/') driver.maximize_window() name = driver.find_element_by_id('fm-login-id') name.send_keys(username) name1 = driver.find_element_by_id('fm-login-password') name1.send_keys(password) submit = driver.find_element_by_class_name('fm-submit') time.sleep(1) submit.click() return driver browser = webdriver.Chrome() browser = login('[email protected]','ab123456',browser) browser.get('https://www.aliexpress.com/wholesale?trafficChannel=main&d=y&SearchText=cartoon+case<ype=wholesale&SortType=default&page=')
這個網站對用戶監管不嚴,使用郵箱註冊都不需要進行驗證,可以用這個網站獲取假郵箱進行註冊:http://www.fakemailgenerator.com/
其實後續真正運行程式爬的時候並沒有登錄,爬了十頁也沒碰到反爬。
獲取商品詳情頁的URL
這一過程需要解決的問題在於該網頁是ajex非同步載入的,網頁不會在打開的同時載入全部數據,在下拉的同時網頁刷新返回新的數據包並渲染,因此通過request無法一次性讀到網頁的全部源碼。解決思路是通過Selenium來模擬瀏覽器下拉行為以獲取一頁內全部的數據,然後暫時還是通過sel去獲取元素。
登錄後打開任務需要的頁面會出現廣告彈窗,首先需要關閉廣告彈窗:
def close_win(browser): time.sleep(10) try: closewindow = browser.find_element_by_class_name('next-dialog-close') browser.execute_script("arguments[0].click();", closewindow) except Exception as e: print(f"searchKey: there is no suspond Page1. e = {e}")
模擬下拉行為並獲取一頁中全部商品的url:
def get_products(browser): wait = WebDriverWait(browser, 1) for i in range(30): browser.execute_script('window.scrollBy(0,230)') time.sleep(1) products = wait.until(EC.presence_of_all_elements_located((By.CLASS_NAME,"product-info"))) if len(products) >= 60: break else: print(len(products)) continue products = browser.find_elements_by_class_name('product-info') return products
後來經學長指點發現不需要這麼麻煩,搜索頁的商品信息雖然是經過下滑操作才會通過JS動態渲染,但商品信息其實都是寫在html文檔里的,可以通過以下方式獲取:
url = 'https://www.aliexpress.com/wholesale?trafficChannel=main&d=y&SearchText=cartoon+case<ype=wholesale&SortType=default&page=' driver = webdriver.Chrome() driver.get(url) info = re.findall('window.runParams = (\{.*\})',driver.page_source)[-1] infos = json.loads(info) items = infos['items']
然後就可以慢慢去匹配。
獲取商品內頁詳情
這一部分的問題在於需要爬取的網頁很多,繼續使用sel會導致爬蟲速度很慢,另外商品內頁的數據似乎不是非同步返回的。解決方案是先使用sel訪問商品內頁,將整個網頁源碼down下來後用正則表達式去匹配元素:
def get_pro_info(product): url = product.find_element_by_class_name('item-title').get_attribute('href') driver = webdriver.Chrome() driver.get(url) page = driver.page_source driver.close() material=re.findall(r'"skuAttr":".*?#(.*?);',page) color=re.findall(r'skuAttr":".*?#.*?#(.*?)"',page) stock=re.findall(r'skuAttr":".*?"availQuantity":(.*?),',page) price=re.findall(r'skuAttr":".*?"actSkuCalPrice":"(.*?)"',page) pics = re.findall(r'<div class="sku-property-image"><img class="" src="(.*?)"', page) titles = re.findall(r'<img class="" src=".*?" title="(.*?)">', page) video = re.findall(r'id="item-video" src="(.*?)"', page) return material, color, stock, price, pics, titles, video
接入MySQL
爬取到的數據要求用資料庫儲存,這裡需要接入MySQL,資料庫crawl和表SKU都是提前建好的:
conn = pymysql.connect(host='localhost', user='root', password='ab226690',db='crawl')
mycursor = conn.cursor()
通過迴圈實現數組數據的寫入,這裡很坑的一點是insert的時候pymysql的格式轉換和python不是完全一樣,參數用'%s'匹配就可以,不需要針對數字型欄位搞整形或浮點型:
#寫入sku表 sql = "INSERT INTO SKU(skuID,material,color,stock,price, url) VALUES (%s,%s,%s,%s,%s,%s)"#就是這裡,雖然有些變數是數值型,但還是用%s來對應 for i in range(len(skuID)): if titles: params = (skuID[i], material[i], color[i], stock[i], price[i],url) else: params = (skuID[i], material[i], ' ', stock[i], price[i],url) try: mycursor.execute(sql,params) conn.commit() except IntegrityError: #當出現duplicate primary key時會拋出這個錯誤,這裡這樣寫的本意是碰到重覆主鍵就跳過這一條記錄,但實際運行這段代碼的時候還是會報錯。偷懶的解決辦法是把主鍵取消,但這樣好像不是很合理,日後知道怎麼解決再來更新 conn.rollback() continue
實現寫入操作時碰到的另一個問題是用re匹配不到元素時返回的是一個空的list,這樣會導致無法寫入mysql而報錯,因此要判斷待寫入的變數是否是空的list,是的話要賦合適的值:
sql = "INSERT INTO product(url, product_name, rating, reviews, video, shipping) VALUES (%s,%s,%s,%s,%s,%s)" if rating: pass else: rating = '0.0' if review: pass else: review = '0' if video: pass else: video = ' ' if shipping: pass else: shipping = '0.0' params = (url, pro_name, rating,review, video, shipping) mycursor.execute(sql,params)
關閉資料庫連接:
conn.commit()
提升速度
除了前面提到的使用selenium訪問後轉用re匹配外,還發現一個提升爬蟲效率的點:
browser = webdriver.Chrome() browser.get(source_url) browser = close_win(browser)
像這樣重覆地實例化和關閉瀏覽器驅動是很耗費時間的,因此要使用儘量少的瀏覽器視窗來訪問網站。
本任務中是只實例化了兩個webdriver,一個用來訪問多個商品的展示頁,一個用來訪問商品內頁,具體方法就是實例化後不要這兩個driver,一直用它們來get新的網頁。原來的代碼中是每打開一個網頁都初始化一個新的webdriver去訪問,做出這一修改後代碼運行時間減少了一半。
def scratch_page(source_url): browser = webdriver.Chrome() browser.get(source_url) browser.maximize_window() browser = close_win(browser) pros = get_products(browser) #商品內頁的瀏覽器 browser2 = webdriver.Chrome() error_file = open('ERROR.txt','a+',encoding='utf8') for pro in pros: url, pro_name, skuID, material, color, stock, price, pics, titles, video,rating,shipping, review = get_pro_info(pro, browser2)#對前面的get_pro_info
做簡單修改 if len(skuID)!=len(color): error_file.write('url:'+url+'\n') continue save_data_to_sql(url,pro_name, skuID, material, color, stock, price, pics, titles, video,rating,shipping,review) error_file.close() browser.close() browser2.close()
完整代碼
from selenium import webdriver import time from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC import re import pymysql from sqlalchemy.exc import IntegrityError#捕獲重覆主鍵的異常 def login(username, password, driver=None): driver.get('https://login.aliexpress.com/') driver.maximize_window() name = driver.find_element_by_id('fm-login-id') name.send_keys(username) name1 = driver.find_element_by_id('fm-login-password') name1.send_keys(password) submit = driver.find_element_by_class_name('fm-submit') time.sleep(1) submit.click() return driver def close_win(browser): time.sleep(5) try: closewindow = browser.find_element_by_class_name('next-dialog-close') closewindow.click() except Exception as e: print(f"searchKey: there is no suspond Page1. e = {e}") return browser def get_products(browser): wait = WebDriverWait(browser, 1) for i in range(30): browser.execute_script('window.scrollBy(0,230)') time.sleep(1) products = wait.until(EC.presence_of_all_elements_located((By.CLASS_NAME,"product-info"))) if len(products) >= 60: break else: continue products = browser.find_elements_by_class_name('product-info') return products def get_pro_info(product, driver): url = product.find_element_by_class_name('item-title').get_attribute('href') driver.get(url) time.sleep(0.5) page = driver.page_source material=re.findall(r'"skuAttr":".*?#(.*?);',page) color=re.findall(r'"skuAttr":".*?#.*?#(.*?)"',page) stock=re.findall(r'"skuAttr":".*?"availQuantity":(.*?),',page) price=re.findall(r'"skuAttr":".*?"skuCalPrice":"(.*?)"',page) pics = re.findall(r'<div class="sku-property-image"><img class="" src="(.*?)"', page) titles = re.findall(r'<img class="" src=".*?" title="(.*?)">', page) video = re.findall(r'id="item-video" src="(.*?)"', page) skuID = re.findall(r'"skuId":(.*?),',page) pro_name = re.findall(r'"product-title-text">(.*?)</h1>', page) rating = re.findall(r'itemprop="ratingValue">(.*?)</span>', page) shipping = re.findall(r'<span class="bold">(.*?) ', page) review = re.findall(r'"reviewCount">(.*?) Reviews</span>', page) #當商品沒有顏色可選時,網頁源碼結構變化,需要重新匹配 if titles: pass else: material = re.findall(r'"skuAttr":".*?#(.*?)"', page) color=[] pics = re.findall(r'"imagePathList":\["(.*?)",', page) return url, pro_name, skuID, material, color, stock, price, pics, titles, video,rating,shipping, review def save_data_to_sql(url,pro_name, skuID, material, color, stock, price, pics, titles, video,rating,shipping,review): url = re.findall('/item/(.*?).html',url) # try: conn = pymysql.connect(host='localhost', user='root', password='ab226690',db='crawl') mycursor = conn.cursor() #寫入sku表 sql = "INSERT INTO SKU(skuID,material,color,stock,price, url) VALUES (%s,%s,%s,%s,%s,%s)" for i in range(len(skuID)): if titles: params = (skuID[i], material[i], color[i], stock[i], price[i],url) else: params = (skuID[i], material[i], ' ', stock[i], price[i],url) # mycursor.execute(sql,params) # conn.commit() try: mycursor.execute(sql,params) conn.commit() except IntegrityError: conn.rollback() continue #寫入img表 sql = "INSERT INTO image(url, color, img) VALUES (%s,%s,%s)" i = 0 if titles: for i in range(len(titles)): params = (url, titles[i], pics[i]) # mycursor.execute(sql,params) # conn.commit() try: mycursor.execute(sql,params) conn.commit() except IntegrityError: conn.rollback() continue else: params = (url, ' ', pics) # mycursor.execute(sql,params) # conn.commit() try: mycursor.execute(sql,params) conn.commit() except IntegrityError: conn.rollback() #寫入product表 sql = "INSERT INTO product(url, product_name, rating, reviews, video, shipping) VALUES (%s,%s,%s,%s,%s,%s)" if rating: pass else: rating = '0.0' if review: pass else: review = '0' if video: pass else: video = ' ' if shipping: pass else: shipping = '0.0' params = (url, pro_name, rating,review, video, shipping) mycursor.execute(sql,params) conn.commit() # try: # mycursor.execute(sql,params) # conn.commit() # except Exception: # conn.rollback() conn.close() # except Exception as e: # conn.rollback() # print(e) def scratch_page(source_url): browser = webdriver.Chrome() browser.get(source_url) browser.maximize_window() browser = close_win(browser) pros = get_products(browser) #商品內頁的瀏覽器 browser2 = webdriver.Chrome() error_file = open('ERROR.txt','a+',encoding='utf8') for pro in pros: url, pro_name, skuID, material, color, stock, price, pics, titles, video,rating,shipping, review = get_pro_info(pro, browser2) if len(skuID)!=len(color): error_file.write('url:'+url+'\n') continue save_data_to_sql(url,pro_name, skuID, material, color, stock, price, pics, titles, video,rating,shipping,review) error_file.close() browser.close() browser2.close() url = 'https://www.aliexpress.com/wholesale?trafficChannel=main&d=y&SearchText=cartoon+case<ype=wholesale&SortType=default&page=' for p in range(1,11): url_ = url + str(p) start_time = time.time() scratch_page(url_) end_time = time.time() print('成功爬取' + str(p) + '頁') print('第' + str(p) + '頁耗時: '+str(start_time-end_time)+'s')