爬蟲之selenium_ZenDei技術網路在線

selenium基本操作概念：基於瀏覽器自動化的模塊：基於手機自動化的模塊的應用環境的安裝跟爬蟲之間的關聯？可以實現模擬登陸便捷的捕獲動態載入數據（可見即可得）基本操作導包：（web瀏覽器，driver驅動）必須提供對應瀏覽器的驅動程式（谷歌，火狐...） "谷歌瀏覽器驅動下載地 ...

selenium基本操作

概念：基於瀏覽器自動化的模塊
- appnium：基於手機自動化的模塊的應用
環境的安裝
- pip install selenium -i https://pypi.tuna.tsinghua.edu.cn/simple
跟爬蟲之間的關聯？
- 可以實現模擬登陸
- 便捷的捕獲動態載入數據（可見即可得）
基本操作
- 導包：from selenium import webdriver（web瀏覽器，driver驅動）
- 必須提供對應瀏覽器的驅動程式（谷歌，火狐...）
  - 谷歌瀏覽器驅動下載地址
- 實例化一個瀏覽器對象
```
bro = webdriver.Chrome(executable_path='./chromedriver.exe')
# Chrome 谷歌瀏覽器 executable_path 瀏覽器驅動路徑
```
- 標簽定位
  - find系列的函數
- 標簽對象.send_keys()：向指定標簽中錄入數據
- 提交標簽.click()
- js註入：瀏覽器對象.execute_script("js代碼")
- 瀏覽器對象.page_source ：返回當前頁面的頁面源碼數據，包含動態載入數據
- 關閉瀏覽器：瀏覽器對象.quit()
缺點
- 爬取的效率比較低下
什麼時候用selenium
- 動態載入的數據requests模塊實在爬取不到，使用selenium

示例代碼

登陸京東，搜索商品

from selenium import webdriver
from time import sleep

# 實例化瀏覽器對象
bro = webdriver.Chrome(executable_path='./chromedriver.exe')   # Chrome 谷歌瀏覽器 executable_path 瀏覽器驅動地址 
# 制定一些自動化的操作

# 發起請求
bro.get('https://www.jd.com')
# 如何進行標簽定位
search_tag = bro.find_element_by_id('key')
# 向文本框中錄入數據
search_tag.send_keys('mac pro')
sleep(2)
btn = bro.find_element_by_xpath('//*[@id="search"]/div/div[2]/button')
btn.click()
sleep(2)
# 註入JS代碼
bro.execute_script('window.scrollTo(0,document.body.scrollHeight)')
sleep(2)
# page_source ：返回當前頁面的頁面源碼數據，包含動態載入數據
print(bro.page_source)

# 關閉瀏覽器
bro.quit()

案例：使用selenium捕獲要藥監總局的動態載入數據

該網站的數據是動態載入的，來測試selenium如何便捷的捕獲動態載入數據
網址：http://125.35.6.84:81/xk/

from selenium import webdriver
from time import sleep
from lxml import etree

# 實例化瀏覽器對象
bro = webdriver.Chrome(executable_path='./chromedriver.exe')
# 發起請求
bro.get('http://125.35.6.84:81/xk/')
sleep(1)
# 第一頁的頁面源碼數據
page_text = bro.page_source
all_page_text = [page_text]
for i in range(1,5):
    # 找到下一頁對應的標簽
    a_tag = bro.find_element_by_xpath('//*[@id="pageIto_next"]')
    # 對下一頁的標簽發起點擊
    a_tag.click()
    sleep(1)
    # page_source 獲取當前頁面的源碼數據（涵動態載入）
    page_text = bro.page_source
    all_page_text.append(page_text)
for page_text in all_page_text:
    tree = etree.HTML(page_text)
    # xpath解析到name對應的標簽
    li_lst = tree.xpath('//*[@id="gzlist"]/li')
    for li in li_lst:
        name = li.xpath('./dl/@title')[0]
        print(name)
sleep(2)
bro.quit()

動作鏈

動作鏈：一系列連續的動作

導包：from selenium.webdriver import ActionChains
NoSuchElementException報錯：沒有定位到指定的標簽
- 定位的標簽是存在於一張嵌套的子頁面中，如果想定位之頁面中的指定標簽的話需要：
  - 瀏覽器對象.switch_to.frame('iframe標簽id的屬性值')：將當前瀏覽器頁面切換到指定的子頁面範圍中
針對指定的瀏覽器實例化一個動作鏈對象
- action = ActionChains(bro)
點擊且長按指定的標簽
- action.click_and_hold(tagName)
偏移
- action.move_by_offset(xoffset, yoffset) 一點一點偏移
- action.move_to_element(to_element)
- action.move_to_element_with_offset(to_element, xoffset, yoffset)
偏移.perform()：動作鏈立即執行

示例代碼

標簽嵌套子頁面中，菜鳥教程例子地址

from selenium import webdriver
from selenium.webdriver import ActionChains
from time import sleep

bro = webdriver.Chrome("./chromedriver.exe")
bro.get('https://www.runoob.com/try/try.php?filename=jqueryui-api-droppable')

# 標簽定位
bro.switch_to.frame('iframeResult')
div_tag = bro.find_element_by_id('draggable')

# 需要使用ActionChains定製好的行為動作

# 針對當前瀏覽器頁面實例化了一個動作鏈對象
action = ActionChains(bro)
# 點擊且長按一個指定的標簽
action.click_and_hold(div_tag)

for i in range(1,7):
    # 一點一點遷移
    action.move_by_offset(10,15).perform()  # perform() 是動作鏈立即執行
    action.move_to_element
    action.move_to_element_with_offset
    sleep(0.5)

無頭瀏覽器

概念：沒有可視化界面的瀏覽器
phantomJS無頭瀏覽器，幾乎不用了，停止更新維護了，現在不用了

谷歌無頭瀏覽器

就是本機安裝的谷歌瀏覽器，只是需要通過代碼進行相關配置就可以變成無頭瀏覽器

from selenium import webdriver
from selenium.webdriver.chrome.options import Options

# 無頭瀏覽器開整
# 實例化options對象
chrome_options = Options()
# 調用add_argument方法，進行自定義配置
chrome_options.add_argument('--headless')
chrome_options.add_argument('--disable-gpu')

bro = webdriver.Chrome(executable_path="./chromedriver.exe",chrome_options=chrome_options)
bro.get('https://www.baidu.com')
# 截屏
bro.save_screenshot('./1.png')
print(bro.page_source)

規避檢測

webServer是如何檢測到我們的請求是否使用了selenium
- 網站開發者工具Consloe中註入js代碼：window.navigator.webdriver
  - true：請求是基於selenium發起的（異常請求）
  - undefined：請求是基於瀏覽器發起的（正常請求）
環境配置
- 本機谷歌瀏覽器的驅動程式所在的目錄路徑添加到環境變數中
- 使用本機谷歌的驅動程式開啟一個瀏覽器
  - chrome.exe --remote-debugging-port=9222 --user-data-dir="D:\selenum\AutomationProfile"
    
    9222：埠（任意空閑埠）
    
    "D:\selenum\AutomationProfile"：已經事先存在的一個空目錄

使用托管機制

Consloe中註入js代碼：window.navigator.webdriver，雖然會返回true，但不會提示請停用以開發者模式運行的擴展程式，相當於自己打開的瀏覽器

# 終端先運行如下代碼
chrome.exe --remote-debugging-port=9222 --user-data-dir="D:\selenum\AutomationProfile"

from selenium import webdriver
from selenium.webdriver.chrome.options import Options

chrome_options = Options()
chrome_options.add_experimental_option('debuggerAddress','127.0.0.1:9222')

# 代碼托管打開的瀏覽器,不會實例化一個新的瀏覽器
driver = webdriver.Chrome(executable_path="./chromedriver.exe",chrome_options=chrome_options)
driver.get('http://www.taobao.com')

老版本的selenium規避檢測的操作
- 這個目前會被檢測到

from selenium import webdriver
from selenium.webdriver import ChromeOptions
 
option = ChromeOptions()     #實例化一個ChromeOptions對象
option.add_experimental_option('excludeSwitches', ['enable-automation'])  #以鍵值對的形式加入參數
 
bro = webdriver.Chrome(executable_path='./chromedriver.exe',options=option)  #在調用瀏覽器驅動時傳入option參數就能實現undefined

模擬登陸

12306模擬登陸

URL：12306登陸
分析：
- 識別的驗證碼圖片必須通過截圖獲取驗證碼然後存儲到本地
  - 登陸操作和唯一的驗證碼圖片一一對應
基於超級鷹識別驗證碼登錄，類型9004

# 超級鷹的包
import requests
from hashlib import md5

class Chaojiying_Client(object):

    def __init__(self, username, password, soft_id):
        self.username = username
        password =  password.encode('utf8')
        self.password = md5(password).hexdigest()
        self.soft_id = soft_id
        self.base_params = {
            'user': self.username,
            'pass2': self.password,
            'softid': self.soft_id,
        }
        self.headers = {
            'Connection': 'Keep-Alive',
            'User-Agent': 'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0)',
        }

    def PostPic(self, im, codetype):
        """
        im: 圖片位元組
        codetype: 題目類型 參考 http://www.chaojiying.com/price.html
        """
        params = {
            'codetype': codetype,
        }
        params.update(self.base_params)
        files = {'userfile': ('ccc.jpg', im)}
        r = requests.post('http://upload.chaojiying.net/Upload/Processing.php', data=params, files=files, headers=self.headers)
        return r.json()

    def ReportError(self, im_id):
        """
        im_id:報錯題目的圖片ID
        """
        params = {
            'id': im_id,
        }
        params.update(self.base_params)
        r = requests.post('http://upload.chaojiying.net/Upload/ReportError.php', data=params, headers=self.headers)
        return r.json()

# 封裝一個驗證碼識別的函數
def transform_code(imgPath,imgType):
    chaojiying = Chaojiying_Client('超級鷹用戶名', '超級鷹用戶名對應的密碼', '軟體ID')
    im = open(imgPath, 'rb').read()
    return chaojiying.PostPic(im, imgType)['pic_str']


# 模擬登陸實現代碼

from time import sleep
from PIL import Image	# pip install Pillow
from selenium import webdriver
from selenium.webdriver import ActionChains

# 實例化一個谷歌瀏覽器對象
bro = webdriver.Chrome(executable_path="./chromedriver.exe")
# 發起請求
bro.get('https://kyfw.12306.cn/otn/resources/login.html')
# 登錄頁面第一個展示的是掃碼，點擊帳號密碼登錄
bro.find_element_by_xpath('/html/body/div[2]/div[2]/ul/li[2]/a').click()
sleep(2) # 等待2秒，載入驗證碼圖片
# 定位到用戶名密碼框，輸入帳號密碼
bro.find_element_by_id('J-userName').send_keys('xxxxxxxx')  # 12306用戶名
bro.find_element_by_id('J-password').send_keys('********')  # 12306用戶名對應的密碼

# 驗證碼的點擊操作
bro.save_screenshot('./12306.png')# 將頁面當作圖片保存到本地
# 將驗證碼圖片的標簽定位到
img_tag = bro.find_element_by_id('J-loginImg')
# 驗證碼的坐標和大小
location = img_tag.location
size = img_tag.size

# 裁剪的範圍，這個根據截圖自己情況調整，自己調試的(699, 284, 1015, 472)
rangle = (int(location['x'])-65,int(location['y']),int(location['x']+size['width'])-49,int(location['y']+size['height']))

# 使用Image類根據rangle裁剪範圍進行驗證碼圖片的裁剪
i = Image.open('./12306.png')  # bytes類型數據
frame = i.crop(rangle)  # 驗證碼對應的二進位數據
frame.save('./code.png')
img_coor = transform_code('./code.png',9004)  # 返回坐標值 274,146|37,147

# 將坐標字元串轉換為嵌套的列表
all_lst = []	# [[274,146],[37,147]...]
if '|' in img_coor:
    lst_1 = img_coor.split("|")
    count_1 = len(lst_1)
    for i in range(count_1):
        xy_lst = []
        x = int(lst_1[i].split(',')[0])
        y = int(lst_1[i].split(',')[1])
        xy_lst.append(x)
        xy_lst.append(y)
        all_lst.append(xy_lst)
else:
    x = int(img_coor.split(',')[0])
    y = int(img_coor.split(',')[1])
    xy_lst = []
    xy_lst.append(x)
    xy_lst.append(y)
    all_lst.append(xy_lst)

for data in all_lst:
    # 每個data都是一個列表中有2個元素
    x = data[0]
    y = data[1]
    # 實例化一個動作鏈，在指定範圍(驗證碼標簽範圍)，找到x,y坐標，點擊，動作鏈立即執行
    ActionChains(bro).move_to_element_with_offset(img_tag,x,y).click().perform()
    # 執行一次等待0.5秒，防止過快
    sleep(0.5)

# 點擊登錄按鈕，實現登錄
bro.find_element_by_id('J-login').click()
sleep(2)
# 關閉瀏覽器
bro.quit()

Pyppeteer

波曉張博客地址