前言 在進行某些爬蟲任務的時候,我們經常會遇到僅用Http協議難以攻破的情況,比如協議中帶有加密參數,破解需要花費大量時間,那這時候就會用Selenium去模擬瀏覽器進行頁面上的元素抓取 大多數情況下我們用Selenium只是爬取一下頁面上可見的元素信息或者做一些模擬人工的操作,但頁面可見元素的數據 ...
前言
在進行某些爬蟲任務的時候,我們經常會遇到僅用Http協議難以攻破的情況,比如協議中帶有加密參數,破解需要花費大量時間,那這時候就會用Selenium去模擬瀏覽器進行頁面上的元素抓取
大多數情況下我們用Selenium只是爬取一下頁面上可見的元素信息或者做一些模擬人工的操作,但頁面可見元素的數據欄位畢竟有限,有許多有用的欄位隱藏在介面響應中的,但是要如何拿到介面響應內容呢?
在網上搜索Selenium如何獲取Chrome中Network數據包響應結果,大多數的文章都是Python或者Java,C#的資源少之又少,雖然知道原理,但每個語言之間SDK代碼實現相差很大,C#的SDK真的有點魔改,需要自己慢慢摸索
探索
通過尋找資料,大致就2種方案
方案1:通過Selenium指定本地一個代理去截取所有請求,類似於常見抓包工具的原理,但是C#是沒有這種插件,也有可能是我沒找到,比如Python和Java有一個叫Browsermob-Proxy的插件,可以和Selenium深度結合實現代理抓包。我利用FiddlerCore做了一個本地代理工具,但是並不好用,不能和Selenium進行深度綁定使用,會導致Selenium爬取過程和請求截取是非同步進行的,強行用代碼實現同步又很難受,達不到我想要的要求
方案2:Selenium通過chromedriver開啟瀏覽器的性能日誌功能,記錄類型為Performance的日誌,該功能在Selenium中叫做 PerformanceLoggingPreferences。網頁載入完成後,可以通過Selenium拿到瀏覽器Performance Logs摘要信息,再利用Log中的RequestId調用Chrome CDP命令去瀏覽器端獲取日誌的完整內容。Selenium封裝的CDP,本質上還是Http請求,只是帶著驅動視窗的SessionId和Chrome的API做交互
方案1參考資料:
https://blog.csdn.net/qq_32502511/article/details/101536325 (Python + Browsermob-Proxy)
https://blog.csdn.net/fontcolor0/article/details/103297635/ (Java + Browsermob-Proxy)
https://www.cnblogs.com/airoot/articles/14888284.html (C# + FiddlerCore)
方案2參考資料:
https://chromedevtools.github.io/devtools-protocol/ (Chrome DevTools Protocol 介紹)
https://www.jianshu.com/p/615e3c0140a5 (Python + 開啟PerfLoggingPref)
https://blog.csdn.net/weixin_49855251/article/details/112281901 (不同日誌類型的介紹)
https://blog.csdn.net/bigcarp/article/details/115065730 (Java + 原生CDP協議獲取日誌內容)
上手
啟用Logging:
C#偽代碼示例:
首先需要安裝最新的Nuget包:OpenQA.Selenium
var option = new ChromeOptions(); option.SetLoggingPreference("performance",OpenQA.Selenium.LogLevel.Info); //啟用performance日誌,等級為Info即可 option.PerformanceLoggingPreferences = new ChromiumPerformanceLoggingPreferences() { IsCollectingNetworkEvents = true //採集網路請求事件 }; using (ChromeDriver driver = new ChromeDriver(driverPath,option,TimeSpan.FromSeconds(5))) { driver.Navigate().GoToUrl("https://item.m.jd.com/product/10052422060501.html"); Thread.Sleep(3 * 1000); //等待頁面載入完成 var logs = driver.Manage().Logs.GetLog("performance"); //獲取所有performance日誌 }
順便放個Python代碼做個對比:
caps = { 'browserName': 'chrome', 'loggingPrefs': { 'performance': 'Info', //啟用performance日誌,等級為Info即可 }, 'goog:chromeOptions': { 'perfLoggingPrefs': { 'enableNetwork': True, //採集網路請求事件 }, 'w3c': False, }, } driver = webdriver.Chrome(desired_capabilities=caps) //TODO 獲取日誌
//.....
分析Logging:
我們從返回的日誌列表裡隨便挑一個看看原始內容:
{[2022-08-03T08:47:31Z] [Info] {"message":{"method":"Network.responseReceived","params":{"frameId":"78D9BC6F0CBE162DA6779F410AA1500C","hasExtraInfo":true,"loaderId":"33814E6BD702343CF0A2A38C976C772F","requestId":"33814E6BD702343CF0A2A38C976C772F","response":{"connectionId":242,"connectionReused":false,"encodedDataLength":546,"fromDiskCache":false,"fromPrefetchCache":false,"fromServiceWorker":false,"headers":{"access-control-allow-credentials":"true","access-control-allow-headers":"Origin, X-Requested-With, Content-Type, multipart/form-data, Accept, Authorization","access-control-allow-methods":"POST, GET, PATCH, DELETE, PUT, OPTIONS","access-control-allow-origin":"*","access-control-max-age":"3600","cache-control":"no-cache,no-store","content-encoding":"gzip","content-language":"zh-CN","content-type":"text/html;charset=UTF-8","date":"Wed, 03 Aug 2022 08:47:33 GMT","hit":"bj-9153118133147527","server":"jfe","strict-transport-security":"max-age=86400","vary":"Accept-Encoding"},"mimeType":"text/html","protocol":"h2","remoteIPAddress":"106.39.169.120","remotePort":443,"responseTime":1.659516451658764e+12,"securityDetails":{"certificateId":0,"certificateTransparencyCompliance":"compliant","cipher":"AES_256_GCM","issuer":"GlobalSign RSA OV SSL CA 2018","keyExchange":"","keyExchangeGroup":"X25519","protocol":"TLS 1.3","sanList":["*.jd.com","*.360buy.com","*.360buyimg.com","*.3.cn","*.7fresh.com","*.baitiao.com","*.chinabank.com.cn","*.e.jd.com","*.jd.co.th","*.jddglobal.com","*.jd.hk","*.jd.id","*.jdpay.com","*.jd.ru","*.jdworldwide.com","*.jdx.com","*.joybuy.com","*.joybuy.es","*.jr.jd.com","*.k.jd.com","*.m.jd.com","*.m.yhd.com","*.shop.jd.com","*.wangyin.com","*.yhd.com","*.yiyaojd.com","360buy.com","360buyimg.com","3.cn","7fresh.com","baitiao.com","chinabank.com.cn","jd.co.th","jddglobal.com","jd.hk","jd.id","jdpay.com","jd.ru","jdworldwide.com","jdx.com","joybuy.com","joybuy.es","wangyin.com","yhd.com","yiyaojd.com","jd.com"],"signedCertificateTimestampList":[{"hashAlgorithm":"SHA-256","logDescription":"Sectigo 'Mammoth' CT log","logId":"6F5376AC31F03119D89900A45115FF77151C11D902C10029068DB2089A37D913","origin":"Embedded in certificate","signatureAlgorithm":"ECDSA","signatureData":"3045022017A2AC492303F50786758D0B4B63EEB8D031850832031FC5A43139C0CDA5EB4F0221009063F884220327718857A6897B87ED5D9F785FFC97F23BD45C84975A11DE721E","status":"Verified","timestamp":1.634109209345e+12},{"hashAlgorithm":"SHA-256","logDescription":"Google 'Argon2022' log","logId":"2979BEF09E393921F056739F63A577E5BE577D9C600AF8F94D5D265C255DC784","origin":"Embedded in certificate","signatureAlgorithm":"ECDSA","signatureData":"3045022071D6D51A59CAA1764E478598AA8EE34B628C8F856B09CDB8382E090CF5D6D97D022100A8FE17E028BB6D439387721974ED3629B9CC44C90AC524B505B9B2375C75CBB9","status":"Verified","timestamp":1.634109210136e+12},{"hashAlgorithm":"SHA-256","logDescription":"Sectigo 'Sabre' CT log","logId":"5581D4C2169036014AEA0B9B573C53F0C0E43878702508172FA3AA1D0713D30C","origin":"Embedded in certificate","signatureAlgorithm":"ECDSA","signatureData":"3045022015D94080264A3FCA83C0F3DE2B85A703384BB678FEBBE5408B4FF7D30BD40900022100DFBAB7992EE99420724CA35A5C2C252298B750994B0EAA98ACB22992F97D545E","status":"Verified","timestamp":1.634109209393e+12}],"subjectName":"*.jd.com","validFrom":1634109205,"validTo":1668410005},"securityState":"secure","status":200,"statusText":"","timing":{"connectEnd":96.834,"connectStart":19.521,"dnsEnd":19.521,"dnsStart":0,"proxyEnd":-1,"proxyStart":-1,"pushEnd":0,"pushStart":0,"receiveHeadersEnd":217.707,"requestTime":2272960.91089,"sendEnd":97.25,"sendStart":97.03,"sslEnd":96.829,"sslStart":55.726,"workerFetchStart":-1,"workerReady":-1,"workerRespondWithSettled":-1,"workerStart":-1},"url":"https://item.m.jd.com/product/10052422060501.html"},"timestamp":2272961.1294,"type":"Document"}},"webview":"78D9BC6F0CBE162DA6779F410AA1500C"}}
我們格式化一下,可以發現日誌里已經包含了請求的大部分描述信息,我們需要遍歷過濾出method為Network.responseReceived的日誌,當然你也可以再根據其他的一些參數過濾出你要的請求
每個日誌都有一個requestId,我們再通過該值調CDP命令[Network.getResponseBody]換取具體內容
{ "message": { "method": "Network.responseReceived", "params": { "frameId": "78D9BC6F0CBE162DA6779F410AA1500C", "hasExtraInfo": true, "loaderId": "33814E6BD702343CF0A2A38C976C772F", "requestId": "33814E6BD702343CF0A2A38C976C772F", "response": { "connectionId": 242, "connectionReused": false, "encodedDataLength": 546, "fromDiskCache": false, "fromPrefetchCache": false, "fromServiceWorker": false, "headers": { "access-control-allow-credentials": "true", "access-control-allow-headers": "Origin, X-Requested-With, Content-Type, multipart/form-data, Accept, Authorization", "access-control-allow-methods": "POST, GET, PATCH, DELETE, PUT, OPTIONS", "access-control-allow-origin": "*", "access-control-max-age": "3600", "cache-control": "no-cache,no-store", "content-encoding": "gzip", "content-language": "zh-CN", "content-type": "text/html;charset=UTF-8", "date": "Wed, 03 Aug 2022 08:47:33 GMT", "hit": "bj-9153118133147527", "server": "jfe", "strict-transport-security": "max-age=86400", "vary": "Accept-Encoding" }, "mimeType": "text/html", "protocol": "h2", "remoteIPAddress": "106.39.169.120", "remotePort": 443, "responseTime": 1.659516451658764e+12, "securityDetails": { "certificateId": 0, "certificateTransparencyCompliance": "compliant", "cipher": "AES_256_GCM", "issuer": "GlobalSign RSA OV SSL CA 2018", "keyExchange": "", "keyExchangeGroup": "X25519", "protocol": "TLS 1.3", "sanList": ["*.jd.com", "*.360buy.com", "*.360buyimg.com", "*.3.cn", "*.7fresh.com", "*.baitiao.com", "*.chinabank.com.cn", "*.e.jd.com", "*.jd.co.th", "*.jddglobal.com", "*.jd.hk", "*.jd.id", "*.jdpay.com", "*.jd.ru", "*.jdworldwide.com", "*.jdx.com", "*.joybuy.com", "*.joybuy.es", "*.jr.jd.com", "*.k.jd.com", "*.m.jd.com", "*.m.yhd.com", "*.shop.jd.com", "*.wangyin.com", "*.yhd.com", "*.yiyaojd.com", "360buy.com", "360buyimg.com", "3.cn", "7fresh.com", "baitiao.com", "chinabank.com.cn", "jd.co.th", "jddglobal.com", "jd.hk", "jd.id", "jdpay.com", "jd.ru", "jdworldwide.com", "jdx.com", "joybuy.com", "joybuy.es", "wangyin.com", "yhd.com", "yiyaojd.com", "jd.com"], "signedCertificateTimestampList": [{ "hashAlgorithm": "SHA-256", "logDescription": "Sectigo 'Mammoth' CT log", "logId": "6F5376AC31F03119D89900A45115FF77151C11D902C10029068DB2089A37D913", "origin": "Embedded in certificate", "signatureAlgorithm": "ECDSA", "signatureData": "3045022017A2AC492303F50786758D0B4B63EEB8D031850832031FC5A43139C0CDA5EB4F0221009063F884220327718857A6897B87ED5D9F785FFC97F23BD45C84975A11DE721E", "status": "Verified", "timestamp": 1.634109209345e+12 }, { "hashAlgorithm": "SHA-256", "logDescription": "Google 'Argon2022' log", "logId": "2979BEF09E393921F056739F63A577E5BE577D9C600AF8F94D5D265C255DC784", "origin": "Embedded in certificate", "signatureAlgorithm": "ECDSA", "signatureData": "3045022071D6D51A59CAA1764E478598AA8EE34B628C8F856B09CDB8382E090CF5D6D97D022100A8FE17E028BB6D439387721974ED3629B9CC44C90AC524B505B9B2375C75CBB9", "status": "Verified", "timestamp": 1.634109210136e+12 }, { "hashAlgorithm": "SHA-256", "logDescription": "Sectigo 'Sabre' CT log", "logId": "5581D4C2169036014AEA0B9B573C53F0C0E43878702508172FA3AA1D0713D30C", "origin": "Embedded in certificate", "signatureAlgorithm": "ECDSA", "signatureData": "3045022015D94080264A3FCA83C0F3DE2B85A703384BB678FEBBE5408B4FF7D30BD40900022100DFBAB7992EE99420724CA35A5C2C252298B750994B0EAA98ACB22992F97D545E", "status": "Verified", "timestamp": 1.634109209393e+12 }], "subjectName": "*.jd.com", "validFrom": 1634109205, "validTo": 1668410005 }, "securityState": "secure", "status": 200, "statusText": "", "timing": { "connectEnd": 96.834, "connectStart": 19.521, "dnsEnd": 19.521, "dnsStart": 0, "proxyEnd": -1, "proxyStart": -1, "pushEnd": 0, "pushStart": 0, "receiveHeadersEnd": 217.707, "requestTime": 2272960.91089, "sendEnd": 97.25, "sendStart": 97.03, "sslEnd": 96.829, "sslStart": 55.726, "workerFetchStart": -1, "workerReady": -1, "workerRespondWithSettled": -1, "workerStart": -1 }, "url": "https://item.m.jd.com/product/10052422060501.html" }, "timestamp": 2272961.1294, "type": "Document" } }, "webview": "78D9BC6F0CBE162DA6779F410AA1500C" }
完整代碼:
var option = new ChromeOptions(); option.SetLoggingPreference("performance",OpenQA.Selenium.LogLevel.Info); //啟用performance日誌,等級為Info即可 option.PerformanceLoggingPreferences = new ChromiumPerformanceLoggingPreferences() { IsCollectingNetworkEvents = true //採集網路請求事件 }; using (ChromeDriver driver = new ChromeDriver(driverPath,option,TimeSpan.FromSeconds(5))) { driver.Navigate().GoToUrl("https://item.m.jd.com/product/10052422060501.html"); Thread.Sleep(3 * 1000); //等待頁面載入完成 var logs = driver.Manage().Logs.GetLog("performance").Where(o => o.Message.Contains("\"Network.responseReceived\""));//獲取所有performance日誌,並過濾出所有類型為Network.responseReceived的日誌 foreach (var log in logs) { //日誌找不到就會拋出異常,必須要捕獲異常 try { var json = JObject.Parse(log.Message); var url = json["message"]["params"]["response"]["url"].ToString(); //請求url,可通過url過濾出你要的請求 var requestId = json["message"]["params"]["requestId"].ToString(); //利用RequestId做為參數,執行CDP命令獲取日誌詳細內容。 踩坑警告:返回的是一個字典,需要轉換為Dictionary<string,object> var response = driver.ExecuteCdpCommand("Network.getResponseBody",new Dictionary<string,object>() {{ "requestId",requestId }}) as Dictionary<string,object>; if (response.TryGetValue("body",out object? bodyObj) && bodyObj != null) { string body = bodyObj.ToString(); Console.WriteLine($"輸出Body內容:{body}"); } } catch(Exception ex){ //記錄錯誤日誌 } } }
封裝
為了以後方便復用,我封裝了一個類
其中過濾條件的入參,可以根據實際情況自行修改,如果不需要對body內容進行過濾可以去掉這個參數,這樣就不用每次等待拿到結果再過濾,性能會好很多
/// <summary> /// Selenium網路請求日誌幫助類 /// </summary> public class NetworkLoggingHelper { static readonly Logger logger = NlogProvider.GetLogger(); /// <summary> /// 開啟網路請求日誌 /// </summary> /// <param name="option"></param> public static void OpenNetworkPerformanceLogging(ref ChromeOptions option) { option.SetLoggingPreference("performance",OpenQA.Selenium.LogLevel.Info); option.PerformanceLoggingPreferences = new ChromiumPerformanceLoggingPreferences() { IsCollectingNetworkEvents = true }; } /// <summary> /// 獲取網路請求數據 /// </summary> /// <param name="driver"></param> /// <param name="filter">過濾條件 入參1:請求的url | 入參2:請求的mimeType | 入參3:請求的body</param> /// <returns></returns> public static Dictionary<string,string> GetNetworkApiDatas(ChromeDriver driver,Func<string,string,string,bool> filter) { Dictionary<string,string> datas = new Dictionary<string,string>(); try { var logs = driver.Manage().Logs.GetLog("performance")?.Where(o => o.Message.Contains("\"Network.responseReceived\"")); foreach (var log in logs) { try { var json = JObject.Parse(log.Message); if (json["message"]["params"] == null || json["message"]["params"]["response"] == null) { continue; } var url = json["message"]["params"]["response"]["url"].ToString(); var mimeType = json["message"]["params"]["response"]["mimeType"].ToString(); var requestId = json["message"]["params"]["requestId"].ToString(); var response = driver.ExecuteCdpCommand("Network.getResponseBody",new Dictionary<string,object>() { { "requestId",requestId } }) as Dictionary<string,object>; if (response != null && response.Count > 0) { //是否base64編碼 var isBase64Encode = false; if (response.TryGetValue("base64Encoded",out object? base64Encoded) && base64Encoded != null) { isBase64Encode = (bool)base64Encoded; } //獲取響應內容 string body = string.Empty; if (response.TryGetValue("body",out object? bodyObj) && bodyObj != null) { body = bodyObj.ToString(); if (isBase64Encode) { body = body.DecodeBase64(Encoding.UTF8); } } //根據條件過濾,如果不需要body內容參與過濾條件判斷,這個if語句可以移到獲取response的上面,性能會好很多 if (filter.Invoke(url,mimeType,body)) { datas.Add(url,body); } } } catch (Exception ex) { logger.Error(ex.Message); } } } catch (Exception ex) { logger.Error($"獲取日誌失敗:{ex.Message}"); } return datas; } }
調用示例:
var option = new ChromeOptions(); NetworkLoggingHelper.OpenNetworkPerformanceLogging(ref option); //開啟日誌 using (ChromeDriver driver = new ChromeDriver(driverPath,option,TimeSpan.FromSeconds(5))) { driver.Navigate().GoToUrl("https://item.m.jd.com/product/10052422060501.html"); Thread.Sleep(3 * 1000); //等待頁面載入完成 var datas = NetworkLoggingHelper.GetNetworkApiDatas(driver,(url,mimeType,body) => { return url.Contains("//item.m.jd.com/product/") || mimeType.Contains("application/json") || body.Contains("windows.itemInfo"); }); Console.WriteLine(datas.Count); }
Selenium工作原理
敘述一下selenium工作的過程
1.selenium client(python等語言編寫的自動化測試腳本)初始化一個service服務,通過Webdriver啟動瀏覽器驅動程式chromedriver.exe
2.通過RemoteWebDriver向瀏覽器驅動程式發送HTTP請求,瀏覽器驅動程式解析請求,打開瀏覽器,並獲得sessionid,如果再次對瀏覽器操作需攜帶此id
3.打開瀏覽器,綁定特定的埠,把啟動後的瀏覽器作為webdriver的remote server
3.打開瀏覽器後,所有的selenium的操作(訪問地址,查找元素等)均通過RemoteConnection鏈接到remote server,然後使用execute方法調用_request方法通過urlib3向remote server發送請求
4.瀏覽器通過請求的內容執行對應動作
5.瀏覽器再把執行的動作結果通過瀏覽器驅動程式返回給測試腳本
6. webdriver.是 w3c 的標準協議。提供一組介面,用於發現和操作web文檔中的DOM元素。
webdriver.是一系列的API.它給測試代碼提供了定位和操作 wEB元素的能力。不同的開發語言有相應的wsbdriver.
↑↑↑↑ 該部分內容來自:https://www.cnblogs.com/xrxc/p/14776895.html
作者:Harry
原文出處:https://www.cnblogs.com/simendancer/articles/16546199.html
有些文本描述和圖片源自網路,如有侵犯請私信告知