第5章 scrapy爬取知名問答網站

-Advertisement-

第五章感覺是第四章的練習項目，無非就是多了一個模擬登錄。不分小節記錄了，直接上知識點，可能比較亂。 1.常見的httpcode： 2.怎麼找post參數？先找到登錄的頁面，打開firebug，輸入錯誤的賬號和密碼，觀察post_url變換，從而確定參數。 3.讀取本地的文件，生成cookies。 ...

第五章感覺是第四章的練習項目，無非就是多了一個模擬登錄。

不分小節記錄了，直接上知識點，可能比較亂。

1.常見的httpcode：

2.怎麼找post參數？

先找到登錄的頁面，打開firebug，輸入錯誤的賬號和密碼，觀察post_url變換，從而確定參數。

3.讀取本地的文件，生成cookies。

1 try:
2     import cookielib #py2
3 except:
4     import http.cookiejar as cookielib #py3

4.用requests登錄知乎

 1 # -*- coding: utf-8 -*-
 2 __author__ = 'jinxiao'
 3 
 4 import requests
 5 try:
 6     import cookielib
 7 except:
 8     import http.cookiejar as cookielib
 9 
10 import re
11 
12 session = requests.session()  #實例化session,下麵的requests可以直接換成session
13 session.cookies = cookielib.LWPCookieJar(filename="cookies.txt") #實例化cookies，保存cookies
14 #讀取cookies
15 try:
16     session.cookies.load(ignore_discard=True)
17 except:
18     print ("cookie未能載入")
19 
20 #知乎一定要加上瀏覽器的頭，其他網站不一定，一般都是要的
21 agent = "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:51.0) Gecko/20100101 Firefox/51.0"
22 header = {
23     "HOST":"www.zhihu.com",
24     "Referer": "https://www.zhizhu.com",
25     'User-Agent': agent
26 }
27 
28 def is_login():
29     #通過個人中心頁面返回狀態碼來判斷是否為登錄狀態
30     inbox_url = "https://www.zhihu.com/question/56250357/answer/148534773"
31     response = session.get(inbox_url, headers=header, allow_redirects=False)  #禁止重定向，判斷為是否登錄
32     if response.status_code  != 200:
33         return False
34     else:
35         return True
36 
37 def get_xsrf():
38     #獲取xsrf code
39     response = session.get("https://www.zhihu.com", headers=header)
40     match_obj = re.match('.*name="_xsrf" value="(.*?)"', response.text)
41     if match_obj:
42         return (match_obj.group(1))
43     else:
44         return ""
45 
46 
47 def get_index():
48     response = session.get("https://www.zhihu.com", headers=header)
49     with open("index_page.html", "wb") as f:
50         f.write(response.text.encode("utf-8"))
51     print ("ok")
52 
53 def zhihu_login(account, password):
54     #知乎登錄
55     if re.match("^1\d{10}",account):
56         print ("手機號碼登錄")
57         post_url = "https://www.zhihu.com/login/phone_num"
58         post_data = {
59             "_xsrf": get_xsrf(),
60             "phone_num": account,
61             "password": password
62         }
63     else:
64         if "@" in account:
65             #判斷用戶名是否為郵箱
66             print("郵箱方式登錄")
67             post_url = "https://www.zhihu.com/login/email"
68             post_data = {
69                 "_xsrf": get_xsrf(),
70                 "email": account,
71                 "password": password
72             }
73 
74     response_text = session.post(post_url, data=post_data, headers=header)
75     session.cookies.save()
76 
77 zhihu_login("18782902568", "admin123")
78 # get_index()
79 print(is_login())

zhihu_requests_login

5.在shell調試中添加UserAgent

　scrapy shell -s USER_AGENT='...' url

6.JsonView插件

可以很好的可視化看json

7.寫入html文件

with open(''e:/zhihu.html'',"wb") as f:
    f.write(response.text.encode('utf-8'))

8.yield理解

　　如果是yield item 會到pipelins中處理

　　如果是yield Request 會到下載器去下載

9.在mysql中怎麼去重，設置主鍵去重，主鍵衝突

解決：在插入的sql語句後面加上 ON DUPLICATE KEY UPDATE content=VALUES(content) #這是需要更新的內容

10.手動輸入驗證碼（zhihu.login_requests.py）

 1 def get_captcha():
 2     import time
 3     t=str(int(time.time()*1000))
 4     captcha_url="https://www.zhihu.com/captcha.gif?r={0}&type=login".format(t)
 5     t=session.get(captcha_url,headers=header)
 6     with open("captcha.jpg","wb") as f:
 7         f.write(t.content)
 8         f.close()
 9     captcha=input("輸入驗證碼：")
10     return captcha
#為什麼是第五行是session，而不是requests？
#因為requests會重新建立一次繪畫 session，這與後面的參數不符，輸入的驗證碼並不是當前的驗證碼。

作者：今孝

出處：http://www.cnblogs.com/jinxiao-pu/p/6749332.html

本文版權歸作者和博客園共有，歡迎轉載，但未經作者同意必須保留此段聲明，且在文章頁面明顯位置給出原文連接。

您的分享是我們最大的動力!

-Advertisement-

更多相關文章

Leetcode: 1.Two Sum

Given an array of integers, return indices of the two numbers such that they add up to a specific target. You may assume that each input would have e... ...
python學習之路——DP演算法初試

這次的題目是這樣的：假設有一個6*6的棋盤，每個格子裡面有一個獎品（每個獎品的價值在100到1000之間），現在要求從左上角開始到右下角結束，每次只能往右或往下走一個格子，所經過的格子里的獎品歸自己所有。問最多能收集價值多少的獎品。最先看到這個問題的時候腦子裡面的立馬出現許多的腦洞：暴力、二叉樹 ...
使用PHP把圖片上傳到七牛

自動載入類上傳代碼： ...
python編程快速上手之第9章實踐項目參考答案

本章介介紹了shutil，zipfile模塊的使用，我們先來認識一下這2個模塊吧。一.shutil模塊 shutil模塊主要用於對文件或文件夾進行處理，包括：複製，移動，改名和刪除文件，在shutil模塊中主要以下這麼幾個函數： 1.複製文件和文件夾 shutil模塊提供了2個函數:shutil. ...
PHP實現文件下載

1) { echo ""; return; } //HTTP頭部信息 header("Content-type: application/octet-stream"); header("Accept-Ranges: bytes"); he... ...
正則表達式模式修飾符

下麵列出了當前可用的 PCRE 修飾符。括弧中提到的名字是 PCRE 內部這些修飾符的名稱。模式修飾符中的空格，換行符會被忽略，其他字元會導致錯誤。 Warning This feature was DEPRECATED in PHP 5.5.0, and REMOVED as of PHP 7. ...
少年，是時候換種更優雅的方式部署你的php代碼了

讓我們來回憶下上次你是怎麼發佈你的代碼的： 1. 先把線上的代碼用ftp備份下來 2. 上傳修改了的文件 3. 測試一下功能是否正常 4. 網站500了，趕緊用備份替換回去 5. 替換錯了/替換漏了 6. 一臺伺服器發佈成功 7. 登錄每一臺執行一遍發佈操作 8. 加班搞定 9. 老闆發飆 ... ...
1225 八數位難題

1225 八數位難題時間限制: 1 s 空間限制: 128000 KB 題目等級 : 鑽石 Diamond 題解查看運行結果 1225 八數位難題 1225 八數位難題時間限制: 1 s 空間限制: 128000 KB 題目等級 : 鑽石 Diamond 時間限制: 1 s 空間限制: 128 ...