分享一個爬取HUST(哈理工)學生成績的Python程式(OCR自動識別驗證碼)

-Advertisement-

Python版本：3.5.2 日期：2018/1/21 ~~~~ __Author__ = "Lance " coding = utf 8 from urllib import request from urllib import parse from http import cookiejar f ...

Python版本：3.5.2
日期：2018/1/21

__Author__ = "Lance#"

# -*- coding = utf-8 -*-

from urllib import request
from urllib import parse
from http import cookiejar
from aip.ocr import AipOcr
import re

class Hust(object):
    def __init__(self, stu_id, passwd):
        #登錄地址，驗證碼地址，成績查詢地址
        self.__url_check = "http://jwzx.hrbust.edu.cn/academic/getCaptcha.do"
        self.__url_login = "http://jwzx.hrbust.edu.cn/academic/j_acegi_security_check"
        self.__url_scoal = "http://jwzx.hrbust.edu.cn/academic/manager/score/studentOwnScore.do"
        #信息頭，模擬瀏覽器
        self.__headers = {
            "User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:57.0) Gecko/20100101 Firefox/57.0"
        }

        self.__captcha = ''
        #這裡是自已在AI中申請到的ID和KEY
        self.__APP_ID = 'xxxxxx'
        self.__API_KEY = 'xxxxxx'
        self.__SECRET_KEY = 'xxxxxx'

        #參數信息，在瀏覽器中可以捕獲
        self.__post_data = {
            "groupId": "",
            "j_username": stu_id,
            "j_password": passwd,
            "j_captcha" : ''
        }

        ##聲明一個CookieJar對象實例
        self.__cookie = cookiejar.CookieJar()
        #利用HTTPCookieProcessor對象來創建cookie處理器
        self.__cookieProc = request.HTTPCookieProcessor(self.__cookie)
        # 通過handler來構建opener
        self.__opener = request.build_opener(self.__cookieProc)
        #安裝使用這個opener
        request.install_opener(self.__opener)

    def ocr_captcha(self):
        '''ocr識別驗證碼'''

        Req = request.Request(self.__url_check, headers=self.__headers)
        captcha = request.urlopen(Req).read()

        #AI的介面函數
        client = AipOcr(self.__APP_ID, self.__API_KEY, self.__SECRET_KEY)
        res = client.basicGeneral(captcha)
        self.__captcha = res['words_result'][0]['words']

    def get_captcha(self):
        '''得到驗證碼'''

        return self.__captcha

    def set_postdata(self):
        '''設置要發送的參數，就是修改驗證碼'''

        self.__post_data["j_captcha"] = self.__captcha

    def login(self):
        '''模擬登錄'''

        #urlencode的作用：將字元串以URL編碼，用於編碼處理
        data = parse.urlencode(self.__post_data).encode()
        Req = request.Request(self.__url_login, headers=self.__headers)
        html = request.urlopen(Req, data=data)
        #登錄頁採用的是GBK編碼，這個需要註意
        return html.read().decode("GBK")

    def get_scoal(self):
        '''獲取到成績信息，並用正則分解'''

        Req = request.Request(self.__url_scoal, headers=self.__headers)
        res = request.urlopen(Req).read().decode()

        #解析HTML採用的正則表達式
        pat = re.compile('<td>(.*?)</td>', re.S)
        list = re.findall(pat, res)

        #對採集到的數據進行整理
        for i, con in enumerate(list):
            list[i] = con.replace("\n        ", "")

        return list

    def display(self, list):
        '''顯示成績信息'''

        cnt = len(list)
        new_list = []
        cnt -= 3
        y = int(cnt / 13)

        for m in range(y):
            new_list.insert(m, [list[j] for j in range(3 + m * 13, 16 + m * 13)])

        print("學年   學期   及格標誌    分數       學分           課程名")

        for item in new_list:
            print("{}   {}    {:>5s}      {:5s}    {:^5s}  {:^20s}".format(
                item[0], item[1], item[12], item[6].replace('<span style=" color:#FF0000">', "").replace("</span>", ""),
                item[7], item[3]))

if __name__ == '__main__':
    cnt = 1
    err_str = "輸入的驗證碼不正確！"

    #此處是自己的學號和密碼
    stu = Hust("xxxxxx", "xxxxxx")
    while True:
        stu.ocr_captcha()
        print("識別到的驗證碼為: %s     ------      " % stu.get_captcha(), end="")
        stu.set_postdata()
        html = stu.login()
        if err_str not in html:
            print("驗證碼正確")
            break
        cnt += 1
        print("驗證碼錯誤，啟動第%d次識別" % cnt)
    print()
    print("Scoal Info".center(70, "-"))
    list = stu.get_scoal()
    stu.display(list)
    print("End".center(70, "-"))

完成效果圖：

請自動忽略這個人掛科的消息，0.0

可能我理解不周，請謹慎參考，我會後期完善，謝謝支持！
歡迎探討。

您的分享是我們最大的動力!

-Advertisement-

更多相關文章

Java併發編程：線程池的使用

一. 準備工作 1. 本文參考 Java併發編程：線程池的使用二. 相關代碼文件介紹 1. ThreadPoolExecutor.java 線程池中最核心的一個類，提供了四個構造函數用於創建線程池 public class ThreadPoolExecutor extends AbstractEx ...
Eclipse配置maven web項目問題總結

clipse創建Maven結構的web項目的時候選擇Artifact Id為maven-artchetype-webapp,點擊finish之後,一般會遇到如下問題 1. The superclass "javax.servlet.http.HttpServlet" was not found on ...
游戲伺服器設計之任務系統

任務系統是游戲中最重要的系統之一，本文旨在設計一個輕量清晰的任務系統。通用易擴展是本系統關註的重點。任務系統中當角色的條件滿足時，自動觸發每一類型的任務，每個任務有其所需的完成條件，當角色完成了指定的操作後，則會觸發任務自動完成，任務完成後一般玩家會領取對應的獎勵，結束任務，此任務的生命周期結束，如... ...
Python中級 —— 02函數式編程

# 函數式編程函數是Python內建支持的一種封裝，而啊、函數式編程通俗說來就是玉虛把函數本身作為參數傳入另一個函數，允許返回一個函數。 > 函數名其實也是變數，也可以被賦值。如果函數名被賦值為其他值，則不再指向原來函數。高階函數：既然變數可以指向函數，函數的參數能接收變數... ...
Java Web應用集成OSGI

對OSGI的簡單理解就像Java Web應用程式需要運行在Tomcat、Weblogic這樣的容器中一樣。程式員開發的OSGI程式包也需要運行在OSGI容器中。目前主流的OSGI容器包括：Apache Felix以及Eclipse Equinox。OSGI程式包在OSGI中稱作Bundle。 Bu ...
java基礎面試題：switch語句能否作用在byte上，能否作用在long上，能否作用在String上?

byte short char都是隱性int類型都可以，以及他們的包裝類 long 不行 String也可以，要求case中也為String類型 ...
java基礎面試題：說說&和&&的區別

&與&&都是邏輯與不同的是&左右兩邊的判斷都要進行，而&&是短路與，當&&左邊條件為假則不用再判斷右邊條件，所以效率更高例如，對於if(str != null && !str.equals(“”))表達式，當str為null時，後面的表達式不會執行，所以不會出現NullPointerExcept ...
RabbitMQ集群簡介

一個RabbitMQ消息代理是一個由一個或多個Erlang節點組成的邏輯組，其中的每個節點都共用users, virtual hosts, queues, exchanges, bindings, and runtime parameters。我們把這些相關節點組成的集合作為一個cluster（集群 ...