哈嘍兄弟們,今天來實現採集一下最新的qcwu招聘數據。因為網站嘛,大家都爬來爬去的,人家就會經常更新,所以代碼對應的也要經常重新去寫。 對於會的人來說,當然無所謂,任他更新也攔不住,但是對於不會的小伙伴來說,網站一更新,當場自閉。 所以這期是出給不會的小伙伴的,我還錄製了視頻進行詳細講解,跟源碼一起 ...
哈嘍兄弟們,今天來實現採集一下最新的qcwu招聘數據。
因為網站嘛,大家都爬來爬去的,人家就會經常更新,所以代碼對應的也要經常重新去寫。
對於會的人來說,當然無所謂,任他更新也攔不住,但是對於不會的小伙伴來說,網站一更新,當場自閉。
所以這期是出給不會的小伙伴的,我還錄製了視頻進行詳細講解,跟源碼一起打包好了,代碼里有領取方式
軟體工具
先來看看需要準備啥
環境使用
Python 3.8
Pycharm
模塊使用
# 第三方模塊 需要安裝的 requests >>> pip install requests csv
實現爬蟲基本流程
一、數據來源分析: 思路固定
-
明確需求:
- 明確採集網站以及數據內容
網址: 51job
內容: 招聘信息 -
通過開發者工具, 進行抓包分析, 分析具體數據來源
I. 打開開發者工具: F12 / 右鍵點擊檢查選擇network
II. 刷新網頁, 讓數據內容重新載入一遍
III. 通過搜索<搜索你要的數據>去找數據具體位置
招聘信息數據包: https://we.***.com/api/job/search-pc?api_key=51job×tamp=1688645783&keyword=python&searchType=2&function=&industry=&jobArea=010000%2C020000%2C030200%2C040000%2C090200&jobArea2=&landmark=&metro=&salary=&workYear=°ree=&companyType=&companySize=&jobType=&issueDate=&sortType=0&pageNum=1&requestId=&pageSize=20&source=1&accountId=&pageCode=sou%7Csou%7Csoulb
二、代碼實現步驟: 步驟固定
- 發送請求, 模擬瀏覽器對於url地址發送請求
請求鏈接: 招聘信息數據包url - 獲取數據, 獲取伺服器返迴響應數據 <所有的數據>
開發者工具: response - 解析數據, 提取我們想要的數據內容
招聘基本信息 - 保存數據, 把信息數據保存表格文件裡面
代碼解析
模塊
# 導入數據請求模塊 import requests # 導入格式化輸出模塊 # Python學習交流扣裙 708525271 # 代碼和視頻在裙里拿 from pprint import pprint # 導入csv import csv
發送請求, 模擬瀏覽器對於url地址發送請求
headers = { 'Cookie': 'guid=54b7a6c4c43a33111912f2b5ac6699e2; sajssdk_2015_cross_new_user=1; sensorsdata2015jssdkcross=%7B%22distinct_id%22%3A%2254b7a6c4c43a33111912f2b5ac6699e2%22%2C%22first_id%22%3A%221892b08f9d11c8-09728ce3464dad8-26031d51-3686400-1892b08f9d211e7%22%2C%22props%22%3A%7B%22%24latest_traffic_source_type%22%3A%22%E7%9B%B4%E6%8E%A5%E6%B5%81%E9%87%8F%22%2C%22%24latest_search_keyword%22%3A%22%E6%9C%AA%E5%8F%96%E5%88%B0%E5%80%BC_%E7%9B%B4%E6%8E%A5%E6%89%93%E5%BC%80%22%2C%22%24latest_referrer%22%3A%22%22%7D%2C%22identities%22%3A%22eyIkaWRlbnRpdHlfY29va2llX2lkIjoiMTg5MmIwOGY5ZDExYzgtMDk3MjhjZTM0NjRkYWQ4LTI2MDMxZDUxLTM2ODY0MDAtMTg5MmIwOGY5ZDIxMWU3IiwiJGlkZW50aXR5X2xvZ2luX2lkIjoiNTRiN2E2YzRjNDNhMzMxMTE5MTJmMmI1YWM2Njk5ZTIifQ%3D%3D%22%2C%22history_login_id%22%3A%7B%22name%22%3A%22%24identity_login_id%22%2C%22value%22%3A%2254b7a6c4c43a33111912f2b5ac6699e2%22%7D%2C%22%24device_id%22%3A%221892b08f9d11c8-09728ce3464dad8-26031d51-3686400-1892b08f9d211e7%22%7D; nsearch=jobarea%3D%26%7C%26ord_field%3D%26%7C%26recentSearch0%3D%26%7C%26recentSearch1%3D%26%7C%26recentSearch2%3D%26%7C%26recentSearch3%3D%26%7C%26recentSearch4%3D%26%7C%26collapse_expansion%3D; search=jobarea%7E%60010000%2C020000%2C030200%2C040000%2C090200%7C%21recentSearch0%7E%60010000%2C020000%2C030200%2C040000%2C090200%A1%FB%A1%FA000000%A1%FB%A1%FA0000%A1%FB%A1%FA00%A1%FB%A1%FA99%A1%FB%A1%FA%A1%FB%A1%FA99%A1%FB%A1%FA99%A1%FB%A1%FA99%A1%FB%A1%FA99%A1%FB%A1%FA9%A1%FB%A1%FA99%A1%FB%A1%FA%A1%FB%A1%FA0%A1%FB%A1%FApython%A1%FB%A1%FA2%A1%FB%A1%FA1%7C%21; privacy=1688644161; Hm_lvt_1370a11171bd6f2d9b1fe98951541941=1688644162; Hm_lpvt_1370a11171bd6f2d9b1fe98951541941=1688644162; JSESSIONID=BA027715BD408799648B89C132AE93BF; acw_tc=ac11000116886495592254609e00df047e220754059e92f8a06d43bc419f21; ssxmod_itna=Qqmx0Q0=K7qeqD5itDXDnBAtKeRjbDce3=e8i=Ax0vTYPGzDAxn40iDtrrkxhziBemeLtE3Yqq6j7rEwPeoiG23pAjix0aDbqGkPA0G4GG0xBYDQxAYDGDDPDocPD1D3qDkD7h6CMy1qGWDm4kDWPDYxDrjOKDRxi7DDvQkx07DQ5kQQGxjpBF=FHpu=i+tBDkD7ypDlaYj9Om6/fxMp7Ev3B3Ix0kl40Oya5s1aoDUlFsBoYPe723tT2NiirY6QiebnnDsAhWC5xyVBDxi74qTZbKAjtDirGn8YD===; ssxmod_itna2=Qqmx0Q0=K7qeqD5itDXDnBAtKeRjbDce3=e8i=DnIfwqxDstKhDL0iWMKV3Ekpun3DwODKGcDYIxxD==; acw_sc__v2=64a6bf58f0b7feda5038718459a3b1e625849fa8', 'Referer': 'https://we.51job.com/pc/search?jobArea=010000,020000,030200,040000,090200&keyword=python&searchType=2&sortType=0&metro=', 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36', } # 請求鏈接 url = 'https://we.***.com/api/job/search-pc' # 請求參數 data = { 'api_key': '51job', 'timestamp': '*****', 'keyword': '****', 'searchType': '2', 'function': '', 'industry': '', 'jobArea': '010000,020000,030200,040000,090200', 'jobArea2': '', 'landmark': '', 'metro': '', 'salary': '', 'workYear': '', 'degree': '', 'companyType': '', 'companySize': '', 'jobType': '', 'issueDate': '', 'sortType': '0', 'pageNum': '1', 'requestId': '', 'pageSize': '20', 'source': '1', 'accountId': '', 'pageCode': 'sou|sou|soulb', } # 發送請求 response = requests.get(url=url, params=data, headers=headers)
獲取數據
獲取伺服器返迴響應數據 <所有的數據>
開發者工具: response
- response.json() 獲取響應json數據
解析數據
提取我們想要的數據內容
for迴圈遍歷
for index in response.json()['resultbody']['job']['items']: # index 具體崗位信息 --> 字典 dit = { '職位': index['jobName'], '公司': index['fullCompanyName'], '薪資': index['provideSalaryString'], '城市': index['jobAreaString'], '經驗': index['workYearString'], '學歷': index['degreeString'], '公司性質': index['companyTypeString'], '公司規模': index['companySizeString'], '職位詳情頁': index['jobHref'], '公司詳情頁': index['companyHref'], }
以字典方式進行數據保存
csv_writer.writerow(dit) print(dit)
保存表格
f = open('python.csv', mode='w', encoding='utf-8', newline='') csv_writer = csv.DictWriter(f, fieldnames=[ '職位', '公司', '薪資', '城市', '經驗', '學歷', '公司性質', '公司規模', '職位詳情頁', '公司詳情頁', ]) csv_writer.writeheader()
可視化部分
import pandas as pd df = pd.read_csv('data.csv') df.head() df['學歷'] = df['學歷'].fillna('不限學歷') edu_type = df['學歷'].value_counts().index.to_list() edu_num = df['學歷'].value_counts().to_list() from pyecharts import options as opts from pyecharts.charts import Pie from pyecharts.faker import Faker from pyecharts.globals import CurrentConfig, NotebookType CurrentConfig.NOTEBOOK_TYPE = NotebookType.JUPYTER_LAB c = ( Pie() .add( "", [ list(z) for z in zip(edu_type,edu_num) ], center=["40%", "50%"], ) .set_global_opts( title_opts=opts.TitleOpts(title="Python學歷要求"), legend_opts=opts.LegendOpts(type_="scroll", pos_left="80%", orient="vertical"), ) .set_series_opts(label_opts=opts.LabelOpts(formatter="{b}: {c}")) ) c.load_javascript() c.render_notebook() df['城市'] = df['城市'].str.split('·').str[0] city_type = df['城市'].value_counts().index.to_list() city_num = df['城市'].value_counts().to_list() c = ( Pie() .add( "", [ list(z) for z in zip(city_type,city_num) ], center=["40%", "50%"], ) .set_global_opts( title_opts=opts.TitleOpts(title="Python招聘城市分佈"), legend_opts=opts.LegendOpts(type_="scroll", pos_left="80%", orient="vertical"), ) .set_series_opts(label_opts=opts.LabelOpts(formatter="{b}: {c}")) ) c.render_notebook() def LowMoney(i): if '萬' in i: low = i.split('-')[0] if '千' in low: low_num = low.replace('千', '') low_money = int(float(low_num) * 1000) else: low_money = int(float(low) * 10000) else: low = i.split('-')[0] if '元/天' in low: low_num = low.replace('元/天', '') low_money = int(low_num) * 30 else: low_money = int(float(low) * 1000) return low_money df['最低薪資'] = df['薪資'].apply(LowMoney) def MaxMoney(j): Max = j.split('-')[-1].split('·')[0] if '萬' in Max and '萬/年' not in Max: max_num = int(float(Max.replace('萬', '')) * 10000) elif '千' in Max: max_num = int(float(Max.replace('千', '')) * 1000) elif '元/天' in Max: max_num = int(Max.replace('元/天', '')) * 30 else: max_num = int((int(Max.replace('萬/年', '')) * 10000) / 12) return max_num df['最高薪資'] = df['薪資'].apply(MaxMoney) def tranform_price(x): if x <= 5000.0: return '0~5000元' elif x <= 8000.0: return '5001~8000元' elif x <= 15000.0: return '8001~15000元' elif x <= 25000.0: return '15001~25000元' else: return '25000以上' df['最低薪資分級'] = df['最低薪資'].apply(lambda x:tranform_price(x)) price_1 = df['最低薪資分級'].value_counts() datas_pair_1 = [(i, int(j)) for i, j in zip(price_1.index, price_1.values)] df['最高薪資分級'] = df['最高薪資'].apply(lambda x:tranform_price(x)) price_2 = df['最高薪資分級'].value_counts() datas_pair_2 = [(i, int(j)) for i, j in zip(price_2.index, price_2.values)] pie1 = ( Pie(init_opts=opts.InitOpts(theme='dark',width='1000px',height='600px')) .add('', datas_pair_1, radius=['35%', '60%']) .set_series_opts(label_opts=opts.LabelOpts(formatter="{b}:{d}%")) .set_global_opts( title_opts=opts.TitleOpts( title="Python工作薪資\n\n最低薪資區間", pos_left='center', pos_top='center', title_textstyle_opts=opts.TextStyleOpts( color='#F0F8FF', font_size=20, font_weight='bold' ), ) ) .set_colors(['#EF9050', '#3B7BA9', '#6FB27C', '#FFAF34', '#D8BFD8', '#00BFFF', '#7FFFAA']) ) pie1.render_notebook() pie1 = ( Pie(init_opts=opts.InitOpts(theme='dark',width='1000px',height='600px')) .add('', datas_pair_2, radius=['35%', '60%']) .set_series_opts(label_opts=opts.LabelOpts(formatter="{b}:{d}%")) .set_global_opts( title_opts=opts.TitleOpts( title="Python工作薪資\n\n最高薪資區間", pos_left='center', pos_top='center', title_textstyle_opts=opts.TextStyleOpts( color='#F0F8FF', font_size=20, font_weight='bold' ), ) ) .set_colors(['#EF9050', '#3B7BA9', '#6FB27C', '#FFAF34', '#D8BFD8', '#00BFFF', '#7FFFAA']) ) pie1.render_notebook() exp_type = df['經驗'].value_counts().index.to_list() exp_num = df['經驗'].value_counts().to_list() c = ( Pie() .add( "", [ list(z) for z in zip(exp_type,exp_num) ], center=["40%", "50%"], ) .set_global_opts( title_opts=opts.TitleOpts(title="Python招聘經驗要求"), legend_opts=opts.LegendOpts(type_="scroll", pos_left="80%", orient="vertical"), ) .set_series_opts(label_opts=opts.LabelOpts(formatter="{b}: {c}")) ) c.render_notebook() # 按城市分組並計算平均薪資 avg_salary = df.groupby('城市')['最低薪資'].mean() CityType = avg_salary.index.tolist() CityNum = [int(a) for a in avg_salary.values.tolist()] avg_salary_1 = df.groupby('城市')['最高薪資'].mean() CityType_1 = avg_salary_1.index.tolist() CityNum_1 = [int(a) for a in avg_salary_1.values.tolist()] from pyecharts.charts import Bar # 創建柱狀圖實例 c = ( Bar() .add_xaxis(CityType) .add_yaxis("", CityNum) .set_global_opts( title_opts=opts.TitleOpts(title="各大城市Python低平均薪資"), visualmap_opts=opts.VisualMapOpts( dimension=1, pos_right="5%", max_=30, is_inverse=True, ), xaxis_opts=opts.AxisOpts(axislabel_opts=opts.LabelOpts(rotate=45)) # 設置X軸標簽旋轉角度為45度 ) .set_series_opts( label_opts=opts.LabelOpts(is_show=False), markline_opts=opts.MarkLineOpts( data=[ opts.MarkLineItem(type_="min", name="最小值"), opts.MarkLineItem(type_="max", name="最大值"), opts.MarkLineItem(type_="average", name="平均值"), ] ), ) ) c.render_notebook() # 創建柱狀圖實例 c = ( Bar() .add_xaxis(CityType_1) .add_yaxis("", CityNum_1) .set_global_opts( title_opts=opts.TitleOpts(title="各大城市Python高平均薪資"), visualmap_opts=opts.VisualMapOpts( dimension=1, pos_right="5%", max_=30, is_inverse=True, ), xaxis_opts=opts.AxisOpts(axislabel_opts=opts.LabelOpts(rotate=45)) # 設置X軸標簽旋轉角度為45度 ) .set_series_opts( label_opts=opts.LabelOpts(is_show=False), markline_opts=opts.MarkLineOpts( data=[ opts.MarkLineItem(type_="min", name="最小值"), opts.MarkLineItem(type_="max", name="最大值"), opts.MarkLineItem(type_="average", name="平均值"), ] ), ) ) c.render_notebook() ### 結論: 1. 學歷要求基本大專以上 2. 薪資待遇: 8000-25000 左右 3. 北上廣 薪資偏高一些 ### 如何簡單實現可視化分析 1. 通過爬蟲採集完整的數據內容 --> 表格 / 資料庫 2. 讀取文件內容 3. 統計每個類目的數據情況 4. 通過可視化模塊: <使用官方文檔提供代碼模板去實現> import pandas as pd # 讀取數據 df = pd.read_csv('data.csv') # 顯示前五行數據 df.head() c_type = df['公司性質'].value_counts().index.to_list() # 統計數據類目 c_num = df['公司性質'].value_counts().to_list() # 統計數據個數 c_type from pyecharts.charts import Bar # 導入pyecharts裡面柱狀圖 from pyecharts.faker import Faker # 導入隨機生成數據 from pyecharts.globals import ThemeType # 主題設置 c = ( Bar({"theme": ThemeType.MACARONS}) # 主題設置 .add_xaxis(c_type) # x軸數據 .add_yaxis("", c_num) # Y軸數據 .set_global_opts( # 標題顯示 title_opts={"text": "Python招聘企業公司性質分佈", "subtext": "民營', '已上市', '外資(非歐美)', '合資', '國企', '外資(歐美)', '事業單位'"} ) # 保存html文件 # .render("bar_base_dict_config.html") ) # print(Faker.choose()) # ['小米', '三星', '華為', '蘋果', '魅族', 'VIVO', 'OPPO'] 數據類目 # print(Faker.values()) # [38, 54, 20, 85, 71, 22, 38] 數據個數 c.render_notebook() # 直接顯示在jupyter上面