前言 嗨嘍,大家好呀~這裡是愛看美女的茜茜吶 又到了學Python時刻~今天我們來採集一下評論數據! WB態數據抓包+所有的數據提取方式+詞雲圖可視化 開發環境: python 3.8: 解釋器 pycharm: 代碼編輯器 requests 第三方模塊 採集評論代碼 # 導入模塊 import r ...
前言
嗨嘍,大家好呀~這裡是愛看美女的茜茜吶
又到了學Python時刻~今天我們來採集一下評論數據!
WB態數據抓包+所有的數據提取方式+詞雲圖可視化
開發環境:
-
python 3.8: 解釋器
-
pycharm: 代碼編輯器
-
requests 第三方模塊
採集評論代碼
# 導入模塊 import requests import parsel import re import csv import time headers = { 'cookie': 'XSRF-TOKEN=V48EJHd1wO3DP9ffnlwgfvQr; WBPSESS=yr8Ogb3qBlrorv2L6-ukSsE1SdVJvjLsi6ub0yOZpfazK2TqOMmvxlay7kNrt6LGuwSQINF-zpQWhR5GxHKCX1k4G2jaPAJoABJpxykZAJt4WAVgjdO_FFGWKvaHbvCJoOFzEoJ5rXkc31Ex4pDEylNKVb9H913jTpjFGBoBha4=; login_sid_t=8f13cfe80a400ba04cd5d9094175b145; cross_origin_proto=SSL; WBStorage=4d96c54e|undefined; _s_tentry=weibo.com; Apache=9429320084537.793.1662010843614; SINAGLOBAL=9429320084537.793.1662010843614; ULV=1662010843618:1:1:1:9429320084537.793.1662010843614:; wb_view_log=1920*10801; SSOLoginState=1662010869; SUB=_2A25OFDZPDeRhGeFI6lsT-CnPyDqIHXVtYCCHrDV8PUNbmtANLXDXkW9NfV7QbU7-nuy6Ejf4yBGzw8ymJY1CysT9; SUBP=0033WrSXqPxfM725Ws9jqgMF55529P9D9WWGGcL4DsCvRg-RQA6cXEKN5JpX5KzhUgL.FoMceK.E1hM0e0q2dJLoIp7LxKML1KBLBKnLxKqL1hnLBoMNSo24eonNe0ec; ALF=1693546910; wvr=6; wb_view_log_7619287336=1920*10801; webim_unReadCount=%7B%22time%22%3A1662011426887%2C%22dm_pub_total%22%3A0%2C%22chat_group_client%22%3A0%2C%22chat_group_notice%22%3A0%2C%22allcountNum%22%3A23%2C%22msgbox%22%3A0%7D; PC_TOKEN=b6ef7633b7',
response = requests.get(url=url, headers=headers) html_data = response.json()['data']['html'] selector = parsel.Selector(html_data) # .list_box > .list_ul > div .list_con .WB_text:nth-child(1) divs = selector.css('.list_box > .list_ul > div') try: sub_ = re.findall('action-data="(id=4808806519278561.*?)"', html_data)[0] except: sub_ = '' print(sub_) for div in divs[0: -1]: content = div.css('.list_con .WB_text:nth-child(1)::text').getall()[1].replace(':', '').replace(' ', '') imgUrl = div.css('.WB_face.W_fl img::attr(src)').get() user = div.css('.WB_text a:nth-child(1)::text').get() time_ = div.css('.WB_from.S_txt2::text').get() print(user, content, time_, imgUrl) csv_writer.writerow([user, content, time_, imgUrl]) if sub_ == '': return 0 get_next(sub_) get_next('id=4808806519278561&from=singleWeiBo&__rnd=1662011439459')
詞雲圖代碼
import jieba import pandas as pd import stylecloud # 讀取文件 df_wb = pd.read_csv('微博評論.csv') def get_cut_words(content_series): # 讀入停用詞表 stop_words = [] with open("stop_words.txt", 'r', encoding='utf-8') as f: lines = f.readlines() for line in lines: stop_words.append(line.strip()) # 添加關鍵詞 my_words = ['沒有欲望', '便宜點'] for i in my_words: jieba.add_word(i) # 自定義停用詞 my_stop_words = [] stop_words.extend(my_stop_words) # 分詞 word_num = jieba.lcut(content_series.str.cat(sep='。'), cut_all=False) # 條件篩選 word_num_selected = [i for i in word_num if i not in stop_words and len(i) >= 2] return word_num_selected text = get_cut_words(content_series=df_wb['content']) # 繪製詞雲圖 stylecloud.gen_stylecloud( text=' '.join(text), collocations=False, font_path=r'C:\Windows\Fonts\msyh.ttc', icon_name='fab fa-apple', size = 768, output_name='iPhone.png', )
好了,今天的分享就到這裡,需要更多源碼、資料點擊下方藍字即可~
更多資料獲取加Q君羊:261823976 點擊藍字加入【python學習裙】
記得點贊鴨~