Python採集B站最新周傑倫MV內容, 粉絲評論並實現詞雲分析

-Advertisement-

前言大家早好、午好、晚好吖~ 環境使用: Python 3.8 Pycharm 2021.2版本 ffmpeg <需要設置環境變數> 模塊使用: import requests >>> pip install requests 內置模塊你安裝好python環境就可以了 import re imp ...

前言

大家早好、午好、晚好吖~

環境使用:

Python 3.8
Pycharm 2021.2版本
ffmpeg <需要設置環境變數>

模塊使用:

import requests >>> pip install requests

內置模塊你安裝好python環境就可以了

import re
import json
import subprocess

如果安裝python第三方模塊:

win + R 輸入 cmd 點擊確定, 輸入安裝命令 pip install 模塊名 (pip install requests) 回車
在pycharm中點擊Terminal(終端) 輸入安裝命令

基本思路流程:

採集視頻數據.... 1. 視頻標題 2. 視頻內容

1、對著網頁滑鼠右鍵點擊查看網頁源代碼 ctrl + F 搜索 playinfo

代碼實現步驟: <通用>

發送請求, 模擬瀏覽器對於url地址發送請求
獲取數據, 獲取網頁源代碼 <因為我們想要數據內容, 來自於網頁源代碼>
解析數據, 提取我們想要數據內容
保存數據, 把視頻內容完整的保存到本地文件夾

代碼

採集視頻

# 導入數據請求模塊
import requests
# 導入正則
import re
# 導入json
import json
# 導入格式化輸出模塊
import pprint
# 導入進程模塊
import subprocess

"""

發送請求

模擬瀏覽器對於url地址發送請求
"""

# 確定網址
url = 'https://www.bilibili.com/video/BV1ua411p7iA?vd_source=b2da3931eefc454d41eb6bb5b34749d1'
# python代碼如何模擬瀏覽器? 請求頭 ---> 用偽裝python代碼
headers = {
    # referer 防盜鏈 告訴伺服器請求url地址是從哪裡跳過過來
    'referer': 'https://www.bilibili.com/',
    # user-agent 用戶代理 表示瀏覽器基本身份標識
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.0.0 Safari/537.36'
}
# 發送請求  ---> 得到響應對象 200 狀態碼表示請求成功
response = requests.get(url=url, headers=headers)
# <Response [200]>
print(response)

獲取數據

# 獲取數據 得到響應對象文本數據 ---> 字元串數據類型
# print(response.text)

"""

解析數據

提取我們想要數據內容
正則表達式 re ---> 對於字元串數據進行提取
---> 0 1 2 開始計數 -3 -2 -1<---
lis = ['a', 'b', 'c'] lis[0] lis[-3]

re.findall() --> 匹配數據返回列表數據類型列表取值: 根據索引位置提取內容
re 模塊名
.findall() 調用re模塊裡面findall()方法 --> 找到所有 <我們想要數據>
從什麼地方去找什麼數據
從 response.text 裡面去找 "title":"(.?)","pubdate" 其中 (.?) 這段是我們想要的
"""

源碼、解答、教程可加Q裙：832157862
# 獲取標題
title = re.findall('"title":"(.*?)","pubdate"', response.text)[0].replace(' ', '')
# 正則替換特殊字元
title = re.sub(r'[\/:*?"<>|]', '', title)
# 獲取shipin數據信息
html_data = re.findall('<script>window.__playinfo__=(.*?)</script>', response.text)[0]
# 轉成json字典數據類型
json_data = json.loads(html_data)
# 字典取值 --> 鍵值對取值, 根據冒號左邊內容[鍵]  提取冒號右邊的內容[值]
audio_url = json_data['data']['dash']['audio'][0]['baseUrl']
video_url = json_data['data']['dash']['video'][0]['baseUrl']
print(audio_url)
print(video_url)
print(title)

保存數據

--> 403 Forbidden 沒有訪問許可權 --> 防盜鏈加headers請求頭

# 發送請求 獲取音頻二進位數據
audio_content = requests.get(url=audio_url, headers=headers).content
# 發送請求 獲取視頻二進位數據
# video__content = requests.get(url=video_url, headers=headers).content
# with open('video\\' + title + '.mp3', mode='wb') as a:
#     a.write(audio_content)
# with open('video\\' + title + '.mp4', mode='wb') as v:  # 丨
#     v.write(video__content)

# 通過ffmpeg 這個軟體命令 進行視頻合成
cmd = f"ffmpeg -i video\\{title}.mp4 -i video\\{title}.mp3 -c:v copy -c:a aac -strict experimental video\\{title}output.mp4"
subprocess.run(cmd, shell=True)

採集評論

源碼、解答、教程可加Q裙：832157862
# 導入數據請求模塊
import time

import requests

for page in range(1, 11):
    # 請求網址
    time.sleep(1)
    url = f'https://api.bilibili.com/x/v2/reply/main?csrf=9b972b9803693b4f5c0d6a042b2d0c0e&mode=3&next={page}&oid=215631694&plat=1&type=1'
    # 請求頭
    headers = {
        # 'origin': 'https://www.bilibili.com',
        'referer': 'https://www.bilibili.com/video/',
        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.0.0 Safari/537.36',
    }
    # 發送請求
    response = requests.get(url=url, headers=headers)
    # 獲取數據
    content_list = [i['content']['message'] for i in response.json()['data']['replies']]
    print(content_list)
    # for 遍歷輸出內容
    for content in content_list:
        with open('評論.txt', mode='a', encoding='utf-8') as f:
            f.write(content)
            f.write('\n')
        print(content)

製作詞雲

# 導入結巴分詞模塊
import jieba
# 導入詞雲模塊
import wordcloud
# 讀取文件內容
f = open('評論.txt', encoding='utf-8')
txt = f.read()
print(txt)
string = ' '.join(jieba.lcut(txt))
print(string)
wc = wordcloud.WordCloud(
    width=700,   # 寬
    height=700,  # 高
    background_color='white',  # 背景顏色
    font_path='msyh.ttc',  # 設置字體
    scale=15,  # 規模
)
wc.generate(string)
wc.to_file('評論詞雲.png')

尾語

好了，我的這篇文章寫到這裡就結束啦！

有更多建議或問題可以評論區或私信我哦！一起加油努力叭(ง •_•)ง

喜歡就關註一下博主，或點贊收藏評論一下我的文章叭！！！

您的分享是我們最大的動力!