基於python的scrapy框架爬取豆瓣電影及其可視化

-Advertisement-

1.Scrapy框架介紹主要介紹，spiders，engine，scheduler,downloader,Item pipeline scrapy常見命令如下：對應在scrapy文件中有，自己增加爬蟲文件，系統生成items,pipelines,setting的配置文件就這些。 items寫需要 ...

1.Scrapy框架介紹

scrapy

主要介紹，spiders，engine，scheduler,downloader,Item pipeline

scrapy常見命令如下：

對應在scrapy文件中有，自己增加爬蟲文件，系統生成items,pipelines,setting的配置文件就這些。

items寫需要爬取的屬性名，pipelines寫一些數據流操作，寫入文件，還是導入資料庫中。主要爬蟲文件寫domain，屬性名的xpath，在每頁添加屬性對應的信息等。

    movieRank = scrapy.Field()
    movieName = scrapy.Field()
    Director = scrapy.Field()
    movieDesc = scrapy.Field()
    movieRate = scrapy.Field()
    peopleCount = scrapy.Field()
    movieDate = scrapy.Field()
    movieCountry = scrapy.Field()
    movieCategory = scrapy.Field()
    moviePost = scrapy.Field()

import json

class DoubanPipeline(object):
    def __init__(self):
        self.f = open("douban.json","w",encoding='utf-8')

    def process_item(self, item, spider):
        content = json.dumps(dict(item),ensure_ascii = False)+"\n"
        self.f.write(content)
        return item

    def close_spider(self,spider):
        self.f.close()

這裡xpath使用過程中，安利一個chrome插件xpathHelper。

    allowed_domains = ['douban.com']
    baseURL = "https://movie.douban.com/top250?start="
    offset = 0
    start_urls = [baseURL + str(offset)]


    def parse(self, response):
        node_list = response.xpath("//div[@class='item']")

        for node in node_list:
            item = DoubanItem()
            item['movieName'] = node.xpath("./div[@class='info']/div[1]/a/span/text()").extract()[0]
            item['movieRank'] = node.xpath("./div[@class='pic']/em/text()").extract()[0]
            item['Director'] = node.xpath("./div[@class='info']/div[@class='bd']/p[1]/text()[1]").extract()[0]
            if len(node.xpath("./div[@class='info']/div[@class='bd']/p[@class='quote']/span[@class='inq']/text()")):
                item['movieDesc'] = node.xpath("./div[@class='info']/div[@class='bd']/p[@class='quote']/span[@class='inq']/text()").extract()[0]
            else:
                item['movieDesc'] = ""
            
            item['movieRate'] = node.xpath("./div[@class='info']/div[@class='bd']/div[@class='star']/span[@class='rating_num']/text()").extract()[0] 
            item['peopleCount'] = node.xpath("./div[@class='info']/div[@class='bd']/div[@class='star']/span[4]/text()").extract()[0]
            item['movieDate'] = node.xpath("./div[2]/div[2]/p[1]/text()[2]").extract()[0].lstrip().split('\xa0/\xa0')[0]
            item['movieCountry'] = node.xpath("./div[2]/div[2]/p[1]/text()[2]").extract()[0].lstrip().split('\xa0/\xa0')[1]
            item['movieCategory'] = node.xpath("./div[2]/div[2]/p[1]/text()[2]").extract()[0].lstrip().split('\xa0/\xa0')[2]           
            item['moviePost'] = node.xpath("./div[@class='pic']/a/img/@src").extract()[0]
            yield item

        if self.offset <250:
            self.offset += 25
            url = self.baseURL+str(self.offset)
            yield scrapy.Request(url,callback = self.parse)

這裡基本可以爬蟲，產生需要的json文件。

接下來是可視化過程。

我們先梳理一下，我們掌握的數據情況。

douban = pd.read_json('douban.json',lines=True,encoding='utf-8')
douban.info()

基本我們可以分析，電影國家產地，電影拍攝年份，電影類別以及一些導演在TOP250中影響力。

先做個簡單瞭解，可以使用value_counts()函數。

douban = pd.read_json('douban.json',lines=True,encoding='utf-8')
df_Country = douban['movieCountry'].copy()

for i in range(len(df_Country)):
    item = df_Country.iloc[i].strip()
    df_Country.iloc[i] = item[0]
print(df_Country.value_counts())

美國電影占半壁江山，122/250，可以反映好萊塢電影工業之強大。同樣，日本電影和香港電影在中國也有著重要地位。令人意外是，中國大陸地區電影數量不是令人滿意。豆瓣影迷對於國內電影還是非常挑剔的。

douban = pd.read_json('douban.json',lines=True,encoding='utf-8')
df_Date = douban['movieDate'].copy()

for i in range(len(df_Date)):
    item = df_Date.iloc[i].strip()
    df_Date.iloc[i] = item[2]
print(df_Date.value_counts())

2000年以來電影數目在70%以上，考慮10代才過去9年和打分滯後性，總體來說越新的電影越能得到受眾喜愛。這可能和豆瓣top250選取機制有關，必須人數在一定數量以上。

douban = pd.read_json('douban.json',lines=True,encoding='utf-8')
df_Cate = douban['movieCategory'].copy()

for i in range(len(df_Cate)):
    item = df_Cate.iloc[i].strip()
    df_Cate.iloc[i] = item[0]
print(df_Cate.value_counts())

劇情電影情節起伏更容易得到觀眾認可。

下麵展示幾張可視化圖片

不太會用python進行展示，有些難看。其實，推薦用Echarts等插件，或者用Excel，BI軟體來處理圖片，比較方便和美觀。

第一次做這種爬蟲和可視化，多有不足之處，懇請指出。

您的分享是我們最大的動力!

-Advertisement-

更多相關文章

“C++動態綁定”相關問題探討

一、相關問題： 1. 基類、派生類的構造和析構順序 2. 基類、派生類中virtual的取捨二、測試代碼：三、探討與結論： 1. 基類、派生類的構造和析構順序為：基類構造-派生類構造-派生類析構-基類析構上述代碼輸出結果為： 2. 基類、派生類中virtual的取捨：若要實現動態綁定，基類中v ...
Python+Excel+Unittest+HTMLTestRunner實現數據驅動介面自動化測試（一）

整個流程：使用HTMLTestRunner的Run方法執行用例，用例調用Excel讀取方法，將測試數據導入到unittest用例中執行，測試結果返回給HTMLTestRunner。因為剛接觸介面自動化，寫的比較簡單。後面也會考慮加一個請求類型的封裝，excel測試數據也會增加一些欄位（如用例是否 ...
#leetcode刷題之路21-合併兩個有序鏈表

將兩個有序鏈表合併為一個新的有序鏈表並返回。新鏈表是通過拼接給定的兩個鏈表的所有節點組成的。示例：輸入：1->2->4, 1->3->4輸出：1->1->2->3->4->4 思路：始終讓l1是頭節點小的那一個，然後拿l2的節點值依次與l1比較並插入l1中。最後返回l1。 ...
漢諾塔

def HanNuoTa(n,a,b,c):#n=盤子數 a，b，c為塔 if n == 1: print(a,"->",c) return None if n == 2: print(a,"->",b) print(a,"->",c) print(b,"->",c) return None Han ...
SpringCloud學習(二)---Eureka

Eureka 重點在使用,概念和源碼基本不涉及 Eureka是一個基於REST(REST是HTTP協議的)的服務,主要在亞馬遜網路服務(AWS)雲中使用,定位服務來進行中間層伺服器的均衡負載和故障轉移. Spring Cloud封裝Eureka來實現服務註冊和發現,Eureka採用了C S的設計架構 ...
Python爬蟲4-URLError與HTTPError

GitHub代碼練習地址：URLError：https://github.com/Neo-ML/PythonPractice/blob/master/SpiderPrac06_URLError.py HTTPError：https://github.com/Neo-ML/PythonPractice ...
012章緒論+向量

01B-5: 圖靈機 01B-5: 圖靈機 Which of the following is NOT a component of a Turing machine? 以下哪項不是圖靈機的組成要件？ A tape of finite length 有限長的紙帶 A tape of finite l ...
mybatis之旅第一篇-初識mybatis

一、JDBC的問題為什麼我們要使用mybatis，是因為JDBC存在以下問題 1、資料庫連接創建、釋放頻繁造成系統資源浪費，從而影響系統性能。如果使用資料庫連接池可解決此問題。 2、 Sql語句在代碼中硬編碼，造成代碼不易維護，實際應用中sql變化的可能較大，sql變動需要改變java代碼。 3 ...