一分鐘搞懂你的博客為什麼沒人看

關於博客訪問量的問題，影響因素有很多，例如你的權重，你的博客數量，包括你的標題是否吸引人都是一個衡量的標準。這些東西需要的是日積月累，今天我們從其中的一個維度入手：發帖時間。相信大家都明白，不論是csdn，博客園這種技術博客還是今日頭條百度貼吧或者抖音快手這種娛樂論壇，都有自己的線上高峰期。例如 ...

　　關於博客訪問量的問題，影響因素有很多，例如你的權重，你的博客數量，包括你的標題是否吸引人都是一個衡量的標準。

這些東西需要的是日積月累，今天我們從其中的一個維度入手：發帖時間。相信大家都明白，不論是csdn，博客園這種技術博客

還是今日頭條百度貼吧或者抖音快手這種娛樂論壇，都有自己的線上高峰期。例如百度貼吧，用戶年齡段普遍偏小，“夜貓子”占據主力。

21-23點是線上高峰期，這個時間的閱讀量以及評論量也是最多的，自媒體人肯定會選擇在這個時間發帖已得到更多的閱讀及評論。

　　那我們的博客園呢？目前我們還不知道，既然園子裡面都是程式猿，數據統計咱就要拿出點技術人員該有的樣子，接下來我們

寫一個爬蟲統計所有的發帖時間以及閱讀數量。

　　所需語言：

　　　　python

　　　　sql server

爬取數據

我們打開博客園首頁，首頁的文章列表有發帖時間，閱讀數，博客園最多只有200頁，我們只要將這200頁的所有文章閱讀數，發帖時間爬取到就ok。

下麵我們用python+scrapy 來編寫爬蟲代碼。

環境配置：

pip install scrapy 安裝爬蟲框架，scrapy安裝容易遇到坑，scrapy教程與常見坑，不懂scrapy看鏈接。

scrapy startproject csblog 創建項目

scrapy gensider scblogSpider “csblogs.com” 創建爬蟲文件

修改csblog下麵的items.py

title:文章標題

read：閱讀數

date：發帖時間

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class CnblogsItem(scrapy.Item):
    title = scrapy.Field()
    read = scrapy.Field()
    date = scrapy.Field()

然後我們編寫爬蟲代碼，首先審查下首頁的html結構。

首先吐槽下翻頁遇到的坑，https://www.cnblogs.com/#p4，錶面看上去#p4是頁碼，但是多次嘗試變化頁碼爬取，都無效果，始終為第一頁。

經過調試工具查看請求才發現，這個url是被重寫過得，想要翻頁得這麼發請求。

接下來就容易多了，向這個地址發請求，在返回的html中取得相應的數據就好了，貼代碼。

# -*- coding: utf-8 -*-
import scrapy
from cnblogs.items import CnblogsItem

class CsblogSpider(scrapy.Spider):
    name = 'csblog'
    allowed_domains = ['cnblogs.com']
    start_urls= ['https://www.cnblogs.com/mvc/AggSite/PostList.aspx']

    PageIndex = 1
    

    def start_requests(self):     
        url = self.start_urls[0]
        #因為博客園只允許200頁
        for each in range(1,200):
            print("抓取頁碼")
            print(each)
            post_data ={
                'CategoryId':'808',
                'CategoryType':"SiteHome",
                'ItemListActionName':"PostList",
                'PageIndex':str(each),
                'ParentCategoryId':'0',
                'TotalPostCount':'400'
                

                }
            yield scrapy.FormRequest(url=url, formdata=post_data)


    def parse(self, response):
        items = []
        #所有文章都在<div class="post_item">中
        for each in response.xpath("/html/body/div[@class='post_item']"):
            #提取標題
            title = each.xpath('div[@class="post_item_body"]/h3/a/text()').extract()
            #提取發佈日期
            date = each.xpath('div[@class="post_item_body"]/div/text()').extract()
            #提取閱讀數
            read = each.xpath('div[@class="post_item_body"]/div/span[@class="article_view"]/a/text()').extract()
            title = title[0]
            #去除無用的字元
            date = str(date).replace("['                    \\r\\n    ', ' \\r\\n",'').replace(" \\r\\n    ']","").replace("發佈於 ","").lstrip()
            read = read[0].replace("閱讀(","").replace(")","")


            
            item = CnblogsItem()
            item['title'] = title
            item['read'] = read
            item['date'] = date
            items.append(item)


        
        return items

爬蟲的代碼很簡單，這也是python的強大之處。

運行 scrapy crawl csblog -o data.xml 將爬取到的數據保存為xml。

我們已經將抓取到的數據保存到本地xml了，接下來要做的事情就是數據統計了。所謂“術業有專攻”，做統計沒有比sql 更強大的語言了，python的任務到此結束。

數據存儲

為了方便的對數據進項統計查詢，我們把xml保存到MS Sql Server中，做個這個事情沒有比Sql server的老伙計C#更合適的了，沒啥好說的簡簡單單的幾個方法。

 　　　　static void Main(string[] args)
        {
            data d = (data)Deserialize(typeof(data), File.OpenRead(@"D:/MyCode/cnblogs/cnblogs/data.xml"));
            DataTable dt = ToDataTable<data.item>(d.items);
            dt.TableName = "t_article";
            dt.Columns.Remove("date");
            SqlHelper.ExecuteNonQuery(dt);
        }

        /// <summary>
        /// Convert a List{T} to a DataTable.
        /// </summary>
        private static DataTable ToDataTable<T>(List<T> items)
        {
            var tb = new DataTable(typeof(T).Name);
            PropertyInfo[] props = typeof(T).GetProperties(BindingFlags.Public | BindingFlags.Instance);
            foreach (PropertyInfo prop in props)
            {
                Type t = GetCoreType(prop.PropertyType);
                tb.Columns.Add(prop.Name, t);
            }
            foreach (T item in items)
            {
                var values = new object[props.Length];

                for (int i = 0; i < props.Length; i++)
                {
                    values[i] = props[i].GetValue(item, null);
                }

                tb.Rows.Add(values);
            }
            return tb;
        }

        /// <summary>
        /// Determine of specified type is nullable
        /// </summary>
        public static bool IsNullable(Type t)
        {
            return !t.IsValueType || (t.IsGenericType && t.GetGenericTypeDefinition() == typeof(Nullable<>));
        }

        /// <summary>
        /// Return underlying type if type is Nullable otherwise return the type
        /// </summary>
        public static Type GetCoreType(Type t)
        {
            if (t != null && IsNullable(t))
            {
                if (!t.IsValueType)
                {
                    return t;
                }
                else
                {
                    return Nullable.GetUnderlyingType(t);
                }
            }
            else
            {
                return t;
            }
        }
        /// 反序列化  
        /// </summary>  
        /// <param name="type"></param>  
        /// <param name="xml"></param>  
        /// <returns></returns>  
        public static object Deserialize(Type type, Stream stream)
        {
            XmlSerializer xmldes = new XmlSerializer(type);
            return xmldes.Deserialize(stream);
        }

數據已經成功的存儲到sql server，接下來的數據統計是重頭戲了。

數據統計

--200頁碼帖子總數量
select COUNT(*) from t_article

--查詢的哪個時間段閱讀量最多
--查詢結果顯示早9點閱讀量是最多的，並不意外
--而早6點（5180）與7點（55144）相差了近10倍
--7點與8點相比差了也有三倍，這說明程式猿們陸續
--開始上班了，上班敲代碼一定是查資料的高峰期，
--果不其然，8,9,10,11,15,16是閱讀量最高峰的幾個時間段
--都分佈在上班時間，而出乎意料的事22點的閱讀量也不低
--看來程式猿們回家後也很努力的嘛（應該是在加班）
select 
CONVERT(INT, CONVERT(varchar(2),time, 108)) as count,
SUM([read]) as [read]
from t_article 
group by 
CONVERT(INT, CONVERT(varchar(2),time, 108)) 
order by [read] desc

--查詢閱讀量在一個星期內的分佈情況
--結果一點都不意外，星期三比另六天
--高得多，星期一到星期五是工作日
--每天的閱讀量都很高，周末閱讀量下滑
--的厲害，因為休息了嘛（居然沒在加班）
select 
datename(weekday, time) as weekday,
SUM([read]) as [read]
from t_article 
group by 
datename(weekday, time) 
order by [read] desc

--按照閱讀數量排行
--閱讀數量與發帖時間基本成正比
--這意味著，你辛辛苦苦寫的文章
--沒人看，沒有關係。時間不會辜負你
select 
CONVERT(varchar(100), time, 111),
sum([read])
from t_article 
group by CONVERT(varchar(100), time, 111)
order by sum([read])