《Python For Data Analysis》學習筆記-1

-Advertisement-

在引言章節里，介紹了MovieLens 1M數據集的處理示例。書中介紹該數據集來自GroupLens Research（http://www.groupLens.org/node/73）,該地址會直接跳轉到https://grouplens.org/datasets/movielens/，這裡面提供 ...

在引言章節里，介紹了MovieLens 1M數據集的處理示例。書中介紹該數據集來自GroupLens Research（http://www.groupLens.org/node/73）,該地址會直接跳轉到https://grouplens.org/datasets/movielens/，這裡面提供了來自MovieLens網站的各種評估數據集，可以下載相應的壓縮包，我們需要的MovieLens 1M數據集也在裡面。

下載解壓後的文件夾如下：

這三個dat表都會在示例中用到。我所閱讀的《Python For Data Analysis》中文版（PDF）是2014年第一版的，裡面所有示例都是基於Python 2.7和pandas 0.8.2所寫的，而我安裝的是Python 3.5.2與pandas 0.20.2，裡面的一些函數與方法會有較大的不同，有些是新版本中參數改變了，而有些是新版本里棄用了某些舊版本的函數，這導致我運行按照書中示例代碼時，會遇到一些Error和Warning。在測試MovieLens 1M數據集代碼時，在和一樣我的配置環境下，會遇到如下幾個問題。

在將dat數據讀入到pandas DataFrame對象中時，書中給出代碼為：

users = pd.read_table('ml-1m/users.dat', sep='::', header=None, names=unames)

rnames = ['user_id', 'movie_id', 'rating', 'timestamp']
ratings = pd.read_table('ml-1m/ratings.dat', sep='::', header=None, names=rnames)

mnames = ['movie_id', 'title', 'genres']
movies = pd.read_table('ml-1m/movies.dat', sep='::', header=None, names=mnames)

直接運行會出現Warning:

F:/python/HelloWorld/DataAnalysisByPython-1.py:4: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators (separators > 1 char and different from '\s+' are interpreted as regex); you can avoid this warning by specifying engine='python'.
  users = pd.read_table('ml-1m/users.dat', sep='::', header=None, names=unames)
F:/python/HelloWorld/DataAnalysisByPython-1.py:7: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators (separators > 1 char and different from '\s+' are interpreted as regex); you can avoid this warning by specifying engine='python'.
  ratings = pd.read_table('ml-1m/ratings.dat', sep='::', header=None, names=rnames)
F:/python/HelloWorld/DataAnalysisByPython-1.py:10: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators (separators > 1 char and different from '\s+' are interpreted as regex); you can avoid this warning by specifying engine='python'.
  movies = pd.read_table('ml-1m/movies.dat', sep='::', header=None, names=mnames)

雖然也能運行，但是作為完美強迫症的我還是想要解決這個Warning。這個警告是說因為'C'引擎不支持，只能退回到'Python'引擎，而剛好pandas.read_table方法里有個engine參數，用來設置使用哪種解析引擎，有'C'和'Python'這兩個選項。既然'C'引擎不支持，我們只需把engine設為'Python'就可以了。

users = pd.read_table('ml-1m/users.dat', sep='::', header=None, names=unames, engine = 'python')

rnames = ['user_id', 'movie_id', 'rating', 'timestamp']
ratings = pd.read_table('ml-1m/ratings.dat', sep='::', header=None, names=rnames, engine = 'python')

mnames = ['movie_id', 'title', 'genres']
movies = pd.read_table('ml-1m/movies.dat', sep='::', header=None, names=mnames, engine = 'python')

使用pivot_table方法來對聚合後的數據按性別計算每部電影的平均得分，書中給出的代碼為：
```
mean_ratings = data.pivot_table('rating', rows='title', cols='gender', aggfunc='mean')
```
直接運行會報錯，這段代碼無法運行：
```
Traceback (most recent call last):
  File "F:/python/HelloWorld/DataAnalysisByPython-1.py", line 19, in <module>
    mean_ratings = data.pivot_table('rating', rows='title', cols='gender', aggfunc='mean')
TypeError: pivot_table() got an unexpected keyword argument 'rows'
```
TypeError說明這裡的'rows'參數並不是方法里可用的關鍵字參數，這是這麼回事呢？去官網上查了下pandas的API使用文檔（http://pandas.pydata.org/pandas-docs/stable/api.html），發現是因為0.20.2版的pandas.pivot_table里關鍵字參數變了，為了實現同樣效果，只需把rows換成index就可以了，同時也沒有cols參數，要用columns來代替。
```
mean_ratings = data.pivot_table('rating', index='title', columns='gender', aggfunc='mean')
```
為了瞭解女性觀眾最喜歡的電影，使用DataFrame的方法對F列進行降序排序，書中的示例代碼為：
```
top_female_ratings = mean_ratings.sort_index(by='F', ascending=False)
```
這裡也只是給出一個Warning，並不會幹擾程式進行：
```
F:/python/HelloWorld/DataAnalysisByPython-1.py:32: FutureWarning: by argument to sort_index is deprecated, pls use .sort_values(by=...)
  top_female_ratings = mean_ratings.sort_index(by='F', ascending=False)
```
這裡是說進行排序的sort_index方法在將來語言或者庫中可能發生改變，建議改為使用sort_values。在API使用文檔中，對pandas.DataFrame.sort_index的描述為“Sort object by labels (along an axis)”，而對pandas.DataFrame.sort_values的描述為“Sort by the values along either axis”，兩者能達到同樣效果，那我就直接替換成sort_values就可以了。在後面的“計算評分分歧”中也會用到sort_index，也可以替換成sort_values。
```
top_female_ratings = mean_ratings.sort_values(by='F', ascending=False)
```

最後一個錯誤還是和排序有關。在“計算評分分歧”中計算得分數據的標準差之後，根據過濾後的值對Series進行降序排序，書中的代碼為：

print(rating_std_by_title.order(ascending=False)[:10])

這裡的錯誤是：

Traceback (most recent call last):
  File "F:/python/HelloWorld/DataAnalysisByPython-1.py", line 47, in <module>
    print(rating_std_by_title.order(ascending=False)[:10])
  File "E:\Program Files\Python35\lib\site-packages\pandas\core\generic.py", line 2970, in __getattr__
    return object.__getattribute__(self, name)
AttributeError: 'Series' object has no attribute 'order'

居然已經沒有這個order的方法了，只好去API文檔中找替代的方法用。有兩個，sort_index和sort_values，這和DataFrame中的方法一樣，為了保險起見，我選擇使用sort_values：

print(rating_std_by_title.sort_values(ascending=False)[:10]

得到的結果和數據展示的結果一樣，可以放心使用。

第三方庫不同版本間的差異還是挺明顯的，建議是使用最新的版本，在使用時配合官網網站上的API使用文檔，輕鬆解決各類問題~

您的分享是我們最大的動力!

-Advertisement-

更多相關文章

ntp服務配置，詳解！

在linux系統中，為了避免主機時間因為長時間運行下所導致的時間偏差，進行時間同步（synchronize）的工作是非常必要的。linux系統下，一般使用ntp服務來同步不同機器的時間。NTP是網路時間協議（Network Time Protocol）的簡稱，就是通過網路協議使電腦之間的時間同步化 ...
WPF字典集合類ObservableDictionary

WPF最核心的技術優勢之一就是數據綁定。數據綁定，可以通過對數據的操作來更新界面。數據綁定最經常用到的是ObservableCollection<T> 和 Dictionary<T, T> 這兩個類。 ObservableCollection表示一個動態數據集合，在添加項、移除項或刷新整個列表時， ...
asp .net 面試問題整理

1，ref 和out 的區別https://i.cnblogs.com/EditPosts.aspx?postid=7009441&update=1 2，有三個a 標簽沒有id，沒有name,沒有class 如何選中第二個 a 標簽，用jquert 選擇 <a href ="#這是第一個" </a ...
html轉PDF

工具包及demo：鏈接: https://pan.baidu.com/s/1i4TUUep 密碼: gssq ...
構造器

引用類型構造器是將類型的實例初始化為良好狀態的特殊方法，創建引用類型的實例時，首先為實例的數據欄位分配記憶體，然後初始化對象的附加欄位（類型對象指針和同步塊索引），最後調用類型的實例構造器來設置對象的初始化狀態。構造引用類型對象時，在電泳類型的實例構造器之前，為對象分配的記憶體總是先被歸0，沒有被構 ...
有料又有型，近乎v5.0新型社區巨磅發佈！

炎炎夏日將至，近乎新版v5.0正式發佈！本次近乎全新改版，從最初的定位到最終的呈現方式，v5.0都以嶄新面孔示人。無論是追求高質高感的加“料”動作，靈活百變的“型”體塑造，還是性能的高調升級，v5.0都在極力探索互聯網下企業社交之道，以快速搭建和定製出符合客戶需求的新型社區網站為目標，實現多終端內 ...
學習Java分為幾個階段，分別是什麼？

多年前我自學的時候是很茫然，上網問問題，總是一堆外行的人說很難啊，你需要這樣需要那樣，不然就是，一堆人說一些空話，多看多寫，買好書，我很無語，除了這些就沒有自己的一些想法嗎？首先很多人認為學JAVA要C的基礎，但是實際上不用，學開車，我不想先學騎自行車，沒有必要。第一階段： 1. JAVA語法和 ...
編寫高質量代碼的建議(一) 關於數組和集合

閱讀目錄建議65：避開基本類型數組轉換列表陷阱建議66：asList方法產生的List的對象不可更改建議67：不同的列表選擇不同的遍歷演算法建議68：頻繁插入和刪除時使用LinkList 建議69：列表相等只關心元素數據閱讀目錄建議65：避開基本類型數組轉換列表陷阱建議66：asList ...