吐血整理python數據分析利器pandas的八個生命周期！

在筆者之前的文章`《驅動開發：內核特征碼搜索函數封裝》`中我們封裝實現了特征碼定位功能，本章將繼續使用該功能，本次我們需要枚舉內核`LoadImage`映像回調，在Win64環境下我們可以設置一個`LoadImage`映像載入通告回調，當有新驅動或者DLL被載入時，回調函數就會被調用從而執行我們自己... ...

這裡從八個pandas的數據處理生命周期，整理彙總出pandas框架在整個數據處理過程中都是如何處理數據的。

【閱讀全文】

也就是從pandas的數據表對象以及數據彙總、數據統計等等直到數據導出的八個處理過程來完成pandas使用的彙總處理。

首先，需要準備好將python非標準庫導入進來，除了pandas之外一般伴隨數據分析處理使用的還有numpy科學計算庫。

# Importing the pandas library and giving it the alias pd.
import pandas as pd

# Importing the numpy library and giving it the alias np.
import numpy as np

1、數據表對象（DataFrame）

在pandas的數據分析處理中，主要依賴的是對DataFrame對象的處理來完成數據的提取、彙總、統計等操作。

那麼在初始化DataFrame對象的時候有兩種方式，一種是直接讀取Excel、csv文件獲取數據後返回DataFrame數據對象。

# Reading the csv file and converting it into a dataframe.
dataframe_csv = pd.DataFrame(pd.read_csv('./data.csv'))

# Reading the excel file and converting it into a dataframe.
dataframe_xlsx = pd.DataFrame(pd.read_excel('./data.xlsx'))

另一種則是需要自己創建DataFrame對象的數據，將字典等類型的python對象直接初始化為DataFrame數據表的形式。

# Creating a dataframe with two columns, one called `name` and the other called `age`.
dataframe = pd.DataFrame({"編程語言": ['Java', 'Python', 'C++'],
                          "已誕生多少年": [23, 20, 28]},
                         columns=['編程語言', '已誕生多少年'])

2、數據表（DataFrame）結構信息

通過DataFrame對象內置的各種函數來查看數據維度、列名稱、數據格式等信息。

# Creating a dataframe with two columns, one called `name` and the other called `age`.
dataframe = pd.DataFrame({"編程語言": ['Java', 'Python', 'C++'],
                          "已誕生多少年": [23, 20, 28]},
                         columns=['編程語言', '已誕生多少年'])

【加粗】dataframe.info()

查看數據表的基本信息展示，包括列數、數據格式、列名稱、占用空間等。

dataframe.info()

# <class 'pandas.core.frame.DataFrame'>
# Index: 0 entries
# Data columns (total 2 columns):
#  #   Column  Non-Null Count  Dtype
# ---  ------  --------------  -----
#  0   編程語言    0 non-null      object
#  1   已誕生多少年  0 non-null      object
# dtypes: object(2)
# memory usage: 0.0+ bytes

【加粗】dataframe.columns

查看DataFrame對象的所有列的名稱，並返回數組信息。

print('顯示所有列的名稱是：{0}'.format(dataframe.columns))

# 顯示所有列的名稱是：Index(['編程語言', '已誕生多少年'], dtype='object')

【加粗】dataframe['列名'].dtype

查看DataFrame對象中某一列的格式dtype是什麼。

print('列名（編程語言）的格式是：{0}'.format(dataframe[u'編程語言'].dtype))

# 列名（編程語言）的格式是：object

【加粗】dataframe.shape

通過DataFrame對象的shape函數，進而展示出數據是幾行幾列的結構。

print('dataframe的結構是：{0}'.format(dataframe.shape))

# dataframe的結構是：(3, 2)

【加粗】dataframe.values

使用DataFrame對象的values函數，得出所有數據內容的結果。

# Importing the pprint function from the pprint module.
from pprint import pprint

pprint('dataframe對象的值是：{0}'.format(dataframe.values))

# "dataframe對象的值是：[['Java' 23]\n ['Python' 20]\n ['C++' 28]]"

3、數據清洗

數據清洗即是對DataFrame對象中的數據進行規範化的處理，比如空值的數據填充、重覆數據的清理、數據格式的統一轉換等等。

【加粗】dataframe.fillna()

# 將所有數據為空的項填充為0
dataframe.fillna(value=0)

# 使用均值進行填充
dataframe[u'已誕生多少年'].fillna(dataframe[u'已誕生多少年'].mean())

【加粗】map(str.strip)

# 去除指定列的首尾多餘的空格後，再重新賦值給所在列

dataframe[u'編程語言'] = dataframe[u'編程語言'].map(str.strip)

【加粗】dataframe.astype

# 更改DataFrame數據對象中某個列的數據格式。

dataframe[u'已誕生多少年'].astype('int')

【加粗】dataframe.rename

# 更改DataFrame數據對象中某個列的名稱

dataframe.rename(columns={u'已誕生多少年': u'語言年齡'})

【加粗】 dataframe.drop_duplicates

# 以DataFrame中的某個列為準，刪除其中的重覆項

dataframe[u'編程語言'].drop_duplicates()

【加粗】dataframe.replace

# 替換DataFrame數據對象中某個列中指定的值

dataframe[u'編程語言'].replace('Java', 'C#')

4、數據預梳理

數據預處理（data preprocessing）是指在主要的處理以前對數據進行的一些處理。

如對大部分地球物理面積性觀測數據在進行轉換或增強處理之前，首先將不規則分佈的測網經過插值轉換為規則網的處理，以利於電腦的運算。

【加粗】數據合併

使用DataFrame對象數據合併的有四種方式可以選擇，分別是merge、append、join、concat方式，不同方式實現的效果是不同的。

接下來使用兩種比較常見的方式append、concat、join來演示一下DataFrame對象合併的效果。

使用兩個DataFrame的數據對象通過append將對象的數據內容進行合併。

# Creating a dataframe with two columns, one called `編程語言` and the other called `已誕生多少年`.
dataframeA = pd.DataFrame({"編程語言": ['Java', 'Python', 'C++'],
                           "已誕生多少年": [23, 20, 28]}, columns=['編程語言', '已誕生多少年'])

# Creating a dataframe with two columns, one called `編程語言` and the other called `已誕生多少年`.
dataframeB = pd.DataFrame({"編程語言": ['Scala', 'C#', 'Go'],
                           "已誕生多少年": [23, 20, 28]}, columns=['編程語言', '已誕生多少年'])

# Appending the dataframeB to the dataframeA.
res = dataframeA.append(dataframeB)

# Printing the result of the append operation.
print(res)

#      編程語言  已誕生多少年
# 0    Java      23
# 1  Python      20
# 2     C++      28
# 0   Scala      23
# 1      C#      20
# 2      Go      28
#
# Process finished with exit code 0

使用兩個DataFrame的數據對象通過concat將對象的數據內容進行合併。

# Concatenating the two dataframes together.
res = pd.concat([dataframeA, dataframeB])

# Printing the result of the append operation.
print(res)

#      編程語言  已誕生多少年
# 0    Java      23
# 1  Python      20
# 2     C++      28
# 0   Scala      23
# 1      C#      20
# 2      Go      28

concat函數的合併效果和append函數有異曲同工之妙，兩者同樣都是對數據內容進行縱向合併的。

使用兩個DataFrame的數據對象通過join將對象的數據結構及數據內容進行橫向合併。

# Creating a dataframe with two columns, one called `編程語言` and the other called `已誕生多少年`.
dataframeC = pd.DataFrame({"編程語言": ['Java', 'Python', 'C++'],
                           "已誕生多少年": [23, 20, 28]}, columns=['編程語言', '已誕生多少年'])

# Creating a dataframe with one column called `歷史表現` and three rows.
dataframeD = pd.DataFrame({"歷史表現": ['A', 'A', 'A']})

# Joining the two dataframes together.
res = dataframeC.join(dataframeD, on=None)

# Printing the result of the append operation.
print(res)

#      編程語言  已誕生多少年 歷史表現
# 0    Java      23    A
# 1  Python      20    A
# 2     C++      28    A

可以發現使用join的函數之後，將dataframeD作為一個列擴展了並且對應的每一行都準確的填充了數據A。

【加粗】設置索引

給DataFrame對象設置索引的話就比較方便了，直接DataFrame對象提供的set_index函數設置需要定義索引的列名稱就OK了。

# Creating a dataframe with two columns, one called `編程語言` and the other called `已誕生多少年`.
dataframeE = pd.DataFrame({"編程語言": ['Java', 'Python', 'C++'],
                           "已誕生多少年": [23, 20, 28]}, columns=['編程語言', '已誕生多少年'])

# Setting the index of the dataframe to the column `編程語言`.
dataframeE.set_index(u'編程語言')

# Printing the dataframeE.
print(dataframeE)

#      編程語言  已誕生多少年
# 0    Java      23
# 1  Python      20
# 2     C++      28

【加粗】數據排序

DataFrame數據對象的排序主要是通過索引排序、某個指定列排序的方式為參照完成對DataFrame對象中的整個數據內容排序。

# Sorting the dataframeE by the index.
res = dataframeE.sort_index()

# Printing the res.
print(res)

#      編程語言  已誕生多少年
# 0    Java      23
# 1  Python      20
# 2     C++      28

# Sorting the dataframeE by the column `已誕生多少年`.
res = dataframeE.sort_values(by=['已誕生多少年'], ascending=False)

# Printing the res.
print(res)

#      編程語言  已誕生多少年
# 2     C++      28
# 0    Java      23
# 1  Python      20

sort_index函數是指按照當前DataFrame數據對象的索引進行排序，sort_values則是按照指定的一個或多個列的值進行降序或者升序。

【加粗】數據分組

數據預處理中的數據分組主要是需要的分組的數據打上特殊的標記以便於後期對數據的歸類處理。

比較簡單一些的分組處理可以使用numpy中提供的函數進行處理，這裡使用numpy的where函數來設置過濾條件。

# Creating a new column called `分組標記（高齡/低齡）` and setting the value to `高` if the value in the column `已誕生多少年` is greater
# than or equal to 23, otherwise it is setting the value to `低`.
dataframeE['分組標記（高齡/低齡）'] = np.where(dataframeE[u'已誕生多少年'] >= 23, '高', '低')

# Printing the dataframeE.
print(dataframeE)

#      編程語言  已誕生多少年 分組標記（高齡/低齡）
# 0    Java      23           高
# 1  Python      20           低
# 2     C++      28           高

稍微複雜一些的過濾條件可以使用多條件的過濾方式找出符合要求的數據項進行分組標記。

# Creating a new column called `分組標記（高齡/低齡,是否是Java）` and setting the value to `高/是` if the value in the column `已誕生多少年` is
# greater than or equal to 23 and the value in the column `編程語言` is equal to `Java`, otherwise it is setting the value to
# `低/否`.
dataframeE['分組標記（高齡/低齡,是否是Java）'] = np.where((dataframeE[u'已誕生多少年'] >= 23) & (dataframeE[u'編程語言'] == 'Java'), '高/是',
                                             '低/否')

# Printing the dataframeE.
print(dataframeE)

#      編程語言  已誕生多少年 分組標記（高齡/低齡） 分組標記（高齡/低齡,是否是Java）
# 0    Java      23           高                 高/是
# 1  Python      20           低                 低/否
# 2     C++      28           高                 低/否

5、提取數據

數據提取即是對符合要求的數據完成提取操作，DataFrame對象提取數據主要是按照標簽值、標簽值和位置以及數據位置進行提取。

DataFrame對象按照位置或位置區域提取數據，這裡所說的位置其實就是DataFrame對象的索引。

基本上所有的操作都能夠使用DataFrame對象的loc函數、iloc函數這兩個函數來實現操作。

提取索引為2的DataFrame對象對應的行數據。

# Selecting the row with the index of 2.
res = dataframeE.loc[2]

# Printing the result of the operation.
print(res)

# 編程語言                   C++
# 已誕生多少年                  28
# 分組標記（高齡/低齡）              高
# 分組標記（高齡/低齡,是否是Java）    低/否
# Name: 2, dtype: object

提取索引0到1位置的所有的行數據。

# Selecting the rows with the index of 0 and 1.
res = dataframeE.loc[0:1]

# Printing the result of the operation.
print(res)

#      編程語言  已誕生多少年 分組標記（高齡/低齡） 分組標記（高齡/低齡,是否是Java）
# 0    Java      23           高                 高/是
# 1  Python      20           低                 低/否

按照前兩行前兩列的數據區域提取數據。

# 註意這裡帶有冒號:的iloc函數用法效果是和前面不一樣的。

# Selecting the first two rows and the first two columns.
res = dataframeE.iloc[:2, :2]

# Printing the result of the operation.
print(res)

#      編程語言  已誕生多少年
# 0    Java      23
# 1  Python      20

提取符合條件的數據項，對某一列數據中指定的值完成提取。

# 提取出編程語言這個列中數據內容是Java、C++的數據行。

# Selecting the rows where the value in the column `編程語言` is either `Java` or `C++`.
res = dataframeE.loc[dataframeE[u'編程語言'].isin(['Java', 'C++'])]

# Printing the result of the operation.
print(res)

#    編程語言  已誕生多少年 分組標記（高齡/低齡） 分組標記（高齡/低齡,是否是Java）
# 0  Java      23           高                 高/是
# 2   C++      28           高                 低/否

6、篩選數據

篩選數據是數據處理整個生命周期中的最後一個對原有數據的提取操作，通過各種邏輯判斷條件的操作來完成數據篩選。

這裡分別通過使用DataFrame對象的'與'、'或'、'非'三種常用的邏輯判斷來實現下麵的數據篩選操作。

# Creating a dataframe with two columns, one called `編程語言` and the other called `已誕生多少年`.
dataframeF = pd.DataFrame({"編程語言": ['Java', 'Python', 'C++'],
                           "已誕生多少年": [23, 20, 28]}, columns=['編程語言', '已誕生多少年'])

res = dataframeF.loc[(dataframeF[u'已誕生多少年'] > 25) & (dataframeF[u'編程語言'] == 'C++'), [u'編程語言', u'已誕生多少年']]

# Printing the result of the operation.
print(res)

#   編程語言  已誕生多少年
# 2  C++      28

res = dataframeF.loc[(dataframeF[u'已誕生多少年'] > 23) | (dataframeF[u'編程語言'] == 'Java'), [u'編程語言', u'已誕生多少年']]

# Printing the result of the operation.
print(res)

#    編程語言  已誕生多少年
# 0  Java      23
# 2   C++      28

res = dataframeF.loc[(dataframeF[u'編程語言'] != 'Java'), [u'編程語言', u'已誕生多少年']]

# Printing the result of the operation.
print(res)

#      編程語言  已誕生多少年
# 1  Python      20
# 2     C++      28

7、數據彙總

數據彙總通常是使用groupby函數對一個或多個列名稱進行分組，再使用count函數統計分組後的數目。

res = dataframeF.groupby(u'編程語言').count()

# Printing the result of the operation.
print(res)

#         已誕生多少年
# 編程語言
# C++          1
# Java         1
# Python       1

res = dataframeF.groupby(u'編程語言')[u'已誕生多少年'].count()

# Printing the result of the operation.
print(res)

# 編程語言
# C++       1
# Java      1
# Python    1
# Name: 已誕生多少年, dtype: int64

res = dataframeF.groupby([u'編程語言',u'已誕生多少年'])[u'已誕生多少年'].count()

# Printing the result of the operation.
print(res)

# 編程語言    已誕生多少年
# C++     28        1
# Java    23        1
# Python  20        1
# Name: 已誕生多少年, dtype: int64

8、數據統計

數據統計的概念基本上和數學上的思路是一樣的，首先是對數據進行採樣，採樣完成計算相關的標準差、協方差等相關的數據指標。

'''按照採樣不放回的方式，隨機獲取DataFrame對象中的兩條數據'''
res = dataframeF.sample(n=2, replace=False)

# Printing the result of the operation.
print(res)

#      編程語言  已誕生多少年
# 0    Java      23
# 1  Python      20

可以發現每次執行之後都會隨機的從DataFrame的數據表中取出兩條數據。

若是採樣放回的方式時則可以將replace的屬性設置為True即可。

# 計算出DataFrame對象的所有列的協方差
res = dataframeF.cov()

# Printing the result of the operation.
print(res)

#            已誕生多少年
# 已誕生多少年  16.333333

# 計算出DataFrame對象相關性
res = dataframeF.corr()

# Printing the result of the operation.
print(res)

#         已誕生多少年
# 已誕生多少年     1.0

以上就是Python數據處理中整個生命周期數據的處理過程以及常見的各個數據處理過程中的常見處理方式。

感謝大家一直以來的陪伴，Python集中營將會繼續努力創作出更好的內容，感謝大家的閱讀！

【往期推薦】

python中的精度計算應該用什麼，類似Java中的Bigdecimal對象！

如何將Excel中全國各省份人口數據繪製成地域分佈圖？

周末自製了一個批量圖片水印添加器！

歡迎關註作者公眾號【Python 集中營】，專註於後端編程，每天更新技術乾貨，不定時分享各類資料！