採用ID3演算法 (信息熵:H(X)=−∑i=0np(xi)log2p(xi)) 下載一個決策樹可視化軟體:Graphviz (註意環境變數Path加:C:\Program Files (x86)\Graphviz2.38\bin) 代碼: 導入需要用到的庫: 讀取表格: 這裡一些數據(屬性),決定一 ...
採用ID3演算法
(信息熵:H(X)=−∑i=0np(xi)log2p(xi))
下載一個決策樹可視化軟體:Graphviz
(註意環境變數Path加:C:\Program Files (x86)\Graphviz2.38\bin)
代碼:
導入需要用到的庫:
from sklearn.feature_extraction import DictVectorizer import csv from sklearn import tree from sklearn import preprocessing
讀取表格:
這裡一些數據(屬性),決定一位客戶是否要買這臺電腦
讀取表格並做一些簡單的數據處理:
allElectronicsData = open(r'D:\demo.csv', 'rt') reader = csv.reader(allElectronicsData) headers = next(reader) featureList = [] labelList = [] for row in reader: labelList.append(row[len(row)-1]) rowDict = {} for i in range(1, len(row)-1): rowDict[headers[i]] = row[i] featureList.append(rowDict) print(featureList)
看一下結果:
[
{'age': 'youth', 'student': 'no', 'income': 'high', 'credit_rating': 'fair'},
{'age': 'youth', 'student': 'no', 'income': 'high', 'credit_rating': 'excellent'},
{'age': 'middle_aged', 'student': 'no', 'income': 'high', 'credit_rating': 'fair'},
{'age': 'senior', 'student': 'no', 'income': 'medium', 'credit_rating': 'fair'},
{'age': 'senior', 'student': 'yes', 'income': 'low', 'credit_rating': 'fair'},
{'age': 'senior', 'student': 'yes', 'income': 'low', 'credit_rating': 'excellent'},
{'age': 'middle_aged', 'student': 'yes', 'income': 'low', 'credit_rating': 'excellent'},
{'age': 'youth', 'student': 'no', 'income': 'medium', 'credit_rating': 'fair'},
{'age': 'youth', 'student': 'yes', 'income': 'low', 'credit_rating': 'fair'},
{'age': 'senior', 'student': 'yes', 'income': 'medium', 'credit_rating': 'fair'},
{'age': 'youth', 'student': 'yes', 'income': 'medium', 'credit_rating': 'excellent'},
{'age': 'middle_aged', 'student': 'no', 'income': 'medium', 'credit_rating': 'excellent'},
{'age': 'middle_aged', 'student': 'yes', 'income': 'high', 'credit_rating': 'fair'},
{'age': 'senior', 'student': 'no', 'income': 'medium', 'credit_rating': 'excellent'}
]
處理的不錯:
調用sklearn的函數進一步處理數據:
vec = DictVectorizer()
dummyX = vec.fit_transform(featureList) .toarray()
lb = preprocessing.LabelBinarizer()
dummyY = lb.fit_transform(labelList)
查看下處理的結果:
print("dummyX: \n" + str(dummyX)) print(vec.get_feature_names()) print("labelList: " + str(labelList)) print("dummyY: \n" + str(dummyY))
結果:
註意要把數據轉換成數字矩陣,便於學習
dummyX: [[0. 0. 1. 0. 1. 1. 0. 0. 1. 0.] [0. 0. 1. 1. 0. 1. 0. 0. 1. 0.] [1. 0. 0. 0. 1. 1. 0. 0. 1. 0.] [0. 1. 0. 0. 1. 0. 0. 1. 1. 0.] [0. 1. 0. 0. 1. 0. 1. 0. 0. 1.] [0. 1. 0. 1. 0. 0. 1. 0. 0. 1.] [1. 0. 0. 1. 0. 0. 1. 0. 0. 1.] [0. 0. 1. 0. 1. 0. 0. 1. 1. 0.] [0. 0. 1. 0. 1. 0. 1. 0. 0. 1.] [0. 1. 0. 0. 1. 0. 0. 1. 0. 1.] [0. 0. 1. 1. 0. 0. 0. 1. 0. 1.] [1. 0. 0. 1. 0. 0. 0. 1. 1. 0.] [1. 0. 0. 0. 1. 1. 0. 0. 0. 1.] [0. 1. 0. 1. 0. 0. 0. 1. 1. 0.]] ['age=middle_aged', 'age=senior', 'age=youth', 'credit_rating=excellent', 'credit_rating=fair', 'income=high', 'income=low', 'income=medium', 'student=no', 'student=yes'] labelList: ['no', 'no', 'yes', 'yes', 'yes', 'no', 'yes', 'no', 'yes', 'yes', 'yes', 'yes', 'yes', 'no'] dummyY: [[0] [0] [1] [1] [1] [0] [1] [0] [1] [1] [1] [1] [1] [0]]
用決策樹ID3演算法和訓練數據擬合分類器模型:
clf = tree.DecisionTreeClassifier(criterion='entropy') clf = clf.fit(dummyX, dummyY)
可以利用下載的可視化軟體畫圖觀察下:
with open(r"D:\demo.dot", 'w') as f: f = tree.export_graphviz(clf, feature_names=vec.get_feature_names(), out_file=f)
然後調出cmd:
畫好後是pdf形式的,看一下:
模型建好了,我們可以做一個預測:
在第一個數據的基礎上修改下,然後預測是否買電腦:
oneRowX = dummyX[0, :] print("oneRowX: " + str(oneRowX)) newRowX = oneRowX newRowX[0] = 1 newRowX[2] = 0 print("newRowX: " + str(newRowX)) predictedY = clf.predict(newRowX.reshape(1, -1)) print("predictedY: " + str(predictedY))
結果:
oneRowX: [0. 0. 1. 0. 1. 1. 0. 0. 1. 0.] newRowX: [1. 0. 0. 0. 1. 1. 0. 0. 1. 0.] predictedY: [1]
結論:這個人要買這臺電腦