上次在挑選變數並建立分類模型的loocv時(link :https://hermitlin.netlify.com/post/2020/02/14/cre-features-selection/) ,最高的準確率來自adaboost的結果,且落在使用60~70個randomforest importance的變數,但當時多個模型準確率為0.989473684,即存在一個樣本預測錯誤,因此想知道是否在這些模型中,預測錯誤的皆為同一筆樣本。本次將預測的結果先行挑出,並將錯誤的樣本index建立成表,以方便觀察多為那些樣本為容易預測失敗的樣本。

Part1. Import the data and R’S randomforest importance

先讀入資料與之前R的importance結果:

import pandas as pd
import numpy as np

df = pd.read_csv('C:/Users/User/OneDrive - student.nsysu.edu.tw/Educations/NSYSU/fu_chung/bacterial/123.csv')
impor = pd.read_csv('C:/Users/User/OneDrive - student.nsysu.edu.tw/Educations/NSYSU/fu_chung/bacterial - PCA/A.csv')
impo = np.array(impor['names'])
impo
array(['V994', 'V1428', 'V1426', ..., 'V1469', 'V1470', 'V1471'],
      dtype=object)

Part2. Classifiers Building

建立此次會用到adaboost的function,並加入找出預測錯誤樣本index的code:

※測試尋找index的code

bbc = [1,2,3,4,5,6,7,9,8,1]
bbc.index(1)
[i for i,v in enumerate(bbc) if v==1]
[0, 9]
#ADABOOST
def adaloocv(ldf):
    ldf = ldf.reset_index(drop=True)
    cv = []
    for i in range(len(ldf)):
        dtrain = ldf.drop([i])
        dtest = ldf.iloc[i:i+1,:]
        train_X = dtrain.iloc[:,0:ldf.shape[1]-1]
        test_X = dtest.iloc[:,0:ldf.shape[1]-1]
        train_y = dtrain["CRE"]
        test_y = dtest["CRE"]
        from sklearn.ensemble import AdaBoostClassifier
        clf = AdaBoostClassifier(n_estimators=100)
        clf_fit = clf.fit(train_X, train_y)
        test_y_predicted = clf.predict(test_X)
        accuracy_rf = metrics.accuracy_score(test_y, test_y_predicted)
        cv += [accuracy_rf]
    loocv = np.mean(cv)
    av = [i for i,v in enumerate(cv) if v==0]
    
    return "adaboost",sum(cv),loocv,sum(cv[0:46]),sum(cv[46:95]),sum(cv[0:46])/46,sum(cv[46:95])/49,av

Part 3. Processing

這次使用的變數個數為前51個至前151個importances,執行並畫出統計圖表:

#ADABOOST
import time
import sys


lada = []
for i in range (100):  
    ldf = df.loc[:,impo[0:51+i]]
    ldf['CRE'] = df['CRE']
    lada += [adaloocv(ldf)]
    sys.stdout.write('\r')
    sys.stdout.write("[%-50s] %d%%" % ('='*i, (100/(100-1))*i))
    sys.stdout.flush()
    time.sleep(0.00000000000001)
[================================================] 100%
data = pd.DataFrame(lada)
data
0 1 2 3 4 5 6 7
0 adaboost 91.0 0.957895 42.0 49.0 0.913043 1.000000 [12, 33, 35, 41]
1 adaboost 91.0 0.957895 42.0 49.0 0.913043 1.000000 [12, 33, 35, 41]
2 adaboost 91.0 0.957895 42.0 49.0 0.913043 1.000000 [12, 33, 35, 41]
3 adaboost 92.0 0.968421 43.0 49.0 0.934783 1.000000 [12, 33, 35]
4 adaboost 92.0 0.968421 43.0 49.0 0.934783 1.000000 [12, 33, 35]
5 adaboost 91.0 0.957895 42.0 49.0 0.913043 1.000000 [12, 33, 35, 41]
6 adaboost 91.0 0.957895 42.0 49.0 0.913043 1.000000 [12, 33, 35, 41]
7 adaboost 92.0 0.968421 43.0 49.0 0.934783 1.000000 [12, 33, 35]
8 adaboost 92.0 0.968421 43.0 49.0 0.934783 1.000000 [33, 35, 38]
9 adaboost 91.0 0.957895 42.0 49.0 0.913043 1.000000 [12, 33, 35, 38]
10 adaboost 91.0 0.957895 42.0 49.0 0.913043 1.000000 [11, 12, 33, 35]
11 adaboost 91.0 0.957895 43.0 48.0 0.934783 0.979592 [12, 33, 35, 82]
12 adaboost 92.0 0.968421 43.0 49.0 0.934783 1.000000 [12, 33, 35]
13 adaboost 92.0 0.968421 43.0 49.0 0.934783 1.000000 [12, 33, 35]
14 adaboost 92.0 0.968421 43.0 49.0 0.934783 1.000000 [12, 33, 35]
15 adaboost 92.0 0.968421 43.0 49.0 0.934783 1.000000 [12, 33, 35]
16 adaboost 94.0 0.989474 45.0 49.0 0.978261 1.000000 [35]
17 adaboost 94.0 0.989474 45.0 49.0 0.978261 1.000000 [35]
18 adaboost 94.0 0.989474 45.0 49.0 0.978261 1.000000 [35]
19 adaboost 94.0 0.989474 45.0 49.0 0.978261 1.000000 [35]
20 adaboost 93.0 0.978947 44.0 49.0 0.956522 1.000000 [33, 35]
21 adaboost 93.0 0.978947 44.0 49.0 0.956522 1.000000 [33, 35]
22 adaboost 94.0 0.989474 45.0 49.0 0.978261 1.000000 [35]
23 adaboost 92.0 0.968421 43.0 49.0 0.934783 1.000000 [12, 35, 41]
24 adaboost 94.0 0.989474 45.0 49.0 0.978261 1.000000 [35]
25 adaboost 92.0 0.968421 44.0 48.0 0.956522 0.979592 [35, 41, 82]
26 adaboost 91.0 0.957895 42.0 49.0 0.913043 1.000000 [33, 35, 41, 43]
27 adaboost 91.0 0.957895 43.0 48.0 0.934783 0.979592 [35, 41, 43, 59]
28 adaboost 91.0 0.957895 42.0 49.0 0.913043 1.000000 [12, 33, 35, 41]
29 adaboost 91.0 0.957895 42.0 49.0 0.913043 1.000000 [12, 33, 35, 41]
70 adaboost 94.0 0.989474 45.0 49.0 0.978261 1.000000 [35]
71 adaboost 94.0 0.989474 45.0 49.0 0.978261 1.000000 [35]
72 adaboost 92.0 0.968421 45.0 47.0 0.978261 0.959184 [35, 58, 59]
73 adaboost 92.0 0.968421 45.0 47.0 0.978261 0.959184 [35, 58, 59]
74 adaboost 92.0 0.968421 45.0 47.0 0.978261 0.959184 [35, 58, 59]
75 adaboost 92.0 0.968421 45.0 47.0 0.978261 0.959184 [35, 58, 59]
76 adaboost 90.0 0.947368 43.0 47.0 0.934783 0.959184 [35, 41, 43, 58, 59]
77 adaboost 90.0 0.947368 43.0 47.0 0.934783 0.959184 [35, 41, 43, 58, 59]
78 adaboost 92.0 0.968421 44.0 48.0 0.956522 0.979592 [35, 43, 58]
79 adaboost 92.0 0.968421 44.0 48.0 0.956522 0.979592 [35, 43, 59]
80 adaboost 92.0 0.968421 44.0 48.0 0.956522 0.979592 [35, 43, 59]
81 adaboost 92.0 0.968421 44.0 48.0 0.956522 0.979592 [35, 43, 59]
82 adaboost 93.0 0.978947 44.0 49.0 0.956522 1.000000 [35, 43]
83 adaboost 93.0 0.978947 44.0 49.0 0.956522 1.000000 [35, 43]
84 adaboost 93.0 0.978947 44.0 49.0 0.956522 1.000000 [35, 43]
85 adaboost 93.0 0.978947 44.0 49.0 0.956522 1.000000 [35, 43]
86 adaboost 93.0 0.978947 44.0 49.0 0.956522 1.000000 [35, 43]
87 adaboost 93.0 0.978947 44.0 49.0 0.956522 1.000000 [35, 43]
88 adaboost 93.0 0.978947 44.0 49.0 0.956522 1.000000 [35, 43]
89 adaboost 93.0 0.978947 44.0 49.0 0.956522 1.000000 [35, 43]
90 adaboost 93.0 0.978947 44.0 49.0 0.956522 1.000000 [35, 43]
91 adaboost 93.0 0.978947 44.0 49.0 0.956522 1.000000 [35, 43]
92 adaboost 93.0 0.978947 44.0 49.0 0.956522 1.000000 [35, 43]
93 adaboost 92.0 0.968421 43.0 49.0 0.934783 1.000000 [33, 35, 43]
94 adaboost 92.0 0.968421 43.0 49.0 0.934783 1.000000 [33, 35, 43]
95 adaboost 92.0 0.968421 43.0 49.0 0.934783 1.000000 [33, 35, 43]
96 adaboost 92.0 0.968421 43.0 49.0 0.934783 1.000000 [33, 35, 43]
97 adaboost 92.0 0.968421 43.0 49.0 0.934783 1.000000 [33, 35, 43]
98 adaboost 93.0 0.978947 44.0 49.0 0.956522 1.000000 [33, 35]
99 adaboost 92.0 0.968421 44.0 48.0 0.956522 0.979592 [33, 35, 84]

100 rows × 8 columns

將wrong index的值做list並串接:

c = data.iloc[:,7:8]
c1 = []
for i in range(100):    
    c1 = c1+c[7][i]
c1

[12, 33, 35, 41, 12, 33, 35, 41, 12, 33, 35, 41, 12, 33, 35, 12, 33, 35, 12, 33, 35, 41, 12, 33, 35, 41, 12, 33, 35, 33, 35, 38, 12, 33, 35, 38, 11, 12, 33, 35, 12, 33, 35, 82, 12, 33, 35, 12, 33, 35, 12, 33, 35, 12, 33, 35, 35, 35, 35, 35, 33, 35, 33, 35, 35, 12, 35, 41, 35, 35, 41, 82, 33, 35, 41, 43, 35, 41, 43, 59, 12, 33, 35, 41, 12, 33, 35, 41, 12, 33, 35, 41, 43, 33, 35, 43, 12, 33, 35, 43, 12, 33, 35, 43, 12, 33, 35, 43, 33, 35, 43, 33, 35, 43, 35, 33, 35, 41, 33, 35, 41, 33, 35, 41, 33, 35, 41, 33, 35, 41, 35, 41, 33, 35, 41, 35, 41, 35, 41, 35, 35, 41, 35, 41, 35, 35, 41, 35, 41, 35, 41, 58, 82, 85, 58, 82, 85, 58, 82, 85, 58, 82, 85, 35, 41, 82, 35, 41, 35, 41, 35, 41, 48, 58, 85, 35, 41, 48, 58, 85, 35, 41, 48, 58, 85, 35, 41, 48, 58, 11, 35, 41, 58, 11, 35, 41, 58, 11, 35, 33, 35, 41, 33, 35, 41, 35, 35, 35, 58, 59, 35, 58, 59, 35, 58, 59, 35, 58, 59, 35, 41, 43, 58, 59, 35, 41, 43, 58, 59, 35, 43, 58, 35, 43, 59, 35, 43, 59, 35, 43, 59, 35, 43, 35, 43, 35, 43, 35, 43, 35, 43, 35, 43, 35, 43, 35, 43, 35, 43, 35, 43, 35, 43, 33, 35, 43, 33, 35, 43, 33, 35, 43, 33, 35, 43, 33, 35, 43, 33, 35, 33, 35, 84]

用dict的方式計算出現次數:

values = c1
value_cnt = {}  # 將結果用一個字典存儲
for value in values:
    # get(value, num)函數的作用是獲取字典中value對應的鍵值, num=0指示初始值大小。
    value_cnt[value] = value_cnt.get(value, 0) + 1

# 輸出結果
print(value_cnt)
#print([key for key in value_cnt.keys()])
#print([value for value in value_cnt.values()])

{12: 22, 33: 43, 35: 96, 41: 39, 38: 2, 11: 4, 82: 7, 43: 31, 59: 10, 58: 17, 85: 7, 48: 4, 84: 1}

使用51個到151個變數建立的預測模型中,預測樣本錯誤的次數結果為:
35出現 96次
33出現 43次
41出現 39次
43出現 31次
12出現 22次
58出現 17次
59出現 10次
82出現 7次
85出現 7次
11出現 4次
48出現 4次
38出現 2次
84出現 1次

畫出dict儲存的資料:

import matplotlib.pyplot as plt
plt.bar(range(len(value_cnt)), list(value_cnt.values()), align='center')
plt.xticks(range(len(value_cnt)), list(value_cnt.keys()))
# # for python 2.x:
# plt.bar(range(len(D)), D.values(), align='center')  # python 2.x
# plt.xticks(range(len(D)), D.keys())  # in python 2.x

plt.show()