Intro: GML

這次來測試自動化機器學習套件:Ghalat Machine Learning,
主要針對回歸問題與分類問題的自動化學習。

目前套件具有以下功能:
1.自動特徵工程
2.自動選擇機器學習和神經網路模型
3.自動超參數調校
4.排序模型效果(根據交叉驗證分數)
5.推薦最佳模型

我將使用UCI breast cancer dataset(sklearn dataset)來測試此套件for分類的效果以及使用情況。

套件作者Github:https://github.com/Muhammad4hmed/Ghalat-Machine-Learning
Pypl套件說明: https://s0pypi0org.icopy.site/project/GML/2.0.2/

Colab Setting

!pip install -U -q PyDrive
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials
import pandas as pd
# Authenticate and create the PyDrive client.
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

Import and Install Modules

# install some modules
!pip install GML
!pip install category_encoders
Collecting GML
  Downloading https://files.pythonhosted.org/packages/b0/91/3580e3e1f4151fed64cf37840bae994ea7d1409a58a527b8fd010c31c909/GML-2.0.4-py3-none-any.whl
Collecting catboost
[?25l  Downloading https://files.pythonhosted.org/packages/90/86/c3dcb600b4f9e7584ed90ea9d30a717fb5c0111574675f442c3e7bc19535/catboost-0.24.1-cp36-none-manylinux1_x86_64.whl (66.1MB)
[K     |████████████████████████████████| 66.1MB 56kB/s 
[?25hRequirement already satisfied: lightgbm in /usr/local/lib/python3.6/dist-packages (from GML) (2.2.3)
Requirement already satisfied: xgboost in /usr/local/lib/python3.6/dist-packages (from GML) (0.90)
Requirement already satisfied: scikit-learn in /usr/local/lib/python3.6/dist-packages (from GML) (0.22.2.post1)
Collecting autofeat
  Downloading https://files.pythonhosted.org/packages/3a/63/c5aa2e38f50c9dedb1cf1bf6e0b5ab520e2c8747627ca7318a827b618d10/autofeat-1.1.3-py3-none-any.whl
Requirement already satisfied: Keras in /usr/local/lib/python3.6/dist-packages (from GML) (2.4.3)
Requirement already satisfied: plotly in /usr/local/lib/python3.6/dist-packages (from catboost->GML) (4.4.1)
Requirement already satisfied: pandas>=0.24.0 in /usr/local/lib/python3.6/dist-packages (from catboost->GML) (1.0.5)
Requirement already satisfied: six in /usr/local/lib/python3.6/dist-packages (from catboost->GML) (1.15.0)
Requirement already satisfied: graphviz in /usr/local/lib/python3.6/dist-packages (from catboost->GML) (0.10.1)
Requirement already satisfied: numpy>=1.16.0 in /usr/local/lib/python3.6/dist-packages (from catboost->GML) (1.18.5)
Requirement already satisfied: matplotlib in /usr/local/lib/python3.6/dist-packages (from catboost->GML) (3.2.2)
Requirement already satisfied: scipy in /usr/local/lib/python3.6/dist-packages (from catboost->GML) (1.4.1)
Requirement already satisfied: joblib>=0.11 in /usr/local/lib/python3.6/dist-packages (from scikit-learn->GML) (0.16.0)
Requirement already satisfied: sympy in /usr/local/lib/python3.6/dist-packages (from autofeat->GML) (1.1.1)
Requirement already satisfied: future in /usr/local/lib/python3.6/dist-packages (from autofeat->GML) (0.16.0)
Collecting pint
[?25l  Downloading https://files.pythonhosted.org/packages/7f/72/4ea7d219a2d6624fd22c3d8fd5eea183af4f5ece03e3a4726c1c864bb213/Pint-0.15-py2.py3-none-any.whl (200kB)
[K     |████████████████████████████████| 204kB 41.6MB/s 
[?25hRequirement already satisfied: h5py in /usr/local/lib/python3.6/dist-packages (from Keras->GML) (2.10.0)
Requirement already satisfied: pyyaml in /usr/local/lib/python3.6/dist-packages (from Keras->GML) (3.13)
Requirement already satisfied: retrying>=1.3.3 in /usr/local/lib/python3.6/dist-packages (from plotly->catboost->GML) (1.3.3)
Requirement already satisfied: pytz>=2017.2 in /usr/local/lib/python3.6/dist-packages (from pandas>=0.24.0->catboost->GML) (2018.9)
Requirement already satisfied: python-dateutil>=2.6.1 in /usr/local/lib/python3.6/dist-packages (from pandas>=0.24.0->catboost->GML) (2.8.1)
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.6/dist-packages (from matplotlib->catboost->GML) (1.2.0)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.6/dist-packages (from matplotlib->catboost->GML) (0.10.0)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /usr/local/lib/python3.6/dist-packages (from matplotlib->catboost->GML) (2.4.7)
Requirement already satisfied: mpmath>=0.19 in /usr/local/lib/python3.6/dist-packages (from sympy->autofeat->GML) (1.1.0)
Requirement already satisfied: packaging in /usr/local/lib/python3.6/dist-packages (from pint->autofeat->GML) (20.4)
Requirement already satisfied: setuptools in /usr/local/lib/python3.6/dist-packages (from pint->autofeat->GML) (49.6.0)
Requirement already satisfied: importlib-metadata; python_version < "3.8" in /usr/local/lib/python3.6/dist-packages (from pint->autofeat->GML) (1.7.0)
Requirement already satisfied: zipp>=0.5 in /usr/local/lib/python3.6/dist-packages (from importlib-metadata; python_version < "3.8"->pint->autofeat->GML) (3.1.0)
Installing collected packages: catboost, pint, autofeat, GML
Successfully installed GML-2.0.4 autofeat-1.1.3 catboost-0.24.1 pint-0.15
Collecting category_encoders
[?25l  Downloading https://files.pythonhosted.org/packages/44/57/fcef41c248701ee62e8325026b90c432adea35555cbc870aff9cfba23727/category_encoders-2.2.2-py2.py3-none-any.whl (80kB)
[K     |████████████████████████████████| 81kB 2.4MB/s 
[?25hRequirement already satisfied: pandas>=0.21.1 in /usr/local/lib/python3.6/dist-packages (from category_encoders) (1.0.5)
Requirement already satisfied: numpy>=1.14.0 in /usr/local/lib/python3.6/dist-packages (from category_encoders) (1.18.5)
Requirement already satisfied: scikit-learn>=0.20.0 in /usr/local/lib/python3.6/dist-packages (from category_encoders) (0.22.2.post1)
Requirement already satisfied: patsy>=0.5.1 in /usr/local/lib/python3.6/dist-packages (from category_encoders) (0.5.1)
Requirement already satisfied: statsmodels>=0.9.0 in /usr/local/lib/python3.6/dist-packages (from category_encoders) (0.10.2)
Requirement already satisfied: scipy>=1.0.0 in /usr/local/lib/python3.6/dist-packages (from category_encoders) (1.4.1)
Requirement already satisfied: pytz>=2017.2 in /usr/local/lib/python3.6/dist-packages (from pandas>=0.21.1->category_encoders) (2018.9)
Requirement already satisfied: python-dateutil>=2.6.1 in /usr/local/lib/python3.6/dist-packages (from pandas>=0.21.1->category_encoders) (2.8.1)
Requirement already satisfied: joblib>=0.11 in /usr/local/lib/python3.6/dist-packages (from scikit-learn>=0.20.0->category_encoders) (0.16.0)
Requirement already satisfied: six in /usr/local/lib/python3.6/dist-packages (from patsy>=0.5.1->category_encoders) (1.15.0)
Installing collected packages: category-encoders
Successfully installed category-encoders-2.2.2
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from sklearn import datasets
from GML.Ghalat_Machine_Learning import Ghalat_Machine_Learning
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt

Data Analysis

Import Data

利用sklearn中的breast cancer資料來試驗看看。 我將資料30%當作訓練資料,測試資料集保留70%。

breast_cancer = datasets.load_breast_cancer()
df_x = pd.DataFrame(breast_cancer.data)
df_y = pd.DataFrame(breast_cancer.target)
df_x.columns = breast_cancer.feature_names
df_y.columns = np.array(['label'])
df = pd.concat([df_x,df_y],axis=1)
X_train, X_test, y_train, y_test = train_test_split(df_x, df_y, test_size=0.7, random_state=777)
print(X_train.shape)
print(X_test.shape)
(170, 30)
(399, 30)

Data Visualization

Label 為 0 的是惡性腫瘤患者,1為良性腫瘤,人數比為: 212位:357位

%matplotlib inline
by_fraud = df_y.groupby('label')
by_fraud.size().plot(kind = 'bar')
<matplotlib.axes._subplots.AxesSubplot at 0x7f000c2abb00>

從下圖可看出,惡性腫瘤之平均半徑要比楊性腫瘤來的更大。

fig,axes = plt.subplots()

df.boxplot(column='mean radius',by=['label'],ax=axes)
axes.set_title(' ')
axes.set_ylabel('mean radius')
axes.set_figure
<bound method _AxesBase.set_figure of <matplotlib.axes._subplots.AxesSubplot object at 0x7f000c283128>>

GML:Auto Feature Engineering

自動進行特徵工程,能(簡易的)補缺失值,並新增一些交互作用項當作新的特徵。

gml = Ghalat_Machine_Learning()
# It may cost pretty long time
new_X,y,new_Test_X = gml.Auto_Feature_Engineering(X_train,y_train,type_of_task='Classification',test_data=X_test,splits=6,fill_na_='median',ratio_drop=0.2,generate_features=True,feateng_steps=2)
Welcome to Ghalat Machine Learning!

All models are set to train
         Have a tea and leave everything on us ;-)
************************************************************ 
Successfully dealt with missing data!

X:

      mean radius  mean texture  ...  worst symmetry  worst fractal dimension
527       12.340         12.27  ...          0.3110                  0.07592
43        13.280         20.28  ...          0.3739                  0.10270
371       15.190         13.21  ...          0.2487                  0.06766
82        25.220         24.91  ...          0.2355                  0.10510
534       10.960         17.62  ...          0.2289                  0.08278
..           ...           ...  ...             ...                      ...
506       12.220         20.04  ...          0.2709                  0.08839
423       13.660         19.13  ...          0.2744                  0.08839
116        8.950         15.76  ...          0.1652                  0.07722
71         8.888         14.64  ...          0.2254                  0.10840
103        9.876         19.40  ...          0.2622                  0.08490

[170 rows x 30 columns] 
Test Data:

      mean radius  mean texture  ...  worst symmetry  worst fractal dimension
416        9.405         21.70  ...          0.2872                  0.08304
522       11.260         19.83  ...          0.2557                  0.07613
503       23.090         19.83  ...          0.2908                  0.07277
111       12.630         20.76  ...          0.2226                  0.08486
132       16.160         21.54  ...          0.3480                  0.07619
..           ...           ...  ...             ...                      ...
105       13.110         15.56  ...          0.3147                  0.14050
530       11.750         17.56  ...          0.2478                  0.07757
507       11.060         17.12  ...          0.2780                  0.11680
11        15.780         17.89  ...          0.3792                  0.10480
319       12.430         17.00  ...          0.1901                  0.05932

[399 rows x 30 columns] 

 ************************************************************

 ************************************************************ 
Successfully encoded categorical data with Target Mean Encoding using Stratified KFolds technique!

 X:

      mean radius  mean texture  ...  worst symmetry  worst fractal dimension
527       12.340         12.27  ...          0.3110                  0.07592
43        13.280         20.28  ...          0.3739                  0.10270
371       15.190         13.21  ...          0.2487                  0.06766
82        25.220         24.91  ...          0.2355                  0.10510
534       10.960         17.62  ...          0.2289                  0.08278
..           ...           ...  ...             ...                      ...
506       12.220         20.04  ...          0.2709                  0.08839
423       13.660         19.13  ...          0.2744                  0.08839
116        8.950         15.76  ...          0.1652                  0.07722
71         8.888         14.64  ...          0.2254                  0.10840
103        9.876         19.40  ...          0.2622                  0.08490

[170 rows x 30 columns] 

Test Data:

      mean radius  mean texture  ...  worst symmetry  worst fractal dimension
416        9.405         21.70  ...          0.2872                  0.08304
522       11.260         19.83  ...          0.2557                  0.07613
503       23.090         19.83  ...          0.2908                  0.07277
111       12.630         20.76  ...          0.2226                  0.08486
132       16.160         21.54  ...          0.3480                  0.07619
..           ...           ...  ...             ...                      ...
105       13.110         15.56  ...          0.3147                  0.14050
530       11.750         17.56  ...          0.2478                  0.07757
507       11.060         17.12  ...          0.2780                  0.11680
11        15.780         17.89  ...          0.3792                  0.10480
319       12.430         17.00  ...          0.1901                  0.05932

[399 rows x 30 columns] 

 ************************************************************

 ************************************************************ 
 Generating new features !
 ************************************************************
[AutoFeat] The 2 step feature engineering process could generate up to 22155 features.
[AutoFeat] With 170 data points this new feature matrix would use about 0.02 gb of space.
[feateng] Step 1: transformation of original features
[feateng] Generated 155 transformed features from 30 original features - done.
[feateng] Step 2: first combination of features
[feateng] Generated 16575 feature combinations from 17020 original feature tuples - done.
[feateng] Generated altogether 17103 new features in 2 steps
[feateng] Removing correlated features, as well as additions at the highest level
[feateng] Generated a total of 7659 additional features
[featsel] Scaling data...done.
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done   1 tasks      | elapsed:  2.9min
[Parallel(n_jobs=-1)]: Done   2 tasks      | elapsed:  3.1min
[Parallel(n_jobs=-1)]: Done   3 out of   5 | elapsed:  6.3min remaining:  4.2min
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:  7.9min remaining:    0.0s
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:  7.9min finished
[featsel] 23 features after 5 feature selection runs
[featsel] 8 features after correlation filtering
[featsel] 6 features after noise filtering
[AutoFeat] Computing 6 new features.
[AutoFeat]     6/    6 new features ...done.
[AutoFeat] Final dataframe with 36 feature columns (6 new).
[AutoFeat] Training final classification model.
[AutoFeat] Trained model: largest coefficients:
[38.95073469]
8.246416 * meanconcavepoints/meancompactness
5.199850 * sqrt(worstsmoothness)*log(worstarea)
4.653543 * log(worstcompactness)/areaerror
2.491452 * log(worstradius)*log(worsttexture)
1.975475 * worstconcavepoints**2/meancompactness
0.115434 * log(radiuserror)/worstarea
[AutoFeat] Final score: 0.9765
[AutoFeat] Computing 6 new features.
[AutoFeat]     6/    6 new features ...done.

 ************************************************************ 
Successfully generated new features! and selected the best features

 X:

      mean radius  ...  meanconcavepoints/meancompactness
0         12.340  ...                           0.419692
1         13.280  ...                           0.428830
2         15.190  ...                           0.383184
3         25.220  ...                           0.692308
4         10.960  ...                           0.285890
..           ...  ...                                ...
165       12.220  ...                           0.188021
166       13.660  ...                           0.419529
167        8.950  ...                           0.185680
168        8.888  ...                           0.187590
169        9.876  ...                           0.312365

[170 rows x 36 columns] 

Test Data:

      mean radius  ...  meanconcavepoints/meancompactness
0          9.405  ...                           0.204092
1         11.260  ...                           0.128348
2         23.090  ...                           0.786667
3         12.630  ...                           0.498015
4         16.160  ...                           0.437150
..           ...  ...                                ...
394       13.110  ...                           0.543966
395       11.750  ...                           0.457119
396       11.060  ...                           0.398506
397       15.780  ...                           0.511300
398       12.430  ...                           0.491893

[399 rows x 36 columns] 

 ************************************************************

gml.Auto_Feature_Engineering這個函數自動新增6個features於new_X與new_Test_X。

print(new_X.shape)
print(new_Test_X.shape)
(170, 36)
(399, 36)

稍微預覽一下他新增了哪些變數。

new_X.head()
mean radius mean texture mean perimeter mean area mean smoothness mean compactness mean concavity mean concave points mean symmetry mean fractal dimension radius error texture error perimeter error area error smoothness error compactness error concavity error concave points error symmetry error fractal dimension error worst radius worst texture worst perimeter worst area worst smoothness worst compactness worst concavity worst concave points worst symmetry worst fractal dimension sqrt(worstsmoothness)*log(worstarea) worstconcavepoints**2/meancompactness log(worstradius)*log(worsttexture) log(radiuserror)/worstarea log(worstcompactness)/areaerror meanconcavepoints/meancompactness
0 12.34 12.27 78.94 468.5 0.09003 0.06307 0.02958 0.02647 0.1689 0.05808 0.1166 0.4957 0.7714 8.955 0.003681 0.009169 0.008732 0.005740 0.01129 0.001366 13.61 19.27 87.22 564.9 0.1292 0.2074 0.1791 0.10700 0.3110 0.07592 2.277670 0.181528 7.724195 -0.003804 -0.175668 0.419692
1 13.28 20.28 87.32 545.2 0.10410 0.14360 0.09847 0.06158 0.1974 0.06782 0.3704 0.8249 2.4270 31.330 0.005072 0.021470 0.021850 0.009560 0.01719 0.003317 17.38 28.00 113.10 907.2 0.1530 0.3724 0.3664 0.14920 0.3739 0.10270 2.663888 0.155018 9.514511 -0.001095 -0.031528 0.428830
2 15.19 13.21 97.65 711.8 0.07963 0.06934 0.03393 0.02657 0.1721 0.05544 0.1783 0.4125 1.3380 17.720 0.005012 0.014850 0.015510 0.009155 0.01647 0.001767 16.20 15.73 104.50 819.1 0.1126 0.1737 0.1362 0.08178 0.2487 0.06766 2.251001 0.096452 7.674293 -0.002105 -0.098782 0.383184
3 25.22 24.91 171.50 1878.0 0.10630 0.26650 0.33390 0.18450 0.1829 0.06782 0.8973 1.4740 7.3820 120.000 0.008166 0.056930 0.057300 0.020300 0.01065 0.005893 30.00 33.62 211.70 2562.0 0.1573 0.6076 0.6476 0.28670 0.2355 0.10510 3.112816 0.308431 11.955621 -0.000042 -0.004152 0.692308
4 10.96 17.62 70.79 365.6 0.09687 0.09752 0.05263 0.02788 0.1619 0.06408 0.1507 1.5830 1.1650 10.090 0.009501 0.033780 0.044010 0.013460 0.01322 0.003534 11.62 26.51 76.43 407.5 0.1428 0.2510 0.2123 0.09861 0.2289 0.08278 2.271128 0.099712 8.038869 -0.004644 -0.136997 0.285890

GML: Auto Machine Learning (Classification)

在AUTO Machine Learning中,將進行兩輪競爭,第一輪所有模型將利用5-folds cv的準確率爭奪前5名,第二輪競爭時,前5名的模型將再競爭一次.並最終推薦排名第一的模型。

from sklearn.neural_network import MLPClassifier
best_model = gml.GMLClassifier(new_X,y,neural_net='Yes',epochs=100,models=[MLPClassifier()],verbose=False)
Model  LogisticRegressionCV  got validation accuracy of  0.9411764705882353
Model  LogisticRegression  got validation accuracy of  0.9607843137254902
Model  SVC  got validation accuracy of  0.9803921568627451
Model  DecisionTreeClassifier  got validation accuracy of  0.9411764705882353
Model  KNeighborsClassifier  got validation accuracy of  1.0
Model  SGDClassifier  got validation accuracy of  0.9411764705882353
Model  RandomForestClassifier  got validation accuracy of  0.9803921568627451
Model  AdaBoostClassifier  got validation accuracy of  0.9607843137254902
Model  ExtraTreesClassifier  got validation accuracy of  1.0
Model  XGBClassifier  got validation accuracy of  0.9607843137254902
Model  LGBMClassifier  got validation accuracy of  1.0
Model  CatBoostClassifier  got validation accuracy of  1.0
Model  GradientBoostingClassifier  got validation accuracy of  0.9607843137254902
Model  NaiveBayesGaussian  got validation accuracy of  0.9803921568627451
Model  MLPClassifier  got validation accuracy of  0.9607843137254902

 **************************************** 
Training Neural Network
 ****************************************
Neural Network got validation accuracy of  0.9607843137254902
Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense_4 (Dense)              (None, 256)               9472      
_________________________________________________________________
dropout_2 (Dropout)          (None, 256)               0         
_________________________________________________________________
dense_5 (Dense)              (None, 128)               32896     
_________________________________________________________________
dropout_3 (Dropout)          (None, 128)               0         
_________________________________________________________________
dense_6 (Dense)              (None, 64)                8256      
_________________________________________________________________
dense_7 (Dense)              (None, 2)                 130       
=================================================================
Total params: 50,754
Trainable params: 50,754
Non-trainable params: 0
_________________________________________________________________
None

 ************************************************************ 
Round One Results
 ************************************************************ 
                         Model  Val_Accuracy  CV on 5 folds
0                         SVC      0.980392       0.970588
0               MLPClassifier      0.960784       0.970588
0              LGBMClassifier      1.000000       0.964706
0      RandomForestClassifier      0.980392       0.964706
0        ExtraTreesClassifier      1.000000       0.964706
0          CatBoostClassifier      1.000000       0.964706
0              Neural Network      0.960784       0.960784
0        LogisticRegressionCV      0.941176       0.958824
0          LogisticRegression      0.960784       0.958824
0        KNeighborsClassifier      1.000000       0.958824
0               XGBClassifier      0.960784       0.952941
0  GradientBoostingClassifier      0.960784       0.947059
0          NaiveBayesGaussian      0.980392       0.935294
0               SGDClassifier      0.941176       0.929412
0          AdaBoostClassifier      0.960784       0.923529
0      DecisionTreeClassifier      0.941176       0.917647 
 ************************************************************
Model  SVC  got validation accuracy of  0.9607843137254902
Model  Sequential  got validation accuracy of  0.9607843137254902
Model  LGBMClassifier  got validation accuracy of  0.9215686274509803
Model  RandomForestClassifier  got validation accuracy of  0.9411764705882353
Model  ExtraTreesClassifier  got validation accuracy of  0.9803921568627451

 ************************************************************ 
Round Two Results
 ************************************************************ 
                     Model  Val_Accuracy  CV on 5 folds
0              Sequential      0.960784       0.970588
0                     SVC      0.960784       0.970588
0          LGBMClassifier      0.921569       0.964706
0  RandomForestClassifier      0.941176       0.964706
0    ExtraTreesClassifier      0.980392       0.964706 
 ************************************************************


 **************************************** 
Suggested Models for Stacking
 **************************************** 
 0        Sequential
0               SVC
0    LGBMClassifier
Name: Model, dtype: object
**************************************** 
 PLEASE NOTE: these results are calculated using  <function accuracy_score at 0x7f002cd11048>

最終GML推薦我們使用SVC,RandomForestClassifier,Sequential來建立stacking模型。
我們來看看最終best_model:Sequential的超參數為哪些。

best_model
MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
              beta_2=0.999, early_stopping=False, epsilon=1e-08,
              hidden_layer_sizes=(100,), learning_rate='constant',
              learning_rate_init=0.001, max_fun=15000, max_iter=200,
              momentum=0.9, n_iter_no_change=10, nesterovs_momentum=True,
              power_t=0.5, random_state=None, shuffle=True, solver='adam',
              tol=0.0001, validation_fraction=0.1, verbose=False,
              warm_start=False)

GML Test

將GML新增變數後的data匯入模型中訓練,最後將測試集(也利用GML新增變數後)的資料匯入模型當中做預測,看看效果如何。

clf = best_model.fit(new_X,y)

from sklearn.metrics import accuracy_score
accuracy_score(y_test, clf.predict(new_Test_X))
0.9423558897243107

這個準確率好像還好,試試看用原始train data或其他方法來建模看看效果如何。

# 使用GML補值後的train data
from sklearn.ensemble import RandomForestClassifier
rf_model = RandomForestClassifier(max_depth=2)
rf_clf = rf_model.fit(new_X,y)
accuracy_score(y_test, rf_clf.predict(new_Test_X))
0.9699248120300752
# 使用原始train data
rf_model = RandomForestClassifier(max_depth=2)
rf_clf = rf_model.fit(X_train,y)
accuracy_score(y_test, rf_clf.predict(X_test))
0.9448621553884712

Conclusion

因沒建立stacking模型因此難判定說推薦的best_model效果不佳,但它確實非常便利,尤其在自動特徵工程中所自動生成的交互作用項,都有不錯的效果(這次實驗內),以randomforest的情況為例,在使用new data訓練下的模型最終預測率大概比原始data的準確率高大約10%左右。

因此它可以快速處理資料、建立模型、挑選超參數、並有一定程度的預測準確率。