這次使用之前分析過的CRE資料,來嘗試使用OCGAN,但因原先資料CRE:NON比數為46:49,為了達到不平衡的效果,因此最後採用16:49的比例,從46個CRE中取隨機16個,而資料的訓練集以及驗證集比例為下:
Train Set(CRE:Non): 6:19
Validation Set(CRE:Non): 10:30

接著我們將分為有使用OCGAN平衡數據集的資料以及未平衡數據集的資料進行分別建模,兩者皆使用SVM的方式建模,並且先透過LOOCV的方式在數據集Tuning模型,最後套入驗證集計算總準確率、f1 score、auc等,比較有無平衡的效果。

讀入資料

import pandas as pd
import numpy as np

df = pd.read_csv('C:/Users/User/OneDrive - student.nsysu.edu.tw/Educations/NSYSU/fu_chung/bacterial/123.csv')
%matplotlib inline
by_fraud = df.groupby('CRE')
by_fraud.size().plot(kind = 'bar')
<matplotlib.axes._subplots.AxesSubplot at 0x238556a6dd8>

Train vs Test set

train set (0:1):20:6
test set(0:1):2.9:10

from sklearn.model_selection import train_test_split
import random
cre = df[df['CRE'].isin([1])].iloc[0:16,:]
cre['CRE'] = 1
normal = df[df['CRE'].isin([0])].iloc[:,:]
normal['CRE'] = 0

random.seed(3)
train_nor, test_nor = train_test_split(normal, test_size = 0.6)
train_cre, test_cre = train_test_split(cre, test_size = 0.6)
data_train = pd.concat([train_nor,train_cre], axis=0)
data_test = pd.concat([test_nor,test_cre], axis=0) 
C:\Users\User\Anaconda3\envs\Tensorflow-gpu\lib\site-packages\ipykernel_launcher.py:6: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  

Variable Selection

這次使用Best Subsets feature selection ,因為變數大約1400個,因此運算量極大,最終挑選169個變數,並依此建立後續模型。

from sklearn.feature_selection import VarianceThreshold
sel = VarianceThreshold(threshold=(.8 * (1 - .8)))
fix_data = pd.DataFrame(sel.fit_transform(data_train))
fix_data
0 1 2 3 4 5 6 7 8 9 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121
0 0.00 1373136.00 55669.65 29413.53 0.00 0.00 0.00 0.00 0.00 1849143.63 0.00 0.0 0.0 0.00 0.00 0.00 0.00 0.00 0.00 0.0
1 0.00 69101.06 3312204.50 43936.20 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.0 0.0 0.00 0.00 0.00 0.00 0.00 0.00 0.0
2 0.00 55406.89 37459.26 567906.94 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.0 0.0 0.00 0.00 0.00 0.00 0.00 0.00 0.0
3 0.00 0.00 0.00 574.92 2521.91 0.00 0.00 0.00 0.00 0.00 0.00 0.0 0.0 0.00 0.00 0.00 0.00 0.00 0.00 0.0
4 0.00 2209497.50 0.00 332850.19 0.00 0.00 0.00 0.00 0.00 88801.86 0.00 0.0 0.0 0.00 0.00 0.00 0.00 0.00 0.00 0.0
5 0.00 0.00 386523.69 19751.14 0.00 0.00 0.00 0.00 0.00 474353.41 98261.75 0.0 0.0 0.00 0.00 0.00 445534.22 57997.06 123774.34 0.0
6 0.00 410000.25 0.00 3239424.25 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.0 0.0 0.00 0.00 0.00 0.00 0.00 0.00 0.0
7 0.00 0.00 22582.54 275360.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.0 0.0 0.00 0.00 0.00 0.00 0.00 0.00 0.0
8 0.00 1656135.25 578440.81 444822.34 0.00 0.00 0.00 0.00 0.00 835118.25 0.00 0.0 0.0 0.00 0.00 0.00 0.00 0.00 0.00 0.0
9 0.00 7308348.00 0.00 428864.38 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.0 267987.5 25078.88 143299.28 0.00 0.00 0.00 0.00 0.0
10 0.00 6349.39 44542.15 12410.13 0.00 0.00 0.00 0.00 0.00 509739.94 0.00 0.0 0.0 0.00 0.00 0.00 0.00 0.00 0.00 0.0
11 0.00 0.00 8855.12 617930.44 0.00 4434.14 535930.94 355806.91 0.00 0.00 0.00 0.0 0.0 0.00 0.00 0.00 0.00 0.00 0.00 0.0
12 0.00 77937.30 42382.20 5088.32 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.0 0.0 0.00 0.00 0.00 0.00 0.00 0.00 0.0
13 0.00 510357.69 8396.52 76988.38 0.00 0.00 81731.96 0.00 0.00 0.00 0.00 0.0 0.0 0.00 0.00 0.00 0.00 0.00 0.00 0.0
14 0.00 0.00 0.00 1980458.38 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.0 0.0 0.00 0.00 0.00 0.00 0.00 0.00 0.0
15 0.00 43712.79 0.00 9753.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.0 0.0 0.00 0.00 0.00 0.00 0.00 0.00 0.0
16 0.00 13922.24 0.00 19162.79 0.00 0.00 0.00 0.00 0.00 0.00 0.00 4851091.5 0.0 0.00 0.00 1199.36 0.00 0.00 0.00 0.0
17 0.00 7889.80 215028.61 10020.09 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.0 0.0 0.00 0.00 0.00 0.00 0.00 0.00 0.0
18 0.00 1614557.63 234533.52 496465.50 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.0 0.0 0.00 0.00 0.00 0.00 0.00 0.00 0.0
19 0.00 237456.30 489450.28 317787.66 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.0 0.0 0.00 0.00 0.00 0.00 0.00 0.00 1.0
20 850345.94 0.00 1142041.75 0.00 22913.52 83391.35 132192.95 0.00 0.00 0.00 0.00 0.0 0.0 0.00 0.00 0.00 0.00 0.00 0.00 1.0
21 0.00 60298.57 242256.88 0.00 0.00 0.00 28529.95 0.00 577876.63 0.00 0.00 0.0 0.0 0.00 0.00 0.00 0.00 0.00 0.00 1.0
22 404748.19 97165.77 800137.44 134355.13 0.00 285973.69 152992.25 0.00 0.00 0.00 0.00 0.0 0.0 0.00 0.00 0.00 0.00 0.00 0.00 1.0
23 458382.13 136389.20 412460.16 294669.03 46850.84 470360.91 70784.13 52197.92 0.00 0.00 0.00 0.0 0.0 3268253.75 0.00 0.00 0.00 0.00 0.00 1.0
24 0.00 0.00 39572.38 102085.48 0.00 0.00 0.00 0.00 37461.19 0.00 0.00 0.0 0.0 0.00 0.00 0.00 0.00 0.00 0.00 1.0

25 rows × 1122 columns

from sklearn.datasets import load_boston
import pandas as pd
import numpy as np
import statsmodels.api as sm

def stepwise_selection(X, y, 
                       initial_list=[], 
                       threshold_in=0.01, 
                       threshold_out = 0.05, 
                       verbose=True):
    """ Perform a forward-backward feature selection 
    based on p-value from statsmodels.api.OLS
    Arguments:
        X - pandas.DataFrame with candidate features
        y - list-like with the target
        initial_list - list of features to start with (column names of X)
        threshold_in - include a feature if its p-value < threshold_in
        threshold_out - exclude a feature if its p-value > threshold_out
        verbose - whether to print the sequence of inclusions and exclusions
    Returns: list of selected features 
    Always set threshold_in < threshold_out to avoid infinite looping.
    See https://en.wikipedia.org/wiki/Stepwise_regression for the details
    """
    included = list(initial_list)
    while True:
        changed=False
        # forward step
        excluded = list(set(X.columns)-set(included))
        new_pval = pd.Series(index=excluded)
        for new_column in excluded:
            model = sm.OLS(y, sm.add_constant(pd.DataFrame(X[included+[new_column]]))).fit()
            new_pval[new_column] = model.pvalues[new_column]
        best_pval = new_pval.min()
        if best_pval < threshold_in:
            best_feature = new_pval.argmin()
            included.append(best_feature)
            changed=True
            if verbose:
                print('Add  {:30} with p-value {:.6}'.format(best_feature, best_pval))

        # backward step
        model = sm.OLS(y, sm.add_constant(pd.DataFrame(X[included]))).fit()
        # use all coefs except intercept
        pvalues = model.pvalues.iloc[1:]
        worst_pval = pvalues.max() # null if pvalues is empty
        if worst_pval > threshold_out:
            changed=True
            worst_feature = pvalues.argmax()
            included.remove(worst_feature)
            if verbose:
                print('Drop {:30} with p-value {:.6}'.format(worst_feature, worst_pval))
        if not changed:
            break
    return included
X = data_train.iloc[:,0:1471]
y = data_train.iloc[:,1471:1472]
result = stepwise_selection(X, y)

print('resulting features:')
result

[‘V993’, ‘V322’, ‘V864’, ‘V689’, ‘V598’, ‘V1156’, ‘V240’, ‘V395’, ‘V1255’, ‘V1218’, ‘V634’, ‘V529’, ‘V869’, ‘V410’, ‘V521’, ‘V32’, ‘V1201’, ‘V478’, ‘V306’, ‘V964’, ‘V1122’, ‘V485’, ‘V690’, ‘V947’, ‘V677’, ‘V1444’, ‘V832’, ‘V1’, ‘V517’, ‘V351’, ‘V9’, ‘V109’, ‘V872’, ‘V518’, ‘V1239’, ‘V270’, ‘V695’, ‘V147’, ‘V524’, ‘V679’, ‘V320’, ‘V356’, ‘V232’, ‘V687’, ‘V112’, ‘V983’, ‘V146’, ‘V345’, ‘V520’, ‘V198’, ‘V59’, ‘V408’, ‘V110’, ‘V250’, ‘V1275’, ‘V60’, ‘V1253’, ‘V459’, ‘V522’, ‘V889’, ‘V403’, ‘V269’, ‘V87’, ‘V530’, ‘V839’, ‘V399’, ‘V861’, ‘V242’, ‘V823’, ‘V58’, ‘V627’, ‘V84’, ‘V321’, ‘V50’, ‘V483’, ‘V475’, ‘V1396’, ‘V1411’, ‘V1285’, ‘V1093’, ‘V1378’, ‘V413’, ‘V525’, ‘V671’, ‘V30’, ‘V95’, ‘V1199’, ‘V767’, ‘V809’, ‘V1404’, ‘V1401’, ‘V113’, ‘V1198’, ‘V1405’, ‘V1398’, ‘V1209’, ‘V1407’, ‘V1352’, ‘V271’, ‘V528’, ‘V805’, ‘V1397’, ‘V753’, ‘V200’, ‘V1400’, ‘V1408’, ‘V1394’, ‘V593’, ‘V1157’, ‘V233’, ‘V268’, ‘V576’, ‘V181’, ‘V1395’, ‘V820’, ‘V1257’, ‘V514’, ‘V669’, ‘V943’, ‘V489’, ‘V937’, ‘V486’, ‘V513’, ‘V1143’, ‘V966’, ‘V980’, ‘V1274’, ‘V1403’, ‘V343’, ‘V686’, ‘V653’, ‘V1281’, ‘V234’, ‘V1279’, ‘V523’, ‘V870’, ‘V959’, ‘V1278’, ‘V871’, ‘V5’, ‘V775’, ‘V845’, ‘V1211’, ‘V1110’, ‘V1273’, ‘V995’, ‘V1276’, ‘V873’, ‘V595’, ‘V1280’, ‘V1034’, ‘V1228’, ‘V1012’, ‘V1226’, ‘V1094’, ‘V511’, ‘V944’, ‘V1068’, ‘V1146’, ‘V313’, ‘V821’, ‘V122’, ‘V1227’, ‘V386’, ‘V771’, ‘V551’, ‘V538’, ‘V1220’, ‘V1179’]

result = ['V993', 'V322', 'V864', 'V689', 'V598', 'V1156', 'V240', 'V395', 'V1255', 'V1218', 'V634', 'V529', 'V869', 'V410', 'V521', 'V32', 'V1201', 'V478', 'V306', 'V964', 'V1122', 'V485', 'V690', 'V947', 'V677', 'V1444', 'V832', 'V1', 'V517', 'V351', 'V9', 'V109', 'V872', 'V518', 'V1239', 'V270', 'V695', 'V147', 'V524', 'V679', 'V320', 'V356', 'V232', 'V687', 'V112', 'V983', 'V146', 'V345', 'V520', 'V198', 'V59', 'V408', 'V110', 'V250', 'V1275', 'V60', 'V1253', 'V459', 'V522', 'V889', 'V403', 'V269', 'V87', 'V530', 'V839', 'V399', 'V861', 'V242', 'V823', 'V58', 'V627', 'V84', 'V321', 'V50', 'V483', 'V475', 'V1396', 'V1411', 'V1285', 'V1093', 'V1378', 'V413', 'V525', 'V671', 'V30', 'V95', 'V1199', 'V767', 'V809', 'V1404', 'V1401', 'V113', 'V1198', 'V1405', 'V1398', 'V1209', 'V1407', 'V1352', 'V271', 'V528', 'V805', 'V1397', 'V753', 'V200', 'V1400', 'V1408', 'V1394', 'V593', 'V1157', 'V233', 'V268', 'V576', 'V181', 'V1395', 'V820', 'V1257', 'V514', 'V669', 'V943', 'V489', 'V937', 'V486', 'V513', 'V1143', 'V966', 'V980', 'V1274', 'V1403', 'V343', 'V686', 'V653', 'V1281', 'V234', 'V1279', 'V523', 'V870', 'V959', 'V1278', 'V871', 'V5', 'V775', 'V845', 'V1211', 'V1110', 'V1273', 'V995', 'V1276', 'V873', 'V595', 'V1280', 'V1034', 'V1228', 'V1012', 'V1226', 'V1094', 'V511', 'V944', 'V1068', 'V1146', 'V313', 'V821', 'V122', 'V1227', 'V386', 'V771', 'V551', 'V538', 'V1220', 'V1179']
train = data_train.loc[:,result]
test = data_test.loc[:,result]

train_X = new_data_train.iloc[:,:]
test_X = data_test.iloc[:,:]
train_y = new_data_train["CRE"]
test_y = data_test["CRE"]

d_nor = train_nor.loc[:,result]
d_cre = train_cre.loc[:,result]
d_test = pd.concat([test_nor.loc[:,result],test_cre.loc[:,result]], axis=0)

non sampling

loocv

def loocv(ldf):
    ldf = ldf.reset_index(drop=True)
    cv = []
    for i in range(len(ldf)):
        dtrain = ldf.drop([i])
        dtest = ldf.iloc[i:i+1,:]
        train_X = dtrain.iloc[:,0:ldf.shape[1]-1]
        test_X = dtest.iloc[:,0:ldf.shape[1]-1]
        train_y = dtrain["CRE"]
        test_y = dtest["CRE"]
        clf = svm.SVC(kernel = 'linear') #SVM模組,svc,線性核函式 
        clf_fit = clf.fit(train_X, train_y)
        test_y_predicted = clf.predict(test_X)
        accuracy_rf = metrics.accuracy_score(test_y, test_y_predicted)
        cv += [accuracy_rf]
    loocv = np.mean(cv)
    return loocv
df1 = train
df1["CRE"] = data_train["CRE"]
loocv(df1)

0.96

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn import ensemble
from sklearn import metrics
from sklearn import svm 

train_X = train
test_X = test
train_y = data_train["CRE"]
test_y = data_test["CRE"]


#forest = ensemble.RandomForestClassifier(n_estimators = 10)
#forest_fit = forest.fit(train_X, train_y)
clf = svm.SVC(kernel = 'linear') #SVM模組,svc,線性核函式 
clf_fit = clf.fit(train_X, train_y)
test_y_predicted = clf.predict(test_X)

accuracy_rf = metrics.accuracy_score(test_y, test_y_predicted)
print(accuracy_rf)


test_auc = metrics.roc_auc_score(test_y, test_y_predicted)
print (test_auc)

import sklearn
f1 = sklearn.metrics.f1_score(test_y, test_y_predicted)
print(f1)

1.0
1.0
1.0

OCGAN for balance data

### import modules
%matplotlib inline
import os
import random
import keras
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm import tqdm_notebook as tqdm
from keras.models import Model
from keras.layers import Input, Reshape
from keras.layers.core import Dense, Activation, Dropout, Flatten
from keras.layers.normalization import BatchNormalization
from keras.layers.convolutional import UpSampling1D, Conv1D
from keras.layers.advanced_activations import LeakyReLU
from keras.optimizers import Adam, SGD,RMSprop
from keras.callbacks import TensorBoard
from sklearn.preprocessing import StandardScaler

# set parameters
dim = 169
num = 6
g_data = d_cre 

# Standard Scaler
ss = StandardScaler()
g_data = pd.DataFrame(ss.fit_transform(g_data))

# wasserstein_loss
from keras import backend 
# implementation of wasserstein loss
def wasserstein_loss(y_true, y_pred):
    return backend.mean(y_true * y_pred)

# generator
def get_generative(G_in, dense_dim=200, out_dim= dim, lr=1e-3):
    x = Dense(dense_dim)(G_in)
    x = Activation('tanh')(x)
    G_out = Dense(out_dim, activation='tanh')(x)
    G = Model(G_in, G_out)
    opt = keras.optimizers.RMSprop(lr=lr)#原先為SGD
    G.compile(loss=wasserstein_loss, optimizer=opt)#原loss為binary_crossentropy
    return G, G_out

G_in = Input(shape=[10])
G, G_out = get_generative(G_in)
G.summary()

# discriminator
def get_discriminative(D_in, lr=1e-3, drate=.25, n_channels= dim, conv_sz=5, leak=.2):#lr=1e-3, drate=.25, n_channels= dim, conv_sz=5, leak=.2
    x = Reshape((-1, 1))(D_in)
    x = Conv1D(n_channels, conv_sz, activation='relu')(x)
    x = Dropout(drate)(x)
    x = Flatten()(x)
    x = Dense(n_channels)(x)
    D_out = Dense(2, activation='linear')(x)#sigmoid
    D = Model(D_in, D_out)
    dopt = keras.optimizers.RMSprop(lr=lr)#原先為Adam
    D.compile(loss=wasserstein_loss, optimizer=dopt)
    return D, D_out

D_in = Input(shape=[dim])
D, D_out = get_discriminative(D_in)
D.summary()

# set up gan
def set_trainability(model, trainable=False):
    model.trainable = trainable
    for layer in model.layers:
        layer.trainable = trainable
        
def make_gan(GAN_in, G, D):
    set_trainability(D, False)
    x = G(GAN_in)
    GAN_out = D(x)
    GAN = Model(GAN_in, GAN_out)
    GAN.compile(loss=wasserstein_loss, optimizer=G.optimizer)#元loss為binary_crossentropy
    return GAN, GAN_out

GAN_in = Input([10])
GAN, GAN_out = make_gan(GAN_in, G, D)
GAN.summary()

# pre train
def sample_data_and_gen(G, noise_dim=10, n_samples= num):
    XT = np.array(g_data)
    XN_noise = np.random.uniform(0, 1, size=[n_samples, noise_dim])
    XN = G.predict(XN_noise)
    X = np.concatenate((XT, XN))
    y = np.zeros((2*n_samples, 2))
    y[:n_samples, 1] = 1
    y[n_samples:, 0] = 1
    return X, y

def pretrain(G, D, noise_dim=10, n_samples = num, batch_size=32):
    X, y = sample_data_and_gen(G, n_samples=n_samples, noise_dim=noise_dim)
    set_trainability(D, True)
    D.fit(X, y, epochs=1, batch_size=batch_size)
    
pretrain(G, D)

def sample_noise(G, noise_dim=10, n_samples=num):
    X = np.random.uniform(0, 1, size=[n_samples, noise_dim])
    y = np.zeros((n_samples, 2))
    y[:, 1] = 1
    return X, y

# one class detector
def oneclass(data,kernel = 'rbf',gamma = 'auto'):
    num1 = int(len(data)/2)
    num2 = int(len(data)+1)
    from sklearn import svm
    clf = svm.OneClassSVM(kernel=kernel, gamma=gamma).fit(data[0:num1])
    origin = pd.DataFrame(clf.score_samples(data[0:num1]))
    new = pd.DataFrame(clf.score_samples(data[num1:num2]))

    occ = pd.concat([pd.DataFrame(new[0] < origin[0].min()),pd.DataFrame(new[0] > origin[0].max())], axis=1)
    occ['ava'] = pd.DataFrame(occ.iloc[:,1:2] == occ.iloc[:,0:1])
    err = sum(occ['ava'] == False)/len(occ['ava'])
    return err

# productor
def gen(GAN, G, D, times=50, n_samples= num, noise_dim=10, batch_size=32, verbose=False, v_freq=dim,):
    data = pd.DataFrame()
    for epoch in range(times):
        X, y = sample_data_and_gen(G, n_samples=n_samples, noise_dim=noise_dim)
        set_trainability(D, True)   
        xx,yy = X,y
        err = oneclass(xx)        
        num1 = int(len(xx)/2)
        num2 = int(len(xx)+1)
        
        data = pd.concat([data,pd.DataFrame(ss.inverse_transform(xx[num1:num2]))],axis = 0)              
        print("The %d times generator one class svm Error Rate=%f" %(epoch, err))           
            
    return data

# training
def train(GAN, G, D, epochs=1, n_samples= num, noise_dim=10, batch_size=32, verbose=False, v_freq=dim,):
    d_loss = []
    g_loss = []
    e_range = range(epochs)
    if verbose:
        e_range = tqdm(e_range)
    for epoch in e_range:
        X, y = sample_data_and_gen(G, n_samples=n_samples, noise_dim=noise_dim)
        set_trainability(D, True)
        d_loss.append(D.train_on_batch(X, y))
        xx,yy = X,y
        err = oneclass(xx) 
        print("The %d times epoch one class svm Error Rate=%f" %(epoch, err))
        
        X, y = sample_noise(G, n_samples=n_samples, noise_dim=noise_dim)
        set_trainability(D, False)
        g_loss.append(GAN.train_on_batch(X, y))
        if verbose and (epoch + 1) % v_freq == 0:
            print("Epoch #{}: Generative Loss: {}, Discriminative Loss: {}".format(epoch + 1, g_loss[-1], d_loss[-1]))       
            
    return d_loss, g_loss, xx, yy

d_loss, g_loss ,xx,yy= train(GAN, G, D, verbose=True)

Layer (type) Output Shape Param #
================================================================= input_22 (InputLayer) (None, 10) 0
_________________________________________________________________ dense_29 (Dense) (None, 200) 2200
_________________________________________________________________ activation_8 (Activation) (None, 200) 0
_________________________________________________________________ dense_30 (Dense) (None, 169) 33969
================================================================= Total params: 36,169 Trainable params: 36,169 Non-trainable params: 0 _________________________________________________________________ _________________________________________________________________ Layer (type) Output Shape Param #
================================================================= input_23 (InputLayer) (None, 169) 0
_________________________________________________________________ reshape_8 (Reshape) (None, 169, 1) 0
_________________________________________________________________ conv1d_8 (Conv1D) (None, 165, 169) 1014
_________________________________________________________________ dropout_8 (Dropout) (None, 165, 169) 0
_________________________________________________________________ flatten_8 (Flatten) (None, 27885) 0
_________________________________________________________________ dense_31 (Dense) (None, 169) 4712734
_________________________________________________________________ dense_32 (Dense) (None, 2) 340
================================================================= Total params: 4,714,088 Trainable params: 4,714,088 Non-trainable params: 0 _________________________________________________________________ _________________________________________________________________ Layer (type) Output Shape Param #
================================================================= input_24 (InputLayer) (None, 10) 0
_________________________________________________________________ model_22 (Model) (None, 169) 36169
_________________________________________________________________ model_23 (Model) (None, 2) 4714088
================================================================= Total params: 4,750,257 Trainable params: 36,169 Non-trainable params: 4,714,088 _________________________________________________________________ Epoch 1/1 12/12 [==============================] - 1s 48ms/step - loss: -0.0091

HBox(children=(IntProgress(value=0, max=1), HTML(value=’’)))

The 0 times epoch one class svm Error Rate=1.000000

d_loss, g_loss ,xx,yy= train(GAN, G, D, epochs=300, verbose=True)

loss plot

ax = pd.DataFrame(
    {
        'Generative Loss': g_loss,
        'Discriminative Loss': d_loss,
    }
).plot(title='Training loss', logy=False)
ax.set_xlabel("Epochs")
ax.set_ylabel("Loss")

Text(0, 0.5, ‘Loss’)

generate and bulid models

new_data = gen(GAN, G, D, times = 2,verbose=True)
new_data.columns = d_nor.columns
new_data['CRE'] = 1
d_train = pd.concat([d_nor,d_cre],axis = 0)
d_train["CRE"] = data_train["CRE"]
new_data_train = pd.concat([d_train,new_data],axis = 0)
new_data_train

The 0 times generator one class svm Error Rate=1.000000 The 1 times generator one class svm Error Rate=1.000000

V993 V322 V864 V689 V598 V1156 V240 V395 V1255 V1218 V821 V122 V1227 V386 V771 V551 V538 V1220 V1179 CRE
84 0.000000e+00 0.000000 1616.050000 0.000000 0.000000e+00 0.000000 0.000000 0.000000 0.000000 1524.790000 0.000000e+00 0.000000 0.000000 0.000000 0.000000 576.870000 0.000000 0.000000 0.000000 0
59 0.000000e+00 0.000000 0.000000 0.000000 0.000000e+00 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000e+00 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0
93 0.000000e+00 0.000000 0.000000 0.000000 0.000000e+00 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000e+00 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0
49 0.000000e+00 0.000000 0.000000 0.000000 0.000000e+00 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000e+00 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0
89 0.000000e+00 0.000000 0.000000 0.000000 0.000000e+00 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000e+00 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0
46 0.000000e+00 0.000000 0.000000 0.000000 0.000000e+00 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000e+00 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 532747.500000 0.000000 0
50 0.000000e+00 0.000000 0.000000 0.000000 0.000000e+00 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000e+00 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0
73 0.000000e+00 0.000000 0.000000 0.000000 3.789579e+06 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000e+00 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0
53 0.000000e+00 0.000000 0.000000 0.000000 0.000000e+00 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000e+00 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 75932.230000 0.000000 0
90 0.000000e+00 0.000000 0.000000 0.000000 0.000000e+00 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000e+00 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 120355.780000 0
72 0.000000e+00 0.000000 0.000000 0.000000 6.081896e+04 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000e+00 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0
55 0.000000e+00 43165.340000 0.000000 0.000000 0.000000e+00 0.000000 0.000000 0.000000 0.000000 0.000000 1.610336e+07 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0
80 2.226269e+05 0.000000 0.000000 0.000000 0.000000e+00 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000e+00 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 58034.680000 0
62 0.000000e+00 0.000000 0.000000 0.000000 0.000000e+00 47390.040000 0.000000 0.000000 0.000000 0.000000 0.000000e+00 0.000000 0.000000 107487.350000 0.000000 0.000000 0.000000 0.000000 0.000000 0
78 0.000000e+00 4350.400000 0.000000 0.000000 0.000000e+00 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000e+00 0.000000 0.000000 0.000000 0.000000 0.000000 47365.650000 0.000000 0.000000 0
63 0.000000e+00 0.000000 0.000000 0.000000 0.000000e+00 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000e+00 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0
79 7.658691e+05 0.000000 0.000000 0.000000 0.000000e+00 6783.660000 0.000000 0.000000 0.000000 0.000000 0.000000e+00 385997.310000 0.000000 23789.800000 0.000000 0.000000 17489.530000 0.000000 0.000000 0
75 0.000000e+00 0.000000 0.000000 0.000000 0.000000e+00 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000e+00 0.000000 0.000000 0.000000 0.000000 0.000000 31978.320000 0.000000 0.000000 0
88 0.000000e+00 0.000000 0.000000 0.000000 0.000000e+00 0.000000 0.000000 0.000000 0.000000 0.000000 4.644874e+04 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0
12 4.961734e+06 0.000000 0.000000 0.000000 0.000000e+00 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000e+00 145406.670000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1
3 2.020681e+07 0.000000 0.000000 0.000000 0.000000e+00 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000e+00 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 204892.950000 0.000000 1
2 4.932036e+06 0.000000 0.000000 0.000000 0.000000e+00 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000e+00 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 72948.780000 0.000000 1
8 9.387160e+06 79444.560000 0.000000 0.000000 0.000000e+00 0.000000 0.000000 0.000000 0.000000 0.000000 5.207452e+04 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1
6 6.053306e+06 0.000000 0.000000 0.000000 0.000000e+00 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000e+00 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1
13 1.272117e+07 0.000000 0.000000 0.000000 0.000000e+00 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000e+00 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1
0 1.512491e+07 -16218.833383 -0.993609 0.995566 -9.963942e-01 -0.996621 0.994142 0.998140 -0.995715 -0.996149 -1.062502e+04 -29740.948329 -0.998415 0.997263 0.994520 -0.997255 -0.993365 121635.468003 -0.993095 1
1 1.506312e+07 -16018.118402 -0.985025 0.989016 -9.891991e-01 -0.990969 0.981192 0.993236 -0.987260 -0.986484 -1.056079e+04 -29401.546549 -0.991166 0.994364 0.985471 -0.986429 -0.985721 120628.955686 -0.984401 1
2 1.511217e+07 -16203.647890 -0.992818 0.993807 -9.955698e-01 -0.995253 0.990759 0.996745 -0.993857 -0.993840 -1.061835e+04 -29667.786330 -0.996489 0.996588 0.992731 -0.993555 -0.990812 121295.633443 -0.992025 1
3 1.496042e+07 -15581.279153 -0.966205 0.968444 -9.806563e-01 -0.977659 0.957520 0.980549 -0.969081 -0.972953 -1.025045e+04 -28607.187664 -0.987470 0.981976 0.964365 -0.969920 -0.963239 118564.882265 -0.971673 1
4 1.513735e+07 -16285.360136 -0.997202 0.997497 -9.974456e-01 -0.997892 0.997145 0.998738 -0.998016 -0.997765 -1.067976e+04 -29855.686473 -0.999103 0.998467 0.996284 -0.997739 -0.997499 121800.363942 -0.996244 1
5 1.507087e+07 -15967.363031 -0.976027 0.983896 -9.862480e-01 -0.986026 0.980322 0.993358 -0.987903 -0.982607 -1.038747e+04 -29348.452340 -0.988892 0.990501 0.981182 -0.988225 -0.981214 120886.994780 -0.974020 1
0 1.508056e+07 -15925.283069 -0.987930 0.987971 -9.846137e-01 -0.990841 0.990818 0.993492 -0.989569 -0.989493 -1.049298e+04 -29527.114775 -0.993623 0.992465 0.983135 -0.987305 -0.993185 121117.700984 -0.977792 1
1 1.514196e+07 -16296.326162 -0.996452 0.996435 -9.980265e-01 -0.998247 0.996632 0.998864 -0.996634 -0.997642 -1.066461e+04 -29840.008210 -0.998694 0.998259 0.995150 -0.998254 -0.996857 121828.771543 -0.995327 1
2 1.512663e+07 -16255.855632 -0.994796 0.996507 -9.964558e-01 -0.997275 0.993639 0.998080 -0.995786 -0.996104 -1.067495e+04 -29788.474074 -0.997894 0.998463 0.995382 -0.994885 -0.995674 121512.002636 -0.994164 1
3 1.512467e+07 -16237.220094 -0.993826 0.994421 -9.969582e-01 -0.996626 0.993722 0.997794 -0.995277 -0.995618 -1.063710e+04 -29764.659522 -0.998138 0.997284 0.993458 -0.995569 -0.994352 121550.172951 -0.993698 1
4 1.514689e+07 -16320.480008 -0.997326 0.998062 -9.979978e-01 -0.998915 0.997715 0.999290 -0.997620 -0.998389 -1.069688e+04 -29884.836932 -0.999016 0.999236 0.997096 -0.998345 -0.998637 121884.855218 -0.995719 1
5 1.513991e+07 -16296.945582 -0.996303 0.997118 -9.973860e-01 -0.998098 0.996116 0.998822 -0.996933 -0.997412 -1.067624e+04 -29835.214938 -0.998451 0.998534 0.995529 -0.998190 -0.996817 121807.697267 -0.995231 1

37 rows × 170 columns

loocv(new_data_train)

1.0

from sklearn.model_selection import train_test_split
from sklearn import ensemble
from sklearn import metrics

train_X = new_data_train.iloc[:,0:169]
test_X = test.iloc[:,:]
train_y = new_data_train["CRE"]
test_y = data_test["CRE"]


forest = ensemble.RandomForestClassifier(n_estimators = 10)
forest_fit = forest.fit(train_X, train_y)

test_y_predicted = forest.predict(test_X)
accuracy_rf = metrics.accuracy_score(test_y, test_y_predicted)
print(accuracy_rf)


test_auc = metrics.roc_auc_score(test_y, test_y_predicted)
print (test_auc)

import sklearn
f1 = sklearn.metrics.f1_score(test_y, test_y_predicted)
print(f1)

1.0
1.0
1.0