In this python project, I’ll use fictive customer data from a bank to construct a predictive model for the likely churn clients.

Machine Learning Bank Customer Churn Prediction Project

1) Data Preprocessing

Imported All the necessary libraries:

In [1]:
import pandas as pd
import numpy as np
import random
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.model_selection import train_test_split,cross_val_score,GridSearchCV

Read the Data set and store it in a pandas Data Frame:

In [2]:
df = pd.read_csv('.\datata/Churn_Modelling.csv')

Pass the Warnings:

In [3]:
def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn

Explore the names of columns in the data frame:

In [4]:
df.columns
Out[4]:
Index(['RowNumber', 'CustomerId', 'Surname', 'CreditScore', 'Geography',
       'Gender', 'Age', 'Tenure', 'Balance', 'NumOfProducts', 'HasCrCard',
       'IsActiveMember', 'EstimatedSalary', 'Exited'],
      dtype='object')

How many rows and columns in the data frame :

In [5]:
df.shape
Out[5]:
(10000, 14)

General Information About the Columns of the Data Frame :

what are the data types of the column, missing values, and the memory taken by the data set
In [6]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   RowNumber        10000 non-null  int64  
 1   CustomerId       10000 non-null  int64  
 2   Surname          10000 non-null  object 
 3   CreditScore      10000 non-null  int64  
 4   Geography        10000 non-null  object 
 5   Gender           10000 non-null  object 
 6   Age              10000 non-null  int64  
 7   Tenure           10000 non-null  int64  
 8   Balance          10000 non-null  float64
 9   NumOfProducts    10000 non-null  int64  
 10  HasCrCard        10000 non-null  int64  
 11  IsActiveMember   10000 non-null  int64  
 12  EstimatedSalary  10000 non-null  float64
 13  Exited           10000 non-null  int64  
dtypes: float64(2), int64(9), object(3)
memory usage: 1.1+ MB

What are the data types of the columns in the data frame:

In [7]:
df.dtypes
Out[7]:
RowNumber            int64
CustomerId           int64
Surname             object
CreditScore          int64
Geography           object
Gender              object
Age                  int64
Tenure               int64
Balance            float64
NumOfProducts        int64
HasCrCard            int64
IsActiveMember       int64
EstimatedSalary    float64
Exited               int64
dtype: object

Display the first 10 rows of data frame:

In [8]:
df.head(10)
Out[8]:
RowNumberCustomerIdSurnameCreditScoreGeographyGenderAgeTenureBalanceNumOfProductsHasCrCardIsActiveMemberEstimatedSalaryExited
0115634602Hargrave619FranceFemale4220.00111101348.881
1215647311Hill608SpainFemale41183807.86101112542.580
2315619304Onio502FranceFemale428159660.80310113931.571
3415701354Boni699FranceFemale3910.0020093826.630
4515737888Mitchell850SpainFemale432125510.8211179084.100
5615574012Chu645SpainMale448113755.78210149756.711
6715592531Bartlett822FranceMale5070.0021110062.800
7815656148Obinna376GermanyFemale294115046.74410119346.881
8915792365He501FranceMale444142051.0720174940.500
91015592389H?684FranceMale272134603.8811171725.730

Count the missing values in the columns of the data frame:

In [9]:
df.isnull().sum()
Out[9]:
RowNumber          0
CustomerId         0
Surname            0
CreditScore        0
Geography          0
Gender             0
Age                0
Tenure             0
Balance            0
NumOfProducts      0
HasCrCard          0
IsActiveMember     0
EstimatedSalary    0
Exited             0
dtype: int64

Unique values in categorical and numeric variable:

In [10]:
df["Geography"].unique(),df["Gender"].unique(), df.NumOfProducts.unique(), df.HasCrCard.unique(), df.IsActiveMember.unique()
Out[10]:
(array(['France', 'Spain', 'Germany'], dtype=object),
 array(['Female', 'Male'], dtype=object),
 array([1, 3, 2, 4], dtype=int64),
 array([1, 0], dtype=int64),
 array([1, 0], dtype=int64))

Drop the unnecessary columns from data frame:

In [11]:
df = df.drop(['CustomerId','RowNumber','Surname'], axis = "columns")
df.head()
Out[11]:
CreditScoreGeographyGenderAgeTenureBalanceNumOfProductsHasCrCardIsActiveMemberEstimatedSalaryExited
0619FranceFemale4220.00111101348.881
1608SpainFemale41183807.86101112542.580
2502FranceFemale428159660.80310113931.571
3699FranceFemale3910.0020093826.630
4850SpainFemale432125510.8211179084.100

General statistic of the data: (count, mean, std, min, q1, q2, q3, max):

In [12]:
df.describe().T
Out[12]:
countmeanstdmin25%50%75%max
CreditScore10000.0650.52880096.653299350.00584.00652.000718.0000850.00
Age10000.038.92180010.48780618.0032.0037.00044.000092.00
Tenure10000.05.0128002.8921740.003.005.0007.000010.00
Balance10000.076485.88928862397.4052020.000.0097198.540127644.2400250898.09
NumOfProducts10000.01.5302000.5816541.001.001.0002.00004.00
HasCrCard10000.00.7055000.4558400.000.001.0001.00001.00
IsActiveMember10000.00.5151000.4997970.000.001.0001.00001.00
EstimatedSalary10000.0100090.23988157510.49281811.5851002.11100193.915149388.2475199992.48
Exited10000.00.2037000.4027690.000.000.0000.00001.00

Value counts for Class column(Exited):

In [13]:
ValueCounts = df['Exited'].value_counts()
print(ValueCounts)
0    7963
1    2037
Name: Exited, dtype: int64

A bar graph to show the Value counts for Class column(Exited):

we observe that we have a imbalanced dataset, to handle imbalance data we may need to have resampling.

In [14]:
ax = ValueCounts.plot(kind='bar',figsize=(14,8), width=0.40 ,color=['lightblue','green'])
ax.set_xlabel("Exited values",fontsize=15)      
ax.set_ylabel("Frequency Count",fontsize=15)
ax.set_title( 'A Bar graph showing values Class variable (Exited)' ,fontsize = 15)
plt.show()

Convert the categorical columns to numeric using get dummies:

In [15]:
df1 = pd.get_dummies(df,columns = ['Geography','Gender'])
df1.head()
Out[15]:
CreditScoreAgeTenureBalanceNumOfProductsHasCrCardIsActiveMemberEstimatedSalaryExitedGeography_FranceGeography_GermanyGeography_SpainGender_FemaleGender_Male
06194220.00111101348.88110010
160841183807.86101112542.58000110
2502428159660.80310113931.57110010
36993910.0020093826.63010010
4850432125510.8211179084.10000110

Heat map (Correlation Matrix) before Resampling of imbalanced data:

In [16]:
plt.figure(figsize=(16,10))  
sns.heatmap(df.corr(),annot=True,linewidths=.5, cmap="RdYlGn")
plt.title('Heatmap showing correlations among columns',fontsize = 15)
plt.show()

Storing Features into "X" matrix and Response Class into "y" vector:

In [17]:
#Features
X = df1.loc[:,df1.columns != 'Exited']
#Response
y = df1['Exited']  

Import SMOTE and fit it on X and y to resample the response classes,Remove under sampling :

In [18]:
from imblearn.over_sampling import SMOTE
sm = SMOTE(random_state=42)
X1, y1 = sm.fit_sample(X, y)
Using TensorFlow backend.

A bar graph to show the Value counts for Exite column After Resampling:

We can observe that now we have equal number of Exite and non Exite values in the data set.
In [19]:
ValueCounts = pd.Series(np.array(y1)).value_counts()
ax = ValueCounts.plot(kind='bar',figsize=(14,8), width=0.40 ,fontsize=15,color=['lightblue','green'], title=' Graph After Resampling Exite Column Values' )
ax.set_xlabel("Exite values",fontsize=15)      
ax.set_ylabel("Frequency Count",fontsize=15)
Out[19]:
Text(0, 0.5, 'Frequency Count')

Combine Resampled Features and response:

In [20]:
df2 = pd.concat([pd.DataFrame(X1), pd.Series(y1)], axis=1)
df2.columns = df1.columns
df2.head()
Out[20]:
CreditScoreAgeTenureBalanceNumOfProductsHasCrCardIsActiveMemberEstimatedSalaryExitedGeography_FranceGeography_GermanyGeography_SpainGender_FemaleGender_Male
0619.042.02.00.001.01.01.0101348.881.00.00.01.00.01
1608.041.01.083807.861.00.01.0112542.580.00.01.01.00.00
2502.042.08.0159660.803.01.00.0113931.571.00.00.01.00.01
3699.039.01.00.002.00.00.093826.631.00.00.01.00.00
4850.043.02.0125510.821.01.01.079084.100.00.01.01.00.00
In [21]:
plt.figure(figsize=(15,10))  
sns.heatmap(df2.corr(),annot=True,linewidths=.5, cmap="RdYlGn")
plt.title('Heatmap After Resampling')
plt.show()

2) Data Evaluation and Exploratory Data Analysis (EDA)

Explore the distribution of columns and visualize them:

In [22]:
fig, axes = plt.subplots(2,4,figsize=(12,6))
feats = ['CreditScore', 'Age', 'Tenure', 'Balance', 'NumOfProducts', 'HasCrCard', 'IsActiveMember', 'EstimatedSalary']
for i, ax in enumerate(axes.flatten()):
    ax.hist(df[feats[i]], bins=25, color='green')
    ax.set_title(str(feats[i])+' Distribution', color='brown')
    ax.set_yscale('log')
plt.tight_layout()

A Count plot to show the Comparison of Geography and Exited columns :

In [23]:
plt.figure(figsize=(14,8))
ax = sns.countplot(x="Geography", hue="Exited", data=df)
ax.set_xlabel("Geography",fontsize=15)  
ax.set_ylabel("Count",fontsize=15)
ax.set_title('Comparison of Geography and Exited columns ',fontsize=15)
plt.show()

A Count plot to show the Comparison of Gender and Exited columns :

In [24]:
plt.figure(figsize=(14,8))
ax = sns.countplot(x="Gender", hue="Exited", data=df)
ax.set_xlabel("Gender",fontsize=15)  
ax.set_ylabel("Count",fontsize=15)
ax.set_title('Comparison of Gender and Exited columns ',fontsize=15)
plt.show()

A Count plot to show the Comparison of HasCrCard and Exited columns :

In [25]:
plt.figure(figsize=(14,8))
ax = sns.countplot(x="HasCrCard", hue="Exited", data=df)
ax.set_xlabel("HasCrCard",fontsize=15)  
ax.set_ylabel("Count",fontsize=15)
ax.set_title('Comparison of HasCrCard and Exited columns ',fontsize=15)
plt.show()

A Count plot to show the Comparison of NumOfProducts and Exited columns :

In [26]:
plt.figure(figsize=(14,8))
ax = sns.countplot(x="NumOfProducts", hue="Exited", data=df)
ax.set_xlabel("NumOfProducts",fontsize=15)  
ax.set_ylabel("Count",fontsize=15)
ax.set_title('Comparison of NumOfProducts and Exited columns ',fontsize=15)
plt.show()

A Bar graph showing Exite Rate for each gender :

In [27]:
Gender_exite_rate = df.groupby('Gender').Exited.mean()
Gender_exite_rate
Out[27]:
Gender
Female    0.250715
Male      0.164559
Name: Exited, dtype: float64
In [28]:
ax = Gender_exite_rate.plot(kind='bar',figsize=(14,8), width=0.40 ,color=['gray','brown'])
ax.set_xlabel("Gender",fontsize=15)      
ax.set_ylabel("Exited Rate",fontsize=15)
ax.set_title( 'A Bar graph showing Exite Rate for each gender' ,fontsize = 15)
plt.show()

A Bar graph showing Exite Rate for different Tenure :

In [29]:
Tenure_exite_rate = df.groupby('Tenure').Exited.mean()
Tenure_exite_rate
Out[29]:
Tenure
0     0.230024
1     0.224155
2     0.191794
3     0.211100
4     0.205258
5     0.206522
6     0.202689
7     0.172179
8     0.192195
9     0.216463
10    0.206122
Name: Exited, dtype: float64
In [30]:
ax = Tenure_exite_rate.plot(kind='bar',figsize=(14,8), width=0.40 ,color=['gray','brown'])
ax.set_xlabel("Tenure",fontsize=15)      
ax.set_ylabel("Exited Rate",fontsize=15)
ax.set_title( 'A Bar graph showing Exite Rate for each  Tenure' ,fontsize = 15)
plt.show()

A Bar graph showing Exite Rate for different Credit Scores :

In [31]:
CreditScore_Exite_rate = df.groupby('CreditScore').Exited.mean()
CreditScore_Exite_rate
Out[31]:
CreditScore
350    1.000000
351    1.000000
358    1.000000
359    1.000000
363    1.000000
         ...   
846    0.400000
847    0.333333
848    0.000000
849    0.250000
850    0.184549
Name: Exited, Length: 460, dtype: float64
In [32]:
plt.figure(figsize = (14,7))
plt.scatter(x=CreditScore_Exite_rate.index , y= CreditScore_Exite_rate.values,color = 'blue',marker = 'o',)
plt.xlabel('CreditScore',fontsize = 15)
plt.ylabel('Exite Rate',fontsize = 15)
plt.title('Scatter plot showing Exite Rate for each CreditScore',fontsize = 15)
plt.show()

A Bar graph showing Exite Rate for different age range :

In [33]:
Age_Exite_rate = df.groupby('Age').Exited.mean()
Age_Exite_rate
Out[33]:
Age
18    0.090909
19    0.037037
20    0.050000
21    0.056604
22    0.142857
        ...   
83    0.000000
84    0.500000
85    0.000000
88    0.000000
92    0.000000
Name: Exited, Length: 70, dtype: float64
In [34]:
plt.figure(figsize = (14,7))
plt.scatter(x=Age_Exite_rate.index , y= Age_Exite_rate.values,color = 'blue',marker = 'o',)
plt.xlabel('Age',fontsize = 15)
plt.ylabel('Exite Rate',fontsize = 15)
plt.title('Scatter plot showing Exite Rate for Age',fontsize = 15)
plt.show()

General Statistics for the data before modelling:

In [35]:
df1.describe()
Out[35]:
CreditScoreAgeTenureBalanceNumOfProductsHasCrCardIsActiveMemberEstimatedSalaryExitedGeography_FranceGeography_GermanyGeography_SpainGender_FemaleGender_Male
count10000.00000010000.00000010000.00000010000.00000010000.00000010000.0000010000.00000010000.00000010000.00000010000.00000010000.00000010000.00000010000.00000010000.000000
mean650.52880038.9218005.01280076485.8892881.5302000.705500.515100100090.2398810.2037000.5014000.2509000.2477000.4543000.545700
std96.65329910.4878062.89217462397.4052020.5816540.455840.49979757510.4928180.4027690.5000230.4335530.4316980.4979320.497932
min350.00000018.0000000.0000000.0000001.0000000.000000.00000011.5800000.0000000.0000000.0000000.0000000.0000000.000000
25%584.00000032.0000003.0000000.0000001.0000000.000000.00000051002.1100000.0000000.0000000.0000000.0000000.0000000.000000
50%652.00000037.0000005.00000097198.5400001.0000001.000001.000000100193.9150000.0000001.0000000.0000000.0000000.0000001.000000
75%718.00000044.0000007.000000127644.2400002.0000001.000001.000000149388.2475000.0000001.0000001.0000000.0000001.0000001.000000
max850.00000092.00000010.000000250898.0900004.0000001.000001.000000199992.4800001.0000001.0000001.0000001.0000001.0000001.000000

3) Model Selection

Divide the dataset into train set and test set (80 percent train and 20 percent test):

In [36]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X1, y1, test_size=0.20, random_state=0)  

Shapes for training and testing sets:

In [37]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape
Out[37]:
((12740, 13), (3186, 13), (12740,), (3186,))

Baseline Models:

Model 1 : Decision Tree

Importing DecisionTreeClassifier from tree in sklearn:

In [38]:
from sklearn.tree import DecisionTreeClassifier

Build Decision Tree classifier using default parameters :

In [39]:
Dt_classifier = DecisionTreeClassifier()

Call 'fit' function of the created DT model:

In [40]:
Dt_classifier.fit(X_train, y_train)
Out[40]:
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=None, splitter='best')

Make predictions using test data on Decision Tree model:

In [41]:
y_pred = Dt_classifier.predict(X_test)

Accuracy and Confusion matrix of Decision Tree Model:

In [42]:
from sklearn.metrics import confusion_matrix, accuracy_score
print('Confusion Matrix Decision Tree:\n',confusion_matrix(y_test,y_pred))
accu_dt = accuracy_score(y_test,y_pred)
print(' Accuracy Decision Tree:\n',accu_dt,'\n')
Confusion Matrix Decision Tree:
 [[1345  245]
 [ 195 1401]]
 Accuracy Decision Tree:
 0.8618957940991839 

Model 2 : Sopport Vector Machine Model:

Importing SVC from sklearn.svm class:

In [43]:
from sklearn.svm import SVC

Build SVM classifier using rbf kernal as parameter:

In [44]:
Svc_classifier = SVC()

Call 'fit' function of the SVM model to train it:

In [45]:
Svc_classifier.fit(X_train, y_train)
Out[45]:
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
    kernel='rbf', max_iter=-1, probability=False, random_state=None,
    shrinking=True, tol=0.001, verbose=False)

Make predictions using test data on SVM model:

In [46]:
y_pred = Svc_classifier.predict(X_test)

Accuracy and Confusion matrix of SVM Model:

In [47]:
from sklearn.metrics import confusion_matrix, accuracy_score
print('Confusion Matrix SVM:\n',confusion_matrix(y_test,y_pred))
accu_svm = accuracy_score(y_test,y_pred)
print(' Accuracy SVM:\n',accu_svm,'\n')
Confusion Matrix SVM:
 [[1590    0]
 [1552   44]]
 Accuracy SVM:
 0.5128688010043942 

Model 3 : Random Forest

Importing RandomForestClassifier from ensemble class in sklearn:

In [48]:
from sklearn.ensemble import RandomForestClassifier 

Build RandomForestClassifier using default parameters :

In [49]:
Rf_classifier = RandomForestClassifier()  

Train the random forest on training data:

In [50]:
Rf_classifier.fit(X_train, y_train)  
Out[50]:
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=10,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

Make predictions using test data on Random forest model:

In [51]:
y_pred = Rf_classifier.predict(X_test)

Accuracy and Confusion matrix of random forest Model:

In [52]:
from sklearn.metrics import confusion_matrix, accuracy_score
print('Confusion Matrix Random Forest:\n',confusion_matrix(y_test,y_pred))
accu_rf = accuracy_score(y_test,y_pred)
print(' Accuracy Random Forest:\n',accu_rf,'\n')
Confusion Matrix Random Forest:
 [[1492   98]
 [ 226 1370]]
 Accuracy Random Forest:
 0.8983050847457628 

Model 4 : Logistic Regression

Importing LogisticRegression from sklearn linear_model:

In [53]:
from sklearn.linear_model import LogisticRegression

Build Logistic Regression classifier using default parameters:

In [54]:
Lg_classifier = LogisticRegression()  

Call 'fit' function of the created model:

In [55]:
Lg_classifier.fit(X_train, y_train) 
Out[55]:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

Create predictions by calling 'predict' function of the fitted model:

In [56]:
y_pred = Lg_classifier.predict(X_test) 
y_pred[0:30]
Out[56]:
array([1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0,
       0, 1, 1, 1, 1, 0, 0, 0], dtype=int64)

Accuracy and Confusion matrix of Logistic Regression Model:

In [57]:
from sklearn.metrics import confusion_matrix, accuracy_score
print('Confusion Matrix Logistic Regression:\n',confusion_matrix(y_test,y_pred))
accu_lg = accuracy_score(y_test,y_pred)
print(' Accuracy Logistic Regression:\n',accu_lg,'\n')
Confusion Matrix Logistic Regression:
 [[1094  496]
 [ 440 1156]]
 Accuracy Logistic Regression:
 0.7062146892655368 

Model 5 : K-Nearest Neighbor Model

Importing KNeighborsClassifier from sklearn neighbors class:

In [58]:
from sklearn.neighbors import KNeighborsClassifier

Build KNeighborsClassifier using n_neighbors = 5 parameter:

In [59]:
knn_classifier = KNeighborsClassifier(n_neighbors=5)

Call 'fit' function of the created KNN model:

In [60]:
knn_classifier.fit(X_train, y_train) 
Out[60]:
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                     weights='uniform')

Make predictions by calling 'predict' function of the fitted KNN model:

In [61]:
y_pred = knn_classifier.predict(X_test) 

Accuracy and Confusion matrix of KNN Model:

In [62]:
from sklearn.metrics import confusion_matrix, accuracy_score
print('Confusion Matrix KNN Model:\n',confusion_matrix(y_test,y_pred))
accu_knn = accuracy_score(y_test,y_pred)
print(' Accuracy KNN Model:\n',accu_knn,'\n')
Confusion Matrix KNN Model:
 [[ 952  638]
 [ 379 1217]]
 Accuracy KNN Model:
 0.6807909604519774 

Model 6 : Naive Bayes Model:

Importing GaussianNB from sklearn naive_bayes class:

In [63]:
from sklearn.naive_bayes import GaussianNB

Build GaussianNB using default parameters:

In [64]:
NB_classifier = GaussianNB()

Call 'fit' function of the created Naive bayes model:

In [65]:
NB_classifier.fit(X_train, y_train)
Out[65]:
GaussianNB(priors=None, var_smoothing=1e-09)

Make predictions by calling 'predict' function of the fitted Naive Bayes model:

In [66]:
y_pred = NB_classifier.predict(X_test) 

Accuracy and Confusion matrix of NaiveBayes Model:

In [67]:
from sklearn.metrics import confusion_matrix, accuracy_score
print('Confusion Matrix Naive Bayes:\n',confusion_matrix(y_test,y_pred))
accu_nb = accuracy_score(y_test,y_pred)
print(' Accuracy Naive Bayes Model:\n',accu_nb,'\n')
Confusion Matrix Naive Bayes:
 [[1069  521]
 [ 361 1235]]
 Accuracy Naive Bayes Model:
 0.7231638418079096 

Comparing different machine learning models (model selection) in terms of accuracy:

In [68]:
results = pd.Series([accu_dt , accu_svm, accu_rf, accu_lg, accu_knn, accu_nb  ])
names = ['Decision Tree','SVm','Random Forest','Logistic Regression','KNN','Naive Bayes']
ax = results.plot(kind = 'bar',figsize=(13,7),color=['black','gray','brown','blue','pink','green'])
ax.set_title('Comparision of Models',fontsize=15)
ax.set_yticks([0.0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9])
ax.set_xticklabels(names ,fontsize=15,rotation = 45)
ax.set_xlabel("Models",fontsize=15)
ax.set_ylabel("Accuracy",fontsize=15)
Out[68]:
Text(0, 0.5, 'Accuracy')

From the above comparison we see that Random Forest gave the highest Performance in terms of accuracy hence we select Random Forest Model as our selecton.

4) Model Evaluation

Train & Evaluate Chosen Model:

Fit the selected model (Random Forest in this case) on the training dataset and evaluate the results.
In [69]:
Selected_classifier = RandomForestClassifier(random_state = 0)
Selected_classifier.fit(X_train, y_train)
# Predict the Test set results
y_pred = Selected_classifier.predict(X_test)
#Evaluate Model Results on Test Set:
from sklearn.metrics import precision_score,recall_score,f1_score
acc = accuracy_score(y_test, y_pred )
prec = precision_score(y_test, y_pred )
rec = recall_score(y_test, y_pred )
f1 = f1_score(y_test, y_pred )
results = pd.DataFrame([['Random Forest', acc, prec, rec, f1, ]],columns = ['Model', 'Accuracy', 'Precision', 'Recall', 'F1 Score',])
print (results)
           Model  Accuracy  Precision    Recall  F1 Score
0  Random Forest  0.896422   0.928281  0.859649  0.892648

k-Fold Cross-Validation for Random Forest:

In [70]:
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator = Selected_classifier, X = X_train, y = y_train, cv = 10)
print("Random Forest Classifier Accuracy: %0.2f (+/- %0.2f)"  % (accuracies.mean(),  accuracies.std() * 2))
Random Forest Classifier Accuracy: 0.90 (+/- 0.01)

Therefore, our k-fold Cross Validation results indicate that we would have an accuracy anywhere between 89% to 91% while running this model on any test set.

Evaluate with Confusion Matrix:

In [71]:
cm = confusion_matrix(y_test,y_pred)
print(cm) 
[[1484  106]
 [ 224 1372]]
In [72]:
plt.figure(figsize=(5,5))
sns.heatmap(data=cm, linewidths=.5,annot=True,square=True, cmap='Blues')
plt.xlabel('Predicted Values')
plt.ylabel('Actual Values')
plt.show()

We have got an accuracy of 90%; signalling the characteristics of a reasonably good model.

5) Model Improvement

Prepare the grid values for GridSearchCV for Random Forest Classifier:

In [73]:
# input different # of trees n_estimators for random forrest to GridSearchCV
grid_values = {'n_estimators': [5, 10,15,20,25,30,35,40,45,50,55,60],
# input two different criterion'gini','entropy' for random forrest to GridSearchCV               
               'criterion':['gini','entropy'], 
# input different # of  min_samples_split for random forrest to GridSearchCV               
               'min_samples_split': [1e-20, 5e-20, 1e-10, 5e-10, 1e-5, 5e-5, 1e-2, 5e-2],
              }
#INitializing GridSearchCV using Random forrest Classifier, CV=5 and roc_auc as metrics
Grid_classifier_rf = GridSearchCV(Selected_classifier,grid_values,cv=10, scoring='accuracy')

Fit the Grid_classifier_rf with features and responses:

In [74]:
Grid_classifier_rf.fit(X_train,y_train)
Out[74]:
GridSearchCV(cv=10, error_score='raise-deprecating',
             estimator=RandomForestClassifier(bootstrap=True, class_weight=None,
                                              criterion='gini', max_depth=None,
                                              max_features='auto',
                                              max_leaf_nodes=None,
                                              min_impurity_decrease=0.0,
                                              min_impurity_split=None,
                                              min_samples_leaf=1,
                                              min_samples_split=2,
                                              min_weight_fraction_leaf=0.0,
                                              n_estimators=10, n_jobs=None,
                                              oob_score=False, random_state=0,
                                              verbose=0, warm_start=False),
             iid='warn', n_jobs=None,
             param_grid={'criterion': ['gini', 'entropy'],
                         'min_samples_split': [1e-20, 5e-20, 1e-10, 5e-10,
                                               1e-05, 5e-05, 0.01, 0.05],
                         'n_estimators': [5, 10, 15, 20, 25, 30, 35, 40, 45, 50,
                                          55, 60]},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring='accuracy', verbose=0)

The best parameters given by GridSearchCV on Random Forest Model:

In [75]:
Grid_classifier_rf.best_params_
Out[75]:
{'criterion': 'gini', 'min_samples_split': 1e-20, 'n_estimators': 55}

The best accuracy score given by GridSearchCV on Random Forest Model:

In [76]:
Grid_classifier_rf.best_score_
Out[76]:
0.9097331240188383

The best estimator given by GridSearchCV :

In [77]:
Grid_classifier_rf.best_estimator_
Out[77]:
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=1e-20,
                       min_weight_fraction_leaf=0.0, n_estimators=55,
                       n_jobs=None, oob_score=False, random_state=0, verbose=0,
                       warm_start=False)

Build the Random Forest model again with the best parameters given by GridSearchCV (tuned model):

In [78]:
Tuned_classifier = RandomForestClassifier(criterion  = 'gini', min_samples_split = 1e-20, n_estimators =  55)

Fit the model with best parameters:

In [79]:
Tuned_classifier.fit(X_train,y_train)
Out[79]:
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=1e-20,
                       min_weight_fraction_leaf=0.0, n_estimators=55,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

6) Future Predictions

Make predictions on test data and show first 60 predicted values:

In [80]:
y_pred = Tuned_classifier.predict(X_test)
y_pred[0:60]
Out[80]:
array([1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0,
       0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1,
       0, 1, 1, 0, 0, 0], dtype=int64)

Compare the true values and predicted values of Exited column for tuned model:

In [81]:
df_comp = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})  
df_comp.head(20)  
Out[81]:
ActualPredicted
011
111
211
311
411
511
600
711
810
900
1000
1100
1200
1311
1400
1500
1611
1710
1800
1911

Accuracy and Confusion matrix for Tuned Random Forest Model:

In [82]:
from sklearn.metrics import confusion_matrix, accuracy_score
print('Confusion Matrix Tuned Random Forest:\n',confusion_matrix(y_test,y_pred))
tuned_accu = accuracy_score(y_test,y_pred)
print(' Accuracy Tuned Random Forest:\n',tuned_accu,'\n')
Confusion Matrix Tuned Random Forest:
 [[1489  101]
 [ 196 1400]]
 Accuracy Tuned Random Forest:
 0.9067796610169492 

7) Model Deployment

Deploy the model to a server using ‘joblib’ library so that we can productionize the end-to-end machine learning framework. Later we can run the model over any new dataset to predict the probability of any customer to churn in months to come.

In [ ]:
# pip install joblib 
#to install the package
filename = 'final_model.model'
i = [Tuned_classifier]
joblib.dump(i,filename)
In [ ]: