Python uses PyTorch machine learning neural network to classify and predict bank customer churn model

Python uses PyTorch machine learning neural network to classify and predict bank customer churn model

Original link:

Original source: Tuoduan Data Tribe Official Account


 Classification problems belong to the category of machine learning problems, where given a set of features, the task is to predict discrete values. Some common examples of classification problems are predicting whether a tumor is cancer or whether a student is likely to pass an exam. In this article, given certain characteristics of bank customers, we will predict whether the customer is likely to leave the bank after 6 months. The phenomenon of customers leaving the organization is also called customer churn. Therefore, our task is to predict customer churn based on various customer characteristics.

$ pip install pytorch Copy code

data set

Let's import the required libraries and data sets into our Python application:

import torch import torch.nn as nn import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns %matplotlib inline Copy code

We can use

Method to import a CSV file containing our data set.

dataset = pd.read_csv(r'E:Datasets\customer_data.csv') Copy code

Let's output the data set:

dataset.shape Copy code


(10000, 14) Copy code

The output shows that the data set has 10,000 records and 14 columns. We can use

The data frame method to output the first five rows of the data set.

dataset.head() Copy code


You can see 14 columns in our data set. Based on the first 13 columns, our task is to predict the value of the 14th column, that is


Exploratory data analysis

Let's do some exploratory data analysis on the data set. We will first predict the percentage of customers who actually leave the bank after 6 months and use a pie chart for visualization. Let's first increase the default drawing size of the graph:

fig_size = plt.rcParams["figure.figsize"] fig_size[0] = 10 fig_size[1] = 8 plt.rcParams["figure.figsize"] = fig_size Copy code

The following script draws this

Column of pie chart.

dataset.Exited.value_counts().plot(kind='pie', autopct='%1.0f%%', colors=['skyblue','orange'], explode=(0.05, 0.05)) Copy code


The output shows that in our data set, 20% of customers have left the bank. Here 1 represents the situation that the customer left the bank, and 0 represents the situation that the customer did not leave the bank. Let's plot the number of customers in all geographic locations in the dataset:

The output shows that almost half of the customers are from France, while the proportion of customers in Spain and Germany is 25%.

Now, let's plot the number of customers from each unique geographic location and customer churn information. We can use the library

To perform this operation.


The output shows that although the total number of French customers is twice the total number of Spanish and German customers, the proportion of customers leaving the bank is the same for French and German customers. Similarly, the total number of German and Spanish customers is the same, but the number of German customers leaving the bank is twice that of Spanish customers, indicating that German customers are more likely to leave the bank after 6 months.


Data preprocessing

Before training the PyTorch model, we need to preprocess the data. If you look at the data set, you will see that it has two types of columns: numeric columns and categorical columns. The numeric column contains numeric information.

Wait. Similarly,
They are classified columns because they contain classified information, such as the location and gender of the customer. There are several columns that can be regarded as numeric columns and category columns. For example, the
The column value can be 1 or 0. But that
The column contains information about whether the customer has a credit card. 

Let's output all the columns in the data set again and find out which columns can be regarded as numeric columns and which should be regarded as category columns.

The properties of the data frame display all column names:

Index(['RowNumber','CustomerId','Surname','CreditScore','Geography','Gender','Age','Tenure','Balance','NumOfProducts','HasCrCard','IsActiveMember' , 'EstimatedSalary', 'Exited' ], dtype = 'object') copying the code

From our data column, we will not use the

as well as
Columns, because the values of these columns are completely random and have nothing to do with the output. For example, the customer's last name has no effect on whether the customer leaves the bank. The rest of the columns,
The columns can be considered as category columns. Let's create a list of these columns: With the exception of this column, all the columns can be considered as numeric columns.

numerical_columns = ['CreditScore','Age','Tenure','Balance','NumOfProducts','EstimatedSalary'] Copy code

Finally, the output (

Column values) are stored in

We have created a list of categories, numbers and output columns. However, currently, the type of the classification column is not classified. You can use the following script to check the types of all the columns in the data set:

RowNumber int64 CustomerId int64 Surname object CreditScore int64 Geography object Gender object Age int64 Tenure int64 Balance float64 NumOfProducts int64 HasCrCard int64 IsActiveMember int64 EstimatedSalary float64 Exited int64 dtype: object Copy code

You can see

The type of the column is object,
The type of the column is int64. We need to convert the type of the classification column to
. We can use
Function to do this,

Now, if you plot the types of the columns in the data set again, you will see the following results:


RowNumber int64 CustomerId int64 Surname object CreditScore int64 Geography category Gender category Age int64 Tenure int64 Balance float64 NumOfProducts int64 HasCrCard category IsActiveMember category EstimatedSalary float64 Exited int64 dtype: object Copy code

Now let's check

All categories in the column:

Index ([ 'France', ' Germany', 'Spain'], dtype = 'object') copying the code

When you change the data type of a column to category, each category in the column is assigned a unique code. For example, let s plot the first five rows

And output the code value of the first five lines:


0 France 1 Spain 2 France 3 France 4 Spain Name: Geography, dtype: category Categories (3, object): [France, Germany, Spain] Copy code

The following script plots the code of the value in the first five rows of the column



0 0 1 2 2 0 3 0 4 2 dtype: int8 Copy code

The output shows that France has been coded as 0 and Spain has been coded as 2.

The basic purpose of separating the classification column from the number column is that the value in the number column can be directly input into the neural network. However, you must first convert the value of the category column to a numeric type. The encoding of the value in the classification column partially solves the task of numerical conversion of the classification column.

Since we will use PyTorch for model training, we need to convert the categorical and numeric columns into tensors. 1. let's convert the classification column to a tensor. In PyTorch, you can create tensors from numpy arrays. We will first convert the data in the four classification columns into a numpy array, and then stack all the columns horizontally, as shown in the following script:

geo = dataset['Geography'] ... Copy code

The above script outputs the first ten records in the category column. The output is as follows: output:

array([[0, 0, 1, 1], [2, 0, 0, 1], [0, 0, 1, 0], [0, 0, 0, 0], [2, 0, 1, 1], [2, 1, 1, 0], [0, 1, 1, 1], [1, 0, 1, 0], [0, 1, 0, 1], [0, 1, 1, 1]], dtype=int8) Copy code

Now to create a tensor from the above numpy array, you just need to pass the array to the module s



tensor([[0, 0, 1, 1], [2, 0, 0, 1], [0, 0, 1, 0], [0, 0, 0, 0], [2, 0, 1, 1], [2, 1, 1, 0], [0, 1, 1, 1], [1, 0, 1, 0], [0, 1, 0, 1], [0, 1, 1, 1]]) Copy code

In the output, you can see that the numpy array of category data has now been converted to

Object. Similarly, we can convert a numeric column to a tensor:

numerical_data = np.stack([dataset[col].values for col in numerical_columns], 1) ... Copy code


tensor([[6.1900e+02, 4.2000e+01, 2.0000e+00, 0.0000e+00, 1.0000e+00, 1.0135e+05], [6.0800e+02, 4.1000e+01, 1.0000e+00, 8.3808e+04, 1.0000e+00, 1.1254e+05], [5.0200e+02, 4.2000e+01, 8.0000e+00, 1.5966e+05, 3.0000e+00, 1.1393e+05], [6.9900e+02, 3.9000e+01, 1.0000e+00, 0.0000e+00, 2.0000e+00, 9.3827e+04], [8.5000e+02, 4.3000e+01, 2.0000e+00, 1.2551e+05, 1.0000e+00, 7.9084e+04]]) Copy code

In the output, you can see the first five rows, which contain the values of the six numeric columns in our data set. The last step is to convert the output numpy array to

Object. Output:

tensor([1, 0, 1, 0, 0]) Copy code

Now, let us plot the categorical data, the numerical data and the corresponding output shape:


torch.Size([10000, 4]) torch.Size([10000, 6]) torch.Size([10000]) Copy code

Before training the model, there is a very important step. We convert the categorical column into a numeric value, where the unique value is represented by a single integer. For example, in the

In the column, we see that France is represented by 0 and Germany is represented by 1. We can use these values to train our model. However, a better method is to represent the value in the classification column in the form of an N-dimensional vector, rather than a single integer.

We need to define the vector size for all classification columns. There are no strict rules regarding the number of dimensions. A good rule of thumb to define the embedding size of a column is to divide the number of unique values in the column by 2 (but not more than 50). For example, for the

Column, the number of unique values is 3. The
The corresponding embedding size of the column will be 3/2 = 1.5 = 2 (rounded up). The following script creates a tuple containing the number of unique values and dimension size for all category columns:

categorical_column_sizes = [len(dataset[column].cat.categories) for column in categorical_columns] ... Copy code


[(3, 2), (2, 1), (2, 1), (2, 1)] Copy code

Use the training data to train a supervised deep learning model (such as the model we developed in this article) and evaluate the performance of the model on the test data set. Therefore, we need to divide the data set into a training set and a test set, as shown in the following script:

total_records = 10000 .... Copy code

There are 10,000 records in our data set, of which 80% of the records (ie 8,000 records) will be used to train the model, and the remaining 20% of the records will be used to evaluate the performance of the model. Note that in the above script, the classification and numerical data and output have been divided into training set and test set. To verify that we have correctly divided the data into training and test sets:

print(len(categorical_train_data)) print(len(numerical_train_data)) print(len(train_outputs)) print(len(categorical_test_data)) print(len(numerical_test_data)) print(len(test_outputs)) Copy code


8000 8000 8000 2000 2000 2000 Copy code
Copy code

Create a predictive model

We divided the data into training set and test set, now it is time to define the training model. For this, we can define a class called

, This class will be used to train the model. Look at the following script:

class Model(nn.Module): def __init__(self, embedding_size, num_numerical_cols, output_size, layers, p=0.4): super().__init__() self.all_embeddings = nn.ModuleList([nn.Embedding(ni, nf) for ni, nf in embedding_size]) self.embedding_dropout = nn.Dropout(p) self.batch_norm_num = nn.BatchNorm1d(num_numerical_cols) return x Copy code

Next, to find the size of the input layer, add the number of category columns and number columns together and store them in

Variable. after that,
Loop iteratively and add the corresponding layer to
List. The added layers are:


  • Linear
    : Used to calculate the dot product between the input and the weight matrix
  • ReLu
    : Used as an activation function
  • BatchNorm1d
    : Used to apply batch normalization to numeric columns
  • Dropout
    : Used to avoid overfitting

is behind

In the loop, the output layer is appended to the list of layers. Since we want all the layers in the neural network to be executed in order, we pass the list of layers to
This category.

Next, in the

In the method, both the category column and the number column are passed as input. The embedding of the category column is carried out in the following rows.

embeddings = [] ... Copy code

The batch normalization of numeric columns can be applied by the following script:

x_numerical = self.batch_norm_num (x_numerical) copying the code

Finally, the embedded classification column

And number column
Connected together and passed to the sequence 

Training model

To train the model, first we must create

Object of the class defined in the previous section.

You can see that we passed the embedding size of the classification column, the number of numeric columns, the output size (2 in our example), and the neurons in the hidden layer. You can see that we have three hidden layers with 200, 100, and 50 neurons.
Let's export the model and view:

print(model) Copy code


Model( (all_embeddings): ModuleList( ... ) ) Copy code

As you can see, in the first linear layer,

The value of the variable is 11 because we have 6 numeric columns and the sum of the embedding dimensions of the category column is 5, so 6 + 5 = 11.
The value of is 2 because we only have 2 possible outputs.

Before actually training the model, we need to define the loss function and the optimizer that will be used to train the model. The following script defines the loss function and optimizer:

loss_function = nn.CrossEntropyLoss() Copy code

Now, we train the model. The following script trains the model:

epochs = 300 aggregated_losses = [] for i in range(epochs): print(f'epoch: {i:3} loss: {single_loss.item():10.10f}') Copy code

The number of neurons is set to 300, which means that to train the model, the complete data set will be used 300 times.

For the way the loop is executed during each iteration, the loss is calculated using the loss function. The loss during each iteration will be added to

The output of the above script is as follows:

epoch: 1 loss: 0.71847951 epoch: 26 loss: 0.57145703 epoch: 51 loss: 0.48110831 epoch: 76 loss: 0.42529839 epoch: 101 loss: 0.39972275 epoch: 126 loss: 0.37837571 epoch: 151 loss: 0.37133673 epoch: 176 loss: 0.36773482 epoch: 201 loss: 0.36305946 epoch: 226 loss: 0.36079505 epoch: 251 loss: 0.35350436 epoch: 276 loss: 0.35540250 epoch: 300 loss: 0.3465710580 Copy code

The following script plots the loss function for each period:

plt.plot(range(epochs), aggregated_losses) plt.ylabel('Loss') plt.xlabel('epoch'); Copy code


The output shows that the loss function decreases rapidly initially. After 250 steps, the loss hardly decreases.

Make predictions

The last step is to make predictions on the test data. To do this, we only need to

Pass to
This category. The returned value can then be compared with the actual test output value. The following script predicts the test class and outputs the cross entropy loss of the test data.

with torch.no_grad(): ... Copy code


Loss: 0.36855841 Copy code

The loss on the test set is 0.3685, which is slightly more than the 0.3465 obtained on the training set, which indicates that our model is somewhat overfitting. Since we specify that the output layer will contain 2 neurons, each prediction will contain 2 values. For example, the top 5 predicted values are as follows:

print(y_val[:5]) Copy code


tensor([[ 1.2045, -1.3857], [1.3911, -1.5957], [1.2781, -1.3598], [0.6261, -0.5429], [2.5430, -1.9991]]) Copy code

The idea of this prediction is that if the actual output is 0, the value at index 0 should be greater than the value at index 1, and vice versa. We can use the following script to retrieve the index of the largest value in the list:

y_val = np.argmax(y_val, axis=1) Copy code

Output: now let's output again

The first five values of the list:

print(y_val[:5]) Copy code


tensor([0, 0, 0, 0, 0]) Copy code

Since in the initially predicted output list, for the first five records, the value at the zero index is greater than the value at the first index, so you can see 0 in the first five rows of the processed output.

Finally, we can use from

as well as
The class finds the accuracy, precision and recall values, confusion matrix.

from sklearn.metrics import classification_report, confusion_matrix, accuracy_score print(confusion_matrix(test_outputs,y_val)) print(classification_report(test_outputs,y_val)) print(accuracy_score(test_outputs, y_val)) Copy code


[[1527 83] [224 166]] precision recall f1-score support 0 0.87 0.95 0.91 1610 1 0.67 0.43 0.52 390 micro avg 0.85 0.85 0.85 2000 macro avg 0.77 0.69 0.71 2000 weighted avg 0.83 0.85 0.83 2000 0.8465 Copy code

The output shows that our model achieves an accuracy of 84.65%, which is very impressive considering the fact that we randomly select all the parameters of the neural network model. I suggest you try to change the model parameters, such as the training/testing ratio, the number and size of hidden layers, etc., to see if you can get better results.


PyTorch is a commonly used deep learning library developed by Facebook that can be used for various tasks such as classification, regression, and clustering. This article describes how to use the PyTorch library to classify tabular data.

Most popular insights

1. Analyze the research hotspots of big data journal articles

2. 618 online shopping data inventory-what are the people paying attention to

3. Research on r language text mining tf-idf topic modeling, sentiment analysis n-gram modeling

4. Python topic modeling visualization lda and t-sne interactive visualization

5. Observation of news data under the epidemic

6. Python theme lda modeling and t-sne visualization

7. Topic-modeling analysis of text data in r language

8. Topic model: data listen to those "net things" on the People's Daily Online message board

9. Python crawler performs web crawling lda topic semantic data analysis