## Original link: tecdat.cn/?p=8522

## Original source: Tuoduan Data Tribe Official Account

Classification problems belong to the category of machine learning problems, where given a set of features, the task is to predict discrete values. Some common examples of classification problems are predicting whether a tumor is cancer or whether a student is likely to pass an exam. In this article, given certain characteristics of bank customers, we will predict whether the customer is likely to leave the bank after 6 months. The phenomenon of customers leaving the organization is also called customer churn. Therefore, our task is to predict customer churn based on various customer characteristics.

$ pip install pytorch Copy code

**data set**

Let's import the required libraries and data sets into our Python application:

import torch import torch.nn as nn import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns %matplotlib inline Copy code

We can use

dataset = pd.read_csv(r'E:Datasets\customer_data.csv') Copy code

Let's output the data set:

dataset.shape Copy code

**Output:**

(10000, 14) Copy code

The output shows that the data set has 10,000 records and 14 columns. We can use

dataset.head() Copy code

**Output:**

You can see 14 columns in our data set. Based on the first 13 columns, our task is to predict the value of the 14th column, that is

**Exploratory data analysis**

Let's do some exploratory data analysis on the data set. We will first predict the percentage of customers who actually leave the bank after 6 months and use a pie chart for visualization. Let's first increase the default drawing size of the graph:

fig_size = plt.rcParams["figure.figsize"] fig_size[0] = 10 fig_size[1] = 8 plt.rcParams["figure.figsize"] = fig_size Copy code

The following script draws this

dataset.Exited.value_counts().plot(kind='pie', autopct='%1.0f%%', colors=['skyblue','orange'], explode=(0.05, 0.05)) Copy code

**Output:**

The output shows that in our data set, 20% of customers have left the bank. Here 1 represents the situation that the customer left the bank, and 0 represents the situation that the customer did not leave the bank. Let's plot the number of customers in all geographic locations in the dataset:

The output shows that almost half of the customers are from France, while the proportion of customers in Spain and Germany is 25%.

Now, let's plot the number of customers from each unique geographic location and customer churn information. We can use the library

The output shows that although the total number of French customers is twice the total number of Spanish and German customers, the proportion of customers leaving the bank is the same for French and German customers. Similarly, the total number of German and Spanish customers is the same, but the number of German customers leaving the bank is twice that of Spanish customers, indicating that German customers are more likely to leave the bank after 6 months.

**Data preprocessing**

Before training the PyTorch model, we need to preprocess the data. If you look at the data set, you will see that it has two types of columns: numeric columns and categorical columns. The numeric column contains numeric information.

Let's output all the columns in the data set again and find out which columns can be regarded as numeric columns and which should be regarded as category columns.

Index(['RowNumber','CustomerId','Surname','CreditScore','Geography','Gender','Age','Tenure','Balance','NumOfProducts','HasCrCard','IsActiveMember' , 'EstimatedSalary', 'Exited' ], dtype = 'object') copying the code

From our data column, we will not use the

numerical_columns = ['CreditScore','Age','Tenure','Balance','NumOfProducts','EstimatedSalary'] Copy code

Finally, the output (

We have created a list of categories, numbers and output columns. However, currently, the type of the classification column is not classified. You can use the following script to check the types of all the columns in the data set:

**output:**

RowNumber int64 CustomerId int64 Surname object CreditScore int64 Geography object Gender object Age int64 Tenure int64 Balance float64 NumOfProducts int64 HasCrCard int64 IsActiveMember int64 EstimatedSalary float64 Exited int64 dtype: object Copy code

You can see

Now, if you plot the types of the columns in the data set again, you will see the following results:

**Output**

RowNumber int64 CustomerId int64 Surname object CreditScore int64 Geography category Gender category Age int64 Tenure int64 Balance float64 NumOfProducts int64 HasCrCard category IsActiveMember category EstimatedSalary float64 Exited int64 dtype: object Copy code

Now let's check

Index ([ 'France', ' Germany', 'Spain'], dtype = 'object') copying the code

When you change the data type of a column to category, each category in the column is assigned a unique code. For example, let s plot the first five rows

**Output:**

0 France 1 Spain 2 France 3 France 4 Spain Name: Geography, dtype: category Categories (3, object): [France, Germany, Spain] Copy code

The following script plots the code of the value in the first five rows of the column

**Output:**

0 0 1 2 2 0 3 0 4 2 dtype: int8 Copy code

The output shows that France has been coded as 0 and Spain has been coded as 2.

The basic purpose of separating the classification column from the number column is that the value in the number column can be directly input into the neural network. However, you must first convert the value of the category column to a numeric type. The encoding of the value in the classification column partially solves the task of numerical conversion of the classification column.

Since we will use PyTorch for model training, we need to convert the categorical and numeric columns into tensors. 1. let's convert the classification column to a tensor. In PyTorch, you can create tensors from numpy arrays. We will first convert the data in the four classification columns into a numpy array, and then stack all the columns horizontally, as shown in the following script:

geo = dataset['Geography'].cat.codes.values ... Copy code

The above script outputs the first ten records in the category column. The output is as follows: **output:**

array([[0, 0, 1, 1], [2, 0, 0, 1], [0, 0, 1, 0], [0, 0, 0, 0], [2, 0, 1, 1], [2, 1, 1, 0], [0, 1, 1, 1], [1, 0, 1, 0], [0, 1, 0, 1], [0, 1, 1, 1]], dtype=int8) Copy code

Now to create a tensor from the above numpy array, you just need to pass the array to the module s

**Output:**

tensor([[0, 0, 1, 1], [2, 0, 0, 1], [0, 0, 1, 0], [0, 0, 0, 0], [2, 0, 1, 1], [2, 1, 1, 0], [0, 1, 1, 1], [1, 0, 1, 0], [0, 1, 0, 1], [0, 1, 1, 1]]) Copy code

In the output, you can see that the numpy array of category data has now been converted to

numerical_data = np.stack([dataset[col].values for col in numerical_columns], 1) ... Copy code

**Output:**

tensor([[6.1900e+02, 4.2000e+01, 2.0000e+00, 0.0000e+00, 1.0000e+00, 1.0135e+05], [6.0800e+02, 4.1000e+01, 1.0000e+00, 8.3808e+04, 1.0000e+00, 1.1254e+05], [5.0200e+02, 4.2000e+01, 8.0000e+00, 1.5966e+05, 3.0000e+00, 1.1393e+05], [6.9900e+02, 3.9000e+01, 1.0000e+00, 0.0000e+00, 2.0000e+00, 9.3827e+04], [8.5000e+02, 4.3000e+01, 2.0000e+00, 1.2551e+05, 1.0000e+00, 7.9084e+04]]) Copy code

In the output, you can see the first five rows, which contain the values of the six numeric columns in our data set. The last step is to convert the output numpy array to

**Output:**

tensor([1, 0, 1, 0, 0]) Copy code

Now, let us plot the categorical data, the numerical data and the corresponding output shape:

**Output:**

torch.Size([10000, 4]) torch.Size([10000, 6]) torch.Size([10000]) Copy code

Before training the model, there is a very important step. We convert the categorical column into a numeric value, where the unique value is represented by a single integer. For example, in the

We need to define the vector size for all classification columns. There are no strict rules regarding the number of dimensions. A good rule of thumb to define the embedding size of a column is to divide the number of unique values in the column by 2 (but not more than 50). For example, for the

categorical_column_sizes = [len(dataset[column].cat.categories) for column in categorical_columns] ... Copy code

**Output:**

[(3, 2), (2, 1), (2, 1), (2, 1)] Copy code

Use the training data to train a supervised deep learning model (such as the model we developed in this article) and evaluate the performance of the model on the test data set. Therefore, we need to divide the data set into a training set and a test set, as shown in the following script:

total_records = 10000 .... Copy code

There are 10,000 records in our data set, of which 80% of the records (ie 8,000 records) will be used to train the model, and the remaining 20% of the records will be used to evaluate the performance of the model. Note that in the above script, the classification and numerical data and output have been divided into training set and test set. To verify that we have correctly divided the data into training and test sets:

print(len(categorical_train_data)) print(len(numerical_train_data)) print(len(train_outputs)) print(len(categorical_test_data)) print(len(numerical_test_data)) print(len(test_outputs)) Copy code

**Output:**

8000 8000 8000 2000 2000 2000 Copy code

Copy code

**Create a predictive model**

We divided the data into training set and test set, now it is time to define the training model. For this, we can define a class called

class Model(nn.Module): def __init__(self, embedding_size, num_numerical_cols, output_size, layers, p=0.4): super().__init__() self.all_embeddings = nn.ModuleList([nn.Embedding(ni, nf) for ni, nf in embedding_size]) self.embedding_dropout = nn.Dropout(p) self.batch_norm_num = nn.BatchNorm1d(num_numerical_cols) return x Copy code

Next, to find the size of the input layer, add the number of category columns and number columns together and store them in

- Linear: Used to calculate the dot product between the input and the weight matrix
- ReLu: Used as an activation function
- BatchNorm1d: Used to apply batch normalization to numeric columns
- Dropout: Used to avoid overfitting

is behind

Next, in the

embeddings = [] ... Copy code

The batch normalization of numeric columns can be applied by the following script:

x_numerical = self.batch_norm_num (x_numerical) copying the code

Finally, the embedded classification column

**Training model**

To train the model, first we must create

You can see that we passed the embedding size of the classification column, the number of numeric columns, the output size (2 in our example), and the neurons in the hidden layer. You can see that we have three hidden layers with 200, 100, and 50 neurons.

Let's export the model and view:

print(model) Copy code

**Output:**

Model( (all_embeddings): ModuleList( ... ) ) Copy code

As you can see, in the first linear layer,

Before actually training the model, we need to define the loss function and the optimizer that will be used to train the model. The following script defines the loss function and optimizer:

loss_function = nn.CrossEntropyLoss() Copy code

Now, we train the model. The following script trains the model:

epochs = 300 aggregated_losses = [] for i in range(epochs): print(f'epoch: {i:3} loss: {single_loss.item():10.10f}') Copy code

The number of neurons is set to 300, which means that to train the model, the complete data set will be used 300 times.

The output of the above script is as follows:

epoch: 1 loss: 0.71847951 epoch: 26 loss: 0.57145703 epoch: 51 loss: 0.48110831 epoch: 76 loss: 0.42529839 epoch: 101 loss: 0.39972275 epoch: 126 loss: 0.37837571 epoch: 151 loss: 0.37133673 epoch: 176 loss: 0.36773482 epoch: 201 loss: 0.36305946 epoch: 226 loss: 0.36079505 epoch: 251 loss: 0.35350436 epoch: 276 loss: 0.35540250 epoch: 300 loss: 0.3465710580 Copy code

The following script plots the loss function for each period:

plt.plot(range(epochs), aggregated_losses) plt.ylabel('Loss') plt.xlabel('epoch'); Copy code

**Output:**

The output shows that the loss function decreases rapidly initially. After 250 steps, the loss hardly decreases.

**Make predictions**

The last step is to make predictions on the test data. To do this, we only need to

with torch.no_grad(): ... Copy code

**Output:**

Loss: 0.36855841 Copy code

The loss on the test set is 0.3685, which is slightly more than the 0.3465 obtained on the training set, which indicates that our model is somewhat overfitting. Since we specify that the output layer will contain 2 neurons, each prediction will contain 2 values. For example, the top 5 predicted values are as follows:

print(y_val[:5]) Copy code

**Output:**

tensor([[ 1.2045, -1.3857], [1.3911, -1.5957], [1.2781, -1.3598], [0.6261, -0.5429], [2.5430, -1.9991]]) Copy code

The idea of this prediction is that if the actual output is 0, the value at index 0 should be greater than the value at index 1, and vice versa. We can use the following script to retrieve the index of the largest value in the list:

y_val = np.argmax(y_val, axis=1) Copy code

**Output:** now let's output again

print(y_val[:5]) Copy code

**Output:**

tensor([0, 0, 0, 0, 0]) Copy code

Since in the initially predicted output list, for the first five records, the value at the zero index is greater than the value at the first index, so you can see 0 in the first five rows of the processed output.

Finally, we can use from

from sklearn.metrics import classification_report, confusion_matrix, accuracy_score print(confusion_matrix(test_outputs,y_val)) print(classification_report(test_outputs,y_val)) print(accuracy_score(test_outputs, y_val)) Copy code

**Output:**

[[1527 83] [224 166]] precision recall f1-score support 0 0.87 0.95 0.91 1610 1 0.67 0.43 0.52 390 micro avg 0.85 0.85 0.85 2000 macro avg 0.77 0.69 0.71 2000 weighted avg 0.83 0.85 0.83 2000 0.8465 Copy code

The output shows that our model achieves an accuracy of 84.65%, which is very impressive considering the fact that we randomly select all the parameters of the neural network model. I suggest you try to change the model parameters, such as the training/testing ratio, the number and size of hidden layers, etc., to see if you can get better results.

**Conclusion**

PyTorch is a commonly used deep learning library developed by Facebook that can be used for various tasks such as classification, regression, and clustering. This article describes how to use the PyTorch library to classify tabular data.

Most popular insights

1. Analyze the research hotspots of big data journal articles

2. 618 online shopping data inventory-what are the people paying attention to

3. Research on r language text mining tf-idf topic modeling, sentiment analysis n-gram modeling

4. Python topic modeling visualization lda and t-sne interactive visualization

5. Observation of news data under the epidemic

6. Python theme lda modeling and t-sne visualization

7. Topic-modeling analysis of text data in r language

8. Topic model: data listen to those "net things" on the People's Daily Online message board

9. Python crawler performs web crawling lda topic semantic data analysis