# Column | Feature Engineering Manual Based on Jupyter: Data Preprocessing (1)

Author: Yingxiang Chen & Zihan Yang Editor: Red Stone

The importance of feature engineering in machine learning is self-evident, and proper feature engineering can significantly improve the performance of machine learning models. We have compiled a systematic feature engineering tutorial on Github for your reference and study.

github.com/YC-Coder-Ch...

This article will discuss the data preprocessing part: how to use scikit-learn to process static continuous variables, use Category Encoders to process static category variables, and use Featuretools to process common time series variables.

The data preprocessing of feature engineering will be divided into three parts to introduce:\

• Static continuous variable
• Static categorical variables
• Time series variables

This article will introduce 1.1 data preprocessing of static continuous variables. The following will be combined with Jupyter, using sklearn, for detailed explanation.

1.1 Static continuous variables

1.1.1 Discretization/

Discretizing continuous variables can make the model more robust. For example, when predicting the purchase behavior of a customer, a customer who has made 30 purchases may have very similar behaviors as a customer who has made 32 purchases. Sometimes the over-precision in the feature may be noise, which is why in LightGBM, the model uses histogram algorithm to prevent over-fitting. There are two methods for discrete continuous variables.

1.1.1.1 Binarization

Binarize numerical features./

```# load the sample data
from sklearn.datasets import fetch_california_housing
dataset = fetch_california_housing()
X-, Y = dataset.data, dataset.target # WE Will Take The First Example The AS column later
duplicated code```
```%matplotlib inline
import seaborn as sns
import matplotlib.pyplot as plt

fig, ax = plt.subplots()
sns.distplot(X[:, 0 ], hist = True , kde = True )
ax.set_title( 'Histogram' , fontsize = 12 )
ax.set_xlabel( 'Value' , fontsize = 12 )
ax.set_ylabel ( 'Frequency' , fontSize = 12 is ); # Long-tail has the this Feature Distribution
duplicated code```

```from sklearn.preprocessing import Binarizer

sample_columns = X[0:10,0] # select the top 10 samples
# return array([8.3252, 8.3014, 7.2574, 5.6431, 3.8462, 4.0368, 3.6591, 3.12, 2.0804, 3.6912])

model = Binarizer(threshold=6) # set 6 to be the threshold
# if value <= 6, then return 0 else return 1
result = model.fit_transform(sample_columns.reshape(-1,1)).reshape(-1)
# Return array ([1., 1., 1., 0., 0., 0., 0., 0., 0., 0.])
Copy the code```

1.1.1.2 Binning

Bind the numerical features.

Evenly box:\

```from sklearn.preprocessing import KBinsDiscretizer

# in order to mimic the operation in real-world, we shall fit the KBinsDiscretizer
# on the trainset and transform the testset
# we take the top ten samples in the first column as test set
# take the rest samples in the first column as train set

test_set = X[0:10,0]
# return array([8.3252, 8.3014, 7.2574, 5.6431, 3.8462, 4.0368, 3.6591, 3.12, 2.0804, 3.6912])
train_set = X[10:,0]

model = KBinsDiscretizer(n_bins=5, encode='ordinal', strategy='uniform') # set 5 bins
# return oridinal bin number, set all bins to have identical widths

model.fit(train_set.reshape(-1,1))
result = model.transform(test_set.reshape(-1,1)).reshape(-1)
# return array([2., 2., 2., 1., 1., 1., 1., 0., 0., 1.])
bin_edge = model.bin_edges_
# Return array ([0.4999, 3.39994 , 6.29998, 9.20002, 12.10006, 15.0001]), the bin edges
duplicated code```
```# visualiza the bin edges
fig, ax = plt.subplots()
sns.distplot(train_set, hist = True , kde = True )

for edge in bin_edge: # uniform bins
line = plt.axvline(edge, color = 'b' )
ax.legend([line], [ 'Uniform Bin Edges' ], fontsize= 10 )
ax.set_title( 'Histogram' , fontsize = 12 )
ax.set_xlabel( 'Value' , fontsize = 12 )
ax.set_ylabel( 'Frequency' , fontsize = 12 );
Copy code```

Quantile binning:\

```from sklearn.preprocessing import KBinsDiscretizer

# in order to mimic the operation in real-world, we shall fit the KBinsDiscretizer
# on the trainset and transform the testset
# we take the top ten samples in the first column as test set
# take the rest samples in the first column as train set

test_set = X[0:10,0]
# return array([8.3252, 8.3014, 7.2574, 5.6431, 3.8462, 4.0368, 3.6591, 3.12, 2.0804, 3.6912])
train_set = X[10:,0]

model = KBinsDiscretizer(n_bins=5, encode='ordinal', strategy='quantile') # set 3 bins
# return oridinal bin number, set all bins based on quantile

model.fit(train_set.reshape(-1,1))
result = model.transform(test_set.reshape(-1,1)).reshape(-1)
# return array([4., 4., 4., 4., 2., 3., 2., 1., 0., 2.])
bin_edge = model.bin_edges_
Return Array # ([0.4999, 2.3523, 3.1406, 3.9667, 5.10824, 15.0001]), The bin Edges
# The IS 20 is 2.3523% Quantile
# 3.1406 40% Quantile The IS, etc ..
copy the code```
```# visualiza the bin edges
fig, ax = plt.subplots()
sns.distplot(train_set, hist = True , kde = True )

for edge in bin_edge: # quantile based bins
line = plt.axvline(edge, color = 'b' )
ax.legend([line], [ 'Quantiles Bin Edges' ], fontsize= 10 )
ax.set_title( 'Histogram' , fontsize = 12 )
ax.set_xlabel( 'Value' , fontsize = 12 )
ax.set_ylabel( 'Frequency' , fontsize = 12 );
Copy code```

1.1.2 Zoom/

It is difficult to compare features of different scales, especially in linear models such as linear regression and logistic regression. In the k-means clustering or KNN model based on Euclidean distance, feature scaling is required, otherwise the distance measurement is useless. And for any algorithm that uses gradient descent, scaling will also speed up the convergence speed.

Some commonly used models:\

Note: Skewness affects the PCA model, so it is better to use power transformation to eliminate skewness.

1.1.2.1 Standard scaling (Z-score standardization)

formula:\

Among them, X is a variable (feature), ???? is the mean of X, and ???? is the standard deviation of X. This method is very sensitive to outliers, because outliers affect both ???? and ????.

```from sklearn.preprocessing import StandardScaler

# in order to mimic the operation in real-world, we shall fit the StandardScaler
# on the trainset and transform the testset
# we take the top ten samples in the first column as test set
# take the rest samples in the first column as train set

test_set = X[0:10,0]
# return array([8.3252, 8.3014, 7.2574, 5.6431, 3.8462, 4.0368, 3.6591, 3.12, 2.0804, 3.6912])
train_set = X[10:,0]

model = StandardScaler()

model.fit(train_set.reshape(-1,1)) # fit on the train set and transform the test set
# top ten numbers for simplification
result = model.transform(test_set.reshape(-1,1)).reshape(-1)
# return array([ 2.34539745, 2.33286782, 1.78324852, 0.93339178, -0.0125957,
# 0.08774668, -0.11109548, -0.39490751, -0.94221041, -0.09419626])
# result is the same as ((X[0:10,0]-X [10:, 0] .mean ( ))/X [10:, 0] .std ())
copying the code```
```# visualize the distribution after the scaling
# fit and transform the entire first feature

import seaborn as sns
import matplotlib.pyplot as plt

fig, ax = plt.subplots( 2 , 1 , figsize = ( 13 , 9 ))
sns.distplot(X[:, 0 ], hist = True , kde= True , ax=ax[ 0 ])
ax[ 0 ].set_title( 'Histogram of the Original Distribution' , fontsize= 12 )
ax[ 0 ].set_xlabel( 'Value' , fontsize = 12 )
ax[ 0 ].set_ylabel( 'Frequency' , fontsize = 12 ); # this feature has long-tail distribution

model = StandardScaler()
model.fit(X[:, 0 ].reshape(- 1 , 1 ))
result = model.transform(X[:, 0 ].reshape(- 1 , 1 )).reshape(- 1 )

# show the distribution of the entire feature
sns.distplot(result, hist = True , kde = True , ax=ax[ 1 ])
ax[ 1 ].set_title( 'Histogram of the Transformed Distribution' , fontsize = 12 )
ax[ 1 ].set_xlabel( 'Value' , fontsize = 12 )
ax[ 1 ].set_ylabel( 'Frequency' , fontsize = 12 ); # the distribution is the same, but scales change
fig.tight_layout()
Copy code```

1.1.2.2 MinMaxScaler (scale according to the numerical range)/

Assume that the range of feature values we want to scale is (a, b).

formula:\

Among them, Min is the minimum value of X, and Max is the maximum value of X. This method is also very sensitive to outliers, because outliers affect both Min and Max./

```from sklearn.preprocessing import MinMaxScaler

# in order to mimic the operation in real-world, we shall fit the MinMaxScaler
# on the trainset and transform the testset
# we take the top ten samples in the first column as test set
# take the rest samples in the first column as train set

test_set = X[0:10,0]
# return array([8.3252, 8.3014, 7.2574, 5.6431, 3.8462, 4.0368, 3.6591, 3.12, 2.0804, 3.6912])
train_set = X[10:,0]

model = MinMaxScaler(feature_range=(0,1)) # set the range to be (0,1)

model.fit(train_set.reshape(-1,1)) # fit on the train set and transform the test set
# top ten numbers for simplification
result = model.transform(test_set.reshape(-1,1)).reshape(-1)
# return array([0.53966842, 0.53802706, 0.46602805, 0.35469856, 0.23077613,
# 0.24392077, 0.21787286, 0.18069406, 0.1089985, 0.22008662])
# result is the same as (X[0:10,0]-X[10:,0] .min ())/(X [ 10:, 0] .max () - X [10:, 0] .min ())
copying the code```
```# visualize the distribution after the scaling
# fit and transform the entire first feature

import seaborn as sns
import matplotlib.pyplot as plt

fig, ax = plt.subplots( 2 , 1 , figsize = ( 13 , 9 ))
sns.distplot(X[:, 0 ], hist = True , kde= True , ax=ax[ 0 ])
ax[ 0 ].set_title( 'Histogram of the Original Distribution' , fontsize= 12 )
ax[ 0 ].set_xlabel( 'Value' , fontsize = 12 )
ax[ 0 ].set_ylabel( 'Frequency' , fontsize = 12 ); # this feature has long-tail distribution

model = MinMaxScaler(feature_range=( 0 , 1 ))
model.fit(X[:, 0 ].reshape(- 1 ,1 ))
result = model.transform(X[:, 0 ].reshape(- 1 , 1 )).reshape(- 1 )

# show the distribution of the entire feature
sns.distplot(result, hist = True , kde = True , ax=ax[ 1 ])
ax[ 1 ].set_title( 'Histogram of the Transformed Distribution' , fontsize = 12 )
ax[ 1 ].set_xlabel( 'Value' , fontsize = 12 )
ax[ 1 ].set_ylabel( 'Frequency' , fontsize= 12 is ); # The IS Distribution The Same, But Scales Change
fig.tight_layout () # now to The Change Scale [0,1]
duplicated code```

1.1.2.3 RobustScaler (anti-outlier scaling)/

Use statistics that are robust to outliers (quantiles) to scale features. Suppose we want to scale the feature quantile range (a, b).

formula:

This method is more robust to abnormal points.

```import numpy as np
from sklearn.preprocessing import RobustScaler

# in order to mimic the operation in real-world, we shall fit the RobustScaler
# on the trainset and transform the testset
# we take the top ten samples in the first column as test set
# take the rest samples in the first column as train set

test_set = X[0:10,0]
# return array([8.3252, 8.3014, 7.2574, 5.6431, 3.8462, 4.0368, 3.6591, 3.12, 2.0804, 3.6912])
train_set = X[10:,0]

model = RobustScaler(with_centering = True, with_scaling = True,
quantile_range = (25.0, 75.0))
# with_centering = True => recenter the feature by set X'= X-X.median()
# with_scaling = True => rescale the feature by the quantile set by user
# set the quantile to the (25%, 75%)

model.fit(train_set.reshape(-1,1)) # fit on the train set and transform the test set
# top ten numbers for simplification
result = model.transform(test_set.reshape(-1,1)).reshape(-1)
# return array([ 2.19755974, 2.18664281, 1.7077657, 0.96729508, 0.14306683,
# 0.23049401, 0.05724508, -0.19003715, -0.66689601, 0.07196918])
# result is the same as (X[0:10,0]-np.quantile(X [10:, 0], 0.5 ))/(np.quantile (X [10:, 0], 0.75) -np.quantile (X [10:, 0], 0.25))
copying the code```
```# visualize the distribution after the scaling
# fit and transform the entire first feature

import seaborn as sns
import matplotlib.pyplot as plt

fig, ax = plt.subplots( 2 , 1 , figsize = ( 13 , 9 ))
sns.distplot(X[:, 0 ], hist = True , kde = True , ax=ax[0 ])
ax[ 0 ].set_title( 'Histogram of the Original Distribution' , fontsize= 12 )
ax[ 0 ].set_xlabel( 'Value' , fontsize = 12 )
ax[ 0 ].set_ylabel( 'Frequency' , fontsize = 12 ); # this feature has long-tail distribution

model = RobustScaler(with_centering = True , with_scaling = True ,
quantile_range = ( 25.0 , 75.0 ))
model.fit(X[:, 0 ].reshape(- 1 , 1 ))
result = model.transform(X[:, 0 ].reshape(- 1 , 1 )).reshape(- 1 )

# show the distribution of the entire feature
sns.distplot(result, hist = True , kde = True , ax=ax[ 1 ])
ax[ 1 ].set_title( 'Histogram of the Transformed Distribution' , fontsize = 12 )
ax[ 1 ].set_xlabel( 'Value' , fontsize = 12 )
ax[ 1 ].set_ylabel( 'Frequency' , fontsize = 12 ); # the distribution is the same, but scales change
fig.tight_layout()
Copy code```

1.1.2.4 Power transformation (non-linear transformation)

All the scaling methods described above maintain the original distribution. But normality is an important assumption of many statistical models. We can use a power transformation to convert the original distribution to a normal distribution.

Box-Cox transformation:

The Box-Cox transformation is only applicable to positive numbers and assumes the following distribution:

Considering all values, the optimal value of stable variance and minimized skewness is selected through maximum likelihood estimation.

```from sklearn.preprocessing import PowerTransformer

# in order to mimic the operation in real-world, we shall fit the PowerTransformer
# on the trainset and transform the testset
# we take the top ten samples in the first column as test set
# take the rest samples in the first column as train set

test_set = X[0:10,0]
# return array([8.3252, 8.3014, 7.2574, 5.6431, 3.8462, 4.0368, 3.6591, 3.12, 2.0804, 3.6912])
train_set = X[10:,0]

model = PowerTransformer(method='box-cox', standardize=True)
# apply box-cox transformation

model.fit(train_set.reshape(-1,1)) # fit on the train set and transform the test set
# top ten numbers for simplification
result = model.transform(test_set.reshape(-1,1)).reshape(-1)
Return Array # ([1.91669292, 1.91009687, 1.60235867, 1.0363095, 0.19831579,
# 0.30244247, 0.09143411, -0.24694006, -1.08558469, .11011933])
Copy the code```
```# visualize the distribution after the scaling
# fit and transform the entire first feature

import seaborn as sns
import matplotlib.pyplot as plt

fig, ax = plt.subplots( 2 , 1 , figsize = ( 13 , 9 ))
sns.distplot(X[:, 0 ], hist = True , kde= True , ax=ax[ 0 ])
ax[ 0 ].set_title( 'Histogram of the Original Distribution' , fontsize= 12 )
ax[ 0 ].set_xlabel( 'Value' , fontsize = 12 )
ax[ 0 ].set_ylabel( 'Frequency' , fontsize = 12 ); # this feature has long-tail distribution

model = PowerTransformer(method = 'box-cox' , standardize = True )
model.fit(X[:, 0 ].reshape(- 1 , 1 ))
result = model.transform(X[:, 0 ].reshape(- 1 , 1 )).reshape(- 1 )

# show the distribution of the entire feature
sns.distplot(result, hist = True , kde = True , ax=ax[ 1 ])
ax[ 1 ].set_title( 'Histogram of the Transformed Distribution' , fontsize = 12 )
ax[ 1 ].set_xlabel( 'Value' , fontsize = 12 )
ax[ 1 ].set_ylabel( 'Frequency' , fontsize = 12 ); # the distribution now becomes normal
fig.tight_layout()
Copy code```

Yeo-Johnson transformation:

The Yeo Johnson transformation applies to positive and negative numbers, and assumes the following distribution:

Considering all values, the optimal value of stable variance and minimized skewness is selected through maximum likelihood estimation.

```from sklearn.preprocessing import PowerTransformer

# in order to mimic the operation in real-world, we shall fit the PowerTransformer
# on the trainset and transform the testset
# we take the top ten samples in the first column as test set
# take the rest samples in the first column as train set

test_set = X[0:10,0]
# return array([8.3252, 8.3014, 7.2574, 5.6431, 3.8462, 4.0368, 3.6591, 3.12, 2.0804, 3.6912])
train_set = X[10:,0]

model = PowerTransformer(method='yeo-johnson', standardize=True)
# apply box-cox transformation

model.fit(train_set.reshape(-1,1)) # fit on the train set and transform the test set
# top ten numbers for simplification
result = model.transform(test_set.reshape(-1,1)).reshape(-1)
Return Array # ([1.90367888, 1.89747091, 1.604735, 1.05166306, .20617221,
# .31245176, .09685566, -0.25011726, -1.10512438, .11598074])
Copy the code```
```# visualize the distribution after the scaling
# fit and transform the entire first feature

import seaborn as sns
import matplotlib.pyplot as plt

fig, ax = plt.subplots( 2 , 1 , figsize = ( 13 , 9 ))
sns.distplot(X[:, 0 ], hist = True , kde = True , ax=ax[ 0 ])
ax[ 0 ].set_title( 'Histogram of the Original Distribution' , fontsize= 12 )
ax[ 0 ].set_xlabel( 'Value' , fontsize = 12 )
ax[ 0 ].set_ylabel( 'Frequency' , fontsize = 12 ); # this feature has long-tail distribution

model = PowerTransformer(method = 'yeo-johnson' , standardize = True )
model.fit(X[:, 0 ].reshape(- 1 , 1 ))
result = model.transform(X[:, 0 ].reshape(- 1 , 1 )).reshape(- 1 )

# show the distribution of the entire feature
sns.distplot(result, hist = True , kde = True , ax=ax[ 1 ])
ax[ 1 ].set_title( 'Histogram of the Transformed Distribution' , fontsize = 12 )
ax[ 1 ].set_xlabel( 'Value' , fontsize = 12 )
ax[ 1 ].set_ylabel( 'Frequency' , fontsize = 12 ); # the distribution now becomes normal
fig.tight_layout()
Copy code```

1.1.3 Regularization/

All of the above zooming methods are operated in columns. But regularization works in every row, it tries to "scale" each sample so that it has a unit norm. Since regularization works in every row, it will distort the relationship between features, so it is not common. But regularization methods are very useful in the context of text classification and clustering.

Suppose X[i][j] represents the value of feature j in sample i.

L1 regularization formula:

L2 regularization formula:/

L1 regularization:/

```from sklearn.preprocessing import Normalizer

# Normalizer performs operation on each row independently
# So train set and test set are processed independently

###### for L1 Norm
sample_columns = X[0:2,0:3] # select the first two samples, and the first three features
# return array([[ 8.3252, 41., 6.98412698],
# [8.3014 , 21., 6.23813708]])

model = Normalizer(norm='l1')
# use L2 Norm to normalize each samples

model.fit(sample_columns)

result = model.transform(sample_columns) # test set are processed similarly
# return array([[0.14784762, 0.72812094, 0.12403144],
# [0.23358211, 0.59089121, 0.17552668]])
# result = sample_columns/np.sum(np.abs( sample_columns), axis = 1) .reshape (-1,1)
copying the code```

L2 regularization:/

```###### for L2 Norm
sample_columns = X[0:2,0:3] # select the first three features
# return array([[ 8.3252, 41., 6.98412698],
# [8.3014, 21., 6.23813708] ])

model = Normalizer(norm='l2')
# use L2 Norm to normalize each samples

model.fit(sample_columns)

result = model.transform(sample_columns)
# return array([[0.19627663, 0.96662445, 0.16465922],
# [0.35435076, 0.89639892, 0.26627902]])
# result = sample_columns/np.sqrt(np.sum(sample_columns**2, axis=1)).reshape(- 1,1)
Copy code```
```# visualize the difference in the distribuiton after Normalization
# compare it with the distribuiton after RobustScaling
# fit and transform the entire first & second feature

import seaborn as sns
import matplotlib.pyplot as plt

# RobustScaler
fig, ax = plt.subplots( 2 , 1 , figsize = ( 13 , 9 ))

model = RobustScaler(with_centering = True , with_scaling = True ,
quantile_range = ( 25.0 , 75.0 ))
model.fit(X[:, 0 : 2 ])
result = model.transform(X[:, 0 : 2 ])

sns.scatterplot(result[:, 0 ], result[:, 1 ], ax=ax[ 0 ])
ax[ 0 ].set_title( 'Scatter Plot of RobustScaling result' , fontsize= 12 )
ax[ 0 ].set_xlabel( 'Feature 1' , fontsize= 12 )
ax[ 0 ].set_ylabel( 'Feature 2' , fontsize = 12 );

model = Normalizer(norm = 'l2' )

model.fit(X[:, 0 : 2 ])
result = model.transform(X[:, 0 : 2 ])

sns.scatterplot(result[:, 0 ], result[:, 1 ], ax=ax[ 1 ])
ax[ 1 ].set_title( 'Scatter Plot of Normalization result' , fontsize= 12 )
ax[ 1 ].set_xlabel( 'Feature 1' , fontsize= 12 )
ax[ 1 ].set_ylabel( 'Feature 2' , fontsize= 12 );
fig.tight_layout ()   # Original Distribution Normalization Distort The
duplicated code```

1.1.4 Estimation of missing values

In practice, there may be missing values in the data set. However, this sparse data set is incompatible with most scikit learning models, which assume that all features are numerical, without missing values. So before applying the scikit learning model, we need to estimate the missing values.

But some new models, such as XGboost, LightGBM, and Catboost implemented in other packages, provide support for the missing values in the data set. So when applying these models, we no longer need to fill in the missing values in the data set.

1.1.4.1 Univariate feature interpolation

Assuming there are missing values in the i-th column, then we will estimate it with a constant or statistical data (mean, median or mode) in the i-th column.

```from sklearn.impute import SimpleImputer

test_set = X[0:10,0].copy() # no missing values
# return array([8.3252, 8.3014, 7.2574, 5.6431, 3.8462, 4.0368, 3.6591, 3.12, 2.0804, 3.6912])

# manully create some missing values
test_set = np.nan
test_set = np.nan
# now sample_columns becomes
# array([8.3252, 8.3014, 7.2574, nan, 3.8462, 4.0368, nan, 3.12 ,2.0804, 3.6912])

# create the test samples
# in real-world, we should fit the imputer on train set and tranform the test set.
train_set = X[10:,0].copy()
train_set = np.nan
train_set = np.nan

imputer = SimpleImputer(missing_values=np.nan, strategy='mean') # use mean
# we can set the strategy to'mean','median','most_frequent','constant'
imputer.fit(train_set.reshape(-1,1))
result = imputer.transform(test_set.reshape(-1,1)).reshape(-1)
# return array([8.3252, 8.3014, 7.2574, 3.87023658, 3.8462,
# 4.0368, 3.87023658, 3.12, 2.0804, 3.6912 ])
# all missing values  are imputed with 3.87023658
# 3.87023658 = np.nanmean(train_set)
# which is the mean of the trainset ignoring missing values
duplicated code```

1.1.4.2 Multivariate feature interpolation/

Multivariate feature imputation uses the information of the entire data set to estimate and impute missing values. In scikit-learn, it is implemented in a loop iterative manner.

In each step, one feature column is designated as the output y, and the other feature columns are regarded as the input X. A regressor is suitable for (X, y) where y is known. Then, the regressor is used to predict the missing value of y. This is done in an iterative manner for each feature, and then repeated for the maximum value interpolation round.

Use a linear model (take BayesianRidge as an example):

```from sklearn.experimental import enable_iterative_imputer # have to import this to enable
# IterativeImputer
from sklearn.impute import IterativeImputer
from sklearn.linear_model import BayesianRidge

test_set = X[0:10,:].copy() # no missing values, select all features
# the first columns is
# array([8.3252, 8.3014, 7.2574, 5.6431, 3.8462, 4.0368, 3.6591, 3.12, 2.0804, 3.6912 ])

# manully create some missing values
test_set[3,0] = np.nan
test_set[6,0] = np.nan
test_set[3,1] = np.nan
# now the first feature becomes
# array([8.3252, 8.3014, 7.2574, nan, 3.8462, 4.0368, nan, 3.12 ,2.0804, 3.6912])

# create the test samples
# in real-world, we should fit the imputer on train set and tranform the test set.
train_set = X[10:,:].copy()
train_set[3,0] = np.nan
train_set[6,0] = np.nan
train_set[3,1] = np.nan

impute_estimator = BayesianRidge()
imputer = IterativeImputer(max_iter = 10,
random_state = 0,
estimator = impute_estimator)

imputer.fit(train_set)
result = imputer.transform(test_set)[:,0] # only select the first column to revel how it works
# return array([8.3252, 8.3014, 7.2574, 4.6237195, 3.8462,
# 4.0368, 4.00258149, 3.12, 2.0804, 3.6912] )
Copy code```

Use a tree-based model (take ExtraTrees as an example):

```from sklearn.experimental import enable_iterative_imputer # have to import this to enable
# IterativeImputer
from sklearn.impute import IterativeImputer
from sklearn.ensemble import ExtraTreesRegressor

test_set = X[0:10,:].copy() # no missing values, select all features
# the first columns is
# array([8.3252, 8.3014, 7.2574, 5.6431, 3.8462, 4.0368, 3.6591, 3.12, 2.0804, 3.6912 ])

# manully create some missing values
test_set[3,0] = np.nan
test_set[6,0] = np.nan
test_set[3,1] = np.nan
# now the first feature becomes
# array([8.3252, 8.3014, 7.2574, nan, 3.8462, 4.0368, nan, 3.12 ,2.0804, 3.6912])

# create the test samples
# in real-world, we should fit the imputer on train set and tranform the test set.
train_set = X[10:,:].copy()
train_set[3,0] = np.nan
train_set[6,0] = np.nan
train_set[3,1] = np.nan

impute_estimator = ExtraTreesRegressor(n_estimators=10, random_state=0)
# parameters can be turned in CV though sklearn pipeline
imputer = IterativeImputer(max_iter = 10,
random_state = 0,
estimator = impute_estimator)

imputer.fit(train_set)
result = imputer.transform(test_set)[:,0] # only select the first column to revel how it works
# return array([8.3252, 8.3014, 7.2574, 4.63813, 3.8462, 4.0368, 3.24721,
# 3.12, 2.0804, 3.6912] )
Copy code```

Use K Nearest Neighbor (KNN):

```from sklearn.experimental import enable_iterative_imputer # have to import this to enable
# IterativeImputer
from sklearn.impute import IterativeImputer
from sklearn.neighbors import KNeighborsRegressor

test_set = X[0:10,:].copy() # no missing values, select all features
# the first columns is
# array([8.3252, 8.3014, 7.2574, 5.6431, 3.8462, 4.0368, 3.6591, 3.12, 2.0804, 3.6912 ])

# manully create some missing values
test_set[3,0] = np.nan
test_set[6,0] = np.nan
test_set[3,1] = np.nan
# now the first feature becomes
# array([8.3252, 8.3014, 7.2574, nan, 3.8462, 4.0368, nan, 3.12 ,2.0804, 3.6912])

# create the test samples
# in real-world, we should fit the imputer on train set and tranform the test set.
train_set = X[10:,:].copy()
train_set[3,0] = np.nan
train_set[6,0] = np.nan
train_set[3,1] = np.nan

impute_estimator = KNeighborsRegressor(n_neighbors=10,
p = 1)   # set p=1 to use manhanttan distance
# use manhanttan distance to reduce effect from outliers

# parameters can be turned in CV though sklearn pipeline
imputer = IterativeImputer(max_iter = 10,
random_state = 0,
estimator = impute_estimator)

imputer.fit(train_set)
result = imputer.transform(test_set)[:,0] # only select the first column to revel how it works
# return array([8.3252, 8.3014, 7.2574, 3.6978, 3.8462, 4.0368, 4.052, 3.12,
# 2.0804, 3.6912] )
Copy code```

1.1.4.3 Marking estimates

Sometimes, some missing values may be useful. Therefore, scikit learn also provides the function of converting a data set with missing values into a corresponding binary matrix, which indicates the existence of missing values in the data set.

```from sklearn.impute import MissingIndicator

# illustrate this function on trainset only
# since the precess is independent in train set and test set
train_set = X[ 10 :,:].copy() # select all features
train_set[ 3 , 0 ] = np.nan # manully create some missing values
train_set[ 6 , 0 ] = np.nan
train_set[ 3 , 1 ] = np.nan

indicator = MissingIndicator(missing_values=np.nan, features = 'all' )
# show the results on all the features
result = indicator.fit_transform(train_set) # result have the same shape with train_set
# contains only True & False, True corresponds with missing value

Result [:, 0 .] SUM () # 2 Should return, The column has TWO First Missing values
Result [:, . 1 ]. SUM (); # Should return. 1, The Missing value SECOND One column has
duplicated code```

1.1.5 Feature transformation/

1.1.5.1 Polynomial transformation/

Sometimes we hope to introduce nonlinear features into the model, thereby increasing the complexity of the model. For simple linear models, this will greatly increase the complexity of the model. But for more complex models, such as tree-based ML models, they already include non-linear relationships in the non-parametric tree structure. Therefore, this feature conversion may not be very helpful for tree-based ML models.

For example, if we set the order to 3, the form is as follows:

```from sklearn.preprocessing import PolynomialFeatures

# illustrate this function on one synthesized sample
train_set = np.array([2,3]).reshape(1,-1) # shape (1,2)
# return array([[2, 3]])

poly = PolynomialFeatures(degree = 3, interaction_only = False)
# the highest degree is set to 3, and we want more than just intereaction terms

result = poly.fit_transform(train_set) # have shape (1, 10)
# array([[ 1., 2., 3., 4., 6., 9., 8., 12., 18., 27. ]])
Copy code```

1.1.5.2 Custom transform/

```from sklearn.preprocessing import FunctionTransformer

# illustrate this  function  on  one  synthesized  sample
train_set = np . array ( [ 2 , 3 ] ). reshape ( 1 ,- 1 ) # shape ( 1 , 2 )
# return  array ( [[ 2 , 3 ]] )

transformer = FunctionTransformer ( func = np.log1p, validate=True )
# perform  log  transformation , X '= log ( 1 + x )
# func  can  be  any  numpy  function  such  as  np . exp
result = transformer . transform ( train_set )
# Return  Array ( [[ 1.09861229 , 1.38629436 ]] ), The  Same  AS  NP . Log1p ( train_set )
copying the code```

Well, the above is the introduction to the data preprocessing of static continuous variables. It is recommended that readers combine the code and perform it in Jupyter.

#### ```php

Wonderful review of previous issues. Routes and materials for beginners to get started with artificial intelligence. Download Machine Learning Online Manual Deep Learning Online Manual AI Basic Download (pdf updated to 25 episodes) qq group on this site 1003271085, please reply to "add group" to join the WeChat group of this site Get a 10% discount on the Knowledge Planet Coupon on this site, please reply to the "Knowledge Planet" like article, click one to see

`Copy code`