Industrial steam forecast

Industrial steam forecast

Preface

Thermal power generation uses the steam generated by heating water when fuel is burned to drive the steam turbine to rotate, and then the steam turbine drives the generator to rotate to generate electricity. In this process, the combustion efficiency of the boiler is the core that affects the efficiency of power generation. There are many factors that affect the combustion efficiency of the boiler, including the conditions of the boiler itself, such as the amount of combustion, primary and secondary air, induced air, return air, etc.; and the working conditions of the boiler, such as boiler bed temperature, furnace temperature, bed pressure Wait. This paper uses the industrial steam volume data set to predict the amount of steam produced to analyze the efficiency of thermal power generation.

1.1 Jupyter settings, guide package and data set loading

Import related modules.

import pandas as pd,numpy as np,matplotlib as mpl import matplotlib.pyplot as plt import seaborn as sns import warnings import sklearn from sklearn.exceptions import ConvergenceWarning from typing import types import pandas_profiling Copy code

Block warning

warnings.filterwarnings('ignore') warnings.filterwarnings(action='ignore',category=ConvergenceWarning) Copy code

To prevent Chinese garbled characters, set seaborn Chinese fonts.

mpl.rcParams['font.sans-serif'] =[u'simHei'] mpl.rcParams['axes.unicode_minus'] =False sns.set(font='SimHei') Copy code

Set the number of rows displayed by jupyter

mpl.rcParams['axes.unicode_minus'] =False pd.options.display.min_rows = None pd.set_option('display.expand_frame_repr', False) pd.set_option('expand_frame_repr', False) pd.set_option('max_rows', 10) pd.set_option('max_columns', 30) Copy code

Download Data.

df_train = pd.read_csv('zhengqi_train.txt',sep='\t',encoding='utf-8') df_test = pd.read_csv('zhengqi_test.txt',sep='\t',encoding='utf-8') Copy code

1.2 Exploratory analysis

1.2.1 Analyze the data set

1.2.1.1 Preview data set
  • Preview the data set.
df_all.head(5).append(df_all.tail(5)) V0 V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 ... V24 V25 V26 V27 V28 V29 V30 V31 V32 V33 V34 V35 V36 V37 target 0 0.566 0.016 -0.143 0.407 0.452 -0.901 -1.812 -2.360 -0.436 -2.114 -0.940 -0.307 -0.073 0.550 -0.484 ... 0.800 -0.223 0.796 0.168 -0.450 0.136 0.109 -0.615 0.327 -4.627 -4.789 -5.101 -2.608- 3.508 0.175 1 0.968 0.437 0.066 0.566 0.194 -0.893 -1.566 -2.360 0.332 -2.114 0.188 -0.455 -0.134 1.109 -0.488 ... 0.801 -0.144 1.057 0.338 0.671 -0.128 0.124 0.032 0.600 -0.843 0.160 0.364 -0.335 -0.730 0.676 2 1.013 0.568 0.235 0.370 0.112 -0.797 -1.367 -2.360 0.396 -2.114 0.874 -0.051 -0.072 0.767 -0.493 ... 0.961 -0.067 0.915 0.326 1.287 -0.009 0.361 0.277 -0.116 -0.843 0.160 0.364 0.765 -0.589 0.633 3 0.733 0.368 0.283 0.165 0.599 -0.679 -1.200 -2.086 0.403 -2.114 0.011 0.102 -0.014 0.769 -0.371 ... 1.435 0.113 0.898 0.277 1.298 0.015 0.417 0.279 0.603 -0.843 -0.065 0.364 0.333 -0.112 0.206 4 0.684 0.638 0.260 0.209 0.337 -0.454 -1.073 -2.086 0.314 -2.114 -0.251 0.570 0.199 -0.349 -0.342 ... 0.881 0.221 0.386 0.332 1.289 0.183 1.078 0.328 0.418 -0.843 -0.215 0.364 -0.280 -0.028 0.384 2883 0.190 -0.025 -0.138 0.161 0.600 -0.212 0.757 0.584 -0.026 0.904 0.355 -0.066 0.436 0.141 -0.560 ... -1.310 0.094 -0.461 0.189 -0.449 0.128 -0.208 0.809 -0.173 0.247 -0.027 -0.349 0.576 0.686 0.235 2884 0.507 0.557 0.296 0.183 0.530 -0.237 0.749 0.584 0.537 0.904 -0.061 0.033 0.414 -0.634 -0.626 ... -1.314 -0.066 -0.892 0.372 -0.439 0.291 -0.287 0.465 -0.310 0.763 0.498 -0.349 -0.615 -0.380 1.042 2885 -0.394 -0.721 -0.485 0.084 0.136 0.034 0.655 0.614 -0.818 0.904 0.240 0.287 -0.185 0.389 -0.725 ... -1.310 -0.360 -0.349 0.058 -0.445 0.291 -0.179 0.268 0.552 0.763 0.498 -0.349 0.951 0.748 0.005 2886 -0.219 -0.282 -0.344 -0.049 0.449 -0.140 0.560 0.583 -0.596 0.904 -0.395 -0.023 -0.053 -0.310 -0.258 ... -1.313 -0.603 -0.677 0.133 -0.448 0.216 1.061 -0.051 1.023 0.878 0.610 -0.230 -0.301 0.555 0.350 2887 0.368 0.380 -0.225 -0.049 0.379 0.092 0.550 0.551 0.244 0.904 -0.419 0.515 0.346 -0.114 -0.204 ... -1.314 -0.662 -0.596 0.208 -0.449 0.047 0.057 -0.042 0.847 0.534 -0.009 -0.190 -0.567 0.388 0.417 10 rows 39 columns Copy code
1.2.1.2 Preview relevant statistics
df_train.describe() V0 V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 ... V24 V25 V26 V27 V28 V29 V30 V31 V32 V33 V34 V35 V36 V37 target count 2888.000000 2888.000000 2888.000000 2888.000000 2888.000000 2888.000000 2888.000000 2888.000000 2888.000000 2888.000000 2888.000000 2888.000000 2888.000000 2888.000000 2888.000000 ... 2888.000000 2888.000000 2888.000000 2888.000000 2888.000000 2888.000000 2888.000000 2888.000000 2888.000000 2888.000000 2888.000000 2888.000000 2888.000000 2888.000000 2888.000000 2888.000000 mean 0.123048 0.056068 0.289720 -0.067790 0.012921 -0.558565 0.182892 0.116155 0.177856 -0.169452 0.034319 -0.364465 0.023177 0.195738 0.016081 ... -0.021813 -0.051679 0.072092 0.272407 0.137712 0.097648 0.055477 0.127791 0.020806 0.007801 0.006715 0.197764 0.030658 -0.130330 std 0.928031 0.941515 0.911236 0.970298 0.888377 0.517957 0.918054 0.955116 0.895444 0.953813 0.968272 0.858504 0.894092 0.922757 1.015585 ... 1.033403 0.915957 0.889771 0.270374 0.929899 1.061200 0.901934 0.873028 0.902584 1.006995 1.003291 0.985675 0.970812 1.017196 0.983966 min -4.335000 -5.122000 -3.420000 -3.956000 -4.742000 -2.182000 -4.576000 -5.048000 -4.692000 -12.891000 -2.584000 -3.160000 -5.165000 -3.675000 -2.455000 ... -1.344000 -3.808000 -5.131000 -1.164000 -2.435000 -2.912000 -4.507000 -5.859000 -4.053000 -4.627000 -4.789000 -5.695000 -2.608000 -3.630000 -3.044000 25% -0.297000 -0.226250 -0.313000 -0.652250 -0.385000 -0.853000 -0.310000 -0.295000 -0.159000 -0.390000 -0.420500 -0.803250 -0.419000 -0.398000 -0.668000 ... -1.191000 -0.557250 -0.452000 0.157750 -0.455000 -0.664000 -0.283000 -0.170250 -0.407250 -0.499000 -0.290000 -0.202500 -0.413000 -0.798250 -0.350250 50% 0.359000 0.272500 0.386000 -0.044500 0.110000 -0.466000 0.388000 0.344000 0.362000 0.042000 0.157000 -0.112000 0.123000 0.289500 -0.161000 ... 0.095000 -0.076000 0.075000 0.325000 -0.447000 -0.023000 0.053500 0.299500 0.039000 -0.040000 0.160000 0.364000 0.137000 -0.185500 0.364000 75% 0.726000 0.599000 0.918250 0.624000 0.550250 -0.154000 0.831250 0.782250 0.726000 0.042000 0.619250 0.247000 0.616000 0.864250 0.829750 ... 0.931250 0.356000 0.644250 0.442000 0.730000 0.745250 0.488000 0.635000 0.557000 0.462000 0.273000 0.602000 0.644250 0.495250 0.793250 max 2.121000 1.918000 2.828000 2.457000 2.689000 0.489000 1.895000 1.918000 2.245000 1.335000 4.830000 1.455000 2.657000 2.475000 2.558000 ... 2.423000 7.284000 2.980000 0.925000 4.671000 4.580000 2.689000 2.013000 2.395000 5.465000 5.110000 2.324000 5.238000 3.000000 2.538000 8 rows 39 columns Copy code
1.2.1.3 Preview data type
  • All float type data
df_train.info() dtypes: float64(39) memory usage: 880.1 KB Copy code
1.2.1.4 Preview the dimensions of training set and test set
df_train.shape,df_test.shape ((2888, 39), (1925, 38)) Copy code
1.2.1.5 The number and distribution of missing values
  • No missing values found in the training set
df_train.isnull().sum() # missing_pct = df_all.isnull().sum() * 100/len(df_all) #Count the number of empty columns # missing = pd.DataFrame({ #'name': df_all.columns, #'missing_pct': missing_pct, # }) # missing.sort_values(by='missing_pct', ascending=False).head() name missing_pct V0 V0 0.0 V29 V29 0.0 V22 V22 0.0 V23 V23 0.0 V24 V24 0.0 Copy code
1.2.1.6 Forecast value distribution
  • Mean, median, maximum, and minimum of predicted values
df_train['target'].mean(), df_train['target'].median(), df_train['target'].max(), df_train['target'].min() (0.12635283933517938, 0.313, 2.5380000000000003, -3.0439999999999996) Copy code
  • Change curve of predicted value
plt.figure() df_train['target'].plot() plt.ylabel('target') plt.xlabel('id') plt.show() Copy code

1.2.1.7 Generate data report
pfr = pandas_profiling.ProfileReport(data_train) pfr.to_file("./example.html") Summarize dataset: 100% 53/53 [02:42<00:00, 3.06s/it, Completed] Generate report structure: 100% 1/1 [00:11<00:00, 11.36s/it] Render HTML: 100% 1/1 [00:32<00:00, 32.81s/it] Export report to file: 100% 1/1 [00:00<00:00, 3.85it/s] Copy code

1.2.2 Characteristic distribution curve

  • From the figure, you can observe whether each feature is normally distributed
  • The data value distribution in V9, V17, V18, V22, V23, V24, V28, V35 is very uneven
  • V14, V17, V19, V22, V24, V28 graphs now have multiple extreme values
df = pd.melt(df_train,value_vars=df_train.columns) sp=sns.FacetGrid(df,col='variable',col_wrap=5,sharex=False,sharey=False) sp=sp.map(sns.distplot,'value',color='m',rug=True) Copy code

1.3 Feature Engineering

  • Add the test set to the training set for feature engineering to ensure that the data types are the same.
target = df_train['target'] combined = df_train.drop('target',axis=1).append(df_test) combined.head() V0 V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 ... V23 V24 V25 V26 V27 V28 V29 V30 V31 V32 V33 V34 V35 V36 V37 0 0.566 0.016 -0.143 0.407 0.452 -0.901 -1.812 -2.360 -0.436 -2.114 -0.940 -0.307 -0.073 0.550 -0.484 ... 0.356 0.800 -0.223 0.796 0.168 -0.450 0.136 0.109 -0.615 0.327 -4.627 -4.789 -5.101 -2.608 -3.508 1 0.968 0.437 0.066 0.566 0.194 -0.893 -1.566 -2.360 0.332 -2.114 0.188 -0.455 -0.134 1.109 -0.488 ... 0.357 0.801 -0.144 1.057 0.338 0.671 -0.128 0.124 0.032 0.600 -0.843 0.160 0.364 -0.335 -0.730 2 1.013 0.568 0.235 0.370 0.112 -0.797 -1.367 -2.360 0.396 -2.114 0.874 -0.051 -0.072 0.767 -0.493 ... 0.355 0.961 -0.067 0.915 0.326 1.287 -0.009 0.361 0.277 -0.116 -0.843 0.160 0.364 0.765 -0.589 3 0.733 0.368 0.283 0.165 0.599 -0.679 -1.200 -2.086 0.403 -2.114 0.011 0.102 -0.014 0.769 -0.371 ... 0.352 1.435 0.113 0.898 0.277 1.298 0.015 0.417 0.279 0.603 -0.843 -0.065 0.364 0.333 -0.112 4 0.684 0.638 0.260 0.209 0.337 -0.454 -1.073 -2.086 0.314 -2.114 -0.251 0.570 0.199 -0.349 -0.342 ... 0.352 0.881 0.221 0.386 0.332 1.289 0.183 1.078 0.328 0.418 -0.843 -0.215 0.364 -0.280 -0.028 5 rows 38 columns q Copy code

1.3.1 Feature correlation analysis

  • The figure shows the correlation of all features, which can be divided into several categories through observation and analysis
  • The linear distribution with V0 is: V1, V4, V8, V12, V27, V31, target
  • The linear distribution with V2 is: V6, V7, V16
  • The linear distribution with V5 is: V11
  • The linear distribution with V10 is: V36
  • The linear distribution with V15 is: V29
  • The linear distribution with V33 is: V34
features = ['V0','V1','V2','V3','V4','V5','V6','V7','V8','V9','V10','V11' ,'V12', 'V13','V14','V15','V16','V17','V18','V19','V20','V21','V22','V23','V24', 'V25','V26','V27','V28','V29','V30','V31','V32','V33','V34','V35','V36', 'V37','target'] sns.pairplot(df_train[features]) Copy code

  • After the preliminary analysis of the feature relationship on the training set, the analysis results will now be used to eliminate the high correlation features of the combined.
plt.figure(figsize=(15,8)) sns.heatmap(combined[features].corr(),annot=True) plt.show() Copy code

  • Process V0, V1, V4, V8, V12, V27, V31
features = ['V0','V1','V4','V8','V12','V27','V31','V16'] plt.figure(figsize=(15,8)) sns.heatmap(combined[features].corr(),annot=True) plt.show() Copy code

  • Process V2, V6, V7, V16
features = ['V2','V6','V7','V16'] plt.figure(figsize=(15,8)) sns.heatmap(combined[features].corr(),annot=True) plt.show() Copy code

-Delete the features with Pearson correlation coefficient greater than 0.75, including: V1, V4, V8, V6, V16, V5, V29, V36,

features = ['V0','V2','V7','V10','V11','V12', 'V15','V27','V31','V33','V34'] plt.figure(figsize=(15,8)) sns.heatmap(combined[features].corr(),annot=True) plt.show() rv_features = ['V1','V4','V5','V6','V8','V16','V29','V36'] combined.drop(rv_features,axis=1,inplace=True) combined.head() V0 V2 V3 V7 V9 V10 V11 V12 V13 V14 V15 V17 V18 V19 V20 V21 V22 V23 V24 V25 V26 V27 V28 V30 V31 V32 V33 V34 V35 V37 0 0.566 -0.143 0.407 -2.360 -2.114 -0.940 -0.307 -0.073 0.550 -0.484 0.000 -1.162 -0.573 -0.991 0.610 -0.400 -0.063 0.356 0.800 -0.223 0.796 0.168 -0.450 0.109 -0.615 0.327 -4.627 -4.789 -5.101 -3.508 1 0.968 0.066 0.566 -2.360 -2.114 0.188 -0.455 -0.134 1.109 -0.488 0.000 -1.162 -0.571 -0.836 0.588 -0.802 -0.063 0.357 0.801 -0.144 1.057 0.338 0.671 0.124 0.032 0.600 -0.843 0.160 0.364 -0.730 2 1.013 0.235 0.370 -2.360 -2.114 0.874 -0.051 -0.072 0.767 -0.493 -0.212 -0.897 -0.564 -0.558 0.576 -0.477 -0.063 0.355 0.961 -0.067 0.915 0.326 1.287 0.361 0.277 -0.116 -0.843 0.160 0.364 -0.589 3 0.733 0.283 0.165 -2.086 -2.114 0.011 0.102 -0.014 0.769 -0.371 -0.162 -0.897 -0.574 -0.564 0.272 -0.491 -0.063 0.352 1.435 0.113 0.898 0.277 1.298 0.417 0.279 0.603 -0.843 -0.065 0.364 -0.112 4 0.684 0.260 0.209 -2.086 -2.114 -0.251 0.570 0.199 -0.349 -0.342 -0.138 -0.897 -0.572 -0.394 0.106 0.309 -0.259 0.352 0.881 0.221 0.386 0.332 1.289 1.078 0.328 0.418 -0.843 -0.215 0.364 -0.028 Copy code

1.3.2 Analysis of feature distribution consistency

  • The inconsistent feature distribution of the training set and the test set affects the generalization ability of the model. Therefore, features with inconsistent distributions are eliminated, and V17 and V22 are eliminated.
plt.figure(figsize=(42,36)) i = 1 for feature in test.columns: ax = plt.subplot(6, 6, i) ax = sns.kdeplot(train[feature], color='r', shade=True) ax = sns.kdeplot(test[feature], color='k', shade=True) ax = ax.legend(['train','test']) i = i + 1 plt.show() combined.drop(['V17','V22'],axis=1,inplace=True) Copy code

1.3.1 Dealing with outliers

  • Within the range of three standard deviations of the mean, it can be considered as data that obeys the normal distribution, and data outside this range is regarded as outliers.
def find_outliers_by_3segama(data, feature): data_std = np.std(data[feature]) data_mean = np.mean(data[feature]) #Calculate 3 standard deviations outliers_cut_off = data_std * 3 # boundary lower_rule = data_mean-outliers_cut_off #Upper boundary upper_rule = data_mean + outliers_cut_off #Outlier Screening data[feature+'_outliers'] = data[feature].apply(lambda x:'outliers' if x> upper_rule or x <lower_rule else'normal values') return data #Outlier View for feature in df_train.columns: df_train = find_outliers_by_3segama(df_train, feature) print(combined[fea+'_outliers'].value_counts()) print('='*50) Copy code

  • Remove outliers (optional)
for fea in train.columns: df_train = df_train[df_train[fea+'_outliers']=="normal value"] df_train=df_train.iloc[:,:29] df_train.shape,test_c.shape ________________________________________________________ ((2447, 29) Copy code

1.4 Model training

-Import related modules

import sklearn from sklearn.model_selection import train_test_split from sklearn.model_selection import GridSearchCV from sklearn import metrics from sklearn.linear_model import LinearRegression, LassoCV, RidgeCV, ElasticNetCV from xgboost import XGBRegressor from lightgbm import LGBMRegressor from sklearn.metrics import make_scorer,mean_squared_error from sklearn.preprocessing import MinMaxScaler,StandardScaler,Normalizer,PolynomialFeatures from sklearn.pipeline import Pipeline import time from bayes_opt import BayesianOptimization from sklearn.model_selection import cross_val_score,StratifiedKFold Copy code
  • Divide the data set
train=combined[:2888] test =combined[2888:] x_train,x_val,y_train,y_val =train_test_split(train,target,test_size=0.2,random_state=42) Copy code

1.4.1 Using basic models and scoring

  • Create regression model
lr = LinearRegression() rgcv=RidgeCV() eltcv=ElasticNetCV() lasso=LassoCV() rf =RandomForestRegressor() gbdt=GradientBoostingRegressor() xgb =XGBRegressor() lgbm = LGBMRegressor() models =[lr,rgcv,eltcv, lasso,rf,gbdt ,xgb,lgbm] Copy code
  • Data conversion: standardization, normalization, polynomial expansion, after experiment, select the polynomial expansion method
# x_train = ss.fit_transform(x_train) # x_val =ss.transform(x_val) x_train=poly.fit_transform(x_train) x_val =poly.transform(x_val) # x_train=norm.fit_transform(x_train) # x_val=norm.fit_transform(x_val) # x_train=mms.fit_transform(x_train) # x_val=mms.transform(x_val) for model in models: model=model.fit(x_train,y_train) predict_val=model.predict(x_val) print(model) print('val r2_score:',metrics.r2_score(y_val,predict_val)) print('val mean_squared_error:',metrics.mean_squared_error(y_val,predict_val)) print('**********************************') LinearRegression() val r2_score: 0.8255887645766965 val mean_squared_error: 0.10567861144493183 ********************************** RidgeCV(alphas=array([ 0.1, 1., 10. ])) val r2_score: 0.8255973558177241 val mean_squared_error: 0.10567340587190682 ********************************** ElasticNetCV() val r2_score: 0.826007552533511 val mean_squared_error: 0.10542486099325592 ********************************** LassoCV() val r2_score: 0.8260362646728213 val mean_squared_error: 0.1054074638398755 ********************************** RandomForestRegressor() val r2_score: 0.8142584766091285 val mean_squared_error: 0.11254381767306122 ********************************** GradientBoostingRegressor() val r2_score: 0.8214384543762159 val mean_squared_error: 0.10819335206922724 ********************************** [00:29:40] WARNING: src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror. XGBRegressor() val r2_score: 0.8221341278990157 val mean_squared_error: 0.10777183213830037 ********************************** LGBMRegressor() val r2_score: 0.8315534168942726 val mean_squared_error: 0.10206453134772146 ********************************** Copy code

1.4.2 Adjusting hyperparameters

1.4.2.1 Ridge and Lasso model tuning
  • According to the evaluation of the R-square and mean square error of the basic model, a number of models are selected for parameter adjustment optimization, and the pipeline method is used to search and adjust the grid parameters in the order of standardization-binomial expansion-regression model parameter setting.
## Pipeline is often used for parallel tuning models = [ Pipeline([ ('ss', StandardScaler()), ('poly', PolynomialFeatures()), ('linear', RidgeCV(alphas=np.logspace(-3,1,10))) ]), Pipeline([ ('ss', StandardScaler()), ('poly', PolynomialFeatures()), ('linear', LassoCV(alphas=np.logspace(-3,1,10))) ]), ] # Parameter dictionary, the key in the dictionary is the name of the attribute, and the value is an optional parameter list parameters = { "poly__degree": [3,2,1], "poly__interaction_only": [True, False], "poly__include_bias": [True, False], "linear__fit_intercept": [True, False] } for mode in models: model = GridSearchCV(mode, param_grid=parameters,cv=5, scoring='neg_mean_squared_error') model.fit(x_train, y_train) print(mode[2]) print ("Optimal parameters:", model.best_params_) print ("Best score:", model.best_score_) print('*********************************************** ***************') RidgeCV(alphas=array([1.00000000e-03, 2.78255940e-03, 7.74263683e-03, 2.15443469e-02, 5.99484250e-02, 1.66810054e-01, 4.64158883e-01, 1.29154967e+00, 3.59381366e+00, 1.00000000e+01])) Optimal parameters: {'linear__fit_intercept': True,'poly__degree': 1,'poly__include_bias': True,'poly__interaction_only': True} Optimal score: -0.10341087538163592 ************************************************** ************ LassoCV(alphas=array([1.00000000e-03, 2.78255940e-03, 7.74263683e-03, 2.15443469e-02, 5.99484250e-02, 1.66810054e-01, 4.64158883e-01, 1.29154967e+00, 3.59381366e+00, 1.00000000e+01])) Optimal parameters: {'linear__fit_intercept': False,'poly__degree': 2,'poly__include_bias': True,'poly__interaction_only': False} Optimal score: -0.10087217338993411 ************************************************** ****** Copy code
  • Bring the optimal parameters into the model for training and evaluation
models = [ Pipeline([('ss', StandardScaler()), ('poly',PolynomialFeatures(degree=1,include_bias=True,interaction_only=True)), ('linear', RidgeCV(alphas=np.logspace(-3,1,10),fit_intercept=True))]), Pipeline([('ss', StandardScaler()), ('poly', PolynomialFeatures(degree=2,include_bias=True,interaction_only=False)), ('linear', LassoCV(alphas=np.logspace(-3,1,10),fit_intercept=False))]) ] for mode in models: model=mode.fit(x_train,y_train) predict_val =model.predict(x_val) print(model) print('train mean_squared_error:',metrics.mean_squared_error(y_val,predict_val)) print('**********************************') train mean_squared_error: 0.10561696740544291 ********************************** train mean_squared_error: 0.10668960584119808 ********************************** Copy code
1.4.2.2 lightgbm tuning
model_lgb=LGBMRegressor(random_state=2021) params_dic=dict(learning_rate=[0.01, 0.1, 1], n_estimators=[20,50,120,300], num_leaves=[10,30],max_depth=[-1,4,10]) grid_search = GridSearchCV(model_lgb, cv=5, param_grid=params_dic, scoring='neg_mean_squared_error') grid_search.fit(x_train,y_train) print(f'The best parameter is: {grid_search.best_params_}') print(f'The best score is: {-grid_search.best_score_}') # The best parameters are: ('learning_rate': 0.1,'max_depth': 4,'n_estimators': 300,'num_leaves': 30} Copy code
  • Use the validation set to verify the lgb final model, mean_squared_error has dropped
lgb_final =LGBMRegressor(random_state=2021,learning_rate = 0.1, max_depth = 5, n_estimators=200, num_leaves= 50) lgb_final.fit(x_train,y_train) val_pred =lgb_final.predict(x_val) print(f'mean_squared_error:{metrics.mean_squared_error(y_val,val_pred)}') ____________________________________________________ mean_squared_error:0.10652795234920218 Copy code
1.4.2.3 xgb tuning
xgb_re=XGBRegressor(seed=27,learning_rate=0.1, n_estimators=300,silent=0, objective='reg:linear', gamma=0,subsample=0.8,colsample_bytree=0.8,nthread=4,scale_pos_weight=1) xgb_params ={'n_estimators':[50,100,120],'min_child_weight':list(range(1,4,2)),} best_model = GridSearchCV(xgb_re,param_grid=xgb_params,refit=True, cv=5,scoring='neg_mean_squared_error') best_model.fit(x_train,y_train) print('best_parameters:',best_model.best_params_) print(f'The best score is: {-grid_search.best_score_}') # best_parameters: {'min_child_weight': 3,'n_estimators': 120} Copy code
  • Use the validation set to verify the final xgb model
xgb_final=XGBRegressor(seed=27,learning_rate=0.1, objective='reg:linear', gamma=0.2,subsample=0.5, colsample_bytree=0.8,nthread=1,scale_pos_weight=1, min_child_weight=0.3, n_estimators=300) xgb_final.fit(x_train,y_train) val_pred = xgb_final.predict(x_val) print(f'mean_squared_error:{metrics.mean_squared_error(y_val,val_pred)}') ______________________________________________________ mean_squared_error:0.10342130385311427 Copy code

1.4.4 Model prediction and output result file

  • Model prediction and output txt file
xgb_final=XGBRegressor(seed=27,learning_rate=0.1, objective='reg:linear', gamma=0.2,subsample=0.5, colsample_bytree=0.8,nthread=1,scale_pos_weight=1, min_child_weight=0.3, n_estimators=300) xgb_final.fit(x_train,y_train) pre_test=xgb_final.predict(test) pred=pd.Series(pre_test) pred.to_csv('submit.txt',sep='\t',index=False) Copy code