Original link: tecdat.cn/?p=9913
Original source: Tuoduan Data Tribe Official Account
Overview and definition
In this article, we will consider some alternative fitting methods for linear models, in addition to the usual ordinary least squares method . These alternative methods can sometimes provide better prediction accuracy and model interpretability.
 Prediction accuracy : linear, ordinary least square estimation will have low deviation. OLS also performed well, n >> p . However, if n is not much larger than p , there may be a lot of variability in the fitting, resulting in overfitting and/or poor prediction. If p > n , there is no longer a unique least squares estimate, and the method cannot be used at all.
This question is another aspect of the curse of dimensionality. When p starts to increase, the observation value x starts to become closer to the boundary between the categories than nearby observation values, which poses a major problem for prediction. In addition, for many p , the training samples are often sparse, making it difficult to identify trends and make predictions.
By limiting and narrowing down estimated coefficients, we can usually greatly reduce the variance, because the increase in bias is negligible, which usually leads to a significant increase in accuracy.

Interpretability of the model : variables lead to unnecessary complexity of the resulting model. By deleting them (set coefficient = 0), we get a model that is easier to interpret. However, using OLS makes the coefficient extremely unlikely to be zero.
 Subset selection : We use the least squares fitting model of the subset features.
Although we discussed the application of these techniques in linear models, they are also applicable to other methods, such as classification.
Detailed method
Subset selection
Best subset selection
Here, we fit a separate OLS regression for each possible combination of p predictors, and then view the resulting model fit. The problem with this method is that the best model is hidden within 2^ p possibilities. The algorithm is divided into two stages. (1) Fit all models containing k predictors, where k is the maximum length of the model. (2) Use crossvalidated prediction error to select a model. More specific prediction error methods, such as AIC and BIC, will be discussed below.
This applies to other types of model choices, such as logistic regression, but the score we choose will vary based on the choice. For logistic regression, we will use bias instead of RSS and R^2.
Choose the best model
Each of the three algorithms mentioned above requires us to manually determine which model works best. As mentioned earlier, when using training error, the model with the most predicted values usually has the smallest RSS and the largest R^2. In order to select the model with the largest test error, we need to estimate the test error. There are two ways to calculate the test error.
 Through the training error and adjustment to indirectly estimate the test error to solve the bias of overfitting.
 Use validation sets or crossvalidation methods to directly estimate test errors.
Validation and cross validation
Generally, crossvalidation techniques are more direct estimates of the test and make fewer assumptions about the underlying model. In addition, it can be used in a wider selection of model types.
Ridge regression
Ridge regression is similar to least squares, except that the coefficients are estimated by minimizing slightly different numbers. Like OLS, Ridge regression seeks to reduce the RSS coefficient estimates, but when the coefficients are close to zero, they also produce shrinkage losses. The effect of this loss is to reduce the coefficient estimate to zero. The parameter controls the effect of shrinkage. The behavior of =0 is exactly the same as OLS regression. Of course, choosing a good value is very important and should be selected using crossvalidation. The requirement of ridge regression is that the center of the predictor variable X is set as
Why is ridge regression better than least squares?
The advantage is obvious in the deviation variance . As increases, the flexibility of ridge regression fitting decreases. This leads to a reduction in variance and a smaller increase in deviation. The fixed OLS regression has a high variance, but no bias. However, the lowest test MSE often occurs at the intersection between variance and deviation. Therefore, by appropriately adjusting to obtain less variance, we can find a lower potential MSE.
Ridge regression is most effective when the least squares estimation has high variance. Ridge regression has higher computational efficiency than any subset method because it can solve all values at the same time.
Lasso
Ridge regression has at least one disadvantage. It includes all p predictors in the final model . The penalty term will bring many of them close to zero, but will never be exactly zero. For prediction accuracy, this is usually not a problem, but it makes it more difficult for the model to interpret the results. Lasso overcomes this shortcoming, and can force s to be small enough to force certain coefficients to zero. Since s = 1 leads to conventional OLS regression, when s is close to 0, the coefficient will shrink to zero. Therefore, lasso regression also performs variable selection.
Dimensionality reduction method
So far, the methods we have discussed have controlled the variance by using a subset of the original variables or reducing their coefficients to zero. Now, we explore a class of models that can transform predictor variables, and then use the transformed variables to fit a least squares model. Dimensionality reduction reduces the problem of estimating p +1 coefficients to a simple problem of M +1 coefficients, where M < p . The two methods for this task are principal component regression and partial least squares .
Principal component regression (PCA)
PCA can be described as a method of deriving lowdimensional feature sets from a large number of variables.
In regression, we construct M principal components, and then use these components as predictors in linear regression using least squares. Generally, compared with ordinary least squares, we are likely to fit a better model because we can reduce the impact of overfitting.
Partial Least Squares
The PCR method we described above involves identifying the linear combination of X that best represents the predictor variable .
PLS achieves this by assigning higher weight to the variables that are most closely related to the dependent variable.
In fact, the performance of PLS is no better than ridge regression or PCR. This is because even if PLS can reduce bias, it may also increase variance, so there is no real difference in overall returns.
Interpret highdimensional results
We must always be cautious about the way we report the model results obtained, especially in highdimensional settings. In this case, the problem of multicollinearity is very serious, because any variable in the model can be written as a linear combination of all other variables in the model.
example
Subset selection method
Best subset selection
We hope to predict baseball players based on various statistics from the previous year
library(ISLR) attach(Hitters) names(Hitters) Copy code
## [1] "AtBat" "Hits" "HmRun" "Runs" "RBI" ## [6] "Walks" "Years" "CAtBat" "CHits" "CHmRun" ## [11] "CRuns" "CRBI" "CWalks" "League" "Division" ## [16] "PutOuts" "Assists" "Errors" "Salary" "NewLeague" Copy code
dim(Hitters) Copy code
## [1] 322 20Copy code
str(Hitters) Copy code
##'data.frame': 322 obs. of 20 variables: ## $ AtBat: int 293 315 479 496 321 594 185 298 323 401 ... ## $ Hits: int 66 81 130 141 87 169 37 73 81 92 ... ## $ HmRun: int 1 7 18 20 10 4 1 0 6 17 ... ## $ Runs: int 30 24 66 65 39 74 23 24 26 49 ... ## $ RBI: int 29 38 72 78 42 51 8 24 32 66 ... ## $ Walks: int 14 39 76 37 30 35 21 7 8 65 ... ## $ Years: int 1 14 3 11 2 11 2 3 2 13 ... ## $ CAtBat: int 293 3449 1624 5628 396 4408 214 509 341 5206 ... ## $ CHits: int 66 835 457 1575 101 1133 42 108 86 1332 ... ## $ CHmRun: int 1 69 63 225 12 19 1 0 6 253 ... ## $ CRuns: int 30 321 224 828 48 501 30 41 32 784 ... ## $ CRBI: int 29 414 266 838 46 336 9 37 34 890 ... ## $ CWalks: int 14 375 263 354 33 194 24 12 8 866 ... ## $ League: Factor w/2 levels "A","N": 1 2 1 2 2 1 2 1 2 1 ... ## $ Division: Factor w/2 levels "E","W": 1 2 2 1 1 2 1 2 2 1 ... ## $ PutOuts: int 446 632 880 200 805 282 76 121 143 0 ... ## $ Assists: int 33 43 82 11 40 421 127 283 290 0 ... ## $ Errors: int 20 10 14 3 4 25 7 9 19 0 ... ## $ Salary: num NA 475 480 500 91.5 750 70 100 75 1100 ... ## $ NewLeague: Factor w/2 levels "A","N": 1 2 1 2 2 1 1 1 2 1 ... Copy code
# Check for missing values sum(is.na(Hitters$Salary))/length(Hitters[,1])*100 Copy code
## [1] 18.32Copy code
It turns out that about 18% of data is lost. We will omit the missing data.
Hitters < na.omit(Hitters) dim(Hitters) Copy code
## [1] 263 20Copy code
Perform the best subset selection and use RSS for quantification.
library(leaps) regfit < regsubsets(Salary ~ ., Hitters) summary(regfit) Copy code
## Subset selection object ## Call: regsubsets.formula(Salary ~ ., Hitters) ## 19 Variables (and intercept) ## Forced in Forced out ## AtBat FALSE FALSE ## Hits FALSE FALSE ## HmRun FALSE FALSE ## Runs FALSE FALSE ## RBI FALSE FALSE ## Walks FALSE FALSE ## Years FALSE FALSE ## CAtBat FALSE FALSE ## CHits FALSE FALSE ## CHmRun FALSE FALSE ## CRuns FALSE FALSE ## CRBI FALSE FALSE ## CWalks FALSE FALSE ## LeagueN FALSE FALSE ## DivisionW FALSE FALSE ## PutOuts FALSE FALSE ## Assists FALSE FALSE ## Errors FALSE FALSE ## NewLeagueN FALSE FALSE ## 1 subsets of each size up to 8 ## Selection Algorithm: exhaustive ## AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns ## 1 (1) "" "" "" "" "" "" "" "" "" "" "" "" " ## 21 ) " " "*" " " " " " " " " " " " " " " " " " " ## 3 (1) "" "*"" "" "" "" "" "" "" "" "" "" "" ## 4 (1) "" "*"" "" "" "" "" "" "" "" "" "" "" ## 5 (1) "*" "*" "" "" "" "" "" "" "" "" "" "" " ## 6 (1) "*" "*" "" "" "" ""*"" "" "" "" "" "" ## 7 (1) "" "*" "" "" "" ""*"" "" "*" "*" "*" "" ## 8 (1) "*" "*" "" "" "" ""*"" "" "" "" "*" "*" ## CRBI CWalks LeagueN DivisionW PutOuts Assists Errors NewLeagueN ## 1 (1) "*" "" "" "" "" "" "" "" "" ## 21 ) "*" " " " " " " " " " " " " " " ## 3 (1) "*" "" "" "" "" "*" "" "" "" ## 4 (1) "*" "" "" ""*" "*"" "" "" " ## 5 (1) "*" "" "" ""*" "*"" "" "" " ## 6 (1) "*" "" "" ""*" "*"" "" "" " ## 7 (1) "" "" "" "*" "*" "" "" "" ## 8 (1) "" "*" "" "*" "*" "" "" "" Copy code
The asterisk indicates that the variable is included in the corresponding model.
## [1] 0.3215 0.4252 0.4514 0.4754 0.4908 0.5087 0.5141 0.5286 0.5346 0.5405 ## [11] 0.5426 0.5436 0.5445 0.5452 0.5455 0.5458 0.5460 0.5461 0.5461 Copy code
In this 19variable model, R ^2 increases monotonically.
We can use the builtin drawing function to draw RSS, adj R ^2, C p , AIC and BIC.
Note: The goodness of fit shown above is an estimate of all test errors (except R^2).
Stepwise selection forward and backward
## Subset selection object ## Call: regsubsets.formula(Salary ~ ., data = Hitters, nvmax = 19, method = "forward") ## 19 Variables (and intercept) ## Forced in Forced out ## AtBat FALSE FALSE ## Hits FALSE FALSE ## HmRun FALSE FALSE ## Runs FALSE FALSE ## RBI FALSE FALSE ## Walks FALSE FALSE ## Years FALSE FALSE ## CAtBat FALSE FALSE ## CHits FALSE FALSE ## CHmRun FALSE FALSE ## CRuns FALSE FALSE ## CRBI FALSE FALSE ## CWalks FALSE FALSE ## LeagueN FALSE FALSE ## DivisionW FALSE FALSE ## PutOuts FALSE FALSE ## Assists FALSE FALSE ## Errors FALSE FALSE ## NewLeagueN FALSE FALSE ## 1 subsets of each size up to 19 ## Selection Algorithm: forward ## AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns ## 1 (1) "" "" "" "" "" "" "" "" "" "" "" "" " ## 21 ) " " "*" " " " " " " " " " " " " " " " " " " ## 3 (1) "" "*"" "" "" "" "" "" "" "" "" "" "" ## 4 (1) "" "*"" "" "" "" "" "" "" "" "" "" "" ## 5 (1) "*" "*" "" "" "" "" "" "" "" "" "" "" " ## 6 (1) "*" "*" "" "" "" ""*"" "" "" "" "" "" ## 7 (1) "*" "*" "" "" "" ""*"" "" "" "" "" "" ## 8 (1) "*" "*" "" "" "" ""*"" "" "" "" "" "*" ## 9 (1) "*" "*" "" "" "" ""*"" "" "*" "" "" ""*" ## 10 (1) "*" "*" "" "" "" ""*"" "" "*" "" "" ""*" ## 11 (1) "*" "*" "" "" "" ""*"" ""*"" "" ""*" ## 12 (1) "*" "*" "" "*" "" "*" "" "*" "" "" ""*" ## 13 (1) "*" "*" "" "*" "" "*" "" "*" "" "" ""*" ## 14 (1) "*" "*" "*" "*" "" "*" "" "*" "" "" ""*" ## 15 (1) "*" "*" "*" "*" "" "*" "" "*" "*" "" "*" ## 16 (1) "*" "*" "*" "*" "*" "*" "" "*" "*" "" "*" ## 17 (1) "*" "*" "*" "*" "*" "*" "" "*" "*" "" "*" ## 18 (1) "*" "*" "*" "*" "*" "*" "*" "*" "*" "" "*" ## 19 (1) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" ## CRBI CWalks LeagueN DivisionW PutOuts Assists Errors NewLeagueN ## 1 (1) "*" "" "" "" "" "" "" "" "" ## 21 ) "*" " " " " " " " " " " " " " " ## 3 (1) "*" "" "" "" "" "*" "" "" "" ## 4 (1) "*" "" "" ""*" "*"" "" "" " ## 5 (1) "*" "" "" ""*" "*"" "" "" " ## 6 (1) "*" "" "" ""*" "*"" "" "" " ## 7 (1) "*" "*" "" "*" "*" "" "" "" ## 8 (1) "*" "*" "" "*" "*" "" "" "" ## 9 (1) "*" "*" "" "*" "*" "" "" "" ## 10 (1) "*" "*" "" "*" "*" "*" "" "" ## 11 (1) "*" "*" "*" "*" "*" "*" "" "" ## 12 (1) "*" "*" "*" "*" "*" "*" "" "" ## 13 (1) "*" "*" "*" "*" "*" "*" "*" "" ## 14 (1) "*" "*" "*" "*" "*" "*" "*" "" ## 15 (1) "*" "*" "*" "*" "*" "*" "*" "" ## 16 (1) "*" "*" "*" "*" "*" "*" "*" "" ## 17 (1) "*" "*" "*" "*" "*" "*" "*" "*" ## 18 (1) "*" "*" "*" "*" "*" "*" "*" "*" ## 19 (1) "*" "*" "*" "*" "*" "*" "*" "*" Copy code
## Subset selection object ## 19 Variables (and intercept) ## Forced in Forced out ## AtBat FALSE FALSE ## Hits FALSE FALSE ## HmRun FALSE FALSE ## Runs FALSE FALSE ## RBI FALSE FALSE ## Walks FALSE FALSE ## Years FALSE FALSE ## CAtBat FALSE FALSE ## CHits FALSE FALSE ## CHmRun FALSE FALSE ## CRuns FALSE FALSE ## CRBI FALSE FALSE ## CWalks FALSE FALSE ## LeagueN FALSE FALSE ## DivisionW FALSE FALSE ## PutOuts FALSE FALSE ## Assists FALSE FALSE ## Errors FALSE FALSE ## NewLeagueN FALSE FALSE ## 1 subsets of each size up to 19 ## Selection Algorithm: backward ## AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns ## 1 (1) "" "" "" "" "" "" "" "" "" "" "" "" "*" ## 21 ) " " "*" " " " " " " " " " " " " " " " " "*" ## 3 (1) "" "*"" "" "" "" "" "" "" "" "" "" "*" ## 4 (1) "*" "*" "" "" "" "" "" "" "" "" "" ""*" ## 5 (1) "*" "*" "" "" "" ""*"" "" "" "" "" "*" ## 6 (1) "*" "*" "" "" "" ""*"" "" "" "" "" "*" ## 7 (1) "*" "*" "" "" "" ""*"" "" "" "" "" "*" ## 8 (1) "*" "*" "" "" "" ""*"" "" "" "" "" "*" ## 9 (1) "*" "*" "" "" "" ""*"" "" "*" "" "" ""*" ## 10 (1) "*" "*" "" "" "" ""*"" "" "*" "" "" ""*" ## 11 (1) "*" "*" "" "" "" ""*"" ""*"" "" ""*" ## 12 (1) "*" "*" "" "*" "" "*" "" "*" "" "" ""*" ## 13 (1) "*" "*" "" "*" "" "*" "" "*" "" "" ""*" ## 14 (1) "*" "*" "*" "*" "" "*" "" "*" "" "" ""*" ## 15 (1) "*" "*" "*" "*" "" "*" "" "*" "*" "" "*" ## 16 (1) "*" "*" "*" "*" "*" "*" "" "*" "*" "" "*" ## 17 (1) "*" "*" "*" "*" "*" "*" "" "*" "*" "" "*" ## 18 (1) "*" "*" "*" "*" "*" "*" "*" "*" "*" "" "*" ## 19 (1) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" ## CRBI CWalks LeagueN DivisionW PutOuts Assists Errors NewLeagueN ## 1 (1) "" "" "" "" "" "" "" "" "" ## 21 ) " " " " " " " " " " " " " " " " ## 3 (1) "" "" "" "" ""*"" "" "" " ## 4 (1) "" "" "" "" ""*"" "" "" " ## 5 (1) "" "" "" "" "" *" "" "" "" ## 6 (1) "" "" "" ""*" "*"" "" "" " ## 7 (1) "" "*" "" "*" "*" "" "" "" ## 8 (1) "*" "*" "" "*" "*" "" "" "" ## 9 (1) "*" "*" "" "*" "*" "" "" "" ## 10 (1) "*" "*" "" "*" "*" "*" "" "" ## 11 (1) "*" "*" "*" "*" "*" "*" "" "" ## 12 (1) "*" "*" "*" "*" "*" "*" "" "" ## 13 (1) "*" "*" "*" "*" "*" "*" "*" "" ## 14 (1) "*" "*" "*" "*" "*" "*" "*" "" ## 15 (1) "*" "*" "*" "*" "*" "*" "*" "" ## 16 (1) "*" "*" "*" "*" "*" "*" "*" "" ## 17 (1) "*" "*" "*" "*" "*" "*" "*" "*" ## 18 (1) "*" "*" "*" "*" "*" "*" "*" "*" ## 19 (1) "*" "*" "*" "*" "*" "*" "*" "*" Copy code
We can see here that the 16 variable model is the same for the best subset and selection .
Ridge regression and lasso
Start the crossvalidation method
We will also apply the crossvalidation method in the regularization method.
Validation set
R ^ 2 C p and BIC estimate the test error rate, we can use the crossvalidation method. We must use only training observations to perform all aspects of model fitting and variable selection. Then calculate the test error by applying the trained model to the test or validation data.
## Ridge Regression ## ## 133 samples ## 19 predictors ## ## Preprocessing: scaled, centered ## Resampling: Bootstrapped (25 reps) ## ## Summary of sample sizes: 133, 133, 133, 133, 133, 133, ... ## ## Resampling results across tuning parameters: ## ## lambda RMSE Rsquared RMSE SD Rsquared SD ## 0 400 0.4 40 0.09 ## 1e04 400 0.4 40 0.09 ## 0.1 300 0.5 40 0.09 ## ## RMSE is used to select the best model using the minimum value. ##The final value used for the model is lambda = 0.1. Copy code
mean(ridge.predtest$Salary)^2 Copy code
## [1] 30.1Copy code
k cross validation
Use k cross validation to select the best lambda.
For crossvalidation, we divide the data into test and training data.
## Ridge Regression ## ## 133 samples ## 19 predictors ## ## Preprocessing: centered, scaled ## Resampling: CrossValidated (10 fold) ## ## Summary of sample sizes: 120, 120, 119, 120, 120, 119, ... ## ## Resampling results across tuning parameters: ## ## lambda RMSE Rsquared RMSE SD Rsquared SD ## 0 300 0.6 70 0.1 ## 1e04 300 0.6 70 0.1 ## 0.1 300 0.6 70 0.1 ## ## RMSE is used to select the best model using the minimum value. ##The final value used for the model is lambda = 1e04. Copy code
# Calculate the correlation coefficient predict(ridge$finalModel, type='coef', mode='norm')$coefficients[19,] Copy code
## AtBat Hits HmRun Runs RBI Walks ## 157.221 313.860 18.996 0.000 70.392 171.242 ## Years CAtBat CHits CHmRun CRuns CRBI ## 27.543 0.000 0.000 51.811 202.537 187.933 ## CWalks LeagueN DivisionW PutOuts Assists Errors ## 224.951 12.839 38.595 9.128 13.288 18.620 ## NewLeagueN ## 22.326 Copy code
sqrt(mean(ridge.predtest$Salary)^2) Copy code
## [1] 17.53Copy code
Therefore, the average salary error is about 33,000. The regression coefficient does not seem to be really going towards zero, but this is because we first standardized the data.
Now we should check if this is better than the regular
## Linear Regression ## ## 133 samples ## 19 predictors ## ## Preprocessing: scaled, centered ## Resampling: CrossValidated (10 fold) ## ## Summary of sample sizes: 120, 120, 121, 119, 119, 119, ... ## ## Resampling results ## ## RMSE Rsquared RMSE SD Rsquared SD ## 300 0.5 70 0.2 ## ## Copy code
coef(lmfit$finalModel) Copy code
## (Intercept) AtBat Hits HmRun Runs RBI ## 535.958 327.835 591.667 73.964 169.699 162.024 ## Walks Years CAtBat CHits CHmRun CRuns ## 234.093 60.557 125.017 529.709 45.888 680.654 ## CRBI CWalks LeagueN DivisionW PutOuts Assists ## 393.276 399.506 19.118 46.679 4.898 41.271 ## Errors NewLeagueN ## 22.672 22.469 Copy code
sqrt(mean(lmfit.predtest$Salary)^2) Copy code
## [1] 17.62Copy code
As we have seen, this ridge regression fit certainly has a lower RMSE and a higher R ^2.
lasso
## The lasso ## ## 133 samples ## 19 predictors ## ## Preprocessing: scaled, centered ## Resampling: CrossValidated (10 fold) ## ## Summary of sample sizes: 120, 121, 120, 120, 120, 119, ... ## ## Resampling results across tuning parameters: ## ## fraction RMSE Rsquared RMSE SD Rsquared SD ## 0.1 300 0.6 70 0.2 ## 0.5 300 0.6 60 0.2 ## 0.9 300 0.6 70 0.2 ## ## RMSE is used to select the best model using the minimum value. ##The final value used for the model is = 0.5. Copy code
## $s ## [1] 0.5 ## ## $fraction ## 0 ## 0.5 ## ## $mode ## [1] "fraction" ## ## $coefficients ## AtBat Hits HmRun Runs RBI Walks ## 227.113 406.285 0.000 48.612 93.740 197.472 ## Years CAtBat CHits CHmRun CRuns CRBI ## 47.952 0.000 0.000 82.291 274.745 166.617 ## CWalks LeagueN DivisionW PutOuts Assists Errors ## 287.549 18.059 41.697 7.001 30.768 26.407 ## NewLeagueN ## 19.190 Copy code
sqrt(mean(lasso.predtest$Salary)^2) Copy code
## [1] 14.35Copy code
In lasso, we see that many coefficients have been forced to zero. Even if the RMSE is a little higher than the ridge regression, it has advantages over the linear regression model.
PCR and PLS
Principal component regression
## Data: X dimension: 133 19 ## Y dimension: 133 1 ## Fit method: svdpc ## Number of components considered: 19 ## ## VALIDATION: RMSEP ## Crossvalidated using 10 random segments. ## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps ## CV 451.5 336.9 323.9 328.5 328.4 329.9 337.1 ## adjCV 451.5 336.3 323.6 327.8 327.5 328.8 335.7 ## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps ## CV 335.2 333.7 338.5 334.3 337.8 340.4 346.7 ## adjCV 332.5 331.7 336.4 332.0 335.5 337.6 343.4 ## 14 comps 15 comps 16 comps 17 comps 18 comps 19 comps ## CV 345.1 345.7 329.4 337.3 343.5 338.7 ## adjCV 341.2 341.6 325.7 332.7 338.4 333.9 ## ## TRAINING:% variance explained ## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps ## X 36.55 60.81 71.75 80.59 85.72 89.76 92.74 ## Salary 45.62 50.01 51.19 51.98 53.23 53.36 55.63 ## 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps ## X 95.37 96.49 97.45 98.09 98.73 99.21 99.52 ## Salary 56.48 56.73 58.57 58.92 59.34 59.44 62.01 ## 15 comps 16 comps 17 comps 18 comps 19 comps ## X 99.77 99.90 99.97 99.99 100.00 ## Salary 62.65 65.29 66.48 66.77 67.37 Copy code
The algorithm reports the CV as RMSE and the training data as R^2. By plotting the MSE, we can see that we have achieved the lowest MSE. This shows a great improvement compared to the least squares method, because we can use only 3 components instead of 19 to explain most of the variance.
Execute on the test data set.
sqrt(mean((pcr.predtest$Salary)^2)) Copy code
## [1] 374.8Copy code
The RMSE is lower than the lasso/linear regression.
## Principal Component Analysis ## ## 133 samples ## 19 predictors ## ## Preprocessing: centered, scaled ## Resampling: CrossValidated (10 fold) ## ## Summary of sample sizes: 121, 120, 118, 119, 120, 120, ... ## ## Resampling results across tuning parameters: ## ## ncomp RMSE Rsquared RMSE SD Rsquared SD ## 1 300 0.5 100 0.2 ## 2 300 0.5 100 0.2 ## 3 300 0.6 100 0.2 ## ## RMSE is used to select the best model using the minimum value. ##The final value used for the model is ncomp = 3. Copy code
Choose the best model with 2 components
sqrt(mean(pcr.predtest$Salary)^2) Copy code
## [1] 21.86Copy code
However, the PCR results are not easy to interpret.
Partial Least Squares
## Data: X dimension: 133 19 ## Y dimension: 133 1 ## Fit method: kernelpls ## Number of components considered: 19 ## ## VALIDATION: RMSEP ## Crossvalidated using 10 random segments. ## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps ## CV 451.5 328.9 328.4 332.6 329.2 325.4 323.4 ## adjCV 451.5 328.2 327.4 330.6 326.9 323.0 320.9 ## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps ## CV 318.7 318.7 316.3 317.6 316.5 317.0 319.2 ## adjCV 316.2 315.5 313.5 314.9 313.6 313.9 315.9 ## 14 comps 15 comps 16 comps 17 comps 18 comps 19 comps ## CV 323.0 323.8 325.4 324.5 323.6 321.4 ## adjCV 319.3 320.1 321.4 320.5 319.9 317.8 ## ## TRAINING:% variance explained ## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps ## X 35.94 55.11 67.37 74.29 79.65 85.17 89.17 ## Salary 51.56 54.90 57.72 59.78 61.50 62.94 63.96 ## 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps ## X 90.55 93.49 95.82 97.05 97.67 98.45 98.67 ## Salary 65.34 65.75 66.03 66.44 66.69 66.77 66.94 ## 15 comps 16 comps 17 comps 18 comps 19 comps ## X 99.02 99.26 99.42 99.98 100.00 ## Salary 67.02 67.11 67.24 67.26 67.37 Copy code
The best M is 2. Evaluate the corresponding test error.
sqrt(mean(pls.predtest$Salary)^2) Copy code
## [1] 14.34Copy code
Compared with PCR, here we can see an improvement in RMSE.
Most popular insights
1. Matlab Partial Least Squares Regression (PLSR) and Principal Component Regression (PCR)
3. Principal component analysis (PCA) basic principles and analysis examples
4. Realize LASSO regression analysis based on R language
5. Use LASSO regression to predict stock return data analysis
6. The lasso regression, ridge ridge regression and elasticnet model in r language
7. Data analysis of partial least squares regression plsda in r language
8. Partial least squares pls regression algorithm in r language