R语言交叉验证

设计机器学习模型的主要挑战是使其在未见过的数据上准确工作。为了知道所设计的模型是否工作正常，我们必须用那些在模型训练期间不存在的数据点来测试它。这些数据点将作为模型的未见数据，这样就很容易评估模型的准确性。检查机器学习模型有效性的最好技术之一是交叉验证技术，它可以通过使用R编程语言轻松实现。在这个过程中，数据集的一部分被保留下来，不会被用于训练模型。一旦模型准备就绪，保留的数据集将用于测试目的。在测试阶段，因变量的值被预测，模型的准确性是根据预测误差计算的，即因变量的实际值和预测值之间的差异。有几个统计指标被用来评估回归模型的准确性。

均方根误差（RMSE） ：顾名思义，它是目标变量的实际值和预测值之间平均平方差的平方根。它给出了模型的平均预测误差，因此减少RMSE值可以提高模型的准确性。
平均绝对误差（MAE）： 该指标给出了目标变量的实际值和模型预测值之间的绝对差异。如果离群值与模型的准确性没有太大关系，那么MAE就可以用来评估模型的性能。它的值必须更小，才能做出更好的模型。
**R 2 **误差：R平方指标的值给出了一个概念，即自变量共同解释了因变量中多少百分比的变异。换句话说，它反映了目标变量和模型之间的关系强度，尺度为0-100%。因此，一个更好的模型应该有一个高的R平方值。

交叉验证的类型

在将完整的数据集划分为训练集和验证集的过程中，有可能会丢失一些重要的、关键的训练数据点。由于这些数据没有包括在训练集中，模型没有机会检测到一些模式。这种情况会导致模型的过拟合或欠拟合。为了避免这种情况，有不同类型的交叉验证技术来保证训练和验证数据集的随机抽样，并使模型的准确性最大化。一些最流行的交叉验证技术是

验证集方法
留出一个交叉验证(LOOCV)
K-折交叉验证法
重复的K-折交叉验证

加载数据集

为了实现线性回归，我们使用了一个营销数据集，这是R编程语言中的一个内置数据集。以下是将该数据集导入R编程环境的代码。

# loading required packages
 
# package to perform data manipulation
# and visualization
library(tidyverse)
 
# package to compute
# cross - validation methods
library(caret)
 
# installing package to
# import desired dataset
install.packages("datarium")
 
# loading the dataset
data("marketing", package = "datarium")
 
# inspecting the dataset
head(marketing)

输出

   youtube facebook newspaper sales
1  276.12    45.36     83.04 26.52
2   53.40    47.16     54.12 12.48
3   20.64    55.08     83.16 11.16
4  181.80    49.56     70.20 22.20
5  216.96    12.96     70.08 15.48
6   10.44    58.68     90.00  8.64

验证集方法（或数据分割）

在这种方法中，数据集被随机划分为训练集和测试集。为实现这一技术，需要执行以下步骤。

对数据集进行随机抽样
在训练数据集上训练模型
将所建立的模型应用于测试数据集
通过使用模型性能指标来计算预测误差

下面是这个方法的实现。

# R program to implement
# validation set approach
 
# setting seed to generate a
# reproducible random sampling
set.seed(123)
 
# creating training data as 80% of the dataset
random_sample <- createDataPartition(marketing sales,
                                p = 0.8, list = FALSE)
 
# generating training dataset
# from the random_sample
training_dataset  <- marketing[random_sample, ]
 
# generating testing dataset
# from rows which are not
# included in random_sample
testing_dataset <- marketing[-random_sample, ]
 
# Building the model
 
# training the model by assigning sales column
# as target variable and rest other columns
# as independent variables
model <- lm(sales ~., data = training_dataset)
 
# predicting the target variable
predictions <- predict(model, testing_dataset)
 
# computing model performance metrics
data.frame( R2 = R2(predictions, testing_dataset sales),
            RMSE = RMSE(predictions, testing_dataset sales),
            MAE = MAE(predictions, testing_dataset sales))

输出

       R2     RMSE      MAE
1 0.9049049 1.965508 1.433609

优点

评价一个模型的最基本和最简单的技术之一。
没有复杂的实现步骤。

缺点

模型所做的预测高度依赖于用于训练和验证的观察子集。
只使用一个数据子集进行训练会使模型产生偏差。

留出一个交叉验证(LOOCV)

这种方法也将数据集分成两部分，但它克服了验证集方法的缺点。LOOCV以如下方式进行交叉验证。

在N-1个数据点上训练模型
用上一步留下的那一个数据点来测试模型
计算预测误差
重复以上3个步骤，直到模型没有在所有的数据点上训练和测试。
通过取每个案例中预测误差的平均值来生成总体预测误差。

下面是这个方法的实现。

# R program to implement
# Leave one out cross validation
 
# defining training control
# as Leave One Out Cross Validation
train_control <- trainControl(method = "LOOCV")
 
# training the model by assigning sales column
# as target variable and rest other column
# as independent variable
model <- train(sales ~., data = marketing,
               method = "lm",
               trControl = train_control)
 
# printing model performance metrics
# along with other details
print(model)

输出

Linear Regression 

200 samples
  3 predictor

No pre-processing
Resampling: Leave-One-Out Cross-Validation 
Summary of sample sizes: 199, 199, 199, 199, 199, 199, ... 
Resampling results:

  RMSE      Rsquared   MAE     
  2.059984  0.8912074  1.539441

Tuning parameter 'intercept' was held constant at a value of TRUE

优点

由于几乎每个数据点都被用于训练，因此模型的偏差较小。
由于LOOCV在数据集上运行多次，所以性能指标的值没有随机性。

缺点

如果数据集很大，训练N次模型会导致昂贵的计算时间。

K-折交叉验证法

这种交叉验证技术将数据分为K个大小几乎相同的子集（折叠）。在这K个子集中，有一个子集被用作验证集，其余的子集被用于训练模型。以下是该方法的完整工作程序。

将数据集随机分成K个子集
使用K-1个子集来训练模型
用上一步留下的那个子集来测试模型
重复上述步骤K次，即直到模型在所有子集上都没有被训练和测试。
通过取每个案例中预测误差的平均值来生成总体预测误差。

下面是这个方法的实现。

# R program to implement
# K-fold cross-validation
 
# setting seed to generate a
# reproducible random sampling
set.seed(125)
 
# defining training control
# as cross-validation and
# value of K equal to 10
train_control <- trainControl(method = "cv",
                              number = 10)
 
# training the model by assigning sales column
# as target variable and rest other column
# as independent variable
model <- train(sales ~., data = marketing,
               method = "lm",
               trControl = train_control)
 
# printing model performance metrics
# along with other details
print(model)

输出

Linear Regression 

200 samples
  3 predictor

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 181, 180, 180, 179, 180, 180, ... 
Resampling results:

  RMSE      Rsquared   MAE     
  2.027409  0.9041909  1.539866

Tuning parameter 'intercept' was held constant at a value of TRUE

优点

计算速度快。
一个非常有效的方法来估计模型的预测误差和准确性。

缺点

较低的K值会导致模型有偏差，较高的K值会导致模型性能指标的变化。因此，为模型使用正确的K值是非常重要的（一般来说，K=5和K=10是理想的）。

重复的K-折交叉验证

顾名思义，在这种方法中，K-折交叉验证算法要重复一定的次数。下面是这个方法的实现。

# R program to implement
# repeated K-fold cross-validation
 
# setting seed to generate a
# reproducible random sampling
set.seed(125)
 
# defining training control as
# repeated cross-validation and
# value of K is 10 and repetition is 3 times
train_control <- trainControl(method = "repeatedcv",
                            number = 10, repeats = 3)
 
# training the model by assigning sales column
# as target variable and rest other column
# as independent variable
model <- train(sales ~., data = marketing,
               method = "lm",
               trControl = train_control)
 
# printing model performance metrics
# along with other details
print(model)

输出

Linear Regression 

200 samples
  3 predictor

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 3 times) 
Summary of sample sizes: 181, 180, 180, 179, 180, 180, ... 
Resampling results:

  RMSE      Rsquared   MAE     
  2.020061  0.9038559  1.541517

Tuning parameter 'intercept' was held constant at a value of TRUE