R语言 K-折交叉验证

任何机器学习模型的首要目标是预测实时数据的结果。为了检查所开发的模型是否有足够的效率来预测未见过的数据点的结果，对所应用的机器学习模型的性能评估变得非常必要。K-fold交叉验证技术基本上是一种对数据集进行重新取样的方法，以评估机器学习模型。在这种技术中， 参数K 指的是将给定数据集分成的不同子集的数量。此外， K-1个子集 被用来训练模型，剩下的子集被用来作为验证集。

R语言中的K-fold交叉验证的步骤。

将数据集随机分成K个子集
对于每一个开发出来的数据点子集
- 将该子集作为验证集
- 将所有其余的子集用于训练目的
- 训练模型，并在验证集或测试集上评估它
- 计算预测误差
重复上述步骤K次，即直到模型没有在所有子集上进行训练和测试。
通过取每个案例中预测误差的平均值来生成总体预测误差

为了实现K-fold方法中涉及的所有步骤，R语言有丰富的库和内置函数包，通过它们可以很容易地完成整个任务。以下是在分类和回归机器学习模型上实现K-fold技术 作为交叉验证方法的分步程序。

在分类中实现K-折技术

当目标变量由分类值组成时，如垃圾邮件、非垃圾邮件、真或假等，分类机器学习模型是首选。这里，Naive Bayes分类器将被用作概率分类器来预测目标变量的类别标签。

第1步：加载数据集和其他所需软件包

第一个要求是通过加载所有需要的库和包来设置R环境，以便顺利完成整个过程。下面是这个步骤的实现。

# loading required packages
 
# package to perform data manipulation
# and visualization
library(tidyverse)
 
# package to compute
# cross - validation methods
library(caret)
 
# loading package to
# import desired dataset
library(ISLR)

第2步：探索数据集

为了对数据集进行操作，首先对其进行检查是非常必要的。这将使我们对数据集的结构以及各种数据类型有一个清晰的认识。为此，必须将数据集分配给一个变量。下面是做同样事情的代码。

# assigning the complete dataset
# Smarket to a variable
dataset <- Smarket[complete.cases(Smarket), ]
 
# display the dataset with details
# like column name and its data type
# along with values in each row
glimpse(dataset)
 
# checking values present
# in the Direction column
# of the dataset
table(dataset$Direction)

输出

Rows: 1,250

Columns: 9

 $Year <dbl> 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, \dots$  Lag1      <dbl> 0.381, 0.959, 1.032, -0.623, 0.614, 0.213, 1.392, -0.403, 0.027, 1.303, 0.287, -0.498, -0.189, 0.680, 0.701, -0.562, 0.546, -1…

 $Lag2 <dbl> -0.192, 0.381, 0.959, 1.032, -0.623, 0.614, 0.213, 1.392, -0.403, 0.027, 1.303, 0.287, -0.498, -0.189, 0.680, 0.701, -0.562, 0\dots$  Lag3      <dbl> -2.624, -0.192, 0.381, 0.959, 1.032, -0.623, 0.614, 0.213, 1.392, -0.403, 0.027, 1.303, 0.287, -0.498, -0.189, 0.680, 0.701, -…

 $Lag4 <dbl> -1.055, -2.624, -0.192, 0.381, 0.959, 1.032, -0.623, 0.614, 0.213, 1.392, -0.403, 0.027, 1.303, 0.287, -0.498, -0.189, 0.680, \dots$  Lag5      <dbl> 5.010, -1.055, -2.624, -0.192, 0.381, 0.959, 1.032, -0.623, 0.614, 0.213, 1.392, -0.403, 0.027, 1.303, 0.287, -0.498, -0.189, …

 $Volume <dbl> 1.19130, 1.29650, 1.41120, 1.27600, 1.20570, 1.34910, 1.44500, 1.40780, 1.16400, 1.23260, 1.30900, 1.25800, 1.09800, 1.05310, \dots$  Today     <dbl> 0.959, 1.032, -0.623, 0.614, 0.213, 1.392, -0.403, 0.027, 1.303, 0.287, -0.498, -0.189, 0.680, 0.701, -0.562, 0.546, -1.747, 0…

 $Direction <fct> Up, Up, Down, Up, Up, Up, Down, Up, Up, Up, Down, Down, Up, Up, Down, Up, Down, Up, Down, Down, Down, Down, Up, Down, Down, Up\dots > table(dataset$ Direction)

Down   Up

602  648

根据上述信息，该数据集包含250行和9列。自变量的数据类型是 < dbl> **，来自double，表示双精度的浮点数。目标变量的数据类型是 **< fct> **，意味着因子，对于分类模型来说，它是可取的。此外，目标变量有两个结果，即 **Down 和 Up ，这两个类别的比例几乎为1:1，也就是说，它们是平衡的。目标变量的所有类别的比例必须大致相等，以建立一个无偏的模型。

为此，有许多技术，如。

下抽样
上抽样
使用SMOTE和ROSE的混合取样

第3步：用K-fold算法建立模型

在这一步中， trainControl() 函数被定义为设置 K参数 的值，然后按照K-fold技术所涉及的步骤来开发模型。下面是实现的过程。

# setting seed to generate a 
# reproducible random sampling
set.seed(123)
 
# define training control which
# generates parameters that further
# control how models are created
train_control <- trainControl(method = "cv",
                              number = 10)
 
 
# building the model and
# predicting the target variable
# as per the Naive Bayes classifier
model <- train(Direction~., data = dataset,
               trControl = train_control,
               method = "nb")

第4步：评估模型的准确性

在对模型进行训练和验证后，是时候计算模型的整体准确性了。下面是生成模型总结的代码。

# summarize results of the
# model after calculating
# prediction error in each case
print(model)

输出

Naive Bayes

1250 samples

8 predictor

2 classes: ‘Down’, ‘Up’

No pre-processing

Resampling: Cross-Validated (10 fold)

Summary of sample sizes: 1125, 1125, 1125, 1126, 1125, 1124, …

Resampling results across tuning parameters:

usekernel  Accuracy   Kappa

FALSE      0.9543996  0.9083514

TRUE      0.9711870  0.9422498

Tuning parameter ‘fL’ was held constant at a value of 0

Tuning parameter ‘adjust’ was held constant at a value of 1

Accuracy was used to select the optimal model using the largest value.

The final values used for the model were fL = 0, usekernel = TRUE and adjust = 1.

在回归中实现K-fold技术

回归机器学习模型是用来预测目标变量的，它具有连续的性质，如商品的价格或公司的销售额。下面是对回归模型实现K-fold交叉验证技术的完整步骤。

第1步：导入所有需要的软件包

通过导入所有必要的包和库来建立R环境。下面是这个步骤的实现。

# loading required packages
 
# package to perform data manipulation
# and visualization
library(tidyverse)
 
# package to compute
# cross - validation methods
library(caret)
 
# installing package to
# import desired dataset
install.packages("datarium")

第2步：加载和检查数据集

在这一步，所需的数据集被加载到R环境中。之后，打印数据集的一些行，以了解其结构。下面是执行这一任务的代码。

# loading the dataset
data("marketing", package = "datarium")
 
# inspecting the dataset
head(marketing)

输出

  youtube facebook newspaper sales
1  276.12    45.36     83.04 26.52
2   53.40    47.16     54.12 12.48
3   20.64    55.08     83.16 11.16
4  181.80    49.56     70.20 22.20
5  216.96    12.96     70.08 15.48
6   10.44    58.68     90.00  8.64

第3步：用K-折算法建立模型

K参数 的值在 trainControl() 函数中定义，模型按照K-折交叉验证技术的算法中提到的步骤建立。以下是实现情况。

# setting seed to generate a 
# reproducible random sampling
set.seed(125) 
 
# defining training control
# as cross-validation and 
# value of K equal to 10
train_control <- trainControl(method = "cv",
                              number = 10)
 
# training the model by assigning sales column
# as target variable and rest other column
# as independent variable
model <- train(sales ~., data = marketing, 
               method = "lm",
               trControl = train_control)

第4步：评估模型性能

正如在K-fold算法中提到的，模型要针对数据集的每一个独特的折叠（或子集）进行测试，在每一种情况下，都要计算预测误差，最后，所有预测误差的平均值被视为模型的最终性能得分。因此，下面是打印模型的最终得分和整体总结的代码。

# printing model performance metrics
# along with other details
print(model)

输出

Linear Regression

200 samples

3 predictor

No pre-processing

Resampling: Cross-Validated (10 fold)

Summary of sample sizes: 181, 180, 180, 179, 180, 180, …

Resampling results:

RMSE      Rsquared   MAE

2.027409  0.9041909  1.539866

Tuning parameter ‘intercept’ was held constant at a value of TRUE