如何使用R编程在回归中包括交互作用
在这篇文章中,我们将研究什么是交互作用,以及我们是否应该在模型中使用交互作用以获得更好的结果。
在回归中使用R语言包括交互作用
假设X1和X2是一个数据集的特征,Y是我们试图预测的类标签或输出。那么,如果X1和X2相互影响,这意味着X1对Y的影响取决于X2的值,反之亦然,那么数据集的特征之间的相互影响是什么呢?现在,我们知道了我们的数据集是否包含交互作用。我们也应该知道什么时候在我们的模型中考虑到交互作用,以获得更好的精度或准确性。我们将使用R语言来实现这一点。
我们是否应该在我们的模型中包括交互作用
在将互动纳入模型之前,你应该问两个问题。
- 这个交互作用在概念上有意义吗?
- 交互项在统计上有意义吗?或者说,我们是否认为回归线的斜率有明显不同。
在R中的实施
让我们通过一个例子来看看线性回归模型中的交互作用。
- 数据集
- 肺活量数据集
- 参数/变量。
- 独立变量(Y)。肺活量
- 因果变量(X1):吸烟(是/否)
- 因果变量(X2):年龄
例子
第1步:加载数据集
# Read in the Lung Cap Data
LungCapData <- read.table(file.choose(),
header = T,
sep = "\t")
# Attach LungCapData
attach(LungCapData)
第2步:绘制数据,用不同颜色表示吸烟者(红色)/非吸烟者(蓝色 )
# Plot the data, using different
# colours for smoke(red)/non-smoke(blue)
# First, plot the data for
# the Non-Smokers, in Blue
plot(Age[Smoke == "no"],
LungCap[Smoke == "no"],
col = "blue",
ylim = c(0, 15), xlim = c(0, 20),
xlab = "Age", ylab = "LungCap",
main = "LungCap vs. Age,Smoke")
输出
# Now, add in the points for
# the Smokers, in Solid Red Circles
points(Age[Smoke == "yes"],
LungCap[Smoke == "yes"],
col = "red", pch = 16)
输出
# And, add in a legend
legend(1, 15,
legend = c("NonSmoker", "Smoker"),
col = c("blue", "red"),
pch = c(1, 16), bty = "n")
输出
第3步。拟合一个Reg模型,使用年龄、吸烟和它们的相互作用,并在回归线中添加 。
# Fit a Reg Model, using Age,
# Smoke, and their INTERACTION
model1 <- lm(LungCap ~ Age*Smoke)
coef(model1)
输出
(Intercept) Age Smokeyes Age:Smokeyes
1.05157244 0.55823350 0.22601390 -0.05970463
# Note, that the "*" fits a model with
# Age, Smoke and AgeXSmoke INT.
# Note, also that the same model
# can be fit using the ":"
model1 <- lm(LungCap ~ Age + Smoke + Age:Smoke)
# Ask for a summary of the model
summary(model1)
输出
Call:
lm(formula = LungCap ~ Age + Smoke + Age:Smoke)
Residuals:
Min 1Q Median 3Q Max
-4.8586 -1.0174 -0.0251 1.0004 4.1996
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.05157 0.18706 5.622 2.7e-08 ***
Age 0.55823 0.01473 37.885 < 2e-16 ***
Smokeyes 0.22601 1.00755 0.224 0.823
Age:Smokeyes -0.05970 0.06759 -0.883 0.377
—
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.515 on 721 degrees of freedom
Multiple R-squared: 0.6776, Adjusted R-squared: 0.6763
F-statistic: 505.1 on 3 and 721 DF, p-value: < 2.2e-16
第4步:让我们使用abline命令添加我们模型的回归线
# Now, let's add in the regression
# lines from our mode using the
# abline command for the Non-Smokers, in Blue
abline(a = 1.052, b = 0.558,
col = "blue", lwd = 3)
输出
# And now, add in the line for Smokers, in Red
abline(a = 1.278, b = 0.498,
col = "red", lwd = 3)
输出
# Ask for that model summary again
summary(model1)
# Fit the model that does
# NOT include INTERACTION
model2 <- lm(LungCap ~ Age + Smoke)
summary(model2)
输出
**> **summary(model1)
Call:
lm(formula = LungCap ~ Age + Smoke + Age:Smoke)
Residuals:
Min 1Q Median 3Q Max
-4.8586 -1.0174 -0.0251 1.0004 4.1996
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.05157 0.18706 5.622 2.7e-08 ***
Age 0.55823 0.01473 37.885 < 2e-16 ***
Smokeyes 0.22601 1.00755 0.224 0.823
Age:Smokeyes -0.05970 0.06759 -0.883 0.377
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.515 on 721 degrees of freedom
Multiple R-squared: 0.6776, Adjusted R-squared: 0.6763
F-statistic: 505.1 on 3 and 721 DF, p-value: < 2.2e-16
> summary(model2)
Call:
lm(formula = LungCap ~ Age + Smoke)
Residuals:
Min 1Q Median 3Q Max
-4.8559 -1.0289 -0.0363 1.0083 4.1995
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.08572 0.18299 5.933 4.61e-09 ***
Age 0.55540 0.01438 38.628 < 2e-16 ***
Smokeyes -0.64859 0.18676 -3.473 0.000546 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.514 on 722 degrees of freedom
Multiple R-squared: 0.6773, Adjusted R-squared: 0.6764
F-statistic: 757.5 on 2 and 722 DF, p-value: < 2.2e-16