R语言使用k-Nearest Neighbors进行回归

机器学习是人工智能的一个子集，它为机器提供了自动学习的能力，无需明确编程。在这种情况下，机器在没有人类干预的情况下从经验中得到改善，并相应地调整行动。它主要有3种类型。

有监督的机器学习
无监督的机器学习
强化学习

K-最近的邻居

K-最近的邻居算法创建了一个假想的边界来对数据进行分类。当新的数据点被添加到预测中时，该算法将该点添加到离边界线最近的位置。它遵循 “物以类聚 ，人以群分 “的原则。这种算法可以很容易地在R语言中实现。

K-NN算法

选择K，邻居的数量。
计算K个邻居的欧几里得距离。
根据计算出的欧氏距离，取最近的K个邻居。
计算这K个邻居中每个类别的数据点的数量。
新的数据点被分配到邻居数量最多的类别中。

R语言实现

数据集： 一个由400人组成的样本人口，与某产品公司分享他们的年龄、性别和工资，以及他们是否购买了该产品（0表示没有，1表示有）。下载数据集 Advertisement.csv

# Importing the dataset
dataset = read.csv('Advertisement.csv')
head(dataset, 10)

输出

编号	用户ID	性别	年龄	估计工资	购买的
0	15624510	男性	19	19000	0
1	15810944	男性	35	20000	0
2	15668575	女性	26	43000	0
3	15603246	女性	27	57000	0
4	15804002	男性	19	76000	0
5	15728773	男性	27	58000	0
6	15598044	女性	27	84000	0
7	15694829	女性	32	150000	1
8	15600575	男	25	33000	0
9	15727311	女性	35	65000	0

# Encoding the target
# feature as factor
datasetPurchased = factor(datasetPurchased,
                           levels = c(0, 1))
  
# Splitting the dataset into 
# the Training set and Test set
# install.packages('caTools')
library(caTools)
set.seed(123)
split = sample.split(dataset$Purchased, 
                     SplitRatio = 0.75)
training_set = subset(dataset, 
                      split == TRUE)
test_set = subset(dataset, 
                  split == FALSE)
  
# Feature Scaling
training_set[-3] = scale(training_set[-3])
test_set[-3] = scale(test_set[-3])
  
# Fitting K-NN to the Training set 
# and Predicting the Test set results
library(class)
y_pred = knn(train = training_set[, -3],
             test = test_set[, -3],
             cl = training_set[, 3],
             k = 5,
             prob = TRUE)
  
# Making the Confusion Matrix
cm = table(test_set[, 3], y_pred)

训练集包含300个条目。
测试集包含100个条目。

Confusion matrix result:
[[64][4]
  [3][29]]

训练数据的可视化

# Visualising the Training set results
# Install ElemStatLearn if not present 
# in the packages using(without hashtag)
# install.packages('ElemStatLearn')
library(ElemStatLearn)
set = training_set
  
#Building a grid of Age Column(X1)
# and Estimated Salary(X2) Column
X1 = seq(min(set[, 1]) - 1,
         max(set[, 1]) + 1,
         by = 0.01)
X2 = seq(min(set[, 2]) - 1, 
         max(set[, 2]) + 1, 
         by = 0.01)
grid_set = expand.grid(X1, X2)
  
# Give name to the columns of matrix
colnames(grid_set) = c('Age',
                       'EstimatedSalary')
  
# Predicting the values and plotting
# them to grid and labelling the axes
y_grid = knn(train = training_set[, -3],
             test = grid_set,
             cl = training_set[, 3],
             k = 5)
plot(set[, -3],
     main = 'K-NN (Training set)',
     xlab = 'Age', ylab = 'Estimated Salary',
     xlim = range(X1), ylim = range(X2))
contour(X1, X2, matrix(as.numeric(y_grid), 
                       length(X1), length(X2)),
                       add = TRUE)
points(grid_set, pch = '.',
       col = ifelse(y_grid == 1, 
                    'springgreen3', 'tomato'))
points(set, pch = 21, bg = ifelse(set[, 3] == 1, 
                                  'green4', 'red3'))

输出

在R编程中使用k-Nearest Neighbors进行回归

测试数据的可视化

# Visualising the Test set results
library(ElemStatLearn)
set = test_set
  
# Building a grid of Age Column(X1)
# and Estimated Salary(X2) Column
X1 = seq(min(set[, 1]) - 1,
         max(set[, 1]) + 1, 
         by = 0.01)
X2 = seq(min(set[, 2]) - 1,
         max(set[, 2]) + 1, 
         by = 0.01)
grid_set = expand.grid(X1, X2)
  
# Give name to the columns of matrix
colnames(grid_set) = c('Age', 
                       'EstimatedSalary')
  
# Predicting the values and plotting 
# them to grid and labelling the axes
y_grid = knn(train = training_set[, -3], 
             test = grid_set,
             cl = training_set[, 3], k = 5)
plot(set[, -3],
     main = 'K-NN (Test set)',
     xlab = 'Age', ylab = 'Estimated Salary',
     xlim = range(X1), ylim = range(X2))
contour(X1, X2, matrix(as.numeric(y_grid), 
                       length(X1), length(X2)),
                       add = TRUE)
points(grid_set, pch = '.', col = 
       ifelse(y_grid == 1, 
              'springgreen3', 'tomato'))
points(set, pch = 21, bg =
       ifelse(set[, 3] == 1,
              'green4', 'red3'))

输出

在R编程中使用k-Nearest Neighbors进行回归

优势

没有训练期。

KNN是一种基于实例的学习算法，因此是一种懒惰的学习者。
KNN不从训练表中得出任何判别函数，也没有训练期。
KNN存储训练数据集并使用它来进行实时预测。
1. 新的数据可以被无缝添加，并且不会影响算法的准确性，因为新添加的数据不需要训练。
2. 实现KNN算法只需要两个参数，即K值和欧几里德距离函数。