用R和tidyverse加载和清理数据

tidyverse是一个软件包的集合，由于共享数据表示和API设计，它们可以很好地协同工作。tidyverse包的目的是使其能够通过一个命令简单地安装和加载核心tidyverse包。

要安装tidyverse，请将以下代码放入RStudio。

# Install from CRAN
install.packages("tidyverse")
  
# to check your installation 
library(tidyverse)

输出

── Attaching packages ─────────────────────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.3.6      ✔ purrr   0.3.5 
✔ tibble  3.1.8      ✔ dplyr   1.0.10
✔ tidyr   1.2.1      ✔ stringr 1.4.1 
✔ readr   2.1.3      ✔ forcats 0.5.2 
── Conflicts ────────────────────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()

tidyr包将被用于数据清理，readr包将被用于数据加载。

使用readr加载数据

亲爱的朋友们，在本教程中，我们将使用readr包的read CSV函数读取和解析一个CSV文件。CSV（Comma-Separated Values）文件包含由逗号分隔的数据。在下面的例子中，将使用以下CSV文件。开始时，把要读的文件的路径传给read_csv函数。read CSV函数会生成可以附加到变量上的tibbles。

# load the tidyverse by running this code:
library(tidyverse)
  
# create a tibble named rand
rand <- read_csv("Example.csv")

输出

chr (2): ── Column specification ────────────────────────────────────────────────────────────
Delimiter: ","
chr (2): Code, Age_single_years
dbl (2): Census_night_population_count, Census_usually_resident_population_count

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message

用R和tidyverse加载和清理数据

内联CSV输入是非常有用的，这些选项也可以帮助你进行正常的文件解析。

# give inline csv input
read_csv("a,b,c
  1,2,3
  4,5,6")

输出

      a     b     c
  <dbl> <dbl> <dbl>
1     1     2     3
2     4     5     6

CSV文件的第一行是列的名称。然而，还有其他选择来处理例外情况。

read_csv("first line of metadata
  second line of metadata
  a,b,c
  1,2,3", skip = 2)

输出

      x     y     z
  <dbl> <dbl> <dbl>
1     1     2     3

# when we need to ignore comments in csv file
read_csv("#ignore it is a comment
 #ignore this is another comment
 x,y,z
 1,2,3
 4,5,6", comment = "#")

输出

      x     y     z
  <dbl> <dbl> <dbl>
1     1     2     3
2     4     5     6

如果第一行不是列的名称，那么我们可以这样做

# If you do not set column names then R does it for you.
# The false flag tells the computer that the 
# first line is not column names.
read_csv("1,2,3\n4,5,6", col_names=FALSE)

输出

     X1    X2    X3
  <dbl> <dbl> <dbl>
1     1     2     3
2     4     5     6

# You can set custom column names
read_csv("1,2,3\n4,5,6", col_names = c("COLUMN1","COLUMN2","COLUMN3"))

输出

  COLUMN1 COLUMN2 COLUMN3
    <dbl>   <dbl>   <dbl>
1       1       2       3
2       4       5       6

# you can use na to represent missing data
read_csv("a,b,c\n1,2,.", na = ".")

输出

      a     b c    
  <dbl> <dbl> <lgl>
1     1     2 NA 
 ```

##  用tidyverse清理数据（什么是tidy数据？）

整洁的数据有三个规则。

  * 每个变量都是一个列。 
  * 每一个观察值都是一个行。 
  * 而每个值都是一个单元格。 

![用R和tidyverse加载和清理数据](https://static.deepinout.com/geekdocs/2023/02/20230215113620-2.png "用R和tidyverse加载和清理数据")

首先，请看整齐和不整齐的数据的例子。
```R 
# tidy data
table1

输出

  country      year  cases population
  <chr>       <int>  <int>      <int>
1 Afghanistan  1999    745   19987071
2 Afghanistan  2000   2666   20595360
3 Brazil       1999  37737  172006362
4 Brazil       2000  80488  174504898
5 China        1999 212258 1272915272
6 China        2000 213766 1280428583

不整齐的数据的例子以及如何处理它。

pivot_wider()

在表2中，一个观察值分散在几行中，这可以通过使用pivot_wider()选项来修复

从中获取变量名称的列。在这里，它是类型化的。
要取值的列。这里，它是计数的。

table2

输出

   country      year type            count
   <chr>       <int> <chr>           <int>
 1 Afghanistan  1999 cases             745
 2 Afghanistan  1999 population   19987071
 3 Afghanistan  2000 cases            2666
 4 Afghanistan  2000 population   20595360
 5 Brazil       1999 cases           37737
 6 Brazil       1999 population  172006362
 7 Brazil       2000 cases           80488
 8 Brazil       2000 population  174504898
 9 China        1999 cases          212258
10 China        1999 population 1272915272
11 China        2000 cases          213766
12 China        2000 population 1280428583

table2 %>% pivot_wider(names_from=type,
                       values_from=count)

输出

  country      year  cases population
  <chr>       <int>  <int>      <int>
1 Afghanistan  1999    745   19987071
2 Afghanistan  2000   2666   20595360
3 Brazil       1999  37737  172006362
4 Brazil       2000  80488  174504898
5 China        1999 212258 1272915272
6 China        2000 213766 1280428583

separate()

在表三中，我们必须在一列中分离两个值。

table3

输出

  country      year rate             
* <chr>       <int> <chr>            
1 Afghanistan  1999 745/19987071     
2 Afghanistan  2000 2666/20595360    
3 Brazil       1999 37737/172006362  
4 Brazil       2000 80488/174504898  
5 China        1999 212258/1272915272
6 China        2000 213766/1280428583

table3 %>% separate(rate,
                    into = c("cases", "population"),
                    sep = "/")

输出

  country      year cases  population
  <chr>       <int> <chr>  <chr>
1 Afghanistan  1999 745    19987071
2 Afghanistan  2000 2666   20595360
3 Brazil       1999 37737  172006362
4 Brazil       2000 80488  174504898 
5 China        1999 212258 1272915272
6 China        2000 213766 1280428583

“cases “和 “population “是字符列，这是 separate() 的默认行为。它让列的类型保持不变，我们可以用 “convert = TRUE “转换为更好的类型。

table3 %>% separate(rate,
                    into=c("cases", "population"),
                    convert=TRUE)

输出

  country      year  cases population
  <chr>       <int>  <int>      <int>
1 Afghanistan  1999    745   19987071
2 Afghanistan  2000   2666   20595360
3 Brazil       1999  37737  172006362
4 Brazil       2000  80488  174504898
5 China        1999 212258 1272915272
6 China        2000 213766 1280428583

你也可以给 “sep “传递一个整数，它将把整数解释为要分割的位置。索引从左边的1和右边的-1开始。

table3 %>% separate(year,
                    into=c("century", "year"),
                    sep=2)

输出

  country     century year  rate             
  <chr>       <chr>   <chr> <chr>            
1 Afghanistan 19      99    745/19987071     
2 Afghanistan 20      00    2666/20595360    
3 Brazil      19      99    37737/172006362  
4 Brazil      20      00    80488/174504898  
5 China       19      99    212258/1272915272
6 China       20      00    213766/1280428583

pivot_longer()

当某些列名不是变量的名称，而是变量的值时。

列的名称是数值而不是变量的集合。在这个例子中，这些是 “1999 “和 “2000 “列。
这里要把列名移到的变量名称是 “年”。
将列值移到这里的变量名称是 “cases”。

table4a

输出

  country     `1999` `2000`
* <chr>        <int>  <int>
1 Afghanistan    745   2666
2 Brazil       37737  80488
3 China       212258 213766

table4a %>% pivot_longer(c(`1999`, `2000`),
                           names_to="year",
                           values_to="cases")

输出

  country     year   cases
  <chr>       <chr>  <int>
1 Afghanistan 1999     745
2 Afghanistan 2000    2666
3 Brazil      1999   37737
4 Brazil      2000   80488
5 China       1999  212258
6 China       2000  213766

unite()

使用 “unite() “来重新连接我们在上一个例子中创建的世纪和年份列。”unite() “需要一个tibble和要创建的新变量的名称，以及一个要合并的列。

table5 %>% unite(new, century, year)

输出

   country     new   rate             
   <chr>       <chr> <chr>            
 1 Afghanistan 19_99 745/19987071     
 2 Afghanistan 20_00 2666/20595360    
 3 Brazil      19_99 37737/172006362  
 4 Brazil      20_00 80488/174504898  
 5 China       19_99 212258/1272915272
 6 China       20_00 213766/1280428583

我们还需要使用sep参数，因为在默认情况下，R会在不同列的值之间放置一个下划线（_）。这里我们不需要任何分隔符，所以我们使用””。

table5 %>% unite(new, century, year, sep = "")

输出

   country     new   rate             
   <chr>       <chr> <chr>            
 1 Afghanistan 1999  745/19987071     
 2 Afghanistan 2000  2666/20595360    
 3 Brazil      1999  37737/172006362  
 4 Brazil      2000  80488/174504898  
 5 China       1999  212258/1272915272
 6 China       2000  213766/1280428583

缺失的值

一个数值可以通过两种方式缺失。

明确地–不存在。
隐性地–不存在于数据中。

table98 <- tibble(
  country = c("Afghanistan", "Afghanistan", "Brazil", "China", "China"),  
  year   = c(1999, 2000, 1999, 1999, 2000),
  cases    = c(   745,    2666,    37737,    80488,    212258),
  population = c(19987071, 20595360, 172006362,   NA, 1280428583)
)

这里有两个缺失值。

1999年 “中国 “的人口是明确缺失的，因为其单元格有NA。
2000年 “巴西 “的人口是明确缺失的，因为它没有出现在数据中。

我们可以通过在列中加入年份使隐性缺失值显性化。

table98 %>% pivot_wider(names_from=year,
                        values_from=population)

输出

  country      cases    `1999`     `2000`
  <chr>        <dbl>     <dbl>      <dbl>
1 Afghanistan    745  19987071         NA
2 Afghanistan   2666        NA   20595360
3 Brazil       37737 172006362         NA
4 China        80488        NA         NA
5 China       212258        NA 1280428583

你可以在 “pivot_longer() “中设置 “values_drop_na = TRUE”，将显性的缺失值变为隐性的。

table98 %>% 
  pivot_wider(names_from = year, values_from = population) %>% 
  pivot_longer(
    cols = c(`1999`, `2000`), 
    names_to = "year", 
    values_to = "population", 
    values_drop_na = TRUE
  )

输出

  country      cases year  population
  <chr>        <dbl> <chr>      <dbl>
1 Afghanistan    745 1999    19987071
2 Afghanistan   2666 2000    20595360
3 Brazil       37737 1999   172006362
4 China       212258 2000  1280428583

complete()

complete()接收一组列，并找到所有唯一的组合，必要时填入明确的NA。

table98 %>% complete(year, cases)

输出

    year  cases country     population
   <dbl>  <dbl> <chr>            <dbl>
 1  1999    745 Afghanistan   19987071
 2  1999   2666 NA                  NA
 3  1999  37737 Brazil       172006362
 4  1999  80488 China               NA
 5  1999 212258 NA                  NA
 6  2000    745 NA                  NA
 7  2000   2666 Afghanistan   20595360
 8  2000  37737 NA                  NA
 9  2000  80488 NA                  NA
10  2000 212258 China       1280428583

fill()

用fill()填补那些缺失的值。用最近的非缺失值（有时称为结转的最后一个观察值）来替换缺失值。

treatment <- tribble(
  ~ person,           ~ treatment, ~response,
  "Gautam",           1,           7,
  NA,                 2,           10,
  NA,                 3,           9,
  "heema",            1,           4
)
  
treatment %>% fill(person)

输出

  person treatment response
  <chr>      <dbl>    <dbl>
1 Gautam         1        7
2 Gautam         2       10
3 Gautam         3        9
4 heema          1        4