|>
data select(variable)
Introduction to tidyverse
Introduction to tidyverse
The tidyverse is a collection of R packages that transform and visualize data. They aim to be cohesive with each other and provide simple building block functions for you to use in complex applications. All packages of the tidyverse share very similar syntax and ideologies.
The tidyverse includes the following packages:
dplyr
: grammar of data manipulationtidyr
: reshape your datareadr
: read rectangular datatibble
: re-imagining of the data framestringr
: working with string dataggplot2
: grammar of graphics
and many more…
In this chapter, we focus on two packages: dplyr and tidyr.
dplyr makes manipulating data easy. tidyr helps you to create “tidy data” (data where each variable is in a column, each observation is a row and each value is a cell). These packages contain the functions that help with adding, renaming, removing columns, computing new variables, joining data sets together, and summarising your data. These can solve the most common data manipulation challenges.
First, install the core tidyverse packages with: install.packages("tidyverse")
Then, load the tidyverse package with: library(tidyverse)
Did you know you can also load specific library separately, for example, library(dplyr)
Introduction to dplyr
Introduction to dplyr
dplyr provides a selection of tools for the most common data wrangling tasks. There are a few simple yet handy functions.
select()
filter()
arrange()
mutate()
group_by()
along withungroup()
count()
case_when()
select()
This function allows you to pick particular variables from your data.
This is not only limited to one variable selection. You can select multiple variables by separating the column name with the comma.
|>
data select(variable1, variable2)
We will be using penguin data from the palmerpenguins package.
Let’s say that you want to look at whether the body mass changes over the year.
Hint: To look at the column names of the data, you can use the functions colnames()
Selection helpers
There are many helper functions that allow you to select variables by matching the patterns in the column’s name. Let’s look at a few examples.
starts_with()
: select all variables that start with a prefix.
|>
penguins select(starts_with("bill"))
# A tibble: 344 × 2
bill_length_mm bill_depth_mm
<dbl> <dbl>
1 39.1 18.7
2 39.5 17.4
3 40.3 18
4 NA NA
5 36.7 19.3
6 39.3 20.6
7 38.9 17.8
8 39.2 19.6
9 34.1 18.1
10 42 20.2
# ℹ 334 more rows
ends_with()
: select all variables that ends with a suffix.
|>
penguins select(ends_with("mm"))
# A tibble: 344 × 3
bill_length_mm bill_depth_mm flipper_length_mm
<dbl> <dbl> <int>
1 39.1 18.7 181
2 39.5 17.4 186
3 40.3 18 195
4 NA NA NA
5 36.7 19.3 193
6 39.3 20.6 190
7 38.9 17.8 181
8 39.2 19.6 195
9 34.1 18.1 193
10 42 20.2 190
# ℹ 334 more rows
contains()
: select all variables that contains a literal string.
|>
penguins select(contains("length"))
# A tibble: 344 × 2
bill_length_mm flipper_length_mm
<dbl> <int>
1 39.1 181
2 39.5 186
3 40.3 195
4 NA NA
5 36.7 193
6 39.3 190
7 38.9 181
8 39.2 195
9 34.1 193
10 42 190
# ℹ 334 more rows
num_range()
: select all variables that matches a numerical range like x01, x02, x03.
|>
billboard select(num_range("wk", 10:15))
# A tibble: 317 × 6
wk10 wk11 wk12 wk13 wk14 wk15
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 NA NA NA NA NA NA
2 NA NA NA NA NA NA
3 51 51 51 47 44 38
4 61 61 59 61 66 72
5 57 64 70 75 76 78
6 6 7 22 29 36 47
7 NA NA NA NA NA NA
8 36 37 37 38 49 61
9 10 9 8 6 1 2
10 59 66 68 61 67 59
# ℹ 307 more rows
filter()
This allow you to extract observation from the data based on a given condition.
|>
data filter(expression)
For the expression field, there are many functions and operators that are useful for constructing the expressions.
==
,>
,>=
etc&
,|
,!
,xor()
is.na()
between()
,near()
|>
penguins filter(species == "Gentoo")
# A tibble: 124 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Gentoo Biscoe 46.1 13.2 211 4500
2 Gentoo Biscoe 50 16.3 230 5700
3 Gentoo Biscoe 48.7 14.1 210 4450
4 Gentoo Biscoe 50 15.2 218 5700
5 Gentoo Biscoe 47.6 14.5 215 5400
6 Gentoo Biscoe 46.5 13.5 210 4550
7 Gentoo Biscoe 45.4 14.6 211 4800
8 Gentoo Biscoe 46.7 15.3 219 5200
9 Gentoo Biscoe 43.3 13.4 209 4400
10 Gentoo Biscoe 46.8 15.4 215 5150
# ℹ 114 more rows
# ℹ 2 more variables: sex <fct>, year <int>