Data wrangling with R

In this chapter, you will learn to use tidyr and dplyr from the tidyverse in R for tidying your data set and data maniplulation.

Published

13 December 2024

Introduction to tidyverse

Introduction to tidyverse

The tidyverse is a collection of R packages that transform and visualize data. They aim to be cohesive with each other and provide simple building block functions for you to use in complex applications. All packages of the tidyverse share very similar syntax and ideologies.

The tidyverse includes the following packages:

  • dplyr: grammar of data manipulation
  • tidyr: reshape your data
  • readr: read rectangular data
  • tibble: re-imagining of the data frame
  • stringr: working with string data
  • ggplot2: grammar of graphics

and many more…

In this chapter, we focus on two packages: dplyr and tidyr.

dplyr makes manipulating data easy. tidyr helps you to create “tidy data” (data where each variable is in a column, each observation is a row and each value is a cell). These packages contain the functions that help with adding, renaming, removing columns, computing new variables, joining data sets together, and summarising your data. These can solve the most common data manipulation challenges.

To get start on the journey

First, install the core tidyverse packages with: install.packages("tidyverse")

Then, load the tidyverse package with: library(tidyverse)

Tip

Did you know you can also load specific library separately, for example, library(dplyr)

Introduction to dplyr

Introduction to dplyr

dplyr provides a selection of tools for the most common data wrangling tasks. There are a few simple yet handy functions.

  • select()
  • filter()
  • arrange()
  • mutate()
  • group_by() along with ungroup()
  • count()
  • case_when()

select()

This function allows you to pick particular variables from your data.

data |>
    select(variable)

This is not only limited to one variable selection. You can select multiple variables by separating the column name with the comma.

data |>
    select(variable1, variable2)
Your turn!

We will be using penguin data from the palmerpenguins package.

Let’s say that you want to look at whether the body mass changes over the year.

Hint: To look at the column names of the data, you can use the functions colnames()


    

Selection helpers

There are many helper functions that allow you to select variables by matching the patterns in the column’s name. Let’s look at a few examples.

  • starts_with(): select all variables that start with a prefix.
penguins |>
    select(starts_with("bill"))
# A tibble: 344 × 2
   bill_length_mm bill_depth_mm
            <dbl>         <dbl>
 1           39.1          18.7
 2           39.5          17.4
 3           40.3          18  
 4           NA            NA  
 5           36.7          19.3
 6           39.3          20.6
 7           38.9          17.8
 8           39.2          19.6
 9           34.1          18.1
10           42            20.2
# ℹ 334 more rows
  • ends_with(): select all variables that ends with a suffix.
penguins |>
    select(ends_with("mm"))
# A tibble: 344 × 3
   bill_length_mm bill_depth_mm flipper_length_mm
            <dbl>         <dbl>             <int>
 1           39.1          18.7               181
 2           39.5          17.4               186
 3           40.3          18                 195
 4           NA            NA                  NA
 5           36.7          19.3               193
 6           39.3          20.6               190
 7           38.9          17.8               181
 8           39.2          19.6               195
 9           34.1          18.1               193
10           42            20.2               190
# ℹ 334 more rows
  • contains(): select all variables that contains a literal string.
penguins |>
    select(contains("length"))
# A tibble: 344 × 2
   bill_length_mm flipper_length_mm
            <dbl>             <int>
 1           39.1               181
 2           39.5               186
 3           40.3               195
 4           NA                  NA
 5           36.7               193
 6           39.3               190
 7           38.9               181
 8           39.2               195
 9           34.1               193
10           42                 190
# ℹ 334 more rows
  • num_range(): select all variables that matches a numerical range like x01, x02, x03.
billboard |> 
    select(num_range("wk", 10:15))
# A tibble: 317 × 6
    wk10  wk11  wk12  wk13  wk14  wk15
   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1    NA    NA    NA    NA    NA    NA
 2    NA    NA    NA    NA    NA    NA
 3    51    51    51    47    44    38
 4    61    61    59    61    66    72
 5    57    64    70    75    76    78
 6     6     7    22    29    36    47
 7    NA    NA    NA    NA    NA    NA
 8    36    37    37    38    49    61
 9    10     9     8     6     1     2
10    59    66    68    61    67    59
# ℹ 307 more rows

filter()

This allow you to extract observation from the data based on a given condition.

data |>
    filter(expression)

For the expression field, there are many functions and operators that are useful for constructing the expressions.

  • ==, >, >= etc
  • &, |, !, xor()
  • is.na()
  • between(), near()
penguins |>
    filter(species == "Gentoo")
# A tibble: 124 × 8
   species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>           <dbl>         <dbl>             <int>       <int>
 1 Gentoo  Biscoe           46.1          13.2               211        4500
 2 Gentoo  Biscoe           50            16.3               230        5700
 3 Gentoo  Biscoe           48.7          14.1               210        4450
 4 Gentoo  Biscoe           50            15.2               218        5700
 5 Gentoo  Biscoe           47.6          14.5               215        5400
 6 Gentoo  Biscoe           46.5          13.5               210        4550
 7 Gentoo  Biscoe           45.4          14.6               211        4800
 8 Gentoo  Biscoe           46.7          15.3               219        5200
 9 Gentoo  Biscoe           43.3          13.4               209        4400
10 Gentoo  Biscoe           46.8          15.4               215        5150
# ℹ 114 more rows
# ℹ 2 more variables: sex <fct>, year <int>