#| setup: true
#| exercise: ex_3
library(dplyr)
Quarto live test
Filter the starwars
dataset so that only the droid characters are included.
#| exercise: ex_3
starwars |> ______
Consider using the filter()
function from dplyr
.
|> filter(______) starwars
You should filter the dataset using the species
column.
|> filter(species == ______) starwars
File types and importing function
Data can come in many formats, and their file extension typically provides clues into how to import the data into the R environment.
Remember, we should always use a relative path instead of an absolute path. Refer to the Project Structure module to understand why it is important.
Comma Separated Values
CSV or Comma Separated Values is one of the most common data files you will be working with. The value inside the file is separated by ,
and rows are separated by new lines.
.csv can be read with read_csv()
from the readr
package.
<- readr::read_csv("data/penguins.csv") csv_data
Rows: 344 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): species, island, sex
dbl (5): bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g, year
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
csv_data
# A tibble: 344 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 40.3 18 195 3250
4 Adelie Torgersen NA NA NA NA
5 Adelie Torgersen 36.7 19.3 193 3450
6 Adelie Torgersen 39.3 20.6 190 3650
7 Adelie Torgersen 38.9 17.8 181 3625
8 Adelie Torgersen 39.2 19.6 195 4675
9 Adelie Torgersen 34.1 18.1 193 3475
10 Adelie Torgersen 42 20.2 190 4250
# ℹ 334 more rows
# ℹ 2 more variables: sex <chr>, year <dbl>
Excel spreadsheets
The .xls and .xlsx are Excel spreadsheets. These could contain anything and require a careful inspection to import it correctly. The file will often contain a different sheet, so you need to specify the sheet you want to import.
The readr package does not allow us to import Excel spreadsheets, so we need to use a different package. The readxl
package is a common package used for reading data from the Excel spreadsheet file. To specify the sheet, use the sheet = sheet_number
argument.
<- readxl::read_excel("data/penguins.xlsx")
xlsx_data
xlsx_data
# A tibble: 344 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 40.3 18 195 3250
4 Adelie Torgersen NA NA NA NA
5 Adelie Torgersen 36.7 19.3 193 3450
6 Adelie Torgersen 39.3 20.6 190 3650
7 Adelie Torgersen 38.9 17.8 181 3625
8 Adelie Torgersen 39.2 19.6 195 4675
9 Adelie Torgersen 34.1 18.1 193 3475
10 Adelie Torgersen 42 20.2 190 4250
# ℹ 334 more rows
# ℹ 2 more variables: sex <chr>, year <dbl>
RDS
The RDS or R Data Serialization files are a common format for saving R objects in RStudio to a file.
.RDS can be read with read_rds()
from the readr package.
<- readr::read_rds("data/penguins.rds")
rds_data
rds_data
# A tibble: 344 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 40.3 18 195 3250
4 Adelie Torgersen NA NA NA NA
5 Adelie Torgersen 36.7 19.3 193 3450
6 Adelie Torgersen 39.3 20.6 190 3650
7 Adelie Torgersen 38.9 17.8 181 3625
8 Adelie Torgersen 39.2 19.6 195 4675
9 Adelie Torgersen 34.1 18.1 193 3475
10 Adelie Torgersen 42 20.2 190 4250
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>
Online sources
So far, we only learned how to import data locally (from our computer). Now, we are going to look at how we can read data from online sources. A good starting place is from TidyTuesday project. We can download the data by providing the URL inside the read function. For example, if we want to download a CSV file from the TidyTuesday website, here is what the code would look like:
<- "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-01-07/rainfall.csv"
URL
<- readr::read_csv(URL) data
Rows: 179273 Columns: 11
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (6): station_code, city_name, month, day, quality, station_name
dbl (5): year, rainfall, period, lat, long
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
data
# A tibble: 179,273 × 11
station_code city_name year month day rainfall period quality lat long
<chr> <chr> <dbl> <chr> <chr> <dbl> <dbl> <chr> <dbl> <dbl>
1 009151 Perth 1967 01 01 NA NA <NA> -32.0 116.
2 009151 Perth 1967 01 02 NA NA <NA> -32.0 116.
3 009151 Perth 1967 01 03 NA NA <NA> -32.0 116.
4 009151 Perth 1967 01 04 NA NA <NA> -32.0 116.
5 009151 Perth 1967 01 05 NA NA <NA> -32.0 116.
6 009151 Perth 1967 01 06 NA NA <NA> -32.0 116.
7 009151 Perth 1967 01 07 NA NA <NA> -32.0 116.
8 009151 Perth 1967 01 08 NA NA <NA> -32.0 116.
9 009151 Perth 1967 01 09 NA NA <NA> -32.0 116.
10 009151 Perth 1967 01 10 NA NA <NA> -32.0 116.
# ℹ 179,263 more rows
# ℹ 1 more variable: station_name <chr>
It is the same idea as how to import data locally, but instead of providing the computer paths to the function, we use URL instead.
Additional types
There are many more data file types, and the ways to import them can differ. Here are some of an example:
.tsv: Tab Separated Values. These files can be read with
read_tsv()
from thereadr
package..fwf: Fixed Width File. These files can be read with
read_fwf()
from thereadr
package..txt: Text files. These files can be read differently based on how it was structured. If the data is stored in a consistent table-like structure, you might be able to read it with
read_table()
from thereadr
package. If the structure is complex, you likely need to read the data line by line withread_lines()
and process the format yourself.
We encourage you to use the readr package since it is faster than their base equivalents and creates a tibble instead of a data.frame. The tibble is a preferred choice when storing our data, as it has a nicer formatted print output and features to prevent some surprising (and likely incorrect) results. For example, it will identify dates/datetimes and parse them appropriately.
Exporting data
To make a change to your original data file, you would need to export the data inside the R environment. To export it, you can use the readr
package, which comes with a function to write the data. We recommend you save the data using a standard data file format like CSV, even though you can export it into different formats
The function name for writing the data in the readr package is simply replacing the read_*()
with write_*()
. Here is an example of writing a file to CSV.
<- palmerpenguins::penguins
penguins
::write_csv(penguins, "data/penguins_clean.csv") readr
A good practice is to leave the original data untouched and export the data under a different name. In case you need to use the original data again later on.