File types and importing function

Data can come in many formats, and their file extension typically provides clues into how to import the data into the R environment.

Paths

Remember, we should always use a relative path instead of an absolute path. Refer to the Project Structure module to understand why it is important.

Comma Separated Values

CSV or Comma Separated Values is one of the most common data files you will be working with. The value inside the file is separated by , and rows are separated by new lines.

Import CSV:

.csv can be read with read_csv() from the readr package.

csv_data <- readr::read_csv(here::here("modules/project-structure/data/penguins.csv"))

Rows: 344 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): species, island, sex
dbl (5): bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g, year

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

csv_data

# A tibble: 344 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <chr>   <chr>              <dbl>         <dbl>             <dbl>       <dbl>
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           NA            NA                  NA          NA
 5 Adelie  Torgersen           36.7          19.3               193        3450
 6 Adelie  Torgersen           39.3          20.6               190        3650
 7 Adelie  Torgersen           38.9          17.8               181        3625
 8 Adelie  Torgersen           39.2          19.6               195        4675
 9 Adelie  Torgersen           34.1          18.1               193        3475
10 Adelie  Torgersen           42            20.2               190        4250
# ℹ 334 more rows
# ℹ 2 more variables: sex <chr>, year <dbl>

Excel spreadsheets

The .xls and .xlsx are Excel spreadsheets. These could contain anything and require a careful inspection to import it correctly. The file will often contain a different sheet, so you need to specify the sheet you want to import.

Import Excel

The readr package does not allow us to import Excel spreadsheets, so we need to use a different package. The readxl package is a common package used for reading data from the Excel spreadsheet file. To specify the sheet, use the sheet = sheet_number argument.

xlsx_data <- readxl::read_excel(here::here("modules/project-structure/data/penguins.xlsx"))

xlsx_data

# A tibble: 344 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <chr>   <chr>              <dbl>         <dbl>             <dbl>       <dbl>
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           NA            NA                  NA          NA
 5 Adelie  Torgersen           36.7          19.3               193        3450
 6 Adelie  Torgersen           39.3          20.6               190        3650
 7 Adelie  Torgersen           38.9          17.8               181        3625
 8 Adelie  Torgersen           39.2          19.6               195        4675
 9 Adelie  Torgersen           34.1          18.1               193        3475
10 Adelie  Torgersen           42            20.2               190        4250
# ℹ 334 more rows
# ℹ 2 more variables: sex <chr>, year <dbl>

RDS

The RDS or R Data Serialization files are a common format for saving R objects in RStudio to a file.

Import RDS

.RDS can be read with read_rds() from the readr package.

rds_data <- readr::read_rds(here::here("modules/project-structure/data/penguins.rds"))

rds_data

# A tibble: 344 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           NA            NA                  NA          NA
 5 Adelie  Torgersen           36.7          19.3               193        3450
 6 Adelie  Torgersen           39.3          20.6               190        3650
 7 Adelie  Torgersen           38.9          17.8               181        3625
 8 Adelie  Torgersen           39.2          19.6               195        4675
 9 Adelie  Torgersen           34.1          18.1               193        3475
10 Adelie  Torgersen           42            20.2               190        4250
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>

Online sources

So far, we only learned how to import data locally (from our computer). Now, we are going to look at how we can read data from online sources. A good starting place is from TidyTuesday project. We can download the data by providing the URL inside the read function. For example, if we want to download a CSV file from the TidyTuesday website, here is what the code would look like:

URL <- "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-01-07/rainfall.csv"

data <- readr::read_csv(URL)

Rows: 179273 Columns: 11
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (6): station_code, city_name, month, day, quality, station_name
dbl (5): year, rainfall, period, lat, long

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

data

# A tibble: 179,273 × 11
   station_code city_name  year month day   rainfall period quality   lat  long
   <chr>        <chr>     <dbl> <chr> <chr>    <dbl>  <dbl> <chr>   <dbl> <dbl>
 1 009151       Perth      1967 01    01          NA     NA <NA>    -32.0  116.
 2 009151       Perth      1967 01    02          NA     NA <NA>    -32.0  116.
 3 009151       Perth      1967 01    03          NA     NA <NA>    -32.0  116.
 4 009151       Perth      1967 01    04          NA     NA <NA>    -32.0  116.
 5 009151       Perth      1967 01    05          NA     NA <NA>    -32.0  116.
 6 009151       Perth      1967 01    06          NA     NA <NA>    -32.0  116.
 7 009151       Perth      1967 01    07          NA     NA <NA>    -32.0  116.
 8 009151       Perth      1967 01    08          NA     NA <NA>    -32.0  116.
 9 009151       Perth      1967 01    09          NA     NA <NA>    -32.0  116.
10 009151       Perth      1967 01    10          NA     NA <NA>    -32.0  116.
# ℹ 179,263 more rows
# ℹ 1 more variable: station_name <chr>

It is the same idea as how to import data locally, but instead of providing the computer paths to the function, we use URL instead.

Additional types

There are many more data file types, and the ways to import them can differ. Here are some of an example:

.tsv: Tab Separated Values. These files can be read with read_tsv() from the readr package.
.fwf: Fixed Width File. These files can be read with read_fwf() from the readr package.
.txt: Text files. These files can be read differently based on how it was structured. If the data is stored in a consistent table-like structure, you might be able to read it with read_table() from the readr package. If the structure is complex, you likely need to read the data line by line with read_lines() and process the format yourself.

readr package

We encourage you to use the readr package since it is faster than their base equivalents and creates a tibble instead of a data.frame. The tibble is a preferred choice when storing our data, as it has a nicer formatted print output and features to prevent some surprising (and likely incorrect) results. For example, it will identify dates/datetimes and parse them appropriately.

Exporting data

To make a change to your original data file, you would need to export the data inside the R environment. To export it, you can use the readr package, which comes with a function to write the data. We recommend you save the data using a standard data file format like CSV, even though you can export it into different formats

The function name for writing the data in the readr package is simply replacing the read_*() with write_*(). Here is an example of writing a file to CSV.

penguins <- palmerpenguins::penguins

readr::write_csv(penguins, here::here("modules/project-structure/data/penguins_clean.csv"))

Save file

A good practice is to leave the original data untouched and export the data under a different name. In case you need to use the original data again later on.