Skip to contents

Here we walk through a basic workflow for working with OpenAQ data inside airpurifyr. The aim is to produce some plots about the 2019-20 Australian Bushfire Season, where air quality was extremely poor.

The first step when working with airpurifyr is to set your API key. Register your key at https://explore.openaq.org/register, and set this key using set_openaq_api_key(). Note you can access data without an API key, but it is heavily restricted, so the package will give you a warning for each query you run without an API key set.

With your API key set, our first task is to find all the locations on the east coast of Australia. This is non-trivial, as the “locations” provided by OpenAQ related to sensors rather than physical locations. The easiest (but most API intensive) method to get locations is to download for a large region, and then filter.

all_aus_locations <- get_measurements_for_location(
  country = "AU",
  max_observations = 1000,
  date_from = lubridate::ymd("2020-01-01"),
  date_to = lubridate::ymd("2020-01-14"),
  parameter = "pm25"
)

Here, we have asked for data from Australia, from 1 Jan 2020 to 14 Jan 2020, restricted to pm2.5. Restricting the date range and the parameter means that we need less API calls. Considering we are only using this data to get a list of stations, the smaller the set the better.

We recommend using about two weeks of time to get locations, as not all sensors report on all days.

Next, we filter down these locations using latitude and longitude, and pull down the data we actually want for a much longer period of time.

locations_of_interest <- all_aus_locations |>
  # East coast of Australia (roughly)
  dplyr::filter(long > 141, lat < -31) |>
  dplyr::distinct(location) |>
  dplyr::pull()

au_east_coast_2020 <- get_measurements_for_location(
  country = "AU",
  location = locations_of_interest,
  max_observations = 10000,
  date_from = lubridate::ymd("2019-12-01"),
  date_to = lubridate::ymd("2020-02-01"),
  parameter = "pm25"
)

If you run this example, you’ll notice this query takes a very long time. This is because it is quite a few locations for 90 days, with sensors generally reporting hourly. Here we have increased the max_observations parameter which makes each query larger but reduces the number of queries. If you are having troubles with Error 408 (HTTP timeout) then this can be a fix.

The resulting data looks like this:

head(au_east_coast_2020)
#> # A tibble: 6 × 9
#>   location_id location parameter value date_utc            unit    lat  long
#>         <int> <chr>    <chr>     <dbl> <dttm>              <chr> <dbl> <dbl>
#> 1        2487 Bathurst pm25        9.9 2020-02-01 00:00:00 µg/m³ -33.4  150.
#> 2        2487 Bathurst pm25       10.3 2020-01-31 23:00:00 µg/m³ -33.4  150.
#> 3        2487 Bathurst pm25       10.8 2020-01-31 22:00:00 µg/m³ -33.4  150.
#> 4        2487 Bathurst pm25       11.3 2020-01-31 21:00:00 µg/m³ -33.4  150.
#> 5        2487 Bathurst pm25       12.1 2020-01-31 20:00:00 µg/m³ -33.4  150.
#> 6        2487 Bathurst pm25       12.4 2020-01-31 19:00:00 µg/m³ -33.4  150.
#> # ℹ 1 more variable: country <chr>

We can visualise the stations by looking at their latitude and longitude from the dataset.

states <- ozmaps::ozmap_states |>
  filter(NAME %in% c("New South Wales", "Victoria"))

stations <- au_east_coast_2020 |>
  distinct(lat, long)

ggplot(states) +
  geom_sf() +
  geom_point(aes(x = long, y = lat), data = stations) +
  theme_bw() +
  labs(x="Longitude", y="Latitude") +
  coord_sf()

You can see that stations are most dense near major cities, with very little monitoring occurring in regional Australia.

We can look at frequency reporting using a dotplot.

ggplot(au_east_coast_2020, aes(x = date_utc, y = value, color = location)) +
  geom_point(size = 0.5) +
  guides(colour = "none") +
  labs(x = "Date", y = "pm2.5") +
  geom_hline(aes(yintercept = 500), linetype = "dashed")

Here, each set of dots represents one station. You can see that there are patches of missingness across all stations.

You can also see the effects of the fires, represented by the large spikes in pm2.5 measurements. According to the Victorian Environmental Protection Authority, pm2.5 levels above 500 (marked by the dashed line in the previous plot) are considered “extremely poor”.