Steps to process input to valid data • yowie

The wages contain some values that are unlikely to be true. To detect these unlikely values, we fitted a robust linear regression to each individual. More specifically, for each individual we fitted the model

\[y_i = \beta_0 + \beta_1 x_i + e_i\] where

\(y_i\) is the mean hourly wage in the \(i\)-th year,
\(x_i\) is the year and
\(e_i\) is the \(i\)-th error term,

using the iterated re-weighted least squares (IWLS) process.

Observations with weights (used in the IWLS process) less than 0.12 are modified as missing values. An alternative wage is predicted from the fitted robust linear regression model for these censored values and stored in another variable. The threshold of the weights was determined by visualising the effect of different thresholds using the shiny app found here.

library(tidyverse)
library(MASS)
select <- dplyr::select
wages_before <- readRDS(here::here("paper/results/wages_before.rds"))

# nest the data by id to build a robust linear model
by_id <- wages_before %>%
  select(id, year, mean_hourly_wage) %>%
  group_by(id) %>%
  nest()

# build a robust linear model
id_rlm <- by_id %>%
  mutate(model = map(data, ~rlm(mean_hourly_wage ~ year, data = .x)))

# extract the properties of regression model and weight of each observation

id_aug_w <- id_rlm %>%
  mutate(.fitted = map(model, fitted),
         .resid = map(model, resid),
         .hat = map(model, hatvalues),
         .sigma = map(model, sigma),
         w = map(model, ~.x$w)) %>%
  select(-model) %>%
  unnest(data:w)

# if the weight < 0.12, the mean_hourly_wage is replaced by the model's fitted/predicted value.
# and add the flag whether the observation is predicted value or not.
# since the fitted value is sometimes <0, and wages value could never be negative,
# we keep the mean hourly wage value even its weight < 0.12.

wages_rlm_dat <- id_aug_w %>%
  mutate(wages_rlm = ifelse(w < 0.12  & .fitted >= 0, .fitted,
                            mean_hourly_wage)) %>%
  mutate(is_pred = ifelse(w < 0.12 & .fitted >= 0, TRUE,
                          FALSE)) %>%
  select(id, year, wages_rlm, is_pred)

# join back the `wages_rlm_dat` to `wages_demog_hs`

wages_after <- left_join(wages_before, wages_rlm_dat, by = c("id", "year"))

wages_after
#> # A tibble: 103,994 × 18
#>       id  year mean_hourly_wage total_hours number_of_jobs is_wm dob_month
#>    <int> <dbl>            <dbl>       <int>          <dbl> <lgl>     <int>
#>  1     2  1979             3.85          35              1 FALSE         1
#>  2     2  1980             4.57          NA              1 FALSE         1
#>  3     2  1981             5.14          NA              1 FALSE         1
#>  4     2  1982             5.71          35              1 FALSE         1
#>  5     2  1983             5.71          NA              1 FALSE         1
#>  6     2  1984             5.14          NA              1 FALSE         1
#>  7     2  1985             7.71          NA              1 FALSE         1
#>  8     2  1986             7.69          NA              1 FALSE         1
#>  9     2  1987             8.79          NA              1 FALSE         1
#> 10     2  1988             6.67          NA              2 FALSE         1
#> # … with 103,984 more rows, and 11 more variables: dob_year <int>,
#> #   dob_conflict <lgl>, race <fct>, gender <fct>, yr_hgc <dbl>, hgc_i <int>,
#> #   hgc <chr>, age_1979 <dbl>, years_in_workforce <dbl>, wages_rlm <dbl>,
#> #   is_pred <lgl>