Transforming voting data to ternable-friendly format
Source:vignettes/transform_raw_data.Rmd
transform_raw_data.RmdIntroduction
Raw voting data comes in all shapes and forms, and not all voting data can be easily transformed into a format that is ready for ternary plots.
This vignette shows how to transform common voting data formats into
a ternable-friendly format that can be used for ternary
plots. These include PrefLib-formatted data, raw ballot data in long and
wide forms, and AEC distribution of preference data.
ternable-friendly data
as_ternable() creates a ternable object,
which is a S3 object that contains the data and metadata necessary for
ternary plots.
A ternable-friendly data frame must have the following
characteristics:
- At least 3 numeric columns representing the preferences for 3 items/candidates of the election. These columns will act as the vertices of the ternary plot.
- Values in these item/candidate columns must be non-negative, and sum to 1.
df <- tibble(
electorate = c("A", "B", "C"),
PartyA = c(0.5, 0.4, 0.6),
PartyB = c(0.3, 0.4, 0.2),
PartyC = c(0.2, 0.2, 0.2)
)
df
#> # A tibble: 3 × 4
#> electorate PartyA PartyB PartyC
#> <chr> <dbl> <dbl> <dbl>
#> 1 A 0.5 0.3 0.2
#> 2 B 0.4 0.4 0.2
#> 3 C 0.6 0.2 0.2The above data frame is a minimal example of a
ternable-friendly data frame. It contains 3 columns, each
representing the preferences for a party in a specific electorate. The
values in these columns are all non-negative and sum to 1.
PrefLib-formatted data
PrefLib
format is a common format for storing preferential data, and can be
easily downloaded using prefio::read_preflib()
function.
In this example, we will use the NSW Legislative Assembly Election dataset.
nswla <- read_preflib("00058 - nswla/00058-00000171.soi", from_preflib = TRUE)
nswla
#> # A tibble: 1,104 × 2
#> # ℹ 1,094 more rows
#> # ℹ 2 more variables: preferences <prefrncs>, frequency <int>The dataset contains preferential data, with each row representing a
unique set of preferences and their respective frequencies. What is
missing to transform this data into a ternable-friendly
format is the breakdown of preferences for each candidate in each round
of voting.
The function dop_irv() in this package helps us simulate
the Instant Runoff Voting (IRV) process to get the round-by-round
preferences for each candidate, and then transform the data into a
ternable-friendly format. A caveat for this function is
that it does not handle ties in preferences given a ballot, which works
fine under Australian IRV rules, but may not be appropriate for other
electoral systems.
dop_irv(
nswla, value_type = "percentage",
preferences_col = preferences,
frequency_col = frequency)
#> # A tibble: 4 × 8
#> round `HAYLEN Jo` `WEI Leo` `RAUE Tom` `MAKRIS Andrea` `ROMANOVSKY Teresa`
#> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 0.464 0.233 0.206 0.0572 0.0252
#> 2 2 0.469 0.236 0.209 0.0595 0.0270
#> 3 3 0.478 0.240 0.218 0.0646 0
#> 4 4 0.504 0.253 0.244 0 0
#> # ℹ 2 more variables: `SINDEN Dale` <dbl>, winner <chr>Raw ballot data in long and wide formats
Raw ballot data mostly comes in long or wide format. Both formats can
be transformed into Preflib format, which can then be converted to
ternable-friendly format using dop_irv().
Processing long format
Consider the following example of simulated ballot data in long format for the Melbourne electorate, with 3 parties: ALP, LNP, and Other, and a typical column for preference rank.
| ballot_id | elect_division | party | preference_rank |
|---|---|---|---|
| 1 | Melbourne | ALP | 1 |
| 1 | Melbourne | LNP | 2 |
| 1 | Melbourne | Other | 3 |
| 2 | Melbourne | ALP | 2 |
| 2 | Melbourne | LNP | 1 |
| 2 | Melbourne | Other | 3 |
| 3 | Melbourne | ALP | 3 |
| 3 | Melbourne | LNP | 2 |
| 3 | Melbourne | Other | 1 |
| 4 | Melbourne | ALP | 1 |
| 4 | Melbourne | LNP | 3 |
| 4 | Melbourne | Other | 2 |
| 5 | Melbourne | ALP | 2 |
| 5 | Melbourne | LNP | 3 |
| 5 | Melbourne | Other | 1 |
# Convert to PrefLib format
preflib_long <- prefio::long_preferences(
ballot_long,
vote,
id_cols = c(ballot_id, elect_division),
item_col = party,
rank_col = preference_rank
)
# Convert to ternable-friendly format
dop_irv(preflib_long$vote, value_type = "percentage")
#> # A tibble: 2 × 5
#> round ALP LNP Other winner
#> <int> <dbl> <dbl> <dbl> <chr>
#> 1 1 0.4 0.2 0.4 ALP
#> 2 2 0.6 0 0.4 ALPProcessing wide format
Consider the same example, but in wide format, with each column representing an item/candidate and its value representing the preference rank.
| ballot_id | elect_division | ALP | LNP | Other |
|---|---|---|---|---|
| 1 | Melbourne | 1 | 2 | 3 |
| 2 | Melbourne | 2 | 1 | 3 |
| 3 | Melbourne | 3 | 2 | 1 |
| 4 | Melbourne | 1 | 3 | 2 |
| 5 | Melbourne | 2 | 3 | 1 |
# Convert to PrefLib format
preflib_wide <- prefio::wide_preferences(ballot_wide, vote, ALP:Other)
# Convert to ternable-friendly format
dop_irv(preflib_wide$vote, value_type = "percentage")
#> # A tibble: 2 × 5
#> round ALP LNP Other winner
#> <int> <dbl> <dbl> <dbl> <chr>
#> 1 1 0.4 0.2 0.4 ALP
#> 2 2 0.6 0 0.4 ALPAEC distribution of preferences
This section applies specifically to the AEC distribution of preferences data. However, it can also be adapted to datasets of similar format, where
- Each row represents an aggregated preference count and/or percentage for a specific candidate in a specific group/electorate.
- There must be at least a primary key column that uniquely identifies each row, a value column representing the preference count/percentage, and an item column representing the items/parties.
- There may also be a column representing the electoral status of the item/party (elected or not elected). This is optional.
In this example, we will work with 2025 Distribution of Preference
data, which is included in this package as aecdop_2025. We
are interested in preference percentages of 4 main parties: Labor, The
Coalition, Greens, and Independents. The rest is grouped into Other.
Before transforming the data, let’s do some data wrangling to filter for relevant information.
# only include preference percentage and parties of interest
aecdop_2025 <- aecdop_2025 |>
filter(CalculationType == "Preference Percent") |>
mutate(Party = case_when(
# all parties not in the main parties are grouped into "Other"
!(PartyAb %in% c("LP", "ALP", "NP", "LNP", "LNQ", "GRN", "IND")) ~ "Other",
# group all parties in the Coalition into "LNP"
PartyAb %in% c("LP", "NP", "LNP", "LNQ") ~ "LNP",
TRUE ~ PartyAb
))The data is now ready for transformation, with the following characteristics:
-
DivisionNmandCountNumbercolumns together form the primary key that uniquely identifies each row. -
CalculationValuecolumn contains the preference percentage. -
Partycolumn contains the item/party. -
Electedcolumn contains the elected/electoral status of the party.
We will use the dop_transform() function to transform
the data into a ternable-friendly format.
dop_transform(
data = aecdop_2025,
key_cols = c(DivisionNm, CountNumber),
value_col = CalculationValue,
item_col = Party,
winner_col = Elected,
winner_identifier = "Y"
)
#> # A tibble: 976 × 8
#> DivisionNm CountNumber ALP GRN LNP Other IND Winner
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
#> 1 Adelaide 0 0.465 0.190 0.242 0.104 0 ALP
#> 2 Adelaide 1 0.467 0.194 0.242 0.0967 0 ALP
#> 3 Adelaide 2 0.476 0.205 0.244 0.0745 0 ALP
#> 4 Adelaide 3 0.483 0.210 0.249 0.0578 0 ALP
#> 5 Adelaide 4 0.493 0.222 0.285 0 0 ALP
#> 6 Adelaide 5 0.691 0 0.309 0 0 ALP
#> 7 Aston 0 0.373 0 0.377 0.209 0.0414 ALP
#> 8 Aston 1 0.373 0 0.378 0.206 0.0434 ALP
#> 9 Aston 2 0.376 0 0.380 0.210 0.0338 ALP
#> 10 Aston 3 0.378 0 0.384 0.202 0.0358 ALP
#> # ℹ 966 more rows