When processing data downloaded from popular survey engines, it’s not uncommon for multiple choice questions to be represented as one column per possible response coded as 0/1. So, a question with just two responses might be downloaded as part of a CSV with one column for q1_1 and another for q1_2. If the responses are mutually exclusive, then (q1_1 == 0 iff q1_2 == 1) and (q1_1 == 1 iff q1_2 == 0). If the responses are part of a “choose all that apply” question, then it’s possible to have multiple 1s.
How can these individual binary indicator variables be reassembled into a single response variable?
First, let’s simulate some response data for non-mutually exclusive questions—each row represents one respondent’s choices:
df <- data.frame(
q1_1 = round(runif(5), 0),
q1_2 = round(runif(5), 0),
q1_3 = round(runif(5), 0),
q2_1 = round(runif(5), 0),
q2_2 = round(runif(5), 0),
q3_1 = round(runif(5), 0),
q3_2 = round(runif(5), 0),
q3_3 = round(runif(5), 0),
q3_4 = round(runif(5), 0)
)
df
q1_1 q1_2 q1_3 q2_1 q2_2 q3_1 q3_2 q3_3 q3_4
1 1 0 0 1 0 1 1 0 1
2 0 1 0 0 1 1 1 0 1
3 1 1 1 1 1 0 0 0 1
4 0 1 0 1 1 1 1 0 1
5 0 0 1 1 0 1 0 0 0
R’s dplyr package offers the coalesce() function, which doesn’t suit my needs when the data contains 0s for non-selected response choices. Notice below in row 2, for example, that q1 and q2 select the first non-NA values, which is 0:
library(dplyr)
df %>%
mutate(q1 = coalesce(q1_1, q1_2, q1_3)) %>%
mutate(q2 = coalesce(q2_1, q2_2)) %>%
mutate(q3 = coalesce(q3_1, q3_2, q3_3, q3_4))
q1_1 q1_2 q1_3 q2_1 q2_2 q3_1 q3_2 q3_3 q3_4 q1 q2 q3
1 1 0 0 1 0 1 1 0 1 1 1 1
2 0 1 0 0 1 1 1 0 1 0 0 1
3 1 1 1 1 1 0 0 0 1 1 1 1
4 0 1 0 1 1 1 1 0 1 0 1 1
5 0 0 1 1 0 1 0 0 0 0 1 1
If you replace all 0s with NA, you can get closer to what you need:
df %>%
mutate_all(~ifelse(. == 0, NA_real_, .)) %>%
mutate(q1 = coalesce(q1_1, q1_2, q1_3)) %>%
mutate(q2 = coalesce(q2_1, q2_2)) %>%
mutate(q3 = coalesce(q3_1, q3_2, q3_3, q3_4))
q1_1 q1_2 q1_3 q2_1 q2_2 q3_1 q3_2 q3_3 q3_4 q1 q2 q3
1 1 NA NA 1 NA 1 1 NA 1 1 1 1
2 NA 1 NA NA 1 1 1 NA 1 1 1 1
3 1 1 1 1 1 NA NA NA 1 1 1 1
4 NA 1 NA 1 1 1 1 NA 1 1 1 1
5 NA NA 1 1 NA 1 NA NA NA 1 1 1
Unfortunately, the q1, q2, and q3 vectors here only tells us that there was some response by each respondent, not which response they gave for each of q1, q2, and q3.
It would be nice to have a version of coalesce() that gathered not the first non-NA value, but the column name of the column containing the first non-NA value. Here, I’ll use the structure of dplyr’s coalesce() as a model:
coalesce_colname <-
function(...) {
if (missing(..1)) {
abort("At least one argument must be supplied")
}
colnames <- as.character(as.list(match.call()))[-1]
values <- list(...)
x <- values[[1]]
x[!is.na(x)] <- colnames[1]
values <- values[-1]
colnames <- colnames[-1]
for (i in seq_along(values)) {
x <- ifelse(is.na(x) & !is.na(values[[i]]), colnames[i], x)
}
x
}
With this, you have a (mostly) drop-in replacement for coalesce() that captures the column name:
df %>%
mutate_all(~ifelse(. == 0, NA_real_, .)) %>%
mutate(q1 = coalesce_colname(q1_1, q1_2, q1_3)) %>%
mutate(q2 = coalesce_colname(q2_1, q2_2)) %>%
mutate(q3 = coalesce_colname(q3_1, q3_2, q3_3, q3_4))
q1_1 q1_2 q1_3 q2_1 q2_2 q3_1 q3_2 q3_3 q3_4 q1 q2 q3
1 1 NA NA 1 NA 1 1 NA 1 q1_1 q2_1 q3_1
2 NA 1 NA NA 1 1 1 NA 1 q1_2 q2_2 q3_1
3 1 1 1 1 1 NA NA NA 1 q1_1 q2_1 q3_4
4 NA 1 NA 1 1 1 1 NA 1 q1_2 q2_1 q3_1
5 NA NA 1 1 NA 1 NA NA NA q1_3 q2_1 q3_1
and with a little effort, you can wrangle the column name to extract the response value:
df %>%
mutate_all(~ifelse(. == 0, NA_real_, .)) %>%
mutate(q1 = coalesce_colname(q1_1, q1_2, q1_3)) %>%
mutate(q2 = coalesce_colname(q2_1, q2_2)) %>%
mutate(q3 = coalesce_colname(q3_1, q3_2, q3_3, q3_4)) %>%
mutate_at(c("q1", "q2", "q3"), ~stringr::str_extract(., "\\d+$"))
q1_1 q1_2 q1_3 q2_1 q2_2 q3_1 q3_2 q3_3 q3_4 q1 q2 q3
1 1 NA NA 1 NA 1 1 NA 1 1 1 1
2 NA 1 NA NA 1 1 1 NA 1 2 2 1
3 1 1 1 1 1 NA NA NA 1 1 1 4
4 NA 1 NA 1 1 1 1 NA 1 2 1 1
5 NA NA 1 1 NA 1 NA NA NA 3 1 1
coalesce_colname() isn’t tidyeval aware, so coalesce_colnames(!!!vars) isn’t supported (yet…?)
Like coalesce(), coalesce_colnames() doesn’t address the issue of non-mutually exclusive (“select all that apply”) type questions. It simply reports the name of the first (in the order you list them in the function call) column containing a non-NA value.