I like to standardize the column names of data I’m reading into R so that I don’t have to match column names from one dataset that has an i.d.
column and another that has an id
column or maybe an ID
column. Keep it simple: lower case with a single underscore separator between words.
My current variation on this takes pains to accept a data.frame or a vector of character values and to return the same data type. I convert some of the more common punctuation marks to abbreviated words (% to pct, # to cnt) so that datasets with both population #
and population %
columns are distinctly named as population_cnt
and population_pct
. I also convert camelCase
to camel_case
for consistency.
While there are packages that also provide functions to clean names, I’ve stuck with base R to limit dependencies for those who operate in constrained computing environments and to serve as a teaching example.
clean_names <- function(.data, unique = FALSE) {
n <- if (is.data.frame(.data)) colnames(.data) else .data
n <- gsub("%+", "_pct_", n)
n <- gsub("\\$+", "_dollars_", n)
n <- gsub("\\++", "_plus_", n)
n <- gsub("-+", "_minus_", n)
n <- gsub("\\*+", "_star_", n)
n <- gsub("#+", "_cnt_", n)
n <- gsub("&+", "_and_", n)
n <- gsub("@+", "_at_", n)
n <- gsub("[^a-zA-Z0-9_]+", "_", n)
n <- gsub("([A-Z][a-z])", "_\\1", n)
n <- tolower(trimws(n))
n <- gsub("(^_+|_+$)", "", n)
n <- gsub("_+", "_", n)
if (unique) n <- make.unique(n, sep = "_")
if (is.data.frame(.data)) {
colnames(.data) <- n
.data
} else {
n
}
}
So I can then see the effects of this:
clean_names(
c(
" a", "a ",
"a %", "a", "$a", "$$$a", "GDP ($)", "GDP (us$)",
"a (#)", "a & b", "#", "$",
"a_cnt",
"Aa&Bb", "camelCasePhrases",
"AlphaBetaGamma", "Alpha Beta", "Beta !!! Gamma",
"a + b", "a - b", "a * b"
)
)
[1] "a" "a"
[3] "a_pct" "a"
[5] "dollars_a" "dollars_a"
[7] "gdp_dollars" "gdp_us_dollars"
[9] "a_cnt" "a_and_b"
[11] "cnt" "dollars"
[13] "a_cnt" "aa_and_bb"
[15] "camel_case_phrases" "alpha_beta_gamma"
[17] "alpha_beta" "beta_gamma"
[19] "a_plus_b" "a_minus_b"
[21] "a_star_b"
Typically, I pipe my data import directly into my clean_names() function so that all my data starts with the same naming convention:
library(dplyr)
df <- read.csv("data-raw/source_data") %>%
clean_names()