Clean, Consistent Column Names

I like to standardize the column names of data I’m reading into R so that I don’t have to match column names from one dataset that has an i.d. column and another that has an id column or maybe an ID column. Keep it simple: lower case with a single underscore separator between words.

My current variation on this takes pains to accept a data.frame or a vector of character values and to return the same data type. I convert some of the more common punctuation marks to abbreviated words (% to pct, # to cnt) so that datasets with both population # and population % columns are distinctly named as population_cnt and population_pct. I also convert camelCase to camel_case for consistency.

While there are packages that also provide functions to clean names, I’ve stuck with base R to limit dependencies for those who operate in constrained computing environments and to serve as a teaching example.

clean_names <- function(.data, unique = FALSE) {
  n <- if (is.data.frame(.data)) colnames(.data) else .data

  n <- gsub("%+", "_pct_", n)
  n <- gsub("\\$+", "_dollars_", n)
  n <- gsub("\\++", "_plus_", n)
  n <- gsub("-+", "_minus_", n)
  n <- gsub("\\*+", "_star_", n)
  n <- gsub("#+", "_cnt_", n)
  n <- gsub("&+", "_and_", n)
  n <- gsub("@+", "_at_", n)

  n <- gsub("[^a-zA-Z0-9_]+", "_", n)
  n <- gsub("([A-Z][a-z])", "_\\1", n)
  n <- tolower(trimws(n))
  
  n <- gsub("(^_+|_+$)", "", n)
  
  n <- gsub("_+", "_", n)
  
  if (unique) n <- make.unique(n, sep = "_")
  
  if (is.data.frame(.data)) {
    colnames(.data) <- n
    .data
  } else {
    n
  }
}

So I can then see the effects of this:

clean_names(
  c(
    "  a", "a  ",
    "a %", "a", "$a", "$$$a", "GDP ($)", "GDP (us$)",
    "a (#)", "a & b", "#", "$",
    "a_cnt",
    "Aa&Bb", "camelCasePhrases",
    "AlphaBetaGamma", "Alpha       Beta", "Beta  !!! Gamma",
    "a + b", "a - b", "a * b"
  )
)
 [1] "a"                  "a"                 
 [3] "a_pct"              "a"                 
 [5] "dollars_a"          "dollars_a"         
 [7] "gdp_dollars"        "gdp_us_dollars"    
 [9] "a_cnt"              "a_and_b"           
[11] "cnt"                "dollars"           
[13] "a_cnt"              "aa_and_bb"         
[15] "camel_case_phrases" "alpha_beta_gamma"  
[17] "alpha_beta"         "beta_gamma"        
[19] "a_plus_b"           "a_minus_b"         
[21] "a_star_b"  

Typically, I pipe my data import directly into my clean_names() function so that all my data starts with the same naming convention:

library(dplyr)

df <- read.csv("data-raw/source_data") %>%
  clean_names()