Pipes have been a fundamental aspect of computer programming for many decades. In short, the semantics of pipes can be thought of as taking the output from the left-hand side and passing it as input to the right-hand side. For example, in a linux shell, you might cat example.txt | sort | uniq
to take the contents of a text file, then sort the rows, then take one copy of each distinct value. |
is a common, but not universal, pipe operator and on U.S. Qwerty keyboards, is found above the RETURN key along with the backslash: \
.
Languages that don’t begin by supporting pipes often eventually implement some version of them. In R, the magrittr package introduced the %>%
infix operator as a pipe operator and is most often pronounced as “then”. For example, “take the mtcars
data.frame, THEN take the head
of it, THEN…” and so on.
For a function to be pipe friendly, it should at least take a data object (often named .data
) as its first argument and return an object of the same type—possibly even the same, unaltered object. This contract ensures that your pipe-friendly function can exist in the middle of a piped workflow, accepting the input from its left-hand side and passing along output to its right-hand side.
library(magrittr) custom_function <- function(.data) { message(str(.data)) .data } mtcars %>% custom_function() %>% head(10) %>% custom_function()
This will first display the structure of the 32 by 10 mtcars
data.frame, then take the head(10)
of mtcars
and display the structure of that 10 by 10 reduced version, ultimately returning the reduced version which is, by default in R, printed to the console.
The dplyr
package in R introduces the notion of a grouped data.frame. For example, in the mtcars
data, there is a cyl
parameter that classifies each observation as a 4, 6, or 8 cylinder vehicle. You might want to process each of these groups of rows separately—i.e., process all the 4 cylinder vehicles together, then all the 6 cylinder, then all the 8 cylinder:
library(dplyr) mtcars %>% group_by(cyl) %>% tally()
Note that dplyr
re-exports the magrittr
pipe operator, so it’s not necessary to attach both dplyr
and magrittr
explicitly; attaching dplyr
will usually suffice.
In order to make my custom function group-aware, I need to check the incoming .data
object to see whether it’s a grouped data.frame. If it is, then I can use dplyr
‘s do()
function to call my custom function on each subset of the data. Here, the (.)
notation denotes the subset of .data
being handed to custom_function
at each invocation.
library(dplyr) custom_function <- function(.data) { if (dplyr::is_grouped_df(.data)) { return(dplyr::do(.data, custom_function(.))) } message(str(.data)) .data } mtcars %>% custom_function() mtcars %>% group_by(cyl) %>% custom_function()
In these examples, I’ve messaged some metadata to the console, but your custom functions can do any work they like: create, plot, and save ggplots; compute statistics; generate log files; and so on.
I usually include the R three-dots parameter, ...
, to allow additional parameters to be passed into the function.
custom_function <- function(.data, ...) { if (dplyr::is_grouped_df(.data)) { return(dplyr::do(.data, custom_function(., ...))) } message(str(.data)) .data }
Big thanks for this very clear explanation of pipes in R. I’ve wondered how to rebuild my functions in this pipe friendly way.
WOW! I just gained 5 R IQ points reading this. This simple how-to guide for pipe-friendly functions reveals problems I was about to encounter, and shows how to avoid them. Great topic, great examples. Thanks.