Find & Replace with Superpowers
We’ve all grown up using the Find or Find & Replace features in webpages, word processors, and databases. Typically, Find & Replace, is used to find a sequence of literal symbols that may be present in a document and replace it with another literal sequence of symbols:
- Find: fire
- Replace: water
What if you wanted to find all the words that began with f and included the letter r? Such a description is referred to as a pattern. Patterns are common in textual data: sequences of phone numbers, grant numbers, abbreviations. URLs, email addresses, and so on.
In order to describe such patterns, a graduate student in the 1950s, Stephen C. Kleene, designed a language known as Regular Expressions or RegEx. That became his doctorial dissertation. A regular expression can be thought of as a notation to describe the (possibly recurring) patterns present in textual data.
What is a pattern? Patterns are formal descriptions of the sequence of symbols in textual data. “Every digit that follows an even number” is a pattern. “Any English word that does not contain a vowel” is also a pattern. Some patterns are well known and have common names: URLs, email addresses, phone numbers, and credit card numbers are all patterns.
What is textual data? Generally, any data that can be typed in from the keyboard can be considered textual data. “Two roads diverged in a yellow wood” is textual data. “123 456 789” is textual data that happens to include symbols we might call digits. “@#$%^&” is textual data containing punctuation marks. If you can type it, it’s probably textual data. Non-textual data includes sounds, images, videos, and so on. We sometimes refer to these somewhat erroneously as binary data. (Erroneously because all data stored by a modern digital computer is represented at some level as binary.)
The Challenge of Learning Regular Expressions
Regular expressions are notoriously difficult for humans to read and write. This is largely due to the necessity of using the symbols on the keyboard to express the regular expression. Those same keys and symbols were used to enter the textual data into a computer. They look so much alike—the textual data and the descriptions of patterns in the textual data—that they are easily confused.
There are also many, many uses of parentheses ( ), square brackets [ ], and curly braces { } in regular expressions. They can get confusing to keep track of very quickly.
The Regular Expression Builder Tool
The tool below attempts to help you construct regular expressions by breaking the expression down into individual clauses. Each clause follows a schema consisting of 5 parts: something, the pattern this clause will match, a quantifier, an indicator whether this clause is optional, and an indicator whether this clause should be remembered (AKA captured) for use in replacement patterns.
You can add or remove clauses as necessary using the + and – buttons to the right of each clause to create more complicated expressions.
To the right of each clause, you see the portion of the overall expression contributed by that clause, so that you can learn from the patterns as they’re created.
Further down the page, the regular expression is displayed in full, so that you can copy & paste it into an editor or computer program.
You have the option of displaying the regular expression as it would typically be used in most programming languages or to display it as it’s often used in the R statistical programming language. R has a quirk when entering regular expressions: most backslashes need to be preceded by an additional backslash. \ becomes \\ and \n becomes \\n.
There are often multiple ways of constructing a given regular expression. Every programming language has its own quirks. Here, I’ve attempted to privilege consistency. Even so, the generator may create some malformed regular expressions. In such cases, the result is probably very close to a correct version. Save it and study it to see whether you or your coding partners can find the bug.
Capturing: How to remember what was found to use in replacement
When you capture at least some of the matched clauses, you can use the dynamic result—whatever pattern was matched— in your replacement string. You capture the matched results of a clause by wrapping it in parentheses ( ). Most regular expression-capable editors then allow you to use either $1, $2, $3, … or \1, \2, \3 … in the replacement to denote each captured clause in order. $1 denotes the first clause wrapped in ( ) and $2 denotes the second one. For example, given a list of names formatted as last_name, first_name, the following regular expression might be used to swap to list so that ir appears as first_name last_name. The parentheses indicate each of the two captured clauses. Each of the clauses is comprised of one or more symbols. The two clauses are initially separated by a comma and a space. In the replacement, the first_name is placed first, followed by a space, followed by the last_name.
- Find: (.+), (.+)
- Replace: $2 $1
Limitations
Regular expressions are capable of expressing patterns beyond what this tool affords. Once you’re familiar with constructing regexs on your own, you may find it helpful to explore the full capabilities they offer.
Also, the regular expressions created here may not reflect the simplest or most common way of expressing some patterns. I’ve chosen to adopt a consistent notation rather than use some of the idiomatic shorthands available, such as [:digit:] or even \d, which are common alternative notations for [0-9].
Finally, each editor and programming language uses a slightly different set of rules for writing regular expressions. You may find that an expression you’ve created here doesn’t work in whatever context you’re attempting to use it. Chances are that it’s off by just a few characters, so don’t throw your creation away. Instead, study how your editor or language expresses the ideas included in each of your clauses. For example, you may find that you need to add a backslash before a literal closing parenthesis: \) rather than simply ).
Define Your Regular Expression One Clause at a Time
Pattern | Quantifier | Optional? | Capture? | Clause Contributes | ||
This is the resulting Regular Expression
You can copy and paste this (or you may need a slightly modified version of it) into any program that understands regular expressions.Sample Textual Data (editable) | Matched Expressions (if any; separated by tabs) |