Getting Started
Introduction
This is a quick cheat sheet to getting started with regular expressions.
{.cols-2 .marker-round}
Character Classes
Pattern | Description |
---|
[abc] | A single character of: a, b or c |
[^abc] | A character except: a, b or c |
[a-z] | A character in the range: a-z |
[^a-z] | A character not in the range: a-z |
[0-9] | A digit in the range: 0-9 |
[a-zA-Z] | A character in the range: a-z or A-Z |
[a-zA-Z0-9] | A character in the range: a-z, A-Z or 0-9 |
{.style-list}
Quantifiers
Pattern | Description |
---|
a? | Zero or one of a |
a* | Zero or more of a |
a+ | One or more of a |
[0-9]+ | One or more of 0-9 |
a{3} | Exactly 3 of a |
a{3,} | 3 or more of a |
a{3,6} | Between 3 and 6 of a |
a* | Greedy quantifier |
a*? | Lazy quantifier |
a*+ | Possessive quantifier |
Pattern | Description |
---|
^ | Matches the start of a string. |
{ | Starts a quantifier for the number of occurrences. |
+ | Matches one or more of the preceding element. |
< | Not a standard regex meta character (commonly used in HTML). |
[ | Starts a character class. |
* | Matches zero or more of the preceding element. |
) | Ends a capturing group. |
> | Not a standard regex meta character (commonly used in HTML). |
. | Matches any character except a newline. |
( | Starts a capturing group. |
` | ` |
$ | Matches the end of a string. |
\ | Escapes a meta character, giving it literal meaning. |
? | Matches zero or one of the preceding element. |
{.cols-3 .marker-none}
Escape these special characters with \
Pattern | Description |
---|
. | Any single character |
\s | Any whitespace character |
\S | Any non-whitespace character |
\d | Any digit, Same as [0-9] |
\D | Any non-digit, Same as [^0-9] |
\w | Any word character |
\W | Any non-word character |
\X | Any Unicode sequences, linebreaks included |
\C | Match one data unit |
\R | Unicode newlines |
\v | Vertical whitespace character |
\V | Negation of \v - anything except newlines and vertical tabs |
\h | Horizontal whitespace character |
\H | Negation of \h |
\K | Reset match |
\n | Match nth subpattern |
\pX | Unicode property X |
\p{...} | Unicode property or script category |
\PX | Negation of \pX |
\P{...} | Negation of \p |
\Q...\E | Quote; treat as literals |
\k<name> | Match subpattern name |
\k'name' | Match subpattern name |
\k{name} | Match subpattern name |
\gn | Match nth subpattern |
\g{n} | Match nth subpattern |
\g<n> | Recurse nth capture group |
\g'n' | Recurses nth capture group. |
\g{-n} | Match nth relative previous subpattern |
\g<+n> | Recurse nth relative upcoming subpattern |
\g'+n' | Match nth relative upcoming subpattern |
\g'letter' | Recurse named capture group letter |
\g{letter} | Match previously-named capture group letter |
\g<letter> | Recurses named capture group letter |
\xYY | Hex character YY |
\x{YYYY} | Hex character YYYY |
\ddd | Octal character ddd |
\cY | Control character Y |
[\b] | Backspace character |
\ | Makes any character literal |
Anchors
Pattern | Description |
---|
\G | Start of match |
^ | Start of string |
$ | End of string |
\A | Start of string |
\Z | End of string |
\z | Absolute end of string |
\b | A word boundary |
\B | Non-word boundary |
Substitution
Pattern | Description |
---|
\0 | Complete match contents |
\1 | Contents in capture group 1 |
$1 | Contents in capture group 1 |
${foo} | Contents in capture group foo |
\x20 | Hexadecimal replacement values |
\x{06fa} | Hexadecimal replacement values |
\t | Tab |
\r | Carriage return |
\n | Newline |
\f | Form-feed |
\U | Uppercase Transformation |
\L | Lowercase Transformation |
\E | Terminate any Transformation |
Group Constructs
Pattern | Description |
---|
(...) | Capture everything enclosed |
(a|b) | Match either a or b |
(?:...) | Match everything enclosed |
(?>...) | Atomic group (non-capturing) |
(?|…) | Duplicate subpattern group number |
(?#...) | Comment |
(?'name'...) | Named Capturing Group |
(?<name>...) | Named Capturing Group |
(?P<name>...) | Named Capturing Group |
(?imsxXU) | Inline modifiers |
(?(DEFINE)...) | Pre-define patterns before using them |
Assertions
- | - |
---|
(?(1)yes|no) | Conditional statement |
(?(R)yes|no) | Conditional statement |
(?(R#)yes|no) | Recursive Conditional statement |
(?(R&name\yes|no) | Conditional statement |
(?(?=…)yes|no) | Lookahead conditional |
(?(?<=…)yes|no) | Lookbehind conditional |
Lookarounds
- | - |
---|
(?=...) | Positive Lookahead |
(?!...) | Negative Lookahead |
(?<=...) | Positive Lookbehind |
(?<!...) | Negative Lookbehind |
Lookaround lets you match a group before (lookbehind) or after (lookahead) your main pattern without including it in the
result.
Flags/Modifiers
Pattern | Description |
---|
g | Global |
m | Multiline |
i | Case insensitive |
x | Ignore whitespace |
s | Single line |
u | Unicode |
X | eXtended |
U | Ungreedy |
A | Anchor |
J | Duplicate group names |
Recurse
- | - |
---|
(?R) | Recurse entire pattern |
(?1) | Recurse first subpattern |
(?+1) | Recurse first relative subpattern |
(?&name) | Recurse subpattern name |
(?P=name) | Match subpattern name |
(?P>name) | Recurse subpattern name |
POSIX Character Classes {.col-span-2}
Character Class | Same as | Meaning |
---|
[[:alnum:]] | [0-9A-Za-z] | Letters and digits |
[[:alpha:]] | [A-Za-z] | Letters |
[[:ascii:]] | [\x00-\x7F] | ASCII codes 0-127 |
[[:blank:]] | [\t ] | Space or tab only |
[[:cntrl:]] | [\x00-\x1F\x7F] | Control characters |
[[:digit:]] | [0-9] | Decimal digits |
[[:graph:]] | [[:alnum:][:punct:]] | Visible characters (not space) |
[[:lower:]] | [a-z] | Lowercase letters |
[[:print:]] | [ -~] == [ [:graph:]] | Visible characters |
[[:punct:]] | [!”#$%&’()*+,-./:;<=>?@[]^_`{|}~] | Visible punctuation characters |
[[:space:]] | [\t\n\v\f\r ] | Whitespace |
[[:upper:]] | [A-Z] | Uppercase letters |
[[:word:]] | [0-9A-Za-z_] | Word characters |
[[:xdigit:]] | [0-9A-Fa-f] | Hexadecimal digits |
[[:<:]] | [\b(?=\w)] | Start of word |
[[:>:]] | [\b(?<=\w)] | End of word |
{.show-header}
Control verb
- | - |
---|
(*ACCEPT) | Control verb |
(*FAIL) | Control verb |
(*MARK:NAME) | Control verb |
(*COMMIT) | Control verb |
(*PRUNE) | Control verb |
(*SKIP) | Control verb |
(*THEN) | Control verb |
(*UTF) | Pattern modifier |
(*UTF8) | Pattern modifier |
(*UTF16) | Pattern modifier |
(*UTF32) | Pattern modifier |
(*UCP) | Pattern modifier |
(*CR) | Line break modifier |
(*LF) | Line break modifier |
(*CRLF) | Line break modifier |
(*ANYCRLF) | Line break modifier |
(*ANY) | Line break modifier |
\R | Line break modifier |
(*BSR_ANYCRLF) | Line break modifier |
(*BSR_UNICODE) | Line break modifier |
(*LIMIT_MATCH=x) | Regex engine modifier |
(*LIMIT_RECURSION=d) | Regex engine modifier |
(*NO_AUTO_POSSESS) | Regex engine modifier |
(*NO_START_OPT) | Regex engine modifier |
Regex examples{.cols-3}
Characters
Pattern | Matches |
---|
ring | Match ring springboard etc. |
. | Match a, 9, + etc. |
h.o | Match hoo, h2o, h/o etc. |
ring\? | Match ring? |
\(quiet\) | Match (quiet) |
c:\\windows | Match c:\windows |
Use \
to search for these special characters:
[ \ ^ $ . | ? * + ( ) { }
Alternatives
Pattern | Matches |
---|
cat|dog | Match cat or dog |
id|identity | Match id or identity |
identity|id | Match id or identity |
Order longer to shorter when alternatives overlap
Character classes
Pattern | Matches |
---|
[aeiou] | Match any vowel |
[^aeiou] | Match a NON vowel |
r[iau]ng | Match ring, wrangle, sprung, etc. |
gr[ae]y | Match gray or grey |
[a-zA-Z0-9] | Match any letter or digit |
[\u3a00-\ufa99] | Match any Unicode Hàn (中文) |
In [ ]
always escape . \ ]
and sometimes ^ - .
Shorthand classes
Pattern | Meaning |
---|
\w | ”Word” character (letter, digit, or underscore) |
\d | Digit |
\s | Whitespace (space, tab, vtab, newline) |
\W, \D, or \S | Not word, digit, or whitespace |
[\D\S] | Means not digit or whitespace, both match |
[^\d\s] | Disallow digit and whitespace |
Occurrences
Pattern | Matches |
---|
colou?r | Match color or colour |
[BW]ill[ieamy's]* | Match Bill, Willy, William’s etc. |
[a-zA-Z]+ | Match 1 or more letters |
\d{3}-\d{2}-\d{4} | Match a SSN |
[a-z]\w{1,7} | Match a UW NetID |
Greedy versus lazy
Pattern | Meaning |
---|
* + {n,} greedy | Match as much as possible |
<.+> | Finds 1 big match in <b>bold</b> |
*? +? {n,}? lazy | Match as little as possible |
<.+?> | Finds 2 matches in <b>bold</b> |
Scope {.col-span-2}
Pattern | Meaning |
---|
\b | ”Word” edge (next to non “word” character) |
\bring | Word starts with “ring”, ex ringtone |
ring\b | Word ends with “ring”, ex spring |
\b9\b | Match single digit 9, not 19, 91, 99, etc.. |
\b[a-zA-Z]{6}\b | Match 6-letter words |
\B | Not word edge |
\Bring\B | Match springs and wringer |
^\d*$ | Entire string must be digits |
^[a-zA-Z]{4,20}$ | String must have 4-20 letters |
^[A-Z] | String must begin with capital letter |
[\.!?"')]$ | String must end with terminal puncutation |
Modifiers
Pattern | Meaning |
---|
(?i) [a-z]*(?-i) | Ignore case ON / OFF |
(?s) .*(?-s) | Match multiple lines (causes . to match newline) |
(?m) ^.*;$(?-m) | ^ & $ match lines not whole string |
(?x) | #free-spacing mode, this EOL comment ignored |
(?-x) | free-spacing mode OFF |
/regex/ismx | Modify mode for entire string |
Groups
Pattern | Meaning |
---|
(in|out)put | Match input or output |
\d{5}(-\d{4})? | US zip code (”+ 4” optional) |
Parser tries EACH alternative if match fails after group.
Can lead to catastrophic backtracking.
Back references
Pattern | Matches |
---|
(to) (be) or not \1 \2 | Match to be or not to be |
([^\s])\1{2} | Match non-space, then same twice more aaa, … |
\b(\w+)\s+\1\b | Match doubled words |
Non-capturing group
Pattern | Meaning |
---|
on(?:click|load) | Faster than:
on(click|load) |
Use non-capturing or atomic groups when possible
Atomic groups
Pattern | Meaning |
---|
(?>red|green|blue) | Faster than non-capturing |
(?>id|identity)\b | Match id, but not identity |
“id” matches, but \b
fails after atomic group, parser doesn’t backtrack into group to retry ‘identity’
If alternatives overlap, order longer to shorter.
Lookaround {.row-span-2 .col-span-2}
Pattern | Meaning |
---|
(?= ) | Lookahead, if you can find ahead |
(?! ) | Lookahead,if you can not find ahead |
(?<= ) | Lookbehind, if you can find behind |
(?<! ) | Lookbehind, if you can NOT find behind |
\b\w+?(?=ing\b) | Match warbling, string, fishing, … |
\b(?!\w+ing\b)\w+\b | Words NOT ending in ing |
(?<=\bpre).*?\b | Match pretend, present, prefix, … |
\b\w{3}(?<!pre)\w*?\b | Words NOT starting with pre |
\b\w+(?<!ing)\b | Match words NOT ending in ing |
If-then-else
Match “Mr.” or “Ms.” if word “her” is later in string
requires lookaround for IF condition
RegEx in Python
Getting started
Import the regular expressions module
Examples {.col-span-2 .row-span-3}
re.search()
>>> sentence = 'This is a sample string'
>>> bool(re.search(r'this', sentence, flags=re.I))
>>> bool(re.search(r'xyz', sentence))
re.findall()
>>> re.findall(r'\bs?pare?\b', 'par spar apparent spare part pare')
['par', 'spar', 'spare', 'pare']
>>> re.findall(r'\b0*[1-9]\d{2,}\b', '0501 035 154 12 26 98234')
re.finditer()
>>> m_iter = re.finditer(r'[0-9]+', '45 349 651 593 4 204')
>>> [m[0] for m in m_iter if int(m[0]) < 350]
['45', '349', '4', '204']
re.split()
>>> re.split(r'\d+', 'Sample123string42with777numbers')
['Sample', 'string', 'with', 'numbers']
re.sub()
>>> ip_lines = "catapults\nconcatenate\ncat"
>>> print(re.sub(r'^', r'* ', ip_lines, flags=re.M))
re.compile()
>>> pet = re.compile(r'dog')
<class '_sre.SRE_Pattern'>
>>> bool(pet.search('They bought a dog'))
>>> bool(pet.search('A cat crossed their path'))
Functions
Function | Description |
---|
re.findall | Returns a list containing all matches |
re.finditer | Return an iterable of match objects (one for each match) |
re.search | Returns a Match object if there is a match anywhere in the string |
re.split | Returns a list where the string has been split at each match |
re.sub | Replaces one or many matches with a string |
re.compile | Compile a regular expression pattern for later use |
re.escape | Return string with all non-alphanumerics backslashed |
Flags
- | - | - |
---|
re.I | re.IGNORECASE | Ignore case |
re.M | re.MULTILINE | Multiline |
re.L | re.LOCALE | Make \w ,\b ,\s locale dependent |
re.S | re.DOTALL | Dot matches all (including newline) |
re.U | re.UNICODE | Make \w ,\b ,\d ,\s unicode dependent |
re.X | re.VERBOSE | Readable style |
Regex in JavaScript
test()
let textA = "I like APPles very much";
let textB = "I like APPles";
console.log(regex.test(textA));
console.log(regex.test(textB));
search()
let text = "I like APPles very much";
console.log(text.search(regexA));
console.log(text.search(regexB));
exec()
let text = "Do you like apples?";
console.log(regex.exec(text)[0]);
// Output: Do you like apples?
console.log(regex.exec(text).input);
match()
let text = "Here are apples and apPleS";
// Output: [ "apples", "apPleS" ]
console.log(text.match(regex));
split() {.col-span-2}
let text = "This 593 string will be brok294en at places where d1gits are.";
// Output: [ "This ", " string will be brok", "en at places where d", "gits are." ]
console.log(text.split(regex));
matchAll()
let regex = /t(e)(st(\d?))/g;
let array = [...text.matchAll(regex)];
// Output: ["test1", "e", "st1", "1"]
// Output: ["test2", "e", "st2", "2"]
replace()
let text = "Do you like aPPles?";
// Output: Do you like mangoes?
let result = text.replace(regex, "mangoes");
replaceAll()
let text = "Here are apples and apPleS";
// Output: Here are mangoes and mangoes
let result = text.replaceAll(regex, "mangoes");
Regex in PHP
Functions {.col-span-2}
- | - |
---|
preg_match() | Performs a regex match |
preg_match_all() | Perform a global regular expression match |
preg_replace_callback() | Perform a regular expression search and replace using a callback |
preg_replace() | Perform a regular expression search and replace |
preg_split() | Splits a string by regex pattern |
preg_grep() | Returns array entries that match a pattern |
preg_replace
$str = "Visit Microsoft!";
// Output: Visit CheatSheets!
echo preg_replace($regex, "CheatSheets", $str);
preg_match
$str = "Visit CheatSheets";
$regex = "#cheatsheets#i";
echo preg_match($regex, $str);
preg_matchall {.col-span-2 .row-span-2}
$regex = "/[a-zA-Z]+ (\d+)/";
$input_str = "June 24, August 13, and December 30";
if (preg_match_all($regex, $input_str, $matches_out)) {
echo count($matches_out);
echo count($matches_out[0]);
// Output: Array("June 24", "August 13", "December 30")
print_r($matches_out[0]);
// Output: Array("24", "13", "30")
print_r($matches_out[1]);
preg_grep
$arr = ["Jane", "jane", "Joan", "JANE"];
echo preg_grep($regex, $arr);
preg_split {.col-span-2}
$str = "Jane\tKate\nLucy Marion";
// Output: Array("Jane", "Kate", "Lucy", "Marion")
print_r(preg_split($regex, $str));
Regex in Java
Styles {.col-span-2}
First way
Pattern p = Pattern.compile(".s", Pattern.CASE_INSENSITIVE);
Matcher m = p.matcher("aS");
boolean s1 = m.matches();
System.out.println(s1); // Outputs: true
Second way
boolean s2 = Pattern.compile("[0-9]+").matcher("123").matches();
System.out.println(s2); // Outputs: true
Third way
boolean s3 = Pattern.matches(".s", "XXXX");
System.out.println(s3); // Outputs: false
Pattern Fields
- | - |
---|
CANON_EQ | Canonical equivalence |
CASE_INSENSITIVE | Case-insensitive matching |
COMMENTS | Permits whitespace and comments |
DOTALL | Dotall mode |
MULTILINE | Multiline mode |
UNICODE_CASE | Unicode-aware case folding |
UNIX_LINES | Unix lines mode |
Methods
Pattern
- Pattern compile(String regex [, int flags])
- boolean matches([String regex, ] CharSequence input)
- String[] split(String regex [, int limit])
- String quote(String s)
Matcher
- int start([int group | String name])
- int end([int group | String name])
- boolean find([int start])
- String group([int group | String name])
- Matcher reset()
String
- boolean matches(String regex)
- String replaceAll(String regex, String replacement)
- String[] split(String regex[, int limit])
There are more methods …
Examples {.col-span-2}
Replace sentence:
String regex = "[A-Z\n]{5}$";
String str = "I like APP\nLE";
Pattern p = Pattern.compile(regex, Pattern.MULTILINE);
Matcher m = p.matcher(str);
// Outputs: I like Apple!
System.out.println(m.replaceAll("pple!"));
Array of all matches:
String str = "She sells seashells by the Seashore";
String regex = "\\w*se\\w*";
Pattern p = Pattern.compile(regex, Pattern.CASE_INSENSITIVE);
Matcher m = p.matcher(str);
List<String> matches = new ArrayList<>();
// Outputs: [sells, seashells, Seashore]
System.out.println(matches);
Regex in MySQL {.cols-2}
Functions
Name | Description |
---|
REGEXP | Whether string matches regex |
REGEXP_INSTR() | Starting index of substring matching regex (NOTE: Only MySQL 8.0+) |
REGEXP_LIKE() | Whether string matches regex (NOTE: Only MySQL 8.0+) |
REGEXP_REPLACE() | Replace substrings matching regex (NOTE: Only MySQL 8.0+) |
REGEXP_SUBSTR() | Return substring matching regex (NOTE: Only MySQL 8.0+) |
REGEXP
Examples
mysql> SELECT 'abc' REGEXP '^[a-d]';
mysql> SELECT name FROM cities WHERE name REGEXP '^A';
mysql> SELECT name FROM cities WHERE name NOT REGEXP '^A';
mysql> SELECT name FROM cities WHERE name REGEXP 'A|B|R';
mysql> SELECT 'a' REGEXP 'A', 'a' REGEXP BINARY 'A';
REGEXP_REPLACE
REGEXP_REPLACE(expr, pat, repl[, pos[, occurrence[, match_type]]])
Examples
mysql> SELECT REGEXP_REPLACE('a b c', 'b', 'X');
mysql> SELECT REGEXP_REPLACE('abc ghi', '[a-z]+', 'X', 1, 2);
REGEXP_SUBSTR
REGEXP_SUBSTR(expr, pat[, pos[, occurrence[, match_type]]])
Examples
mysql> SELECT REGEXP_SUBSTR('abc def ghi', '[a-z]+');
mysql> SELECT REGEXP_SUBSTR('abc def ghi', '[a-z]+', 1, 3);
REGEXP_LIKE
REGEXP_LIKE(expr, pat[, match_type])
Examples
mysql> SELECT regexp_like('aba', 'b+')
mysql> SELECT regexp_like('aba', 'b{2}')
mysql> # i: case-insensitive
mysql> SELECT regexp_like('Abba', 'ABBA', 'i');
mysql> SELECT regexp_like('a\nb\nc', '^b$', 'm');
REGEXP_INSTR
REGEXP_INSTR(expr, pat[, pos[, occurrence[, return_option[, match_type]]]])
Examples
mysql> SELECT regexp_instr('aa aaa aaaa', 'a{3}');
mysql> SELECT regexp_instr('abba', 'b{2}', 2);
mysql> SELECT regexp_instr('abbabba', 'b{2}', 1, 2);
mysql> SELECT regexp_instr('abbabba', 'b{2}', 1, 3, 1);