Regular Expression Extensions
# Regular Expression Extensions
# RegExp Constructor
In ES5, the RegExp constructor accepts arguments in two forms.
The first form takes a string as the argument, with the second argument representing the regular expression flags.
var regex = new RegExp('xyz', 'i');
// equivalent to
var regex = /xyz/i;
2
3
The second form takes a regular expression as the argument, which returns a copy of the original regular expression.
var regex = new RegExp(/xyz/i);
// equivalent to
var regex = /xyz/i;
2
3
However, ES5 does not allow a second argument to add flags in this case; otherwise, an error is thrown.
var regex = new RegExp(/xyz/, 'i');
// Uncaught TypeError: Cannot supply flags when constructing one RegExp from another
2
ES6 changed this behavior. If the first argument to the RegExp constructor is a regular expression object, a second argument can be used to specify the flags. Moreover, the returned regular expression will ignore the flags of the original regular expression and use only the newly specified flags.
new RegExp(/abc/ig, 'i').flags
// "i"
2
In the code above, the flags of the original regular expression object are ig, which are overridden by the second argument i.
# String Methods for Regular Expressions
String objects have 4 methods that can use regular expressions: match(), replace(), search(), and split().
ES6 internally redirects all 4 methods to call RegExp instance methods, so that all regex-related methods are defined on the RegExp object.
String.prototype.matchcallsRegExp.prototype[Symbol.match]String.prototype.replacecallsRegExp.prototype[Symbol.replace]String.prototype.searchcallsRegExp.prototype[Symbol.search]String.prototype.splitcallsRegExp.prototype[Symbol.split]
# The u Flag
ES6 adds the u flag to regular expressions, meaning "Unicode mode." It is used to correctly handle Unicode characters greater than \uFFFF — that is, it properly handles four-byte UTF-16 encoding.
/^\uD83D/u.test('\uD83D\uDC2A') // false
/^\uD83D/.test('\uD83D\uDC2A') // true
2
In the code above, \uD83D\uDC2A is a four-byte UTF-16 encoding representing a single character. However, ES5 does not support four-byte UTF-16 encoding, treating it as two characters, which causes the second line to return true. With the u flag, ES6 recognizes it as a single character, so the first line returns false.
Once the u flag is added, the behavior of the following regular expression features is modified.
(1) The Dot Character
The dot (.) in regular expressions matches any single character except newlines. For Unicode characters with code points greater than 0xFFFF, the dot character cannot match them — the u flag must be added.
var s = '𠮷';
/^.$/.test(s) // false
/^.$/u.test(s) // true
2
3
4
The code above shows that without the u flag, the regular expression treats the string as two characters, resulting in a match failure.
(2) Unicode Character Notation
ES6 introduced curly brace notation for Unicode characters. This notation must be used with the u flag in regular expressions for the curly braces to be recognized; otherwise, they are interpreted as quantifiers.
/\u{61}/.test('a') // false
/\u{61}/u.test('a') // true
/\u{20BB7}/u.test('𠮷') // true
2
3
The code above shows that without the u flag, the regular expression cannot recognize the \u{61} notation and will only interpret it as matching 61 consecutive us.
(3) Quantifiers
With the u flag, all quantifiers correctly recognize Unicode characters with code points greater than 0xFFFF.
/a{2}/.test('aa') // true
/a{2}/u.test('aa') // true
/𠮷{2}/.test('𠮷𠮷') // false
/𠮷{2}/u.test('𠮷𠮷') // true
2
3
4
(4) Predefined Patterns
The u flag also affects predefined patterns and their ability to correctly recognize Unicode characters with code points greater than 0xFFFF.
/^\S$/.test('𠮷') // false
/^\S$/u.test('𠮷') // true
2
In the code above, \S is a predefined pattern that matches all non-whitespace characters. Only with the u flag can it correctly match Unicode characters with code points greater than 0xFFFF.
Using this, you can write a function that correctly returns the length of a string.
function codePointLength(text) {
var result = text.match(/[\s\S]/gu);
return result ? result.length : 0;
}
var s = '𠮷𠮷';
s.length // 4
codePointLength(s) // 2
2
3
4
5
6
7
8
9
(5) The i Flag
Some Unicode characters have different code points but very similar glyphs. For example, \u004B and \u212A are both uppercase K.
/[a-z]/i.test('\u212A') // false
/[a-z]/iu.test('\u212A') // true
2
In the code above, without the u flag, the non-standard K character cannot be recognized.
(6) Escaping
Without the u flag, undefined escapes in regular expressions (such as escaping a comma \,) are harmless. In u mode, however, they throw an error.
/\,/ // /\,/
/\,/u // error
2
In the code above, without the u flag, the backslash before the comma has no effect. With the u flag added, it throws an error.
# RegExp.prototype.unicode Property
Regular expression instances now have a unicode property indicating whether the u flag has been set.
const r1 = /hello/;
const r2 = /hello/u;
r1.unicode // false
r2.unicode // true
2
3
4
5
In the code above, you can check whether the u flag is set on a regular expression from the unicode property.
# The y Flag
In addition to the u flag, ES6 also adds the y flag to regular expressions, called the "sticky" modifier.
The y flag works similarly to the g flag — both perform global matching, with each subsequent match starting from the position where the previous match ended. The difference is that the g flag only requires a match somewhere in the remaining string, while the y flag requires the match to start at the very first position of the remaining string. This is what "sticky" means.
var s = 'aaa_aa_a';
var r1 = /a+/g;
var r2 = /a+/y;
r1.exec(s) // ["aaa"]
r2.exec(s) // ["aaa"]
r1.exec(s) // ["aa"]
r2.exec(s) // null
2
3
4
5
6
7
8
9
The code above has two regular expressions: one using the g flag and the other using the y flag. Both are executed twice. On the first execution, they behave the same, with the remaining string being _aa_a. Since the g flag has no position requirement, the second execution returns a result. The y flag, however, requires the match to start at the beginning, so it returns null.
If we modify the regular expression to ensure a head match every time, the y flag will return results.
var s = 'aaa_aa_a';
var r = /a+_/y;
r.exec(s) // ["aaa_"]
r.exec(s) // ["aa_"]
2
3
4
5
In the code above, each match starts from the beginning of the remaining string.
The lastIndex property can better illustrate the y flag.
const REGEX = /a/g;
// Start matching from position 2 (y)
REGEX.lastIndex = 2;
// Match succeeds
const match = REGEX.exec('xaya');
// Match succeeds at position 3
match.index // 3
// Next match starts from position 4
REGEX.lastIndex // 4
// Match starting from position 4 fails
REGEX.exec('xaya') // null
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
In the code above, the lastIndex property specifies the starting position for each search. The g flag searches forward from this position until a match is found.
The y flag also respects the lastIndex property, but requires the match to be found exactly at the position specified by lastIndex.
const REGEX = /a/y;
// Start matching from position 2
REGEX.lastIndex = 2;
// Not sticky, match fails
REGEX.exec('xaya') // null
// Start matching from position 3
REGEX.lastIndex = 3;
// Position 3 is sticky, match succeeds
const match = REGEX.exec('xaya');
match.index // 3
REGEX.lastIndex // 4
2
3
4
5
6
7
8
9
10
11
12
13
14
15
In effect, the y flag implies the head-matching anchor ^.
/b/y.exec('aba')
// null
2
The code above returns null because a head match cannot be guaranteed. The design intent of the y flag is to make the head-matching anchor ^ effective in global matching.
Below is an example using the string object's replace method.
const REGEX = /a/gy;
'aaxa'.replace(REGEX, '-') // '--xa'
2
In the code above, the last a is not replaced because it does not appear at the beginning of the next match.
Using the y flag alone with the match method only returns the first match. It must be combined with the g flag to return all matches.
'a1a2a3'.match(/a\d/y) // ["a1"]
'a1a2a3'.match(/a\d/gy) // ["a1", "a2", "a3"]
2
One application of the y flag is extracting tokens from a string. The y flag ensures that no characters are skipped between matches.
const TOKEN_Y = /\s*(\+|[0-9]+)\s*/y;
const TOKEN_G = /\s*(\+|[0-9]+)\s*/g;
tokenize(TOKEN_Y, '3 + 4')
// [ '3', '+', '4' ]
tokenize(TOKEN_G, '3 + 4')
// [ '3', '+', '4' ]
function tokenize(TOKEN_REGEX, str) {
let result = [];
let match;
while (match = TOKEN_REGEX.exec(str)) {
result.push(match[1]);
}
return result;
}
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
In the code above, if the string contains no illegal characters, the y and g flags produce the same extraction results. However, once an illegal character appears, the two behave differently.
tokenize(TOKEN_Y, '3x + 4')
// [ '3' ]
tokenize(TOKEN_G, '3x + 4')
// [ '3', '+', '4' ]
2
3
4
In the code above, the g flag ignores illegal characters, while the y flag does not, making it easy to detect errors.
# RegExp.prototype.sticky Property
Corresponding to the y flag, ES6 regular expression instances have a sticky property indicating whether the y flag has been set.
var r = /hello\d/y;
r.sticky // true
2
# RegExp.prototype.flags Property
ES6 adds a flags property to regular expressions, which returns the flags of the regular expression.
// ES5 source property
// returns the pattern of the regular expression
/abc/ig.source
// "abc"
// ES6 flags property
// returns the flags of the regular expression
/abc/ig.flags
// 'gi'
2
3
4
5
6
7
8
9
# The s Flag: dotAll Mode
In regular expressions, the dot (.) is a special character that matches any single character, with two exceptions. One is four-byte UTF-16 characters, which can be handled with the u flag. The other is line terminator characters.
A line terminator is a character that signifies the end of a line. The following four characters are classified as "line terminators":
- U+000A line feed (
\n) - U+000D carriage return (
\r) - U+2028 line separator
- U+2029 paragraph separator
/foo.bar/.test('foo\nbar')
// false
2
In the code above, . does not match \n, so the regular expression returns false.
However, in many cases we want to match any single character. Here is a workaround.
/foo[^]bar/.test('foo\nbar')
// true
2
This workaround is not very intuitive. ES2018 introduced (opens new window) the s flag, which makes . match any single character.
/foo.bar/s.test('foo\nbar') // true
This is called dotAll mode, meaning the dot represents all characters. Accordingly, regular expressions also introduced a dotAll property that returns a boolean indicating whether the regular expression is in dotAll mode.
const re = /foo.bar/s;
// alternative syntax
// const re = new RegExp('foo.bar', 's');
re.test('foo\nbar') // true
re.dotAll // true
re.flags // 's'
2
3
4
5
6
7
The /s flag and the multiline flag /m do not conflict. When used together, . matches all characters, while ^ and $ match the beginning and end of each line.
# Lookbehind Assertions
JavaScript's regular expressions previously only supported lookahead assertions and negative lookahead assertions, but not lookbehind assertions or negative lookbehind assertions. ES2018 introduced lookbehind assertions (opens new window), supported since V8 engine version 4.9 (Chrome 62).
A "lookahead assertion" means that x is matched only if followed by y, written as /x(?=y)/. For example, to match only digits before a percent sign, write /\d+(?=%)/. A "negative lookahead assertion" means that x is matched only if not followed by y, written as /x(?!y)/. For example, to match only digits not before a percent sign, write /\d+(?!%)/.
/\d+(?=%)/.exec('100% of US presidents have been male') // ["100"]
/\d+(?!%)/.exec('that's all 44 of them') // ["44"]
2
For the two strings above, swapping the regular expressions would not produce the same results. Also, note that the part inside the "lookahead assertion" parentheses ((?=%)) is not included in the returned result.
A "lookbehind assertion" is the opposite of a "lookahead assertion": x is matched only if preceded by y, written as /(?<=y)x/. For example, to match only digits after a dollar sign, write /(?<=\$)\d+/. A "negative lookbehind assertion" is the opposite of a "negative lookahead assertion": x is matched only if not preceded by y, written as /(?<!y)x/. For example, to match only digits not after a dollar sign, write /(?<!\$)\d+/.
/(?<=\$)\d+/.exec('Benjamin Franklin is on the $100 bill') // ["100"]
/(?<!\$)\d+/.exec('it's is worth about €90') // ["90"]
2
In the examples above, the part inside the "lookbehind assertion" parentheses ((?<=\$)) is also not included in the returned result.
Below is an example using lookbehind assertions for string replacement.
const RE_DOLLAR_PREFIX = /(?<=\$)foo/g;
'$foo %foo foo'.replace(RE_DOLLAR_PREFIX, 'bar');
// '$bar %foo foo'
2
3
In the code above, only foo that follows a dollar sign is replaced.
The implementation of "lookbehind assertions" requires first matching x in /(?<=y)x/, then going back to the left to match the y part. This "right-to-left" execution order is opposite to all other regular expression operations, leading to some unexpected behavior.
First, group matching in lookbehind assertions produces different results than in normal cases.
/(?<=(\d+)(\d+))$/.exec('1053') // ["", "1", "053"]
/^(\d+)(\d+)$/.exec('1053') // ["1053", "105", "3"]
2
In the code above, two capture groups need to be matched. Without a "lookbehind assertion," the first group is greedy and the second group can only capture one character, resulting in 105 and 3. With a "lookbehind assertion," since execution is right-to-left, the second group is greedy and the first group can only capture one character, resulting in 1 and 053.
Second, backreferences in "lookbehind assertions" are also reversed from the usual order — they must be placed before the corresponding capture group.
/(?<=(o)d\1)r/.exec('hodor') // null
/(?<=\1d(o))r/.exec('hodor') // ["r", "o"]
2
In the code above, if the backreference (\1) in the lookbehind assertion is placed after the capture group, no match is found. It must be placed before to work. This is because lookbehind assertions first scan left-to-right, and after finding a match, go back right-to-left to complete the backreference.
# Unicode Property Classes
ES2018 introduced (opens new window) a new class syntax \p{...} and \P{...}, allowing regular expressions to match all characters that satisfy a certain Unicode property.
const regexGreekSymbol = /\p{Script=Greek}/u;
regexGreekSymbol.test('π') // true
2
In the code above, \p{Script=Greek} specifies matching a Greek letter, so matching π succeeds.
Unicode property classes require specifying a property name and property value.
\p{UnicodePropertyName=UnicodePropertyValue}
For certain properties, only the property name or only the property value can be written.
\p{UnicodePropertyName}
\p{UnicodePropertyValue}
2
\P{…} is the inverse match of \p{…}, matching characters that do not satisfy the condition.
Note that these two classes are only valid for Unicode, so the u flag must always be added when using them. Without the u flag, using \p and \P in a regular expression will throw an error — ECMAScript has reserved these two classes.
Due to the large number of Unicode properties, this new class syntax is extremely expressive.
const regex = /^\p{Decimal_Number}+$/u;
regex.test('𝟏𝟐𝟑𝟜𝟝𝟞𝟩𝟪𝟫𝟬𝟭𝟮𝟯𝟺𝟻𝟼') // true
2
In the code above, the property class specifies matching all decimal characters. As you can see, decimal characters in various styles all match successfully.
\p{Number} can even match Roman numerals.
// Match all numbers
const regex = /^\p{Number}+$/u;
regex.test('²³¹¼½¾') // true
regex.test('㉛㉜㉝') // true
regex.test('ⅠⅡⅢⅣⅤⅥⅦⅧⅨⅩⅪⅫ') // true
2
3
4
5
Below are some other examples.
// Match all whitespace
\p{White_Space}
// Match all alphabetic characters of any script, equivalent to Unicode version of \w
[\p{Alphabetic}\p{Mark}\p{Decimal_Number}\p{Connector_Punctuation}\p{Join_Control}]
// Match all non-alphabetic characters of any script, equivalent to Unicode version of \W
[^\p{Alphabetic}\p{Mark}\p{Decimal_Number}\p{Connector_Punctuation}\p{Join_Control}]
// Match Emoji
/\p{Emoji_Modifier_Base}\p{Emoji_Modifier}?|\p{Emoji_Presentation}|\p{Emoji}\uFE0F/gu
// Match all arrow characters
const regexArrows = /^\p{Block=Arrows}+$/u;
regexArrows.test('←↑→↓↔↕↖↗↘↙⇏⇐⇑⇒⇓⇔⇕⇖⇗⇘⇙⇧⇩') // true
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# Named Capture Groups
# Introduction
Regular expressions use parentheses for group matching.
const RE_DATE = /(\d{4})-(\d{2})-(\d{2})/;
The code above has three groups of parentheses in the regular expression. Using the exec method, the three group matches can be extracted.
const RE_DATE = /(\d{4})-(\d{2})-(\d{2})/;
const matchObj = RE_DATE.exec('1999-12-31');
const year = matchObj[1]; // 1999
const month = matchObj[2]; // 12
const day = matchObj[3]; // 31
2
3
4
5
6
A problem with group matching is that the meaning of each group is not easy to discern, and groups can only be referenced by numerical index (e.g., matchObj[1]). If the order of groups changes, the index references must be updated.
ES2018 introduced Named Capture Groups (opens new window), which allow each group match to be assigned a name, making the code easier to read and reference.
const RE_DATE = /(?<year>\d{4})-(?<month>\d{2})-(?<day>\d{2})/;
const matchObj = RE_DATE.exec('1999-12-31');
const year = matchObj.groups.year; // 1999
const month = matchObj.groups.month; // 12
const day = matchObj.groups.day; // 31
2
3
4
5
6
In the code above, "named capture groups" add "question mark + angle brackets + group name" (?<year>) at the beginning of the pattern inside the parentheses. The group name can then be referenced on the groups property of the result returned by exec. Numerical indexes (e.g., matchObj[1]) remain valid.
Named capture groups essentially add an ID to each group match, making it easy to describe the purpose of the match. If the order of groups changes, the processing code does not need to be modified.
If a named group does not match, the corresponding groups object property will be undefined.
const RE_OPT_A = /^(?<as>a+)?$/;
const matchObj = RE_OPT_A.exec('');
matchObj.groups.as // undefined
'as' in matchObj.groups // true
2
3
4
5
In the code above, the named group as did not find a match, so matchObj.groups.as is undefined, but the key as always exists in groups.
# Destructuring Assignment and Replacement
With named capture groups, you can use destructuring assignment to directly assign variables from the match result.
let {groups: {one, two}} = /^(?<one>.*):(?<two>.*)$/u.exec('foo:bar');
one // foo
two // bar
2
3
For string replacement, use $<groupName> to reference named groups.
let re = /(?<year>\d{4})-(?<month>\d{2})-(?<day>\d{2})/u;
'2015-01-02'.replace(re, '$<day>/$<month>/$<year>')
// '02/01/2015'
2
3
4
In the code above, the second argument to replace is a string, not a regular expression.
The second argument to replace can also be a function, with the following parameter sequence.
'2015-01-02'.replace(re, (
matched, // entire match result 2015-01-02
capture1, // first group match 2015
capture2, // second group match 01
capture3, // third group match 02
position, // start position of the match 0
S, // original string 2015-01-02
groups // object composed of named groups {year, month, day}
) => {
let {day, month, year} = groups;
return `${day}/${month}/${year}`;
});
2
3
4
5
6
7
8
9
10
11
12
Named capture groups add a new final function parameter on top of the original parameters: an object composed of the named groups. This object can be destructured directly inside the function.
# References
To reference a "named capture group" within a regular expression, use the \k<groupName> syntax.
const RE_TWICE = /^(?<word>[a-z]+)!\k<word>$/;
RE_TWICE.test('abc!abc') // true
RE_TWICE.test('abc!ab') // false
2
3
Numerical references (\1) are still valid.
const RE_TWICE = /^(?<word>[a-z]+)!\1$/;
RE_TWICE.test('abc!abc') // true
RE_TWICE.test('abc!ab') // false
2
3
Both reference syntaxes can be used simultaneously.
const RE_TWICE = /^(?<word>[a-z]+)!\k<word>!\1$/;
RE_TWICE.test('abc!abc!abc') // true
RE_TWICE.test('abc!abc!ab') // false
2
3
# String.prototype.matchAll()
When a regular expression has multiple matches in a string, the common approach is to use the g or y flag and extract them one by one in a loop.
var regex = /t(e)(st(\d?))/g;
var string = 'test1test2test3';
var matches = [];
var match;
while (match = regex.exec(string)) {
matches.push(match);
}
matches
// [
// ["test1", "e", "st1", "1", index: 0, input: "test1test2test3"],
// ["test2", "e", "st2", "2", index: 5, input: "test1test2test3"],
// ["test3", "e", "st3", "3", index: 10, input: "test1test2test3"]
// ]
2
3
4
5
6
7
8
9
10
11
12
13
14
15
In the code above, the while loop extracts each round of regex matches — three rounds in total.
ES2020 (opens new window) added the String.prototype.matchAll() method, which retrieves all matches at once. However, it returns an iterator, not an array.
const string = 'test1test2test3';
// The g flag is optional
const regex = /t(e)(st(\d?))/g;
for (const match of string.matchAll(regex)) {
console.log(match);
}
// ["test1", "e", "st1", "1", index: 0, input: "test1test2test3"]
// ["test2", "e", "st2", "2", index: 5, input: "test1test2test3"]
// ["test3", "e", "st3", "3", index: 10, input: "test1test2test3"]
2
3
4
5
6
7
8
9
10
11
In the code above, since string.matchAll(regex) returns an iterator, it can be consumed with a for...of loop. The advantage of returning an iterator over an array is that iterators are more resource-efficient when the match results form a very large array.
Converting an iterator to an array is very simple — use the ... operator or the Array.from() method.
// Convert to array: method 1
[...string.matchAll(regex)]
// Convert to array: method 2
Array.from(string.matchAll(regex))
2
3
4
5