Intro

A regular expression is a pattern or sequence of characters that has regular characters and meta-characters. The pattern serves as a template to find (and possibly replace) a desired arrangement in a body of text.

Here is a rough history of regular expressions:

The VBScript RegExp object implements regular expressions in a slightly different way than does the JavaScript/JScript RegExp object. Both of these RegExp objects are modeled after regular expressions in PERL.

Pattern Syntax

Most characters in a regular expression will look for themselves. EG: /geo/ will find george and gorgeous. However some characters have special meaning in regular expression patterns. Here is the basic list of these metacharacters.

\ | () [] {} ^ $ * + ? .

Assertions are sections of patterns that match themselves. Atoms are non-zero width assertions.

Quantifiers say how many of the atom immediately preceding should match in a row. The quantifiers are *, +, ?, and {}. EG: /hi{2}/ matches hii while /(hi){2}/ matches hihi.

Flags are not part of the pattern, but affect the application of the pattern

Character Description
\ Escapes, i.e. marks the next character as special, a literal, a back reference, or an octal.
EG: "n" is "n", but "\n" is a newline. An escape of particular note is "\\"
^ (1) Anchors at start, i.e. matches at the beginning of target string. If RegExp.Multiline is set, then also matches after "\n" or "\r".
EG: "^a" matches the first a in "ana" but not the second.

(2) In sets, this means not the set.
EG: "[^x-z]" matches any character except for "x, "y", or "z".
$ Anchors at end, i.e. matches at the end of target string. If RegExp.Multiline is set, then also matches before "\n" or "\r".
EG: "a$" matches the second a in "ana" but not the first.
. Matches any 1 character except characters related to new lines: [\n\r\u2028\u2029].
EG: "bo." matches "b", "bo", "boo", "booo", "boooo".
* Quantifier: Matches the preceding sub-expression 0 or more times. Same as {0,}.
EG: "bo*" matches "b", "bo", "boo", "booo", "boooo".
+ Quantifier: Matches the preceding sub-expression 1 or more times. Same as {1,}.
EG: "bo+" matches "b", "bo", "boo", "booo", "boooo".
? (1) Quantifier: Matches the preceding sub-expression 0 or 1 times. Same as {0,1}.
EG: "bo?" matches "b", "bo", "boo", "booo", "boooo".

(2) If used immediately after one of the other quantifiers (*, +, ., and {}), then makes the pattern non-greedy.
EG: "X.+X" matches "XHello world.X Xfoo barX", while "X.+?X" matches "XHello world.X"

(3) Used in the look ahead assertions: (?=), (?!), and (?:).
{n} Matches the preceding sub-expression n times.
EG: "bo{2}" matches "b", "bo", "boo", "booo", "boooo".
{n,} Matches the preceding sub-expression n or more times.
EG: "bo{2,}" matches "b", "bo", "boo", "booo", "boooo".
{n,m} Matches the preceding sub-expression n to m times.
EG: "bo{2,3}" matches "b", "bo", "boo", "booo", "boooo".
(pattern)
\(pattern\)
(Latter in POSIX)
(1) Used in a mathematical fashion for grouping, scoping, and setting precedence.
EG: "dais(y|ies)" is the same as "daisy|daisies".

(2) Matches the pattern and captures/remembers/parenthesizes it. Captured matches can be retrieved into the Matches collection (VBScript) or the $0 ... $9 backreference properties (JavaScript) or the $0 .. $99 backreference properties (PERL) or the \n (POSIX).
EG: "/<(.*)>.*<\/\1>/" matches paired elements like "<p>hi</p>".
EG: "/^(.)(.).*\2\1$/" matches strings like "ABcdedBA".
(?:pattern) Look ahead assertion: Matches the pattern. In spite of parentheses, this does not capture.
EG: "|a" is not valid but "(?:)|a" is.
pattern1(?=pattern2) Look ahead assertion: Matches the pattern1 if it is followed by pattern2. In spite of parentheses, this does not capture.
EG: "Win (?=95|98)" matches "Windows" of "Windows 98" but not "Windows" of "Windows 2000".
pattern1(?!pattern2) Look ahead assertion: Matches the pattern1 if it is not followed by pattern2. In spite of parentheses, this does not capture.
EG: "Win (?!95|98)" matches "Windows" of "Windows 2000" but not "Windows" of "Windows 98".
x|y Seperates alternatives. Matches x or y.
EG: "g|food" matches "g" or "food". "(g|f)ood" matches "good" or "food".
[xyz] Positive character set matches any character enclosed.
EG: "[ab]" matches "ab" of "abcd".
[^xyz] Negative character set matches any character not enclosed.
EG: "[^ab]" matches "cd" of "abcd".
[x-z] Positive range of characters.
EG: "[x-z]" matches any "x, "y", or "z".
[^x-z] Negative range of characters.
EG: "[^x-z]" matches any character except for "x, "y", or "z".
\b Matches a word boundary, i.e. position between a character and whitespace.
EG: "er\b" matches "er" in "hover x" but not the "er" in "Ebert".
\B Matches a non-word boundary, i.e. position between a character and a character.
EG: "er\B" matches "er" in "Ebert" but not the "er" in "hover x".
\cx Matches a control character x, where x is A-Z or a-z.
EG: "\cM" matches ctrl+M (carriage return character).
\d
[:digit:]
(latter in POSIX)
Matches a digital character. Same as [0-9].
\D Matches a non-digital character. Same as [^0-9].
\f Matches a form-feed character. Same as [\x0c\cL].
\n Matches a newline character. Same as [\x0a\cJ]. FYI: EOLs by sys: Win \r\n; Unix \n; Mac \r.
\r Matches a carriage return character. Same as [x0d\cM]. FYI: EOLs by sys: Win \r\n; Unix \n; Mac \r.
\s
[:space:]
(latter in POSIX)
Matches a whitespace character. Same as [\t\n\v\f\r] or [\t\n\v\f\r \u00a0\u2000\u2001\u2002\u2003\u2004\u2005\u2006\u2007\u2008\u2009\u200a\u200b\u2028\u2029\u3000].
\S Matches a non-whitespace character. Same as [^\t\n\v\f\r] or [^\t\n\v\f\r \u00a0\u2000\u2001\u2002\u2003\u2004\u2005\u2006\u2007\u2008\u2009\u200a\u200b\u2028\u2029\u3000].
\t Matches a tab character. Same as [\x09\cI].
\v Matches a vertical tab character. Same as [x0b\cK].
\w Matches a word character. Same as [A-Za-z0-9_];.
\W Matches a non-word character. Same as [^A-Za-z0-9_].
[:alnum:] Matches alphanumeric characters in POSIX. Same as [A-Za-z0-9].
[:alpha:] Matches alphabet characters in POSIX. Same as [A-Za-z].
[:blank:] Matches space and tab in POSIX. Same as [ \t].
[:cntrl:] Matches control characters in POSIX. Same as [\x00-\x1F\x7F].
[:graph:] Matches graphical or visible characters in POSIX. Same as [\x21-\x7E].
[:lower:] Matches a lowercase character in POSIX. Same as [a-z].
[:print:] Matches graphical or visible characters and space in POSIX. Same as [\x20-\x7E].
[:punct:] Matches punctuation characters and space in POSIX. Same as [!"#$%&'()*+,-./:;<=>?@[\\\]_`{|}~].
[:upper:] Matches an uppercase character in POSISX. Same as [A-Z].
[:xdigit:] Matches any characters used in hexadecimal digits in POSISX. Same as [A-Fa-f0-9].
\n If integer n is preceded by at least n captured (parenthesized) matches, then back references the captured matches.
EG: "one(,)\stwo\1" matches "one, two" in "one, two, three".
EG: "/<(.*)>.*<\/\1>/" matches paired elements like "<p>hi</p>".
EG: "/^(.)(.).*\2\1$/" matches strings like "ABcdedBA".

Else if n is an octal number, then matches an octal character. In VBScript, n must be between 1-3 digits (0-777).
EG: "\132" matches "Z".
\on Matches an ASCII octal character code n. JavaScript only.
EG: "\x5a" matches "Z".
\xn Matches an ASCII hexadecimal character code n, where n has 2 digits.
EG: "\x5a" matches "Z".
\un Matches a Unicode hexadecimal character code n, where n has 4 digits.
EG: "\u00A2" matches "?".
\0 Matches NUL or NULL PROMPT. Same as [\u0000].

Matching Rules

There are six basic rules that regular expressions apply in order.

  1. Starting before the first character, try to match the pattern on everything to the right, then subtracting characters right-to-left. If no match, then repeat starting after the first character, and so on. EG: Match /ar/ in Cart.
    Cart // Does "Cart" match? ... NO
    Cart // Does "Car" match?    ... NO
    Cart // Does "Ca" match?   ... NO
    Cart // Does "C" match?  ... NO
    Cart // Does "" match?     ... NO
    Cart // Does "art" match?  ... NO
    Cart // Does "ar" match?   ... YES
  2. The whole pattern is regarded as a set of alternatives separated by vertical bars (|).
  3. Any specific alternative matches if every assertion or quantified atom in the alternatives matches sequentially according to Rules 4 and 5.
  4. If an assertion does not match according to the following table, backtrack to Rule 3 and try a higher-pecking-order item with different choices.
    Assertion Description
    ^ Matches at the beginning of the string.
    $ Matches at the end of the string.
    \b Matches a word boundary (between \w and \W), when not inside [].
    \B Matches a non-word boundary.
  5. A quantified atom matches only if the atom itself matches some number of times allowed its following quantifier. Multiple matches must be adjacent within the string. If no match can be found at a current position for any allowed quantity of the atom, backtrack to Rule 3 and try higher-pecking-order items with different choices.
  6. Each atom matches according to its type. If the atom doesn't match, or doesn't allow a match of the rest of the regular expression, backtrack to Rule 5 and try the next choice for the atom's quantity.

EGs

Numbers

Tag related as in HTML, XHTML, XML, etc.

Miscellany

Links

2008-03-29 16:10:18Z