Syntax of POSIX extended regular expressions as used in ARB
A regular expression specifies a set of character strings,
e.g. the expression '/pseu/i' specifies all strings containing
"pseu", "Pseu" or "pSeu" and so on. We say the expression "matches"
(a part of) these strings.
Several characters have special meanings in regular expressions.
All other characters just match against themselves.
Special characters:
-
'.' matches any character (e.g. '/h.s/' matches "has" and "his")
'[xyz]' matches 'x', 'y' or 'z'
'[a-z]' matches all lower case letters
'^' matches the beginning of the string
(e.g. '/^pseu/i' matches all strings starting with "pseu")
'$' matches the end of the string
(e.g. '/cens$/i' matches all strings ending in "cens")
-
'*' matches the preceding element zero or more times
(e.g. '/th*is/' matches "tis", "this", "thhhhhhiss", ..)
'?' matches the preceding element zero or one time
(e.g. '/th?is/' matches "tis" or "this", but not "thhis")
'+' matches the preceding element one or more times
(e.g. '/th+is/' matches "this" or "thhhis", but not "tis")
'{mi,ma}' matches the preceding element 3 to 5 times
(e.g. '/th{2,4}is/' matches "thhis", "thhhis" or "thhhhis")
-
'|' marks an alternative
(e.g. '/bacter|spiri/i' matches all strings containing "bacter" or "spiri")
-
'()' marks a subexpression. Subexpressions can be used to separate alternatives
or to mark parts for use in the replace expression (see below).
(e.g. '/bact|spiri.*cens/' match '/bact/' or '/spiri.*cens/',
whereas '/(bact|spiri).*cens/' match '/bact.*cens/' or '/spiri.*cens/')
-
To match against special characters themselves, escape them
using a '\' (e.g. '/\*/' matches the character "*", '/\\/' matches "\")
Character classes:
-
[...] is called a character class. It matches against any of the characters
listed in between the brackets.
[^...] If the character class starts with '^' it matches against any character
NOT listed (e.g. '[^78]' matches all but '7' or '8')
[5-9] If the character class contains a '-' it is interpreted as "range of characters".
Here '5-9' is equivalent to '56789'.
You may mix ranges and single characters, e.g. '14-79' is same as '145679',
'7-91-3' is same as '789123'.
-
To add special characters to a character class, escape them using '\'.
-
There are several special predefined character classes like '[:word:]' or
'[:punct:]'. See link below for details.
Links:
* A more in-depth explanation of POSIX extended regular expressions can be
found at http://en.wikipedia.org/wiki/Regular_expression#POSIX.
* Many examples are given in this guide: http://www.digitalamit.com/article/regular_expression.phtml
Notes:
* if a expression matches one string multiple times, the longest leftmost
match is used (e.g: '/a*e*/' matches 'aaeee' at position 3 of the
string 'bbaaeeeffaegg', not 'ae' at position 10).
|