cppannotations/annotations/yo/stl/regexlanguage.yo
Frank B. Brokken 75d04f6683 typos
2016-11-08 21:54:15 +01:00

125 lines
5.3 KiB
Text

Regular expressions are expressions consisting of elements resembling those of
numeric expressions. Regular expressions consist of basic elements and
operators, having various priorities and associations. Like numeric
expressions, parentheses can be used to group elements together to form a
units on which operators operate. For an extensive discussion the reader is
referred to, e.g., section 15.10 of the url(ecma-international.org)
(http://ecma-international.org/ecma-262/5.1/#sec-15.10) page, which
describes the characteristics of the regular expressions used by default by
bf(C++)'s tt(regex) classes.
bf(C++)'s default definition of regular expressions distinguishes the
following em(atoms):
itemization(
itt(x): the character `x';
itt(.): any character except for the newline character;
itt([xyz]): a character class; in this case, either an `x', a `y', or a `z'
matches the regular expression. See also the paragraph about
character classes below;
itt([abj-oZ]): a character class containing a range of characters; this
regular expression matches an `a', a `b', any letter from `j' through
`o', or a `Z'. See also the paragraph about character classes below;
itt([^A-Z]): a negated character class: this regular expression matches
any character but those in the class beyond tt(^). In this case, any
character em(except for) an uppercase letter. See also the paragraph
about character classes below;
itt([:predef:]): a em(predefined) set of characters. See below for an
overview. When used, it is interpreted as an element in a character
class. It is therefore always embedded in a set of square brackets
defining the character class (e.g., tt([[:alnum:]]));
itt(\X): if X is `a', `b', `f', `n', `r', `t', or `v', then the ANSI-C
interpretation of `\x'. Otherwise, a literal `X' (used to escape
operators such as tt(*));
itt((r)): the regular expression tt(r). It is used to override precedence
(see below), but also to define tt(r) as a
emi(marked sub-expression) whose matching characters may directly be
retrieved from, e.g., an tt(std::smatch) object (cf. section
ref(SMATCH));
itt((?:r)): the regular expression tt(r). It is used to override
precedence (see below), but it is em(not) regarded as a em(marked
sub-expression);
)
In addition to these basic atoms, the following special atoms are available
(which can also be used in character classes):
itemization(
itt(\s): a whitespace character;
itt(\S): any character but a whitespace character;
itt(\d): a decimal digit character;
itt(\D): any character but a decimal digit character;
itt(\w): an alphanumeric character or an underscore (tt(_)) character;
itt(\W): any character but an alphanumeric character or an underscore
(tt(_)) character.
)
Atoms may be concatenated. If tt(r) and tt(s) are atoms then the regular
expression tt(rs) matches a target text if the target text matches tt(r)
em(and) tt(s), in that order (without any intermediate characters
inside the target text). E.g., the regular expression tt([ab][cd]) matches the
target text tt(ac), but not the target text tt(a:c).
Atoms may be combined using operators. Operators bind to the preceding
atom. If an operator should operate on multiple atoms the atoms must be
surrounded by parentheses (see the last element in the previous itemization).
To use an operator character as an atom it can be escaped. Eg., tt(*)
represent an operator, tt(\*) the atom character star. Note that
character classes do not recognize escape sequences: tt([\*]) represents a
character class consisting of two characters: a backslash and a star.
The following operators are supported (tt(r) and tt(s) represent regular
expression atoms):
itemization(
itt(r*): zero or more tt(r)s;
itt(r+): one or more tt(r)s;
itt(r?): zero or one tt(r)s (that is, an optional r);
itt(r{m, n}): where tt(1 <= m <= n): matches `r' at least m, but at most n
times;
itt(r{m,}): where tt(1 <= m): matches `r' at least m times;
itt(r{m}): where tt(1 <= m): matches `r' exactly m times;
itt(r|s): matches either an `r' or an `s'. This operator has a lower
priority than any of the multiplication operators;
itt(^r) : tt(^) is a pseudo operator. This expression matches `r', if
appearing at the beginning of the target text. If the tt(^)-character
is not the first character of a regular expression it is interpreted
as a literal tt(^)-character;
itt(r$): tt($) is a pseudo operator. This expression matches `r', if
appearing at the end of the target text. If the tt($)-character is not
the last character of a regular expression it is interpreted as a
literal tt($)-character;
)
When a regular expression contains marked sub-expressions and multipliers, and
the marked sub-expressions are multiply matched, then the target's final
sub-string matching the marked sub-expression is reported as the text matching
the marked sub-expression. E.g, when using tt(regex_search) (cf. section
ref(REGSRCH)), marked sub-expression (tt(((a|b)+\s?))), and target text tt(a a
b), then tt(a a b) is the fully matched text, while tt(b) is reported as the
sub-string matching the first and second marked sub-expressions.