Regular expressions are a programming language for describing patterns
in strings. At the syntax level, it's important to understand which characters
are metacharacters (have a special meaning), and which are literal
characters (stand for themselves). At the symantic level, several basic concepts
are important: character classes, quantifiers, boundaries,
grouping, and alternation. These fundamental regex elements apply
to all implemenations, and will solve most or your regex needs.
Metacharacters
The characters that have special meaning are called metacharacters. A
preceding backslash ("") turns a metachacter into a literal character. The set
of metacharacters in character classes, ie between [ and ], is different.
| Char |
Meaning |
|
Turns metacharacters into literal characters, and literal characters
into metacharacters. Because this is also the Java escape character in
strings, it must be doubled. |
| [ |
Starts character class definition. |
| ( |
Starts a group. |
| { |
Encloses repetition count. {min, max} |
| ^ |
Matches boundary at beginning. Class negation when immediately after
[. |
| $ |
Matches boundary at end. |
| . |
Matches any single character. |
| ? |
Preceding element must match zero or one time. |
| * |
Preceding element must match zero or more times. |
| + |
Preceding element must match one or more times. |
| | |
Either preceding or following element must match. |
Boundaries
A boundary is the position between two characters or at the beginning
or end. The two most commonly used boundaries are ^ (matches at beginning) and $
(matches at end).
| Code |
Meaning |
| ^ |
Beginning of a line. |
| $ |
End of a line. |
| A |
Beginning of the input. |
| z |
End of the input. |
|
End of input, ignoring final terminator, if any. |
| G |
End of the previous match (to indicate where new match should start. |
Character classes
A character class defines a set of characters. It matchs exactly one
character unless it is followed by a quantifier specifying how many.
Predefined character classes
Notice the uppercase class is the negation of the lowercase class.
| Code |
Matches |
| . |
Any character. |
| d |
A digit. Same as [0-9] |
| D |
A non-digit. Same as [^0-9] or [^d] |
| s |
A whitespace character. Same as [
x0Bf
] |
| S |
A non-whitespace character. Same as [^s] |
| w |
A "word" character. Same as [a-zA-Z0-9_] includes underscore, which
not all regex libraries do. It does NOT include the non-ASCII Unicode
characters (See below). |
| p{L} |
Unicode letters. |
| W |
A non-word character. Same as [^w] |
| |
|
Quantifiers
An element, X, which may be a literal character, a character class, or
a group, may be followed by a quantifier, which indicates how often it
should be matched.
Quantifiers are classified as greedy or lazy. Greedy
quantifiers try to match as much as possible, and reduce the amount they match
only if forced to by later failures. Lazy quantifiers match as little as
possible, and only expand if required by a later failure. Unlike most regex
libraries, Java supports possesive quantifiers, which are not only
greedy, but won't give back anything they've matched. They can provide a speed
advantage in some circumstances.
| Code |
Meaning |
| X? |
X must match zero or one time. Greedy. |
| X* |
X must match zero or more times. Greedy. |
| X+ |
X must match one or more times. Greedy. |
| X{n} |
X must match n times. |
| X{n,} |
X must match at least n times. Greedy. |
| X{n, m} |
X must match at least n times, but no more than m
times. Greedy. |
| X?? |
X must match zero or one time. Lazy. |
| X*? |
X must match zero or more times. Lazy. |
| X+? |
X must match one or more times. Lazy. |
| X{n,}? |
X must match at least n times. Lazy. |
| X{n, m}? |
X must match at least n times, but no more than m
times. Lazy. |
Grouping
| Code |
Meaning |
| (X) |
This matches X as usual, and it also records the beginning
and end of the substring that X matches. This forms a group
that can be used in one of three ways:
- Matcher methods can be called to get the number of groups, a
particular group by number, or the beginning and end character index
of any group.
- Back references can be made inside a pattern to match previous
groups that were matched. These references are of the form
,
where n is the number of a previous group.
- Matcher appendReplacement() method may reference groups
in the replacement string using
$n,
Group 0 is the entire match. For other groups, the number of the
group corresponds to the number of the left parenthesis in the regex
when counting from the left, starting at one.
The group includes only the last repetition caused by quantifiers.
Enclose the quantifiers in a group if you want the repeations in one
group. |
Alternation
| Code |
Meaning |
| X|Y |
Tries to match X. If that fails, it tries to match Y. |
comments? | | Score: 0
|