Java - Regular Expressions Package API

Introduction

The package java.util.regex contains three classes to support the full version of regular expressions.

The classes are

Class Usage
Pattern holds the compiled form of a regular expression.
Matcher associates the string to be matched with a Pattern and it performs the actual match.
PatternSyntaxException represents an error in a malformed regular expression.

Compiling Regular Expressions

Pattern class holds the compiled form of a regular expression and it is immutable.

It has no public constructor. The class contains a static compile() method, which returns a Pattern object.

The compile() method is overloaded.

static Pattern compile(String regex)
static Pattern compile(String regex, int flags)

The following snippet of code compiles a regular expression into a Pattern object:

String regex = "[a-z]@.";
// Compile the regular expression into a Pattern object
Pattern p = Pattern.compile(regex);

The flags parameter is a bit mask which can modify the way the pattern is matched.

The flags defined as int constants in the Pattern class is listed in the following table.

Flag
Description
Pattern.CANON_EQ

Enables canonical equivalence. Two characters match only if their
full canonical decompositions match. The expression "a\u030A", for example, will match the string "\u00E5" when this flag is specified. By default, matching does not take canonical equivalence into account.
Pattern.CASE_INSENSITIVE


Enables case-insensitive matching. This flag sets the case-insensitive matching
only for US-ASCII charset. For Unicode charset, use
UNICODE_CASE flag and this flag.
Pattern.COMMENTS


Permits whitespace and comments in pattern. When this flag is set, whitespace is
ignored and embedded comments starting with # are ignored until the end of a
line. In this mode, whitespace is ignored, and embedded comments starting with # are ignored until the end of a line.
Pattern.DOTALL

By default, . does not match line terminators.
In dotall mode, the expression . matches any character, including a line terminator. By default this expression does not match line terminators.
Pattern.LITERAL
Enables literal parsing of the pattern. When this flag is specified then the input string that specifies the pattern is treated as a sequence of literal characters. Metacharacters or escape sequences in the input sequence will be given no special meaning.
Pattern.MULTILINE
Enables multiline mode. In multiline mode the expressions ^ and $ match just after or just before, respectively, a line terminator or the end of the input sequence. By default these expressions only match at the beginning and the end of the entire input sequence.
Pattern.UNICODE_CASE
Enables Unicode-aware case folding. When this flag is specified then case-insensitive matching, when enabled by the CASE_INSENSITIVE flag, is done in a manner consistent with the Unicode Standard. By default, case-insensitive matching assumes that only characters in the US-ASCII charset are being matched.
Pattern.UNICODE_CHARACTER_CLASS

Enables the Unicode version of predefined character classes and POSIX character
classes.
Pattern.UNIX_LINES

Enables Unix lines mode. When this flag is set, only the \n character is recognized
as a line terminator.

The following code compiles a regular expression setting the CASE_INSENSTIVE and DOTALL flags.

The matching will be case-insensitive for US-ASCII charset and the expression. will match a line terminator.

// Prepare a regular expression
String regex = "[a-z]@.";

// Compile the regular expression into a Pattern object setting the
// CASE_INSENSITIVE and DOTALL flags
Pattern p = Pattern.compile(regex, Pattern.CASE_INSENSITIVE|Pattern.DOTALL);