Appendices

Java Regex

17.1 Regex Syntax Quick Reference

Syntax Element	Description	Example
Literals	Match exact characters	`a`, `Z`, `9`, `.`
Metacharacters	Special characters with regex meaning	. ^ $ * + ? { } [ ] \ \| ( )
Escaping	Use backslash `\` to treat metacharacters as literals	`\.` matches a dot `.`

Quantifiers

Syntax	Description	Example
`*`	0 or more	`a*` matches ``, `a`, `aa`
`+`	1 or more	`a+` matches `a`, `aa`
`?`	0 or 1 (optional)	`a?` matches `` or `a`
`{n}`	Exactly n times	`a{3}` matches `aaa`
`{n,}`	At least n times	`a{2,}` matches `aa`, `aaa`
`{n,m}`	Between n and m times	`a{2,4}` matches `aa`, `aaa`, `aaaa`

Groups and Capturing

Syntax	Description	Example
`( ... )`	Capturing group	`(abc)` matches `abc`
`(?: ... )`	Non-capturing group	`(?:abc)` groups without capturing
`(?<name> ... )`	Named capturing group (Java 7+)	`(?<year>\d{4})`
`\n`	Backreference to nth group	`\1` refers to first group

Assertions

Syntax	Description	Example
`^`	Start of input (or line in multiline mode)	`^abc` matches `abc` at start
`$`	End of input (or line in multiline mode)	`xyz$` matches `xyz` at end
`\b`	Word boundary	`\bword\b` matches `word` as whole word
`\B`	Non-word boundary	`\Bend\B` matches `end` within a word
`(?= ... )`	Positive lookahead	`a(?=b)` matches `a` if followed by `b`
`(?! ... )`	Negative lookahead	`a(?!b)` matches `a` if not followed by `b`
`(?<= ... )`	Positive lookbehind	`(?<=a)b` matches `b` if preceded by `a`
`(?<! ... )`	Negative lookbehind	`(?<!a)b` matches `b` if not preceded by `a`

Character Classes

Syntax	Description	Example
`[abc]`	Any character a, b, or c	`[aeiou]` vowels
`[a-z]`	Any character in the range a to z	`[0-9]` digits
`[^abc]`	Negated class, any char except a, b, or c	`[^0-9]` non-digit
`\d`	Digit (equivalent to `[0-9]`)	`\d{3}` matches three digits
`\D`	Non-digit	`\D+` matches non-digit chars
`\w`	Word character (letters, digits, underscore)	`\w+` matches words
`\W`	Non-word character	`\W+` matches punctuation, spaces
`\s`	Whitespace (spaces, tabs, line breaks)	`\s*` matches optional spaces
`\S`	Non-whitespace	`\S+` matches non-space chars
`\p{Lower}`	Unicode lowercase letter	Matches `a`, `β`, etc.
`\p{Upper}`	Unicode uppercase letter	Matches `A`, `Γ`, etc.
`\p{IsGreek}`	Unicode Greek script characters	Matches Greek letters

Flags (Java Pattern Flags)

Flag	Meaning	Usage example
`(?i)`	Case-insensitive matching	`(?i)abc` matches `ABC`
`(?m)`	Multiline mode (`^` and `$` match line start/end)	`(?m)^abc`
`(?s)`	Dotall mode (dot `.` matches line breaks)	`(?s).+`
`(?x)`	Ignore whitespace and allow comments	`(?x) a \s+ b`

This cheat sheet summarizes the core regex elements essential for Java pattern matching. For complex patterns, combining these elements thoughtfully ensures clear, maintainable, and efficient regex.

17.2 Java Regex API Summary

Java’s regex functionality is primarily provided by two core classes in the java.util.regex package: Pattern and Matcher. Here’s a concise overview of these classes and their most important methods to help you work efficiently with regex in Java.

Pattern

Represents a compiled regular expression.

Created using the static factory method:

Pattern pattern = Pattern.compile(String regex);

Supports optional flags to modify matching behavior:
- Pattern.CASE_INSENSITIVE — Case-insensitive matching.
- Pattern.MULTILINE — Changes ^ and $ to match start/end of lines.
- Pattern.DOTALL — Makes . match line terminators.
- Pattern.UNICODE_CASE — Enables Unicode-aware case folding.
Common methods:
- matcher(CharSequence input) — Creates a Matcher to apply the pattern to the input.
- split(CharSequence input) — Splits the input around matches.
- pattern() — Returns the regex string.

Matcher

Applies a compiled Pattern to a specific input sequence.
Created via Pattern.matcher() method.
Core methods:
- find() — Searches for the next subsequence matching the pattern.
- matches() — Attempts to match the entire input against the pattern.
- lookingAt() — Attempts to match the input’s beginning.
- group() — Returns the entire matched substring.
- group(int group) — Returns a specific capturing group.
- start() and end() — Indicate start and end positions of the last match.
- replaceAll(String replacement) — Replaces all matches with the replacement string.
- replaceFirst(String replacement) — Replaces the first match.
- reset() — Resets the matcher state for reuse with the same or different input.

Common Usage Patterns

Simple matching:

Pattern p = Pattern.compile("\\d+");
Matcher m = p.matcher("Order 1234");
if (m.find()) {
    System.out.println("Found number: " + m.group());
}

Replacing all occurrences:

String cleaned = input.replaceAll("\\s+", " ");

Splitting with regex:
```
String[] parts = pattern.split(input);
```

Using Pattern and Matcher correctly—such as compiling a pattern once and reusing it—improves performance and readability. Flags let you tailor matching to your needs, while the rich set of methods helps perform extraction, validation, and transformation tasks smoothly in Java applications.

17.3 Common Regex Patterns Library

Here is a curated collection of frequently used regex patterns to help you quickly handle common validation and extraction tasks in Java. Each pattern includes a brief explanation and usage notes.

Email Address

^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$

Matches most standard email formats.
Allows alphanumeric usernames with dots, underscores, and other common symbols.
Domain must include at least one dot and 2+ letter TLD.
Note: Not fully RFC compliant but practical for typical validation.

Phone Number (US Format)

^\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}$

Matches phone numbers with optional parentheses for area code.
Supports separators: dash -, dot ., or space.
Example matches: (123) 456-7890, 123-456-7890, 123.456.7890.

IPv4 Address

\b((25[0-5]|2[0-4]\d|1\d{2}|[1-9]?\d)(\.|$)){4}\b

Matches IPv4 addresses with numbers 0-255.
Ensures each octet is within valid range.
Word boundaries avoid partial matches inside longer strings.

Date (YYYY-MM-DD)

^\d{4}-(0[1-9]|1[0-2])-(0[1-9]|[12]\d|3[01])$

Matches ISO-style dates with four-digit year, month (01-12), and day (01-31).
Does not validate day/month logical correctness (e.g., leap years).

URL (Basic)

^(https?://)?([\w.-]+)\.([a-z]{2,6})([/\w .-]*)*/?$

Matches URLs starting with optional http or https.
Captures domain name with subdomains and TLD.
Matches optional path segments.
Simplified pattern — may not cover all valid URLs.

Password (At Least 8 chars, 1 Upper, 1 Lower, 1 Digit, 1 Special)

^(?=.*[a-z])(?=.*[A-Z])(?=.*\d)(?=.*[@$!%*?&])[A-Za-z\d@$!%*?&]{8,}$

Uses positive lookaheads to enforce character class requirements.
Ensures minimum length of 8.
Accepts letters, digits, and common special characters.

Usage Notes

Customize patterns for specific needs, such as international phone numbers or URL components.
Test thoroughly with edge cases to avoid false positives or negatives.
Combine with Java regex flags for case-insensitivity or multiline matching if needed.

This library provides solid starting points for many common regex tasks in Java projects.

17.4 Glossary of Terms

Quantifiers Symbols that specify how many times a pattern should repeat. Examples: * (0 or more), + (1 or more), ? (0 or 1), {n,m} (between n and m times).

Capturing Groups Parentheses () that group part of a regex and save the matched text for reuse or extraction. For example, (abc) matches "abc" and stores it as group 1.

Named Capturing Groups Groups given names for clearer access, like (?<name>pattern), accessed by name instead of number.

Lookahead Assertions Zero-width checks that assert what follows the current position without consuming characters.

Positive lookahead (?=...) requires the pattern to follow.
Negative lookahead (?!...) requires the pattern not to follow.

Lookbehind Assertions Similar to lookahead but check the text before the current position.

Positive lookbehind (?<=...) asserts what precedes.
Negative lookbehind (?<!...) asserts what does not precede.

Backtracking The process where the regex engine revisits previous matches to try alternative paths when a match fails. Excessive backtracking can cause performance issues.

Greedy vs. Reluctant Matching

Greedy quantifiers (default) try to match as much text as possible.
Reluctant (lazy) quantifiers (*?, +?, ??) match as little as possible.

Possessive Quantifiers Quantifiers like *+ or ++ that match as much as possible without backtracking, improving performance but potentially missing some matches.

Atomic Groups Subpatterns marked (?>...) that prevent backtracking inside the group, optimizing complex regexes.

Unicode Categories Predefined character classes in regex that match Unicode character types, e.g., \p{L} for any letter, \p{Nd} for decimal digits, supporting international text.

Word Boundaries Zero-width assertions \b that match positions between word (\w) and non-word (\W) characters, useful for matching whole words.

Flags (Modifiers) Settings that change regex behavior, such as CASE_INSENSITIVE or MULTILINE, usually passed when compiling patterns.

Escape Sequences Special characters preceded by \ to denote non-literal meanings, e.g., \d for digits, \s for whitespace.

This glossary covers foundational terms essential for understanding and writing effective Java regex patterns.