Index

Advanced Quantifiers and Lazy Matching

Java Regex

6.1 Greedy vs. reluctant quantifiers

In Java regular expressions, quantifiers define how many times a pattern element may repeat. By default, these quantifiers are greedy, meaning they match as much text as possible. However, when this behavior causes overmatching, reluctant quantifiers offer a solution by matching as little text as necessary.

Greedy Quantifiers (Default)

A greedy quantifier tries to consume as many characters as possible while still allowing the overall pattern to match. Common greedy quantifiers include:

For example:

String input = "<b>Hello</b><b>World</b>";
String regex = "<b>.*</b>";

This greedy pattern will match:

<b>Hello</b><b>World</b>

because .* consumes everything between the first <b> and the last </b>, leading to overmatching.

Reluctant Quantifiers (Lazy)

Reluctant quantifiers do the opposite of greedy ones: they match as little as possible, expanding only when needed to satisfy the rest of the pattern. You can make a quantifier reluctant by appending a ?:

Using the same input:

String regex = "<b>.*?</b>";

Now the match will be:

<b>Hello</b>
<b>World</b>

This happens because .*? matches the smallest possible substring between <b> and </b>, avoiding overmatching.

Visual Comparison

Pattern Match Result
<b>.*</b> <b>Hello</b><b>World</b>
<b>.*?</b> <b>Hello</b> and <b>World</b>

When to Use Reluctant Quantifiers

Reluctant quantifiers are useful when:

Summary

Understanding this distinction helps you write more precise regex patterns and avoid subtle bugs in text parsing.

Index

6.2 Possessive quantifiers

Possessive quantifiers are a more advanced type of quantifier in regular expressions that instruct the regex engine to match as much as possible without allowing any backtracking. This behavior makes them useful in performance-critical scenarios but can also lead to unexpected failed matches if not used carefully.

What Are Possessive Quantifiers?

Possessive quantifiers are created by appending a + to the end of a standard greedy quantifier:

Unlike greedy quantifiers (which backtrack if a later part of the pattern fails), possessive quantifiers never backtrack. Once they consume characters, they keep them—no matter what.

Why Use Them?

Possessive quantifiers can:

Example: Possessive Behavior

String input = "aaab";
String greedy = "a+.*b";       // Matches
String possessive = "a++.*b";  // Fails

In the greedy version, a+ matches "aaa", and then .*b matches the rest. If the full match fails, it backtracks—releasing one a at a time to allow .*b to find a match.

In the possessive version, a++ consumes all three a characters and refuses to give any back, so .*b cannot match anything and the whole pattern fails.

Practical Use Case

Consider parsing large strings or logs with patterns that might otherwise cause performance issues:

String regex = ".*+@example\\.com";

This prevents .*+ from backtracking, improving efficiency when matching known suffixes.

When to Avoid

Possessive quantifiers are powerful, but can easily cause false negatives (no match found when one should be). Avoid using them when the pattern depends on backtracking to succeed.

Summary

Possessive quantifiers:

Use them thoughtfully, especially when optimizing complex or repetitive patterns.

Index

6.3 Examples demonstrating differences

Understanding the behavior of greedy, reluctant, and possessive quantifiers is crucial for building correct and efficient regular expressions. This section demonstrates how each quantifier behaves differently—even when used with the same pattern and input.

Test Scenario

We’ll use the following input string:

String input = "<tag>first</tag><tag>second</tag>";

Our goal is to match each <tag>...</tag> block.

Greedy Quantifier

Pattern pattern = Pattern.compile("<tag>.*</tag>");
Matcher matcher = pattern.matcher(input);
while (matcher.find()) {
    System.out.println("Greedy match: " + matcher.group());
}

Output:

Greedy match: <tag>first</tag><tag>second</tag>

Explanation: The greedy .* consumes as much as possible while still allowing the pattern to match. It starts at the first <tag> and captures everything until the last </tag>. This is overmatching.

Reluctant Quantifier

Pattern pattern = Pattern.compile("<tag>.*?</tag>");
Matcher matcher = pattern.matcher(input);
while (matcher.find()) {
    System.out.println("Reluctant match: " + matcher.group());
}

Output:

Reluctant match: <tag>first</tag>
Reluctant match: <tag>second</tag>

Explanation: The .*? matches as little as possible to satisfy the full pattern. It captures each <tag>...</tag> block individually. This is the desired behavior when extracting multiple elements.

Possessive Quantifier

Pattern pattern = Pattern.compile("<tag>.*+</tag>");
Matcher matcher = pattern.matcher(input);
while (matcher.find()) {
    System.out.println("Possessive match: " + matcher.group());
}

Output:

(No match)

Explanation: The possessive .*+ consumes all characters after the first <tag> and refuses to give any back. When the engine reaches </tag>, it can’t find a match because the text has already been consumed. This causes the pattern to fail completely.

Performance Consideration

In large inputs, possessive quantifiers can improve performance by preventing excessive backtracking. For example:

Pattern pattern = Pattern.compile(".*+@example\\.com");

This prevents .*+ from endlessly retrying when matching email addresses in large text bodies.

Summary

Quantifier Behavior Use When
.* (greedy) Matches as much as possible General use, but can overmatch
.*? (reluctant) Matches as little as needed Precise extraction of segments
.*+ (possessive) Matches as much, no backtracking Prevent backtracking/performance

Choosing the right quantifier depends on your intent: whether you want all data, minimal matches, or performance optimization without flexibility.

Index

6.4 Example: Extracting HTML tags without overmatching

One common challenge in text processing is extracting repeated structures like HTML tags. If you use a greedy quantifier, your pattern may unintentionally match everything from the first opening tag to the last closing tag. Reluctant quantifiers can solve this problem by matching as little as possible—just enough to satisfy the pattern.

Problem Overview

Suppose you have the following HTML fragment:

<div>Hello</div><div>World</div>

You want to extract each <div>...</div> pair individually.

The Greedy Problem

Let’s see what happens if we use a greedy quantifier (.*):

Pattern pattern = Pattern.compile("<div>.*</div>");
Matcher matcher = pattern.matcher("<div>Hello</div><div>World</div>");
while (matcher.find()) {
    System.out.println("Match: " + matcher.group());
}

Output:

Match: <div>Hello</div><div>World</div>

Explanation: The .* greedily matches everything between the first <div> and the last </div>, resulting in a single match that swallows both elements. This is known as overmatching.

Solution with Reluctant Quantifier

We can fix this with a reluctant quantifier (.*?):

import java.util.regex.*;

public class ExtractDivTags {
    public static void main(String[] args) {
        String input = "<div>Hello</div><div>World</div>";
        Pattern pattern = Pattern.compile("<div>.*?</div>");
        Matcher matcher = pattern.matcher(input);

        while (matcher.find()) {
            System.out.println("Extracted: " + matcher.group());
        }
    }
}

Output:

Extracted: <div>Hello</div>
Extracted: <div>World</div>

Explanation: Here, .*? matches the smallest possible string that still fits the <div>...</div> pattern. It matches up to the nearest closing tag, giving us the correct, separate results.

What About Possessive Quantifiers?

Now let’s try a possessive quantifier (.*+):

Pattern pattern = Pattern.compile("<div>.*+</div>");

This will fail completely, producing no matches. The possessive quantifier grabs everything after the first <div> and won’t backtrack, so the closing </div> cannot be matched. Possessive quantifiers are useful for performance but unsuitable when backtracking is required for correctness.

Summary

Quantifier Type Result
Greedy (.*) Overmatches across multiple tags
Reluctant (.*?) Matches each tag pair precisely
Possessive (.*+) Fails to match due to no backtracking

Use reluctant quantifiers when parsing nested or repeated structures like HTML. They help prevent overmatching and ensure your pattern behaves as intended.

Ready to move on to Chapter 7 or revise previous content?

Index