Index

Boundary Matchers and Word Boundaries

Java Regex

7.1 Word boundaries \b and \B

In regular expressions, word boundaries are zero-width assertions—they do not consume characters, but rather match a position within the input string. They are especially useful when you want to match whole words without accidentally matching parts of longer words.

\b Word Boundary

The \b assertion matches a position where a word character (typically [a-zA-Z0-9_]) is adjacent to a non-word character (such as whitespace or punctuation) or the start/end of the string.

Examples:

Pattern pattern = Pattern.compile("\\bcat\\b");
Matcher matcher = pattern.matcher("A cat sat on the cathedral.");
while (matcher.find()) {
    System.out.println("Match: " + matcher.group());
}

Output:

Match: cat

Explanation: Here, \\bcat\\b matches only the whole word "cat", not the "cat" in "cathedral".

\B Not a Word Boundary

The \B assertion is the inverse of \b. It matches a position not at a word boundary. This is useful when you want to ensure that a substring occurs within a word, rather than at the start or end.

Example:

Pattern pattern = Pattern.compile("\\Bcat\\B");
Matcher matcher = pattern.matcher("A cat sat on the cathedral.");
while (matcher.find()) {
    System.out.println("Match: " + matcher.group());
}

Output:

Match: cat

Explanation: This pattern matches the "cat" inside "cathedral", but not the standalone word "cat".

Common Pitfalls

  1. Escaping \b in Java Strings: Because \b is also a backspace character in Java strings, you must escape it as \\b in your regex pattern.

  2. Using \b with non-word characters: If you try to use \b around a symbol or punctuation (e.g., \b$100\b), it won't match as expected, since $ is not a word character. In such cases, consider using anchors or lookarounds instead.

When to Use

Summary

Assertion Description Use Case
\b Matches at word boundaries Find whole words only
\B Matches not at word boundaries Match substrings within longer words

Word boundaries provide a powerful, efficient way to precisely target words in larger text without false positives from partial matches.

Index

7.2 Start/end of input vs line boundaries ^, $, \A, \Z

In regular expressions, anchors are special assertions that match a position rather than a character. Java provides two categories of anchors for marking the start and end of input: line boundaries and input boundaries.

Line Boundaries: ^ and $

These anchors are affected by multiline mode (Pattern.MULTILINE). When enabled, ^ and $ will match the start and end of each line within a string, not just the entire string.

Example:

String input = "apple\nbanana\ncherry";
Pattern pattern = Pattern.compile("^banana$", Pattern.MULTILINE);
Matcher matcher = pattern.matcher(input);
if (matcher.find()) {
    System.out.println("Found: " + matcher.group());
}

Output:

Found: banana

Explanation: With Pattern.MULTILINE, ^banana$ matches the exact line "banana", not the entire input.

Without multiline mode, ^ and $ match only the start and end of the whole input string, so the pattern wouldn't find a match in the above example.

Input Boundaries: \A and \Z

These are not affected by multiline mode and always refer to the absolute boundaries of the input string.

Example:

String input = "start\nmiddle\nend";
Pattern pattern = Pattern.compile("\\Astart");
Matcher matcher = pattern.matcher(input);
if (matcher.find()) {
    System.out.println("Found: " + matcher.group());
}

Output:

Found: start

Now using \Z:

Pattern pattern = Pattern.compile("end\\Z");

This would only match "end" if it appears at the very end of the string.

Choosing the Right Anchor

Anchor Meaning Affected by Multiline Mode
^ Start of a line Yes
$ End of a line Yes
\A Start of the input No
\Z End of the input No

Understanding these anchors and when to use them ensures your regex behaves predictably in both single-line and multi-line scenarios.

Index

7.3 Example: Find whole words only

When searching for specific words in text, it's important to avoid partial matches. For example, if you need to find the word "cat", you should not match "catalog" or "scatter". This is where the word boundary anchor (\b) becomes useful. It ensures that the match occurs only when the word is not part of a larger word.

Java Example: Match Whole Word "cat"

import java.util.regex.*;

public class WordBoundaryExample {
    public static void main(String[] args) {
        String input = "The cat sat on the catalog beside the catfish.";
        String word = "cat";
        
        // Pattern to match the whole word "cat"
        Pattern pattern = Pattern.compile("\\b" + word + "\\b");
        Matcher matcher = pattern.matcher(input);

        while (matcher.find()) {
            System.out.println("Found whole word: \"" + matcher.group() +
                               "\" at position " + matcher.start());
        }
    }
}

Output:

Found whole word: "cat" at position 4

Explanation

Edge Case: Punctuation and Boundaries

Now let's add punctuation to the sentence:

String input = "Cat! A wild cat, not a catalog-catfish hybrid.";

The same pattern will still work:

Output:

Found whole word: "cat" at position 10
Found whole word: "cat" at position 25

Punctuation marks like ! and , are non-word characters, so \b correctly identifies word boundaries near them.

Summary

Using \b in Java regex allows you to:

For best results, always escape \b as \\b in Java string literals, and test your patterns with various sentence structures.

Index