Text Searching and Extraction

Java Regex

12.1 Searching multiple occurrences

When working with text in Java, it's common to need to find all occurrences of a particular pattern, not just the first one. The Matcher.find() method from the java.util.regex package is designed precisely for this purpose. Unlike matches(), which tries to match the entire input string, find() searches through the input to locate successive subsequences that match the pattern.

Using `Matcher.find()` to Iterate Over Matches

To find multiple occurrences, you typically create a Matcher object from a compiled Pattern and the input text, then repeatedly call find() in a loop:

Pattern pattern = Pattern.compile("\\bJava\\b");
Matcher matcher = pattern.matcher("Java is fun. I love Java programming.");

while (matcher.find()) {
    System.out.println("Found at index: " + matcher.start() + " - " + matcher.group());
}

This code searches for the whole word "Java" in the input string and prints each match’s start index and matched text.

Extracting Capturing Groups

If your regex contains capturing groups (parentheses), you can extract these groups from each match. For example:

Pattern pattern = Pattern.compile("(\\d{3})-(\\d{4})");
Matcher matcher = pattern.matcher("Call 555-1234 or 666-5678.");

while (matcher.find()) {
    System.out.println("Area code: " + matcher.group(1) + ", Number: " + matcher.group(2));
}

Here, each phone number is split into area code and local number for extraction.

Handling Overlapping or Adjacent Matches

By default, find() continues searching immediately after the last match’s end. This means it doesn’t detect overlapping matches. For example, searching for "ana" in "banana" will find the first "ana" starting at index 1 but will miss the overlapping "ana" starting at index 3.

To handle overlapping matches, you can advance the search manually using matcher.start() or matcher.end(), but it requires custom logic, such as resetting the matcher with adjusted input substrings or using lookahead patterns.

Practical Use Cases

Keyword Search: Finding every occurrence of certain words or phrases in documents.
Data Extraction: Collecting all dates, numbers, or email addresses from a text.
Syntax Highlighting: Locating all tokens of a language to apply formatting.

Summary

The Matcher.find() method is a powerful way to locate multiple occurrences of regex patterns in Java strings. By iterating over matches and extracting groups, developers can implement robust search and extraction functionality. While adjacent matches are straightforward to handle, overlapping matches need extra attention, often requiring more complex regex or iteration strategies. Understanding these concepts enables efficient and flexible text processing in Java applications.

12.2 Extracting structured data from logs and reports

Logs and reports often contain valuable structured information embedded in semi-structured text. Extracting this data efficiently is a common task in many applications such as monitoring, debugging, and analytics. Regex offers a flexible way to isolate key fields like timestamps, error codes, user IDs, or messages, even when the input format varies slightly.

Designing Regex Patterns for Extraction

The first step in extracting data is understanding the typical structure of your log or report lines. For example, a log entry might look like this:

2025-06-22 15:45:30 ERROR 1234 User login failed for userID=5678

Here, you may want to extract the timestamp, error level, error code, and user ID. A regex pattern designed to capture these could be:

(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2})\s+(\w+)\s+(\d+)\s+User login failed for userID=(\d+)

Timestamp: (\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2})
Error Level: (\w+)
Error Code: (\d+)
User ID: (\d+)

Each part is wrapped in parentheses to capture it as a group for later extraction.

Handling Variability and Optional Fields

Logs often contain optional or variable parts. For instance, sometimes the user ID may be missing, or the error message might change. You can use optional groups ((...)?) and non-capturing groups (?:...) to handle such cases gracefully:

(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2})\s+(\w+)\s+(\d+)(?: User login failed for userID=(\d+))?

The (?: ... )? means that the user ID part is optional. When missing, the group for user ID will be null, which your code can check and handle accordingly.

Emphasizing Readability and Maintainability

Complex extraction patterns can become hard to read. Use comments in your regex (via (?x) mode in Java) and break down the pattern logically:

String pattern = "(?x)                             # Enable comments and whitespace\n" +
                 "(\\d{4}-\\d{2}-\\d{2} \\d{2}:\\d{2}:\\d{2}) \\s+  # Timestamp\n" +
                 "(\\w+) \\s+                                    # Error level\n" +
                 "(\\d+)                                        # Error code\n" +
                 "(?: User login failed for userID=(\\d+))?     # Optional userID\n";

This approach makes it easier to update patterns as log formats evolve.

Practical Tips

Test your regex extensively with varied real log samples.
Use capturing groups meaningfully to extract exactly what you need.
Combine regex extraction with string or date parsing for richer data processing.
Consider performance by compiling patterns once when processing large log files.

Summary

Extracting structured data from logs and reports with regex requires careful pattern design that balances flexibility and precision. By capturing key fields, handling optional parts, and maintaining readable patterns, you can build robust extraction solutions that adapt well to semi-structured inputs. This approach helps automate monitoring, error tracking, and analytics in many Java applications.

12.3 Example: Extract IP addresses from log files

Extracting IP addresses from log files is a common task in network monitoring, security auditing, and data analysis. In this section, we’ll provide a complete Java example that uses regex to find and extract both IPv4 and IPv6 addresses from log entries.

Designing the Regex Patterns

IPv4 address consists of four numbers (0–255) separated by dots, e.g., 192.168.1.1.
IPv6 address uses eight groups of hexadecimal numbers separated by colons, e.g., 2001:0db8:85a3::8a2e:0370:7334.

We’ll create regex patterns for both formats:

IPv4 pattern (simplified for readability):

\b(?:25[0-5]|2[0-4]\d|1\d{2}|[1-9]?\d)(?:\.(?:25[0-5]|2[0-4]\d|1\d{2}|[1-9]?\d)){3}\b

This matches numbers from 0 to 255 in four octets separated by dots.

IPv6 pattern (basic version):

\b(?:[0-9a-fA-F]{1,4}:){7}[0-9a-fA-F]{1,4}\b

This matches standard IPv6 addresses without compression (::). Handling all IPv6 variations requires a more complex regex, but this covers many typical cases.

Complete Java Example

import java.util.regex.Matcher;
import java.util.regex.Pattern;
import java.util.ArrayList;

public class IPAddressExtractor {

    public static void main(String[] args) {
        // Sample log entries containing IPv4 and IPv6 addresses
        String logData = """
            User connected from 192.168.1.100 at 10:15
            Failed login from 10.0.0.256 (invalid IP)
            Access granted to 2001:0db8:85a3:0000:0000:8a2e:0370:7334
            Ping from 172.16.254.1 succeeded
            Unknown host 1234:5678:9abc:def0:1234:5678:9abc:defg
            """;

        // Regex pattern to match IPv4 and IPv6 addresses
        String ipv4Pattern = "\\b(?:25[0-5]|2[0-4]\\d|1\\d{2}|[1-9]?\\d)(?:\\.(?:25[0-5]|2[0-4]\\d|1\\d{2}|[1-9]?\\d)){3}\\b";
        String ipv6Pattern = "\\b(?:[0-9a-fA-F]{1,4}:){7}[0-9a-fA-F]{1,4}\\b";

        // Combine patterns with alternation
        String combinedPattern = ipv4Pattern + "|" + ipv6Pattern;

        Pattern pattern = Pattern.compile(combinedPattern);
        Matcher matcher = pattern.matcher(logData);

        ArrayList<String> foundIPs = new ArrayList<>();

        // Iterate over all matches
        while (matcher.find()) {
            String ip = matcher.group();
            foundIPs.add(ip);
        }

        // Print extracted IP addresses
        System.out.println("Extracted IP addresses:");
        for (String ip : foundIPs) {
            System.out.println(ip);
        }
    }
}

Explanation

Pattern compilation: We compile a regex that matches either IPv4 or IPv6 addresses.
Matcher iteration: Using matcher.find(), we locate all occurrences in the input string.
Group extraction: matcher.group() returns the exact matched IP address.
Result collection: We store matches in a list for further use or display.

Sample Output

Extracted IP addresses:
192.168.1.100
2001:0db8:85a3:0000:0000:8a2e:0370:7334
172.16.254.1

Note how invalid IPs like 10.0.0.256 and malformed IPv6 like 1234:5678:9abc:def0:1234:5678:9abc:defg are ignored because they do not match the regex patterns.

Summary

This example demonstrates a practical approach to extracting both IPv4 and IPv6 addresses from log files using Java regex. While the IPv6 regex here covers standard full addresses, extending it for compressed forms and validating IP correctness may require more sophisticated patterns or external libraries. Nonetheless, regex combined with Java’s Matcher provides a powerful and flexible tool for parsing complex text data efficiently.

Text Searching and Extraction

Java Regex

12.1 Searching multiple occurrences

Using Matcher.find() to Iterate Over Matches

Extracting Capturing Groups

Handling Overlapping or Adjacent Matches

Practical Use Cases

Summary

12.2 Extracting structured data from logs and reports

Designing Regex Patterns for Extraction

Handling Variability and Optional Fields

Emphasizing Readability and Maintainability

Practical Tips

Summary

12.3 Example: Extract IP addresses from log files

Designing the Regex Patterns

Complete Java Example

Explanation

Sample Output

Summary

Related Books

Using `Matcher.find()` to Iterate Over Matches