Index

Text Searching and Extraction

Java Regex

12.1 Searching multiple occurrences

When working with text in Java, it's common to need to find all occurrences of a particular pattern, not just the first one. The Matcher.find() method from the java.util.regex package is designed precisely for this purpose. Unlike matches(), which tries to match the entire input string, find() searches through the input to locate successive subsequences that match the pattern.

Using Matcher.find() to Iterate Over Matches

To find multiple occurrences, you typically create a Matcher object from a compiled Pattern and the input text, then repeatedly call find() in a loop:

Pattern pattern = Pattern.compile("\\bJava\\b");
Matcher matcher = pattern.matcher("Java is fun. I love Java programming.");

while (matcher.find()) {
    System.out.println("Found at index: " + matcher.start() + " - " + matcher.group());
}

This code searches for the whole word "Java" in the input string and prints each match’s start index and matched text.

Extracting Capturing Groups

If your regex contains capturing groups (parentheses), you can extract these groups from each match. For example:

Pattern pattern = Pattern.compile("(\\d{3})-(\\d{4})");
Matcher matcher = pattern.matcher("Call 555-1234 or 666-5678.");

while (matcher.find()) {
    System.out.println("Area code: " + matcher.group(1) + ", Number: " + matcher.group(2));
}

Here, each phone number is split into area code and local number for extraction.

Handling Overlapping or Adjacent Matches

By default, find() continues searching immediately after the last match’s end. This means it doesn’t detect overlapping matches. For example, searching for "ana" in "banana" will find the first "ana" starting at index 1 but will miss the overlapping "ana" starting at index 3.

To handle overlapping matches, you can advance the search manually using matcher.start() or matcher.end(), but it requires custom logic, such as resetting the matcher with adjusted input substrings or using lookahead patterns.

Practical Use Cases

Summary

The Matcher.find() method is a powerful way to locate multiple occurrences of regex patterns in Java strings. By iterating over matches and extracting groups, developers can implement robust search and extraction functionality. While adjacent matches are straightforward to handle, overlapping matches need extra attention, often requiring more complex regex or iteration strategies. Understanding these concepts enables efficient and flexible text processing in Java applications.

Index

12.2 Extracting structured data from logs and reports

Logs and reports often contain valuable structured information embedded in semi-structured text. Extracting this data efficiently is a common task in many applications such as monitoring, debugging, and analytics. Regex offers a flexible way to isolate key fields like timestamps, error codes, user IDs, or messages, even when the input format varies slightly.

Designing Regex Patterns for Extraction

The first step in extracting data is understanding the typical structure of your log or report lines. For example, a log entry might look like this:

2025-06-22 15:45:30 ERROR 1234 User login failed for userID=5678

Here, you may want to extract the timestamp, error level, error code, and user ID. A regex pattern designed to capture these could be:

(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2})\s+(\w+)\s+(\d+)\s+User login failed for userID=(\d+)

Each part is wrapped in parentheses to capture it as a group for later extraction.

Handling Variability and Optional Fields

Logs often contain optional or variable parts. For instance, sometimes the user ID may be missing, or the error message might change. You can use optional groups ((...)?) and non-capturing groups (?:...) to handle such cases gracefully:

(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2})\s+(\w+)\s+(\d+)(?: User login failed for userID=(\d+))?

The (?: ... )? means that the user ID part is optional. When missing, the group for user ID will be null, which your code can check and handle accordingly.

Emphasizing Readability and Maintainability

Complex extraction patterns can become hard to read. Use comments in your regex (via (?x) mode in Java) and break down the pattern logically:

String pattern = "(?x)                             # Enable comments and whitespace\n" +
                 "(\\d{4}-\\d{2}-\\d{2} \\d{2}:\\d{2}:\\d{2}) \\s+  # Timestamp\n" +
                 "(\\w+) \\s+                                    # Error level\n" +
                 "(\\d+)                                        # Error code\n" +
                 "(?: User login failed for userID=(\\d+))?     # Optional userID\n";

This approach makes it easier to update patterns as log formats evolve.

Practical Tips

Summary

Extracting structured data from logs and reports with regex requires careful pattern design that balances flexibility and precision. By capturing key fields, handling optional parts, and maintaining readable patterns, you can build robust extraction solutions that adapt well to semi-structured inputs. This approach helps automate monitoring, error tracking, and analytics in many Java applications.

Index

12.3 Example: Extract IP addresses from log files

Extracting IP addresses from log files is a common task in network monitoring, security auditing, and data analysis. In this section, we’ll provide a complete Java example that uses regex to find and extract both IPv4 and IPv6 addresses from log entries.

Designing the Regex Patterns

We’ll create regex patterns for both formats:

IPv4 pattern (simplified for readability):

\b(?:25[0-5]|2[0-4]\d|1\d{2}|[1-9]?\d)(?:\.(?:25[0-5]|2[0-4]\d|1\d{2}|[1-9]?\d)){3}\b

This matches numbers from 0 to 255 in four octets separated by dots.

IPv6 pattern (basic version):

\b(?:[0-9a-fA-F]{1,4}:){7}[0-9a-fA-F]{1,4}\b

This matches standard IPv6 addresses without compression (::). Handling all IPv6 variations requires a more complex regex, but this covers many typical cases.

Complete Java Example

import java.util.regex.Matcher;
import java.util.regex.Pattern;
import java.util.ArrayList;

public class IPAddressExtractor {

    public static void main(String[] args) {
        // Sample log entries containing IPv4 and IPv6 addresses
        String logData = """
            User connected from 192.168.1.100 at 10:15
            Failed login from 10.0.0.256 (invalid IP)
            Access granted to 2001:0db8:85a3:0000:0000:8a2e:0370:7334
            Ping from 172.16.254.1 succeeded
            Unknown host 1234:5678:9abc:def0:1234:5678:9abc:defg
            """;

        // Regex pattern to match IPv4 and IPv6 addresses
        String ipv4Pattern = "\\b(?:25[0-5]|2[0-4]\\d|1\\d{2}|[1-9]?\\d)(?:\\.(?:25[0-5]|2[0-4]\\d|1\\d{2}|[1-9]?\\d)){3}\\b";
        String ipv6Pattern = "\\b(?:[0-9a-fA-F]{1,4}:){7}[0-9a-fA-F]{1,4}\\b";

        // Combine patterns with alternation
        String combinedPattern = ipv4Pattern + "|" + ipv6Pattern;

        Pattern pattern = Pattern.compile(combinedPattern);
        Matcher matcher = pattern.matcher(logData);

        ArrayList<String> foundIPs = new ArrayList<>();

        // Iterate over all matches
        while (matcher.find()) {
            String ip = matcher.group();
            foundIPs.add(ip);
        }

        // Print extracted IP addresses
        System.out.println("Extracted IP addresses:");
        for (String ip : foundIPs) {
            System.out.println(ip);
        }
    }
}

Explanation

Sample Output

Extracted IP addresses:
192.168.1.100
2001:0db8:85a3:0000:0000:8a2e:0370:7334
172.16.254.1

Note how invalid IPs like 10.0.0.256 and malformed IPv6 like 1234:5678:9abc:def0:1234:5678:9abc:defg are ignored because they do not match the regex patterns.

Summary

This example demonstrates a practical approach to extracting both IPv4 and IPv6 addresses from log files using Java regex. While the IPv6 regex here covers standard full addresses, extending it for compressed forms and validating IP correctness may require more sophisticated patterns or external libraries. Nonetheless, regex combined with Java’s Matcher provides a powerful and flexible tool for parsing complex text data efficiently.

Index