Index

Parsing and Tokenizing with Regex

Java Regex

14.1 Splitting strings by patterns

Java’s String.split() method is a powerful tool that lets you divide a string into parts based on a regex pattern rather than just a fixed character. This flexibility is essential when dealing with complex delimiters or varying separators in your input data.

Using Regex with split()

Instead of splitting by a single character like a comma, you can provide a regex pattern to match one or more delimiters. For example, splitting a sentence on commas, semicolons, or spaces:

String sentence = "Java,Python; C++  Ruby";
String[] parts = sentence.split("[,;\\s]+"); // Split on comma, semicolon, or whitespace
for (String part : parts) {
    System.out.println(part);
}

This outputs:

Java
Python
C++
Ruby

The regex [ ,;\\s]+ matches one or more of comma, semicolon, or whitespace characters, effectively splitting on any combination of these.

Handling Optional Spaces and Multiple Delimiters

Sometimes delimiters may be surrounded by optional spaces. For example, a CSV line might have spaces around commas:

String csv = "apple , banana,   cherry ,date";
String[] fruits = csv.split("\\s*,\\s*"); // Split on commas with optional spaces
for (String fruit : fruits) {
    System.out.println(fruit);
}

Output:

apple
banana
cherry
date

Here, the regex \\s*,\\s* matches a comma possibly surrounded by any amount of whitespace, so spaces don’t end up in the tokens.

Escaping Special Characters

If your delimiter includes regex metacharacters (like ., |, *, ?), remember to escape them properly:

String data = "one.two.three";
String[] parts = data.split("\\."); // Dot is escaped as "\\."

Without escaping, the dot matches any character, leading to unexpected splits.

Splitting Logs or Custom Formats

For log lines that use complex delimiters, such as timestamps or specific markers, regex can precisely target these patterns:

String log = "INFO|2025-06-22|User login|Success";
String[] fields = log.split("\\|"); // Split on pipe character

Limitations and Pitfalls

Summary

Using regex with String.split() allows flexible, robust string division beyond fixed characters. Handling multiple delimiters, optional spaces, and escaping special characters helps process real-world data formats like CSV, logs, or free text efficiently. Understanding regex syntax and method nuances ensures accurate and performant splitting for your parsing tasks.

Index

14.2 Tokenizing input for simple parsing

Tokenization is the process of breaking down a piece of text into smaller, meaningful units called tokens. These tokens can be words, numbers, symbols, or other logical chunks that a program can analyze or process individually. Unlike simple splitting, which divides input solely by delimiters, tokenization often involves identifying valid elements while discarding irrelevant separators.

Splitting vs. Tokenizing

For example, given the input:

x = 42 + 15

Using Javas Matcher for Tokenizing

In Java, tokenization is often done by applying a regex pattern with the Matcher.find() method to sequentially extract tokens matching certain criteria.

Here’s an example that tokenizes simple arithmetic expressions into numbers, operators, and identifiers:

import java.util.regex.*;

public class TokenizerExample {
    public static void main(String[] args) {
        String input = "x = 42 + y - 3 * 7";
        String tokenPattern = "\\d+|[a-zA-Z]+|[=+\\-*/]";

        Pattern pattern = Pattern.compile(tokenPattern);
        Matcher matcher = pattern.matcher(input);

        while (matcher.find()) {
            System.out.println("Token: " + matcher.group());
        }
    }
}

Output:

Token: x
Token: =
Token: 42
Token: +
Token: y
Token: -
Token: 3
Token: *
Token: 7

Here, the regex \\d+|[a-zA-Z]+|[=+\\-*/] matches:

Spaces and other irrelevant characters are ignored, as the matcher only finds valid tokens.

Practical Applications

Tokenization is essential for parsing simple command languages, formulas, or configuration inputs where input is more complex than straightforward comma-separated values.

For instance:

By defining regex patterns for each token type and sequentially extracting tokens, you can build parsers that understand and manipulate complex inputs cleanly.

Summary

Tokenizing with regex in Java involves matching meaningful elements of text rather than just splitting by delimiters. Using Matcher.find() to extract tokens like words, numbers, and symbols allows for flexible and precise parsing of formulas, commands, or structured inputs, enabling more powerful text processing beyond basic splitting.

Index

14.3 Example: Parsing CSV or custom delimited data

Parsing CSV (Comma-Separated Values) or similarly structured data with custom delimiters is a common task where regex can help, especially when fields contain quoted values, escaped delimiters, or optional spaces. While dedicated CSV libraries exist, understanding how to handle these challenges with regex deepens your grasp of text parsing.

Challenges in Parsing CSV with Regex

Regex Pattern for CSV Parsing

A robust regex pattern to parse CSV fields can handle both quoted and unquoted values:

"([^"]*(?:""[^"]*)*)"|([^,]+)|,

For clarity, in Java we often adapt this to extract fields as:

"(\"([^\"]*(\"\"[^\"]*)*)\")|([^,]+)|,"

Java Example: Parsing CSV Lines

import java.util.regex.*;
import java.util.*;

public class CsvParserExample {
    public static void main(String[] args) {
        String input = "John, \"Doe, Jane\", \"1234 \"\"Main\"\" St.\", , 42";
        
        // Regex to match quoted fields, unquoted fields, or empty fields
        String regex = "\"([^\"]*(\"\"[^\"]*)*)\"|([^,]+)|,";
        Pattern pattern = Pattern.compile(regex);
        Matcher matcher = pattern.matcher(input);

        List<String> fields = new ArrayList<>();

        while (matcher.find()) {
            String quotedField = matcher.group(1);
            String unquotedField = matcher.group(3);

            if (quotedField != null) {
                // Remove escaped quotes by replacing double quotes with single quotes
                String field = quotedField.replace("\"\"", "\"");
                fields.add(field);
            } else if (unquotedField != null) {
                fields.add(unquotedField.trim());
            } else {
                // Empty field (matched just a comma)
                fields.add("");
            }
        }

        // Output extracted fields
        System.out.println("Parsed fields:");
        for (int i = 0; i < fields.size(); i++) {
            System.out.printf("Field %d: '%s'%n", i + 1, fields.get(i));
        }
    }
}

Explanation

Sample Input

John, "Doe, Jane", "1234 ""Main"" St.", , 42

Sample Output

Parsed fields:
Field 1: 'John'
Field 2: 'Doe, Jane'
Field 3: '1234 "Main" St.'
Field 4: ''
Field 5: '42'

Summary

Parsing CSV or custom delimited data using regex requires careful pattern design to handle quoted fields, escaped characters, and empty values. By combining capturing groups and conditional logic, you can robustly extract fields from complex inputs. This example demonstrates practical handling of typical CSV quirks, enabling you to adapt the approach for other delimiter-based formats and edge cases.

Index