Parsing and Tokenizing with Regex

Java Regex

14.1 Splitting strings by patterns

Java’s String.split() method is a powerful tool that lets you divide a string into parts based on a regex pattern rather than just a fixed character. This flexibility is essential when dealing with complex delimiters or varying separators in your input data.

Using Regex with `split()`

Instead of splitting by a single character like a comma, you can provide a regex pattern to match one or more delimiters. For example, splitting a sentence on commas, semicolons, or spaces:

String sentence = "Java,Python; C++  Ruby";
String[] parts = sentence.split("[,;\\s]+"); // Split on comma, semicolon, or whitespace
for (String part : parts) {
    System.out.println(part);
}

This outputs:

Java
Python
C++
Ruby

The regex [ ,;\\s]+ matches one or more of comma, semicolon, or whitespace characters, effectively splitting on any combination of these.

Handling Optional Spaces and Multiple Delimiters

Sometimes delimiters may be surrounded by optional spaces. For example, a CSV line might have spaces around commas:

String csv = "apple , banana,   cherry ,date";
String[] fruits = csv.split("\\s*,\\s*"); // Split on commas with optional spaces
for (String fruit : fruits) {
    System.out.println(fruit);
}

Output:

apple
banana
cherry
date

Here, the regex \\s*,\\s* matches a comma possibly surrounded by any amount of whitespace, so spaces don’t end up in the tokens.

Escaping Special Characters

If your delimiter includes regex metacharacters (like ., |, *, ?), remember to escape them properly:

String data = "one.two.three";
String[] parts = data.split("\\."); // Dot is escaped as "\\."

Without escaping, the dot matches any character, leading to unexpected splits.

Splitting Logs or Custom Formats

For log lines that use complex delimiters, such as timestamps or specific markers, regex can precisely target these patterns:

String log = "INFO|2025-06-22|User login|Success";
String[] fields = log.split("\\|"); // Split on pipe character

Limitations and Pitfalls

Empty tokens: Adjacent delimiters can produce empty strings in the result array. Use patterns carefully or filter results as needed.
Performance: Complex regex patterns may impact performance if used heavily on large inputs.
Limit parameter: The optional second argument to split() controls the max splits and trailing empty strings — useful for fine-tuning output.

Summary

Using regex with String.split() allows flexible, robust string division beyond fixed characters. Handling multiple delimiters, optional spaces, and escaping special characters helps process real-world data formats like CSV, logs, or free text efficiently. Understanding regex syntax and method nuances ensures accurate and performant splitting for your parsing tasks.

14.2 Tokenizing input for simple parsing

Tokenization is the process of breaking down a piece of text into smaller, meaningful units called tokens. These tokens can be words, numbers, symbols, or other logical chunks that a program can analyze or process individually. Unlike simple splitting, which divides input solely by delimiters, tokenization often involves identifying valid elements while discarding irrelevant separators.

Splitting vs. Tokenizing

Splitting breaks a string wherever a delimiter appears, resulting in chunks that may include empty or irrelevant parts.
Tokenizing uses regex to find valid tokens in a string by matching patterns, focusing only on meaningful pieces and ignoring separators.

For example, given the input:

x = 42 + 15

Splitting by spaces yields: ["x", "=", "42", "+", "15"] (straightforward but sensitive to whitespace).
Tokenizing by patterns extracts tokens such as identifiers (x), numbers (42, 15), and operators (=, +), ignoring spaces completely.

Using Javas `Matcher` for Tokenizing

In Java, tokenization is often done by applying a regex pattern with the Matcher.find() method to sequentially extract tokens matching certain criteria.

Here’s an example that tokenizes simple arithmetic expressions into numbers, operators, and identifiers:

import java.util.regex.*;

public class TokenizerExample {
    public static void main(String[] args) {
        String input = "x = 42 + y - 3 * 7";
        String tokenPattern = "\\d+|[a-zA-Z]+|[=+\\-*/]";

        Pattern pattern = Pattern.compile(tokenPattern);
        Matcher matcher = pattern.matcher(input);

        while (matcher.find()) {
            System.out.println("Token: " + matcher.group());
        }
    }
}

Output:

Token: x
Token: =
Token: 42
Token: +
Token: y
Token: -
Token: 3
Token: *
Token: 7

Here, the regex \\d+|[a-zA-Z]+|[=+\\-*/] matches:

One or more digits (\\d+) — numbers,
One or more letters ([a-zA-Z]+) — variable names or keywords,
Single operator characters ([=+\\-*/]).

Spaces and other irrelevant characters are ignored, as the matcher only finds valid tokens.

Practical Applications

Tokenization is essential for parsing simple command languages, formulas, or configuration inputs where input is more complex than straightforward comma-separated values.

For instance:

Parsing command line inputs or shell-like commands,
Processing mathematical expressions in calculators,
Interpreting simple scripting or domain-specific languages.

By defining regex patterns for each token type and sequentially extracting tokens, you can build parsers that understand and manipulate complex inputs cleanly.

Summary

Tokenizing with regex in Java involves matching meaningful elements of text rather than just splitting by delimiters. Using Matcher.find() to extract tokens like words, numbers, and symbols allows for flexible and precise parsing of formulas, commands, or structured inputs, enabling more powerful text processing beyond basic splitting.

14.3 Example: Parsing CSV or custom delimited data

Parsing CSV (Comma-Separated Values) or similarly structured data with custom delimiters is a common task where regex can help, especially when fields contain quoted values, escaped delimiters, or optional spaces. While dedicated CSV libraries exist, understanding how to handle these challenges with regex deepens your grasp of text parsing.

Challenges in Parsing CSV with Regex

Quoted fields: Fields can be enclosed in quotes to include commas or newlines inside a value.
Escaped quotes: Quotes inside quoted fields are often escaped by doubling ("").
Optional whitespace: Spaces may appear around delimiters.
Empty fields: Fields may be empty (e.g., consecutive commas).
Custom delimiters: Sometimes semicolons, tabs, or pipes separate values instead of commas.

Regex Pattern for CSV Parsing

A robust regex pattern to parse CSV fields can handle both quoted and unquoted values:

"([^"]*(?:""[^"]*)*)"|([^,]+)|,

"([^"]*(?:""[^"]*)*)" matches quoted fields, allowing for escaped quotes ("").
([^,]+) matches unquoted fields without commas.
The trailing , matches empty fields.

For clarity, in Java we often adapt this to extract fields as:

"(\"([^\"]*(\"\"[^\"]*)*)\")|([^,]+)|,"

Java Example: Parsing CSV Lines

import java.util.regex.*;
import java.util.*;

public class CsvParserExample {
    public static void main(String[] args) {
        String input = "John, \"Doe, Jane\", \"1234 \"\"Main\"\" St.\", , 42";
        
        // Regex to match quoted fields, unquoted fields, or empty fields
        String regex = "\"([^\"]*(\"\"[^\"]*)*)\"|([^,]+)|,";
        Pattern pattern = Pattern.compile(regex);
        Matcher matcher = pattern.matcher(input);

        List<String> fields = new ArrayList<>();

        while (matcher.find()) {
            String quotedField = matcher.group(1);
            String unquotedField = matcher.group(3);

            if (quotedField != null) {
                // Remove escaped quotes by replacing double quotes with single quotes
                String field = quotedField.replace("\"\"", "\"");
                fields.add(field);
            } else if (unquotedField != null) {
                fields.add(unquotedField.trim());
            } else {
                // Empty field (matched just a comma)
                fields.add("");
            }
        }

        // Output extracted fields
        System.out.println("Parsed fields:");
        for (int i = 0; i < fields.size(); i++) {
            System.out.printf("Field %d: '%s'%n", i + 1, fields.get(i));
        }
    }
}

Explanation

The pattern matches each CSV field sequentially:
- Group 1 captures quoted fields, including escaped quotes.
- Group 3 captures unquoted fields.
- A standalone comma with no match means an empty field.
Quoted fields have their doubled quotes replaced with single quotes to normalize content.
Unquoted fields are trimmed of whitespace.
Empty fields are handled by adding an empty string.

Sample Input

John, "Doe, Jane", "1234 ""Main"" St.", , 42

Sample Output

Parsed fields:
Field 1: 'John'
Field 2: 'Doe, Jane'
Field 3: '1234 "Main" St.'
Field 4: ''
Field 5: '42'

Summary

Parsing CSV or custom delimited data using regex requires careful pattern design to handle quoted fields, escaped characters, and empty values. By combining capturing groups and conditional logic, you can robustly extract fields from complex inputs. This example demonstrates practical handling of typical CSV quirks, enabling you to adapt the approach for other delimiter-based formats and edge cases.

Parsing and Tokenizing with Regex

Java Regex

14.1 Splitting strings by patterns

Using Regex with split()

Handling Optional Spaces and Multiple Delimiters

Escaping Special Characters

Splitting Logs or Custom Formats

Limitations and Pitfalls

Summary

14.2 Tokenizing input for simple parsing

Splitting vs. Tokenizing

Using Javas Matcher for Tokenizing

Practical Applications

Summary

14.3 Example: Parsing CSV or custom delimited data

Challenges in Parsing CSV with Regex

Regex Pattern for CSV Parsing

Java Example: Parsing CSV Lines

Explanation

Sample Input

Sample Output

Summary

Related Books

Using Regex with `split()`

Using Javas `Matcher` for Tokenizing