Java’s String.split()
method is a powerful tool that lets you divide a string into parts based on a regex pattern rather than just a fixed character. This flexibility is essential when dealing with complex delimiters or varying separators in your input data.
split()
Instead of splitting by a single character like a comma, you can provide a regex pattern to match one or more delimiters. For example, splitting a sentence on commas, semicolons, or spaces:
String sentence = "Java,Python; C++ Ruby";
String[] parts = sentence.split("[,;\\s]+"); // Split on comma, semicolon, or whitespace
for (String part : parts) {
System.out.println(part);
}
This outputs:
Java
Python
C++
Ruby
The regex [ ,;\\s]+
matches one or more of comma, semicolon, or whitespace characters, effectively splitting on any combination of these.
Sometimes delimiters may be surrounded by optional spaces. For example, a CSV line might have spaces around commas:
String csv = "apple , banana, cherry ,date";
String[] fruits = csv.split("\\s*,\\s*"); // Split on commas with optional spaces
for (String fruit : fruits) {
System.out.println(fruit);
}
Output:
apple
banana
cherry
date
Here, the regex \\s*,\\s*
matches a comma possibly surrounded by any amount of whitespace, so spaces don’t end up in the tokens.
If your delimiter includes regex metacharacters (like .
, |
, *
, ?
), remember to escape them properly:
String data = "one.two.three";
String[] parts = data.split("\\."); // Dot is escaped as "\\."
Without escaping, the dot matches any character, leading to unexpected splits.
For log lines that use complex delimiters, such as timestamps or specific markers, regex can precisely target these patterns:
String log = "INFO|2025-06-22|User login|Success";
String[] fields = log.split("\\|"); // Split on pipe character
split()
controls the max splits and trailing empty strings — useful for fine-tuning output.Using regex with String.split()
allows flexible, robust string division beyond fixed characters. Handling multiple delimiters, optional spaces, and escaping special characters helps process real-world data formats like CSV, logs, or free text efficiently. Understanding regex syntax and method nuances ensures accurate and performant splitting for your parsing tasks.
Tokenization is the process of breaking down a piece of text into smaller, meaningful units called tokens. These tokens can be words, numbers, symbols, or other logical chunks that a program can analyze or process individually. Unlike simple splitting, which divides input solely by delimiters, tokenization often involves identifying valid elements while discarding irrelevant separators.
For example, given the input:
x = 42 + 15
["x", "=", "42", "+", "15"]
(straightforward but sensitive to whitespace).x
), numbers (42
, 15
), and operators (=
, +
), ignoring spaces completely.Matcher
for TokenizingIn Java, tokenization is often done by applying a regex pattern with the Matcher.find()
method to sequentially extract tokens matching certain criteria.
Here’s an example that tokenizes simple arithmetic expressions into numbers, operators, and identifiers:
import java.util.regex.*;
public class TokenizerExample {
public static void main(String[] args) {
String input = "x = 42 + y - 3 * 7";
String tokenPattern = "\\d+|[a-zA-Z]+|[=+\\-*/]";
Pattern pattern = Pattern.compile(tokenPattern);
Matcher matcher = pattern.matcher(input);
while (matcher.find()) {
System.out.println("Token: " + matcher.group());
}
}
}
Output:
Token: x
Token: =
Token: 42
Token: +
Token: y
Token: -
Token: 3
Token: *
Token: 7
Here, the regex \\d+|[a-zA-Z]+|[=+\\-*/]
matches:
\\d+
) — numbers,[a-zA-Z]+
) — variable names or keywords,[=+\\-*/]
).Spaces and other irrelevant characters are ignored, as the matcher only finds valid tokens.
Tokenization is essential for parsing simple command languages, formulas, or configuration inputs where input is more complex than straightforward comma-separated values.
For instance:
By defining regex patterns for each token type and sequentially extracting tokens, you can build parsers that understand and manipulate complex inputs cleanly.
Tokenizing with regex in Java involves matching meaningful elements of text rather than just splitting by delimiters. Using Matcher.find()
to extract tokens like words, numbers, and symbols allows for flexible and precise parsing of formulas, commands, or structured inputs, enabling more powerful text processing beyond basic splitting.
Parsing CSV (Comma-Separated Values) or similarly structured data with custom delimiters is a common task where regex can help, especially when fields contain quoted values, escaped delimiters, or optional spaces. While dedicated CSV libraries exist, understanding how to handle these challenges with regex deepens your grasp of text parsing.
""
).A robust regex pattern to parse CSV fields can handle both quoted and unquoted values:
"([^"]*(?:""[^"]*)*)"|([^,]+)|,
"([^"]*(?:""[^"]*)*)"
matches quoted fields, allowing for escaped quotes (""
).([^,]+)
matches unquoted fields without commas.,
matches empty fields.For clarity, in Java we often adapt this to extract fields as:
"(\"([^\"]*(\"\"[^\"]*)*)\")|([^,]+)|,"
import java.util.regex.*;
import java.util.*;
public class CsvParserExample {
public static void main(String[] args) {
String input = "John, \"Doe, Jane\", \"1234 \"\"Main\"\" St.\", , 42";
// Regex to match quoted fields, unquoted fields, or empty fields
String regex = "\"([^\"]*(\"\"[^\"]*)*)\"|([^,]+)|,";
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(input);
List<String> fields = new ArrayList<>();
while (matcher.find()) {
String quotedField = matcher.group(1);
String unquotedField = matcher.group(3);
if (quotedField != null) {
// Remove escaped quotes by replacing double quotes with single quotes
String field = quotedField.replace("\"\"", "\"");
fields.add(field);
} else if (unquotedField != null) {
fields.add(unquotedField.trim());
} else {
// Empty field (matched just a comma)
fields.add("");
}
}
// Output extracted fields
System.out.println("Parsed fields:");
for (int i = 0; i < fields.size(); i++) {
System.out.printf("Field %d: '%s'%n", i + 1, fields.get(i));
}
}
}
The pattern matches each CSV field sequentially:
Quoted fields have their doubled quotes replaced with single quotes to normalize content.
Unquoted fields are trimmed of whitespace.
Empty fields are handled by adding an empty string.
John, "Doe, Jane", "1234 ""Main"" St.", , 42
Parsed fields:
Field 1: 'John'
Field 2: 'Doe, Jane'
Field 3: '1234 "Main" St.'
Field 4: ''
Field 5: '42'
Parsing CSV or custom delimited data using regex requires careful pattern design to handle quoted fields, escaped characters, and empty values. By combining capturing groups and conditional logic, you can robustly extract fields from complex inputs. This example demonstrates practical handling of typical CSV quirks, enabling you to adapt the approach for other delimiter-based formats and edge cases.