Data Cleaning and Transformation

Java Regex

13.1 Removing unwanted characters

Data cleaning is a crucial step in preparing text for further processing, and one common task is removing unwanted or extraneous characters from input strings. Regular expressions provide a flexible and efficient way to identify and eliminate such characters in Java.

Common Scenarios for Removing Characters

Stripping whitespace: Removing leading, trailing, or all whitespace characters (spaces, tabs, newlines) to normalize input.
Eliminating punctuation: Getting rid of commas, periods, or other symbols when they are unnecessary or interfere with processing.
Removing control characters: Cleaning up non-printable or special characters that may cause issues.
Discarding invalid symbols: Excluding characters not allowed in specific contexts, such as letters only, digits only, or standardized formats.

Using Regex Character Classes to Define Unwanted Characters

A powerful approach to removing unwanted characters is to create a regex pattern that matches them and then replace these matches with an empty string. For example:

Whitespace: \s matches any whitespace character (space, tab, newline).
Punctuation: A character class like [.,;:!?] matches common punctuation.
Control characters: \p{Cntrl} matches control characters in Unicode.
Custom sets: You can combine ranges and characters, e.g., [^a-zA-Z0-9] matches anything not a letter or digit.

Practical Java Examples

Removing all whitespace:

String input = "  Example  string with \t whitespace\n ";
String cleaned = input.replaceAll("\\s+", "");
System.out.println(cleaned); // Outputs: Examplestringwithwhitespace

Here, \\s+ matches one or more whitespace characters, removing all spaces and line breaks.

Stripping punctuation:

String sentence = "Hello, world! Let's clean this sentence.";
String noPunct = sentence.replaceAll("[.,!']", "");
System.out.println(noPunct); // Outputs: Hello world Lets clean this sentence

The character class [.,!'] targets commas, periods, exclamation marks, and apostrophes.

Removing non-alphanumeric characters:

String messy = "User@#123$%^&*()!";
String alphanumericOnly = messy.replaceAll("[^a-zA-Z0-9]", "");
System.out.println(alphanumericOnly); // Outputs: User123

The [^a-zA-Z0-9] negated class matches everything except letters and digits, effectively stripping unwanted symbols.

Removing control characters:

String withControls = "Data\u0007 with \u0009 control chars";
String cleaned = withControls.replaceAll("\\p{Cntrl}", "");
System.out.println(cleaned); // Outputs: Data with  control chars

\p{Cntrl} matches Unicode control characters like bell (\u0007) or tab (\u0009).

Tips for Effective Cleaning

Combine character classes: For more complex cleaning, merge classes like [\\s\\p{Punct}] to remove whitespace and punctuation simultaneously.
Use anchors cautiously: Avoid overly broad patterns that might remove important characters.
Test incrementally: Always test regex on sample inputs to ensure you are only removing intended characters.
Consider normalization: Sometimes removing diacritics or special marks requires additional Unicode normalization beyond regex.

Summary

Removing unwanted characters with regex in Java is a straightforward yet powerful technique to sanitize input data. By defining precise character classes and applying methods like replaceAll, you can tailor data cleaning to your specific needs, preparing text for reliable downstream processing and analysis.

13.2 Replacing patterns using `replaceAll` and `replaceFirst`

Java’s String class provides powerful methods for replacing text using regular expressions: replaceAll() and replaceFirst(). Both methods allow you to specify a regex pattern to identify parts of a string to be replaced, but they differ in scope and typical use cases.

Differences Between `replaceAll()` and `replaceFirst()`

replaceAll(String regex, String replacement) This method replaces all occurrences of the regex pattern in the string with the given replacement. It’s useful when you want to transform every matching substring, such as removing unwanted characters or formatting all dates in a document.
replaceFirst(String regex, String replacement) This method replaces only the first occurrence of the regex pattern. Use it when you need to modify just the initial match—such as anonymizing the first email in a log or replacing the first delimiter in a string.

Using Regex Groups in Replacement Strings

Regex groups, created by parentheses (…) in the pattern, allow you to capture parts of the matched substring. You can reference these captured groups in the replacement string using $1, $2, etc., corresponding to the group numbers.

This capability enables complex transformations where you rearrange, format, or selectively modify portions of the matched text.

Examples

Simple global replacement: Removing all digits

String input = "User123 logged in at 10:45";
String cleaned = input.replaceAll("\\d", "");
System.out.println(cleaned); // Outputs: User logged in at :

Replace only the first whitespace with a dash

String input = "apple banana cherry";
String replaced = input.replaceFirst("\\s", "-");
System.out.println(replaced); // Outputs: apple-banana cherry

Using groups to reformat dates

Suppose you have dates like "2025-06-22" and want to change them to "22/06/2025":

String date = "2025-06-22";
String reformatted = date.replaceAll("(\\d{4})-(\\d{2})-(\\d{2})", "$3/$2/$1");
System.out.println(reformatted); // Outputs: 22/06/2025

Here, (\\d{4}) captures the year, (\\d{2}) the month, and (\\d{2}) the day. The replacement rearranges these groups in a new format.

Anonymizing email usernames

Replace the username part before the @ with "***" but keep the domain intact:

String email = "john.doe@example.com";
String anonymized = email.replaceAll("^[^@]+", "***");
System.out.println(anonymized); // Outputs: ***@example.com

Practical Tips

Always escape special characters properly in your regex pattern and replacement strings.
Use groups to retain or rearrange parts of the original match during replacement.
For complex replacements involving conditional logic, consider using the Matcher class with the appendReplacement and appendTail methods for finer control.
Test your replacement patterns thoroughly, especially with edge cases, to avoid unexpected results.

Summary

replaceAll() and replaceFirst() provide flexible regex-based replacement capabilities in Java. Understanding when to replace all matches versus just the first, and leveraging capturing groups for precise transformations, allows you to perform simple to advanced text modifications efficiently in your data cleaning and transformation workflows.

13.3 Example: Normalize phone numbers or dates

In real-world applications, input data often comes in various formats. Normalizing these formats into a consistent standard is a common data cleaning task. Regex is ideal for matching diverse patterns and transforming them into a uniform format.

This section provides practical Java examples to normalize phone numbers and dates using regex replacements.

Normalizing Phone Numbers

Suppose your system receives phone numbers in multiple formats such as:

(123) 456-7890
123.456.7890
123-456-7890
+1 123 456 7890

The goal is to normalize all of them into the format: 123-456-7890 (U.S. style without country code or special characters).

public class PhoneNormalizer {
    public static void main(String[] args) {
        String[] inputs = {
            "(123) 456-7890",
            "123.456.7890",
            "123-456-7890",
            "+1 123 456 7890"
        };

        // Regex pattern to match digits, ignoring spaces, parentheses, dots, plus signs, and dashes
        // We capture three groups of digits: area code, prefix, line number
        String phonePattern = ".*?(\\d{3}).*?(\\d{3}).*?(\\d{4}).*";

        for (String input : inputs) {
            String normalized = input.replaceAll(phonePattern, "$1-$2-$3");
            System.out.println("Original: " + input + " -> Normalized: " + normalized);
        }
    }
}

Explanation:

The pattern .*?(\\d{3}).*?(\\d{3}).*?(\\d{4}).* uses reluctant quantifiers .*? to skip any characters non-greedily until it finds groups of digits.
Three capturing groups extract area code, prefix, and line number.
The replacement string $1-$2-$3 reconstructs the phone number in the desired format.

Expected output:

Original: (123) 456-7890 -> Normalized: 123-456-7890
Original: 123.456.7890 -> Normalized: 123-456-7890
Original: 123-456-7890 -> Normalized: 123-456-7890
Original: +1 123 456 7890 -> Normalized: 123-456-7890

Normalizing Dates

Dates come in many formats such as:

2025-06-22
06/22/2025
22.06.2025

We want to standardize them into the ISO format YYYY-MM-DD.

public class DateNormalizer {
    public static void main(String[] args) {
        String[] inputs = {
            "2025-06-22",
            "06/22/2025",
            "22.06.2025"
        };

        for (String input : inputs) {
            String normalized = normalizeDate(input);
            System.out.println("Original: " + input + " -> Normalized: " + normalized);
        }
    }

    public static String normalizeDate(String date) {
        // Match YYYY-MM-DD directly
        if (date.matches("\\d{4}-\\d{2}-\\d{2}")) {
            return date; // already normalized
        }

        // Match MM/DD/YYYY and transform to YYYY-MM-DD
        if (date.matches("\\d{2}/\\d{2}/\\d{4}")) {
            return date.replaceAll("(\\d{2})/(\\d{2})/(\\d{4})", "$3-$1-$2");
        }

        // Match DD.MM.YYYY and transform to YYYY-MM-DD
        if (date.matches("\\d{2}\\.\\d{2}\\.\\d{4}")) {
            return date.replaceAll("(\\d{2})\\.(\\d{2})\\.(\\d{4})", "$3-$2-$1");
        }

        // Return original if no pattern matched
        return date;
    }
}

Explanation:

The method normalizeDate tests for known date formats using matches() and applies replaceAll() with capturing groups.
The groups reorder date components to the ISO YYYY-MM-DD format.
The code gracefully handles inputs that don’t match any known pattern by returning them unchanged.

Expected output:

Original: 2025-06-22 -> Normalized: 2025-06-22
Original: 06/22/2025 -> Normalized: 2025-06-22
Original: 22.06.2025 -> Normalized: 2025-06-22

Edge Cases and Testing

Input strings might contain invalid or partial data — always validate and handle exceptions.
For phone numbers, extensions or country codes may need separate handling.
Date components should be validated for valid ranges (e.g., months 1–12, days 1–31) for full reliability.
Comprehensive unit tests covering all expected input formats improve robustness.

Summary

By combining carefully designed regex patterns with Java’s replacement methods, you can normalize diverse phone number and date formats into consistent, standardized forms. This facilitates easier storage, searching, and processing in your applications, improving data quality and user experience.

Data Cleaning and Transformation

Java Regex

13.1 Removing unwanted characters

Common Scenarios for Removing Characters

Using Regex Character Classes to Define Unwanted Characters

Practical Java Examples

Tips for Effective Cleaning

Summary

13.2 Replacing patterns using replaceAll and replaceFirst

Differences Between replaceAll() and replaceFirst()

Using Regex Groups in Replacement Strings

Examples

Practical Tips

Summary

13.3 Example: Normalize phone numbers or dates

Normalizing Phone Numbers

Normalizing Dates

Edge Cases and Testing

Summary

Related Books

13.2 Replacing patterns using `replaceAll` and `replaceFirst`

Differences Between `replaceAll()` and `replaceFirst()`