Index

Data Cleaning and Transformation

Java Regex

13.1 Removing unwanted characters

Data cleaning is a crucial step in preparing text for further processing, and one common task is removing unwanted or extraneous characters from input strings. Regular expressions provide a flexible and efficient way to identify and eliminate such characters in Java.

Common Scenarios for Removing Characters

Using Regex Character Classes to Define Unwanted Characters

A powerful approach to removing unwanted characters is to create a regex pattern that matches them and then replace these matches with an empty string. For example:

Practical Java Examples

  1. Removing all whitespace:
String input = "  Example  string with \t whitespace\n ";
String cleaned = input.replaceAll("\\s+", "");
System.out.println(cleaned); // Outputs: Examplestringwithwhitespace

Here, \\s+ matches one or more whitespace characters, removing all spaces and line breaks.

  1. Stripping punctuation:
String sentence = "Hello, world! Let's clean this sentence.";
String noPunct = sentence.replaceAll("[.,!']", "");
System.out.println(noPunct); // Outputs: Hello world Lets clean this sentence

The character class [.,!'] targets commas, periods, exclamation marks, and apostrophes.

  1. Removing non-alphanumeric characters:
String messy = "User@#123$%^&*()!";
String alphanumericOnly = messy.replaceAll("[^a-zA-Z0-9]", "");
System.out.println(alphanumericOnly); // Outputs: User123

The [^a-zA-Z0-9] negated class matches everything except letters and digits, effectively stripping unwanted symbols.

  1. Removing control characters:
String withControls = "Data\u0007 with \u0009 control chars";
String cleaned = withControls.replaceAll("\\p{Cntrl}", "");
System.out.println(cleaned); // Outputs: Data with  control chars

\p{Cntrl} matches Unicode control characters like bell (\u0007) or tab (\u0009).

Tips for Effective Cleaning

Summary

Removing unwanted characters with regex in Java is a straightforward yet powerful technique to sanitize input data. By defining precise character classes and applying methods like replaceAll, you can tailor data cleaning to your specific needs, preparing text for reliable downstream processing and analysis.

Index

13.2 Replacing patterns using replaceAll and replaceFirst

Java’s String class provides powerful methods for replacing text using regular expressions: replaceAll() and replaceFirst(). Both methods allow you to specify a regex pattern to identify parts of a string to be replaced, but they differ in scope and typical use cases.

Differences Between replaceAll() and replaceFirst()

Using Regex Groups in Replacement Strings

Regex groups, created by parentheses (…) in the pattern, allow you to capture parts of the matched substring. You can reference these captured groups in the replacement string using $1, $2, etc., corresponding to the group numbers.

This capability enables complex transformations where you rearrange, format, or selectively modify portions of the matched text.

Examples

  1. Simple global replacement: Removing all digits
String input = "User123 logged in at 10:45";
String cleaned = input.replaceAll("\\d", "");
System.out.println(cleaned); // Outputs: User logged in at :
  1. Replace only the first whitespace with a dash
String input = "apple banana cherry";
String replaced = input.replaceFirst("\\s", "-");
System.out.println(replaced); // Outputs: apple-banana cherry
  1. Using groups to reformat dates

Suppose you have dates like "2025-06-22" and want to change them to "22/06/2025":

String date = "2025-06-22";
String reformatted = date.replaceAll("(\\d{4})-(\\d{2})-(\\d{2})", "$3/$2/$1");
System.out.println(reformatted); // Outputs: 22/06/2025

Here, (\\d{4}) captures the year, (\\d{2}) the month, and (\\d{2}) the day. The replacement rearranges these groups in a new format.

  1. Anonymizing email usernames

Replace the username part before the @ with "***" but keep the domain intact:

String email = "john.doe@example.com";
String anonymized = email.replaceAll("^[^@]+", "***");
System.out.println(anonymized); // Outputs: ***@example.com

Practical Tips

Summary

replaceAll() and replaceFirst() provide flexible regex-based replacement capabilities in Java. Understanding when to replace all matches versus just the first, and leveraging capturing groups for precise transformations, allows you to perform simple to advanced text modifications efficiently in your data cleaning and transformation workflows.

Index

13.3 Example: Normalize phone numbers or dates

In real-world applications, input data often comes in various formats. Normalizing these formats into a consistent standard is a common data cleaning task. Regex is ideal for matching diverse patterns and transforming them into a uniform format.

This section provides practical Java examples to normalize phone numbers and dates using regex replacements.

Normalizing Phone Numbers

Suppose your system receives phone numbers in multiple formats such as:

The goal is to normalize all of them into the format: 123-456-7890 (U.S. style without country code or special characters).

public class PhoneNormalizer {
    public static void main(String[] args) {
        String[] inputs = {
            "(123) 456-7890",
            "123.456.7890",
            "123-456-7890",
            "+1 123 456 7890"
        };

        // Regex pattern to match digits, ignoring spaces, parentheses, dots, plus signs, and dashes
        // We capture three groups of digits: area code, prefix, line number
        String phonePattern = ".*?(\\d{3}).*?(\\d{3}).*?(\\d{4}).*";

        for (String input : inputs) {
            String normalized = input.replaceAll(phonePattern, "$1-$2-$3");
            System.out.println("Original: " + input + " -> Normalized: " + normalized);
        }
    }
}

Explanation:

Expected output:

Original: (123) 456-7890 -> Normalized: 123-456-7890
Original: 123.456.7890 -> Normalized: 123-456-7890
Original: 123-456-7890 -> Normalized: 123-456-7890
Original: +1 123 456 7890 -> Normalized: 123-456-7890

Normalizing Dates

Dates come in many formats such as:

We want to standardize them into the ISO format YYYY-MM-DD.

public class DateNormalizer {
    public static void main(String[] args) {
        String[] inputs = {
            "2025-06-22",
            "06/22/2025",
            "22.06.2025"
        };

        for (String input : inputs) {
            String normalized = normalizeDate(input);
            System.out.println("Original: " + input + " -> Normalized: " + normalized);
        }
    }

    public static String normalizeDate(String date) {
        // Match YYYY-MM-DD directly
        if (date.matches("\\d{4}-\\d{2}-\\d{2}")) {
            return date; // already normalized
        }

        // Match MM/DD/YYYY and transform to YYYY-MM-DD
        if (date.matches("\\d{2}/\\d{2}/\\d{4}")) {
            return date.replaceAll("(\\d{2})/(\\d{2})/(\\d{4})", "$3-$1-$2");
        }

        // Match DD.MM.YYYY and transform to YYYY-MM-DD
        if (date.matches("\\d{2}\\.\\d{2}\\.\\d{4}")) {
            return date.replaceAll("(\\d{2})\\.(\\d{2})\\.(\\d{4})", "$3-$2-$1");
        }

        // Return original if no pattern matched
        return date;
    }
}

Explanation:

Expected output:

Original: 2025-06-22 -> Normalized: 2025-06-22
Original: 06/22/2025 -> Normalized: 2025-06-22
Original: 22.06.2025 -> Normalized: 2025-06-22

Edge Cases and Testing

Summary

By combining carefully designed regex patterns with Java’s replacement methods, you can normalize diverse phone number and date formats into consistent, standardized forms. This facilitates easier storage, searching, and processing in your applications, improving data quality and user experience.

Index