Character Classes and POSIX Character Classes

Java Regex

5.1 Custom character classes `[abc]` and ranges `[a-z]`

Custom character classes are a fundamental feature in regular expressions that let you match any one character from a specific set or range. These classes are enclosed in square brackets [ ] and give you the flexibility to specify exactly which characters you want to allow at a particular position in your pattern.

How Custom Character Classes Work

When you write a regex like [abc], it means match exactly one character that is either a, b, or c. The regex engine checks the input string at that position and accepts the match if it finds any one of those characters.

You can also specify ranges inside the brackets. For example, [a-z] matches any lowercase letter from a through z. Similarly, [0-9] matches any digit from 0 to 9.

You can combine multiple ranges and individual characters inside one set. For example:

[a-f0-3xZ]

matches any character that is:

A lowercase letter between a and f,
A digit between 0 and 3,
The letter x, or
The uppercase letter Z.

Practical Examples

Matching vowels: To match any lowercase vowel, you can use:
```
[aeiou]
```
This matches any one of the vowels a, e, i, o, or u.
Matching hexadecimal digits: Hex digits include digits and letters from a to f or A to F. You can specify:
```
[0-9a-fA-F]
```
This matches any single hex digit.
Matching specific letters: Suppose you want to match either x, y, or z:
```
[xyz]
```

Benefits and Use Cases

Custom character classes are especially useful when you want to restrict input to a specific set of characters or allow multiple options in a concise way. For example:

Validating user input like postal codes or serial numbers.
Parsing strings where only certain letters or digits are allowed.
Creating flexible patterns that adapt to different alphabets or symbol sets.

By mastering custom character classes and ranges, you gain powerful control over what your regex matches, making your patterns both precise and adaptable.

5.2 Negated classes `[^...]`

Negated character classes are a useful extension of custom character classes that allow you to match any character except those specified. Instead of listing characters you want to match, you specify which characters to exclude.

Syntax of Negated Character Classes

A negated character class starts with a caret (^) immediately following the opening square bracket:

[^abc]

This pattern matches any single character except a, b, or c.

How They Work

When the regex engine encounters a negated class, it checks if the character at the current position is not in the specified set. If it isn’t, the match succeeds.

For example, [^\d] matches any character that is not a digit, because \d represents digits, and the caret negates the class.

Examples

Exclude digits: To match any character except digits, use:
```
[^\d]
```
This matches letters, symbols, whitespace, and other non-digit characters.
Exclude whitespace: To match any character that is not whitespace, use:
```
[^\s]
```
Exclude specific letters: If you need to match any character except x, y, or z:
```
[^xyz]
```

Practical Uses of Negated Classes

Negated classes simplify patterns when you want to filter out unwanted characters rather than explicitly listing what to accept. This is especially helpful when:

You need to validate input that excludes certain characters (e.g., no digits or special symbols).
You want to match everything except a small set of forbidden characters.
You want to create flexible patterns that allow a wide range of characters except a few.

Negated character classes provide a straightforward way to express exclusions in regex, making your patterns more concise and easier to maintain.

5.3 POSIX character classes like `\p{Lower}`, `\p{Upper}`, etc.

Java’s regex engine supports a powerful set of POSIX character classes that allow you to match characters based on their Unicode categories. These classes provide an excellent way to create patterns that work across different languages and scripts, making your regexes locale-independent and more flexible.

What Are POSIX Character Classes?

POSIX character classes use the syntax:

\p{Category}

where Category specifies a Unicode character class or property. These categories represent broad sets of characters, such as lowercase letters, uppercase letters, digits, punctuation marks, and whitespace.

Some common POSIX classes include:

\p{Lower} — Matches any lowercase letter (e.g., a, b, c, including accented letters like á).
\p{Upper} — Matches any uppercase letter (e.g., A, B, C, including accented uppercase letters).
\p{Digit} — Matches any Unicode digit (similar to \d but broader).
\p{Alpha} — Matches any alphabetic character.
\p{Punct} — Matches any punctuation character.
\p{Space} — Matches any whitespace character (spaces, tabs, newlines, etc.).

Examples of POSIX Classes in Use

To match a lowercase letter:

\p{Lower}

This matches any lowercase letter, including those beyond the ASCII range, such as ñ, ü, or ç.

To match any uppercase letter:

\p{Upper}

For digits, you can use:

\p{Digit}

which covers more than just 0-9 by including digits from other scripts.

POSIX Classes vs. Predefined Shorthand Classes

Java also provides predefined shorthand character classes like:

\w — word characters (letters, digits, underscore)
\d — digits (0-9)
\s — whitespace

However, these shorthands are limited mostly to ASCII ranges and do not fully support the diversity of Unicode characters.

POSIX classes, on the other hand, are based on Unicode properties and thus better support internationalization. For example, \p{Lower} matches lowercase letters in all languages, not just a-z.

Benefits for Internationalization

Using POSIX classes ensures your regex patterns work consistently with multilingual text, supporting characters from different alphabets and scripts. This is crucial when dealing with global applications that handle names, addresses, or other inputs containing non-ASCII characters.

By mastering POSIX character classes, you can create regex patterns that are robust, clear, and ready for the diverse text your Java applications may encounter.

5.4 Example: Validating password complexity

Validating password strength is a common and practical use case for regular expressions. A good password policy typically enforces multiple rules, such as requiring uppercase letters, lowercase letters, digits, and special characters. In this example, we'll use both basic and POSIX character classes to construct a regex that checks for strong passwords.

Password Rules

Let’s define a password as valid if it satisfies all of the following:

At least 8 characters long
Contains at least one lowercase letter (\p{Lower})
Contains at least one uppercase letter (\p{Upper})
Contains at least one digit (\p{Digit})
Contains at least one special character (not a letter or digit)

Java Regex Pattern

We’ll use positive lookahead assertions in combination with POSIX character classes to enforce the rules:

^(?=.*\p{Lower})(?=.*\p{Upper})(?=.*\p{Digit})(?=.*[^\\p{Alnum}]).{8,}$

Explanation:

^ and $ anchor the pattern to the start and end of the string.
(?=.*\p{Lower}) ensures at least one lowercase letter.
(?=.*\p{Upper}) ensures at least one uppercase letter.
(?=.*\p{Digit}) ensures at least one digit.
(?=.*[^\\p{Alnum}]) ensures at least one non-alphanumeric (i.e., special) character.
.{8,} ensures the total length is at least 8 characters.

Runnable Java Example

import java.util.regex.Pattern;
import java.util.regex.Matcher;

public class PasswordValidator {
    public static void main(String[] args) {
        String[] passwords = {
            "Password1!",      // valid
            "weakpass",        // no digit or uppercase or special char
            "StrongPass123",   // missing special character
            "Short1!",         // too short
            "Valid#2023"       // valid
        };

        String regex = "^(?=.*\\p{Lower})(?=.*\\p{Upper})(?=.*\\p{Digit})(?=.*[^\\p{Alnum}]).{8,}$";
        Pattern pattern = Pattern.compile(regex);

        for (String password : passwords) {
            Matcher matcher = pattern.matcher(password);
            if (matcher.matches()) {
                System.out.println("Valid password: " + password);
            } else {
                System.out.println("Invalid password: " + password);
            }
        }
    }
}

Output

Valid password: Password1!
Invalid password: weakpass
Invalid password: StrongPass123
Invalid password: Short1!
Valid password: Valid#2023

Summary

This example demonstrates how powerful regex can be for enforcing complex string constraints. By combining POSIX character classes and lookaheads, we keep the pattern readable, internationalized, and robust. This technique can be adapted for login systems, form validations, and any scenario requiring secure password handling.