Processing Text Files and Data Sources

Java Streams

15.1 Reading and Processing CSV Files with Streams

Java Streams offer a powerful and efficient way to read and process CSV files, especially when combined with Files.lines(). By leveraging the stream API, you can read large datasets line by line, transform them into structured objects, filter or group them, and collect results with minimal memory usage and high readability.

Key Steps for CSV Processing

Open the file using Files.lines(Path) in a try-with-resources block.
Skip headers if present.
Split each line using String.split() or a CSV parser.
Map lines to structured objects.
Filter or transform data as needed.
Collect results into a list or other structure.

Example: Parse CSV into a `Person` Class

CSV file: people.csv

name,age,email
Alice,30,alice@example.com
Bob,25,bob@example.com
Charlie,invalid,charlie@example.com

import java.io.IOException;
import java.nio.file.*;
import java.util.*;
import java.util.stream.*;

public class CsvPersonExample {
    public static void main(String[] args) {
        Path file = Path.of("people.csv");

        try (Stream<String> lines = Files.lines(file)) {
            List<Person> people = lines
                .skip(1) // Skip header
                .map(CsvPersonExample::parsePerson)
                .flatMap(Optional::stream) // Filter out failed parses
                .filter(p -> p.age >= 18) // Filter adults
                .collect(Collectors.toList());

            people.forEach(System.out::println);
        } catch (IOException e) {
            System.err.println("Failed to read CSV file: " + e.getMessage());
        }
    }

    static Optional<Person> parsePerson(String line) {
        try {
            String[] parts = line.split(",", -1);
            if (parts.length < 3) return Optional.empty();
            String name = parts[0].trim();
            int age = Integer.parseInt(parts[1].trim());
            String email = parts[2].trim();
            return Optional.of(new Person(name, age, email));
        } catch (Exception e) {
            return Optional.empty(); // Skip malformed line
        }
    }

    static class Person {
        String name;
        int age;
        String email;

        Person(String name, int age, String email) {
            this.name = name;
            this.age = age;
            this.email = email;
        }

        public String toString() {
            return name + " (" + age + ") - " + email;
        }
    }
}

Error Handling & Best Practices

Use Optional to gracefully skip malformed records.
Always wrap Files.lines() in a try-with-resources block to close the stream properly.
For large files, avoid loading everything into memory unless necessary.
Consider using a CSV library like OpenCSV for complex formats with quoted values and delimiters.

Summary

Processing CSV files with streams enables concise, readable, and efficient data transformation pipelines. With careful error handling and attention to performance, you can parse large datasets into structured objects with minimal overhead.

15.2 Processing Large Text Files Efficiently

Java Streams are especially powerful for processing large text files thanks to their lazy evaluation and line-by-line streaming capabilities. Unlike traditional approaches that read the entire file into memory, Files.lines(Path) returns a lazily populated Stream<String>, allowing efficient processing of massive files without exhausting system resources.

Why Streams Are Efficient for Large Files

Lazy Evaluation: Data is read and processed only when needed.
Line Streaming: Only one line is held in memory at a time.
Parallel Processing: Can be parallelized for performance (with care).

This is ideal for logs, large CSVs, or text analytics.

Example 1: Count Lines Matching a Pattern

import java.io.IOException;
import java.nio.file.*;
import java.util.stream.*;

public class ErrorCounter {
    public static void main(String[] args) {
        Path logPath = Path.of("server.log");

        try (Stream<String> lines = Files.lines(logPath)) {
            long errorCount = lines
                .filter(line -> line.contains("ERROR"))
                .count();

            System.out.println("Total ERROR lines: " + errorCount);
        } catch (IOException e) {
            System.err.println("Failed to read file: " + e.getMessage());
        }
    }
}

🔍 Explanation: Only lines containing "ERROR" are processed. No need to load the full log file into memory.

Example 2: Extract Statistics from Lines

Suppose each line contains a numeric value. You want to compute summary statistics efficiently.

values.txt:
23
42
17
invalid
58

import java.io.IOException;
import java.nio.file.*;
import java.util.*;
import java.util.stream.*;

public class FileStats {
    public static void main(String[] args) {
        Path path = Path.of("values.txt");

        try (Stream<String> lines = Files.lines(path)) {
            IntSummaryStatistics stats = lines
                .map(String::trim)
                .filter(s -> s.matches("\\d+")) // Filter valid numbers
                .mapToInt(Integer::parseInt)
                .summaryStatistics();

            System.out.println("Count: " + stats.getCount());
            System.out.println("Min: " + stats.getMin());
            System.out.println("Max: " + stats.getMax());
            System.out.println("Average: " + stats.getAverage());
        } catch (IOException e) {
            System.err.println("Error reading file: " + e.getMessage());
        }
    }
}

🧠 Efficient Design:

Filters invalid data early.
Uses mapToInt() for primitive stream processing.
Avoids materializing all lines at once.

Best Practices

Use try-with-resources to ensure file closure.
Avoid stateful operations or collecting entire files unnecessarily.
Use parallel() cautiously—file I/O is typically I/O-bound, not CPU-bound.

Summary

Java Streams combined with Files.lines() provide a scalable, elegant solution for processing large text files. By processing data lazily and efficiently, you can analyze logs, parse files, and compute summaries without memory overhead, even on gigabyte-scale datasets.

15.3 Example: Word Count with Streams

Word counting is a classic problem that demonstrates the power of Java Streams for text processing. This example walks through reading a file, tokenizing text into words, normalizing and cleaning input, and then computing word frequencies using collectors.

We'll use Files.lines() to read a file lazily, process each line to extract words, and count occurrences with Collectors.groupingBy().

Key Steps

Read lines from a file using Files.lines().
Convert all text to lowercase (normalization).
Split lines into words using a regex.
Filter out blanks or invalid entries.
Use Collectors.groupingBy() and Collectors.counting() to tally words.

Example File: `sample.txt`

Hello, world!
This is a test. This is only a test.
hello HELLO? test!

Runnable Code: Word Frequency Counter

import java.io.IOException;
import java.nio.file.*;
import java.util.*;
import java.util.function.Function;
import java.util.stream.*;

public class WordCountExample {
    public static void main(String[] args) {
        Path file = Path.of("sample.txt");

        try (Stream<String> lines = Files.lines(file)) {
            Map<String, Long> wordCounts = lines
                .flatMap(line -> Arrays.stream(line
                    .toLowerCase()                // Normalize case
                    .replaceAll("[^a-z\\s]", "") // Remove punctuation
                    .split("\\s+")))             // Split by whitespace
                .filter(word -> !word.isBlank()) // Skip empty strings
                .collect(Collectors.groupingBy(
                    Function.identity(),
                    Collectors.counting()));

            // Print sorted result
            wordCounts.entrySet().stream()
                .sorted(Map.Entry.<String, Long>comparingByValue(Comparator.reverseOrder()))
                .forEach(entry ->
                    System.out.printf("%-10s -> %d%n", entry.getKey(), entry.getValue()));
        } catch (IOException e) {
            System.err.println("Error reading file: " + e.getMessage());
        }
    }
}

Breakdown of the Pipeline

Normalization: toLowerCase() ensures that "Hello" and "hello" are treated the same.
Cleaning: replaceAll("[^a-z\\s]", "") strips punctuation.
Splitting: split("\\s+") tokenizes each line into words.
Filtering: Removes blank strings from extra spaces or lines.
Collecting: Groups by word and counts each occurrence.
Sorting: Final output is sorted by frequency in descending order.

Output for `sample.txt`

hello      -> 3
test       -> 3
this       -> 2
is         -> 2
a          -> 2
only       -> 1
world      -> 1

Summary

This example highlights how to build a complete and efficient word count pipeline using Java Streams. With just a few transformations and collectors, you can process complex text input, handle edge cases like punctuation and blank lines, and produce clean, sorted output. This pattern is easily extendable to more advanced natural language processing tasks.

Processing Text Files and Data Sources

Java Streams

15.1 Reading and Processing CSV Files with Streams

Key Steps for CSV Processing

Example: Parse CSV into a Person Class

Error Handling & Best Practices

Summary

15.2 Processing Large Text Files Efficiently

Why Streams Are Efficient for Large Files

Example 1: Count Lines Matching a Pattern

Example 2: Extract Statistics from Lines

Best Practices

Summary

15.3 Example: Word Count with Streams

Key Steps

Example File: sample.txt

Runnable Code: Word Frequency Counter

Breakdown of the Pipeline

Output for sample.txt

Summary

Related Books

Example: Parse CSV into a `Person` Class

Example File: `sample.txt`

Output for `sample.txt`