Index

Processing Text Files and Data Sources

Java Streams

15.1 Reading and Processing CSV Files with Streams

Java Streams offer a powerful and efficient way to read and process CSV files, especially when combined with Files.lines(). By leveraging the stream API, you can read large datasets line by line, transform them into structured objects, filter or group them, and collect results with minimal memory usage and high readability.

Key Steps for CSV Processing

  1. Open the file using Files.lines(Path) in a try-with-resources block.
  2. Skip headers if present.
  3. Split each line using String.split() or a CSV parser.
  4. Map lines to structured objects.
  5. Filter or transform data as needed.
  6. Collect results into a list or other structure.

Example: Parse CSV into a Person Class

CSV file: people.csv

name,age,email
Alice,30,alice@example.com
Bob,25,bob@example.com
Charlie,invalid,charlie@example.com
import java.io.IOException;
import java.nio.file.*;
import java.util.*;
import java.util.stream.*;

public class CsvPersonExample {
    public static void main(String[] args) {
        Path file = Path.of("people.csv");

        try (Stream<String> lines = Files.lines(file)) {
            List<Person> people = lines
                .skip(1) // Skip header
                .map(CsvPersonExample::parsePerson)
                .flatMap(Optional::stream) // Filter out failed parses
                .filter(p -> p.age >= 18) // Filter adults
                .collect(Collectors.toList());

            people.forEach(System.out::println);
        } catch (IOException e) {
            System.err.println("Failed to read CSV file: " + e.getMessage());
        }
    }

    static Optional<Person> parsePerson(String line) {
        try {
            String[] parts = line.split(",", -1);
            if (parts.length < 3) return Optional.empty();
            String name = parts[0].trim();
            int age = Integer.parseInt(parts[1].trim());
            String email = parts[2].trim();
            return Optional.of(new Person(name, age, email));
        } catch (Exception e) {
            return Optional.empty(); // Skip malformed line
        }
    }

    static class Person {
        String name;
        int age;
        String email;

        Person(String name, int age, String email) {
            this.name = name;
            this.age = age;
            this.email = email;
        }

        public String toString() {
            return name + " (" + age + ") - " + email;
        }
    }
}

Error Handling & Best Practices

Summary

Processing CSV files with streams enables concise, readable, and efficient data transformation pipelines. With careful error handling and attention to performance, you can parse large datasets into structured objects with minimal overhead.

Index

15.2 Processing Large Text Files Efficiently

Java Streams are especially powerful for processing large text files thanks to their lazy evaluation and line-by-line streaming capabilities. Unlike traditional approaches that read the entire file into memory, Files.lines(Path) returns a lazily populated Stream<String>, allowing efficient processing of massive files without exhausting system resources.

Why Streams Are Efficient for Large Files

This is ideal for logs, large CSVs, or text analytics.

Example 1: Count Lines Matching a Pattern

import java.io.IOException;
import java.nio.file.*;
import java.util.stream.*;

public class ErrorCounter {
    public static void main(String[] args) {
        Path logPath = Path.of("server.log");

        try (Stream<String> lines = Files.lines(logPath)) {
            long errorCount = lines
                .filter(line -> line.contains("ERROR"))
                .count();

            System.out.println("Total ERROR lines: " + errorCount);
        } catch (IOException e) {
            System.err.println("Failed to read file: " + e.getMessage());
        }
    }
}

πŸ” Explanation: Only lines containing "ERROR" are processed. No need to load the full log file into memory.

Example 2: Extract Statistics from Lines

Suppose each line contains a numeric value. You want to compute summary statistics efficiently.

values.txt:
23
42
17
invalid
58
import java.io.IOException;
import java.nio.file.*;
import java.util.*;
import java.util.stream.*;

public class FileStats {
    public static void main(String[] args) {
        Path path = Path.of("values.txt");

        try (Stream<String> lines = Files.lines(path)) {
            IntSummaryStatistics stats = lines
                .map(String::trim)
                .filter(s -> s.matches("\\d+")) // Filter valid numbers
                .mapToInt(Integer::parseInt)
                .summaryStatistics();

            System.out.println("Count: " + stats.getCount());
            System.out.println("Min: " + stats.getMin());
            System.out.println("Max: " + stats.getMax());
            System.out.println("Average: " + stats.getAverage());
        } catch (IOException e) {
            System.err.println("Error reading file: " + e.getMessage());
        }
    }
}

🧠 Efficient Design:

Best Practices

Summary

Java Streams combined with Files.lines() provide a scalable, elegant solution for processing large text files. By processing data lazily and efficiently, you can analyze logs, parse files, and compute summaries without memory overhead, even on gigabyte-scale datasets.

Index

15.3 Example: Word Count with Streams

Word counting is a classic problem that demonstrates the power of Java Streams for text processing. This example walks through reading a file, tokenizing text into words, normalizing and cleaning input, and then computing word frequencies using collectors.

We'll use Files.lines() to read a file lazily, process each line to extract words, and count occurrences with Collectors.groupingBy().

Key Steps

  1. Read lines from a file using Files.lines().
  2. Convert all text to lowercase (normalization).
  3. Split lines into words using a regex.
  4. Filter out blanks or invalid entries.
  5. Use Collectors.groupingBy() and Collectors.counting() to tally words.

Example File: sample.txt

Hello, world!
This is a test. This is only a test.
hello HELLO? test!

Runnable Code: Word Frequency Counter

import java.io.IOException;
import java.nio.file.*;
import java.util.*;
import java.util.function.Function;
import java.util.stream.*;

public class WordCountExample {
    public static void main(String[] args) {
        Path file = Path.of("sample.txt");

        try (Stream<String> lines = Files.lines(file)) {
            Map<String, Long> wordCounts = lines
                .flatMap(line -> Arrays.stream(line
                    .toLowerCase()                // Normalize case
                    .replaceAll("[^a-z\\s]", "") // Remove punctuation
                    .split("\\s+")))             // Split by whitespace
                .filter(word -> !word.isBlank()) // Skip empty strings
                .collect(Collectors.groupingBy(
                    Function.identity(),
                    Collectors.counting()));

            // Print sorted result
            wordCounts.entrySet().stream()
                .sorted(Map.Entry.<String, Long>comparingByValue(Comparator.reverseOrder()))
                .forEach(entry ->
                    System.out.printf("%-10s -> %d%n", entry.getKey(), entry.getValue()));
        } catch (IOException e) {
            System.err.println("Error reading file: " + e.getMessage());
        }
    }
}

Breakdown of the Pipeline

Output for sample.txt

hello      -> 3
test       -> 3
this       -> 2
is         -> 2
a          -> 2
only       -> 1
world      -> 1

Summary

This example highlights how to build a complete and efficient word count pipeline using Java Streams. With just a few transformations and collectors, you can process complex text input, handle edge cases like punctuation and blank lines, and produce clean, sorted output. This pattern is easily extendable to more advanced natural language processing tasks.

Index