Working with Character Sets and Encodings

Java IO and NIO

9.1 Charset and CharsetDecoder/CharsetEncoder

When working with text in Java, understanding how characters are represented and converted to bytes (and vice versa) is crucial. This is especially important for applications dealing with file IO, network communication, or interoperability with systems that may use different encodings. Java’s Charset, CharsetEncoder, and CharsetDecoder classes provide a robust framework to handle these conversions reliably and efficiently.

What is a Character Set?

A character set (or charset) defines a mapping between a collection of characters (letters, digits, symbols) and their corresponding numeric values (code points). These numeric values are then encoded into sequences of bytes to store or transmit text.

Common character sets include:

ASCII: 7-bit, represents basic English letters and control characters.
ISO-8859-1: An 8-bit encoding covering Western European languages.
UTF-8: A variable-length Unicode encoding capable of representing any character.
UTF-16: A Unicode encoding using 2 or 4 bytes per character.

Because different systems and protocols may use different charsets, converting between bytes and characters requires specifying which charset to use. Incorrect charset assumptions often lead to mojibake (garbled text).

The `Charset` Class in Java

The Charset class (in java.nio.charset) represents a named mapping between sequences of 16-bit Unicode characters (char) and sequences of bytes. Java’s core platform includes support for many standard charsets, and you can obtain a Charset instance for any supported charset.

How to Obtain a Charset

You can get a Charset instance via:

Charset charset1 = Charset.forName("UTF-8");         // Standard charset by name
Charset charset2 = StandardCharsets.UTF_8;           // Preferred constant from Java 7+

Java also provides constants for other popular charsets like US_ASCII, ISO_8859_1, and UTF_16.

Role of Charset in Java Text Encoding and Decoding

Encoding: Converting a CharBuffer (characters) into a ByteBuffer (bytes) using a CharsetEncoder.
Decoding: Converting a ByteBuffer back into a CharBuffer using a CharsetDecoder.

This two-way transformation is essential because:

Internally, Java strings and characters use UTF-16.
External data (files, network data) is often represented as bytes encoded in a specific charset.

`CharsetEncoder` and `CharsetDecoder` Classes

CharsetEncoder

CharsetEncoder converts characters into bytes according to a specific charset encoding scheme.

Created from a Charset by calling .newEncoder().
Provides methods to encode characters in bulk or incrementally.
Handles character-to-byte conversion, including error handling for unmappable or malformed characters.

CharsetDecoder

CharsetDecoder converts bytes into characters.

Created from a Charset by calling .newDecoder().
Reads bytes from a ByteBuffer and outputs decoded characters into a CharBuffer.
Handles invalid byte sequences gracefully.

Why Use CharsetEncoder/Decoder Instead of Convenience Methods?

While Java offers convenience methods like String.getBytes(Charset) and new String(byte[], Charset), the encoder/decoder classes give:

More control: You can manage incremental encoding/decoding (streaming).
Error handling: Configure how to respond to malformed or unmappable input.
Performance benefits: Avoid unnecessary intermediate objects in bulk operations.

Example: Using CharsetEncoder and CharsetDecoder

import java.nio.*;
import java.nio.charset.*;

public class CharsetExample {
    public static void main(String[] args) throws CharacterCodingException {
        // Obtain a Charset instance for UTF-8
        Charset charset = StandardCharsets.UTF_8;

        // Create encoder and decoder
        CharsetEncoder encoder = charset.newEncoder();
        CharsetDecoder decoder = charset.newDecoder();

        // The original string
        String original = "Hello, 世界";  // Includes Unicode characters

        // Encode: Convert characters to bytes
        CharBuffer charBuffer = CharBuffer.wrap(original);
        ByteBuffer byteBuffer = encoder.encode(charBuffer);

        System.out.println("Encoded bytes:");
        while (byteBuffer.hasRemaining()) {
            System.out.printf("%02X ", byteBuffer.get());
        }
        System.out.println();

        // Reset buffer position for reading
        byteBuffer.flip();

        // Decode: Convert bytes back to characters
        CharBuffer decodedCharBuffer = decoder.decode(byteBuffer);
        String decodedString = decodedCharBuffer.toString();

        System.out.println("Decoded string:");
        System.out.println(decodedString);
    }
}

Output:

Encoded bytes:
48 65 6C 6C 6F 2C 20 E4 B8 96 E7 95 8C 
Decoded string:
Hello, 世界

Handling Malformed and Unmappable Characters

CharsetEncoder and CharsetDecoder allow configuring actions on errors:

encoder.onMalformedInput(CodingErrorAction.REPLACE);
encoder.onUnmappableCharacter(CodingErrorAction.IGNORE);
decoder.onMalformedInput(CodingErrorAction.REPORT);

Options include:

REPORT: Throw an exception on error.
IGNORE: Skip malformed/unmappable sequences.
REPLACE: Replace with a default character (e.g., ?).

Summary

The Charset class models the concept of a named character encoding scheme and provides access to encoders and decoders.
The CharsetEncoder converts characters (Java’s internal UTF-16) into bytes according to the charset.
The CharsetDecoder converts bytes back into characters.
These classes are essential for correctly reading and writing text data in different encodings, avoiding data corruption.
They support fine-grained control over incremental encoding/decoding and error handling, making them suitable for performance-critical and robust applications.

By mastering Charset, CharsetEncoder, and CharsetDecoder, Java developers can handle text data across diverse platforms and protocols with confidence and precision.

9.2 Reading and Writing Text with Charset Support

Handling text files correctly in Java requires careful attention to character encoding. A common source of bugs and data corruption arises when the character encoding used to read a file does not match the encoding in which the file was written. This mismatch can cause unreadable characters (mojibake), lost data, or exceptions. To avoid such problems, explicitly specifying the character set (charset) during text file IO is crucial.

Why Specifying Charset Matters

Every text file is essentially a sequence of bytes interpreted as characters according to an encoding scheme or charset. Examples of common charsets are UTF-8, ISO-8859-1, UTF-16, etc. Different charsets encode characters differently:

For example, the character 'é' in UTF-8 is two bytes (0xC3 0xA9), but in ISO-8859-1 it is one byte (0xE9).
A file encoded in UTF-8 but read as ISO-8859-1 will produce corrupted characters or unexpected results.

If you rely on Java's platform default charset implicitly (e.g., FileReader or FileWriter), your code becomes non-portable and fragile because default charset varies by platform, locale, and JVM settings.

Explicit charset specification ensures that your program reads and writes text consistently, regardless of platform or environment.

Java Classes for Charset-Aware Text IO

The key classes to perform charset-aware reading and writing of text files are:

InputStreamReader: Converts bytes from an input stream into characters using a specified charset.
OutputStreamWriter: Converts characters into bytes and writes them to an output stream using a specified charset.

Both classes bridge between byte streams (InputStream/OutputStream) and character streams (Reader/Writer).

Reading Text Files with `InputStreamReader`

The typical pattern to read a text file with a specific charset is:

import java.io.*;
import java.nio.charset.Charset;

public class CharsetAwareFileRead {
    public static void main(String[] args) {
        File file = new File("example.txt");
        Charset charset = Charset.forName("UTF-8");

        try (InputStream inputStream = new FileInputStream(file);
             Reader reader = new InputStreamReader(inputStream, charset);
             BufferedReader bufferedReader = new BufferedReader(reader)) {

            String line;
            while ((line = bufferedReader.readLine()) != null) {
                System.out.println(line);
            }

        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

Explanation:

FileInputStream reads raw bytes from the file.
InputStreamReader converts these bytes into characters using the specified charset (UTF-8 in this example).
BufferedReader provides efficient buffered reading and convenient methods like readLine().

This method guarantees that bytes are interpreted according to the specified charset, avoiding platform default pitfalls.

Writing Text Files with `OutputStreamWriter`

Similarly, to write text safely with explicit charset:

import java.io.*;
import java.nio.charset.Charset;

public class CharsetAwareFileWrite {
    public static void main(String[] args) {
        File file = new File("output.txt");
        Charset charset = Charset.forName("UTF-8");

        try (OutputStream outputStream = new FileOutputStream(file);
             Writer writer = new OutputStreamWriter(outputStream, charset);
             BufferedWriter bufferedWriter = new BufferedWriter(writer)) {

            bufferedWriter.write("Hello, 世界!");
            bufferedWriter.newLine();
            bufferedWriter.write("This file is written with UTF-8 charset.");

        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

Explanation:

FileOutputStream writes raw bytes to the file.
OutputStreamWriter encodes characters into bytes using the specified charset.
BufferedWriter provides buffered writing with convenient methods such as newLine().

This ensures the written bytes accurately represent the characters in UTF-8 encoding.

Example: Round-Trip Reading and Writing

To illustrate safe and portable IO, consider reading a file in UTF-8 and writing its contents to another file in UTF-8 explicitly:

import java.io.*;
import java.nio.charset.StandardCharsets;

public class RoundTripTextIO {
    public static void main(String[] args) {
        File inputFile = new File("input.txt");
        File outputFile = new File("output.txt");

        try (BufferedReader reader = new BufferedReader(
                new InputStreamReader(new FileInputStream(inputFile), StandardCharsets.UTF_8));
             BufferedWriter writer = new BufferedWriter(
                new OutputStreamWriter(new FileOutputStream(outputFile), StandardCharsets.UTF_8))) {

            String line;
            while ((line = reader.readLine()) != null) {
                writer.write(line);
                writer.newLine();
            }

            System.out.println("File copied successfully with UTF-8 charset.");

        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

This example:

Reads input.txt assuming it’s encoded in UTF-8.
Writes its contents to output.txt in UTF-8.
Avoids data corruption by explicitly specifying the charset for both reading and writing.
Is portable and reliable across different platforms and JVM settings.

Common Pitfalls and Best Practices

Never rely on default charset for file IO. Always specify charset explicitly unless you have very specific reasons and control over the environment.
Match the charset on both reading and writing. When reading files you wrote yourself, use the same charset consistently to avoid surprises.
Use StandardCharsets constants when possible. E.g., StandardCharsets.UTF_8 is preferred over Charset.forName("UTF-8") to avoid typos and UnsupportedCharsetException.
Wrap streams in buffered readers/writers. For performance and convenient line-based operations.
Be cautious with legacy APIs like FileReader and FileWriter. They use platform default charset internally and are discouraged for portable code.

Summary

Text files are sequences of bytes interpreted as characters via a charset.
Mismatched charset assumptions between reading and writing cause data corruption.
Java’s InputStreamReader and OutputStreamWriter classes allow explicit charset specification bridging byte streams to character streams.
Always specify charset explicitly for safe, portable, and correct text file IO.
Use buffered wrappers for efficient and convenient reading/writing.

By adhering to these principles and using the proper classes with explicit charset parameters, Java developers can avoid many common pitfalls and ensure their applications handle text files robustly across environments and locales.

9.3 Handling Unicode and UTF Variants

In today’s globalized digital world, software often needs to handle text from many languages and scripts. This requirement has made Unicode the fundamental standard for representing text characters consistently across platforms and systems. Understanding Unicode and its encoding schemes—UTF-8, UTF-16, and UTF-32—is critical for developers, including Java programmers, to ensure correct, efficient, and interoperable text processing.

What Is Unicode?

Unicode is a universal character set designed to encode all the characters used in writing systems worldwide—letters, digits, symbols, emoji, and control characters—into a single standard. Unlike older character encodings that supported only limited alphabets (like ASCII or ISO-8859-1), Unicode can represent over one million code points, though currently fewer than 150,000 are assigned.

Each character in Unicode is assigned a code point, a unique integer value typically written in hexadecimal, e.g., the Latin capital letter A is U+0041.
Code points are organized in planes of 65,536 characters each; the Basic Multilingual Plane (BMP) is plane 0 and contains most commonly used characters.

Unicode itself is an abstract mapping of characters to numbers. To store or transmit text, these numbers need to be encoded as bytes—this is where UTF encodings come in.

Why Different UTF Encodings Exist

Unicode code points are abstract; they do not specify how to represent characters as bytes. Different encoding schemes—UTF-8, UTF-16, and UTF-32—define how to convert these code points to byte sequences.

Each UTF encoding offers trade-offs in terms of:

Storage size
Compatibility with legacy encodings
Processing complexity
Ease of random access to characters

No single encoding fits all use cases perfectly, so the Unicode standard provides multiple UTF variants to meet diverse needs.

The UTF Encodings Explained

UTF-8

Variable-length encoding: 1 to 4 bytes per character.
Uses 1 byte for ASCII characters (U+0000 to U+007F), making it fully backward-compatible with ASCII.
Non-ASCII characters use 2, 3, or 4 bytes.
Widely used on the web and many modern applications due to its efficiency for texts dominated by ASCII characters.
Byte order is unambiguous (no endianness issues).

Example:

The character 'A' (U+0041) is encoded as 0x41 (1 byte).
The character '世' (U+4E16) is encoded as 0xE4 0xB8 0x96 (3 bytes).

UTF-16

Variable-length encoding: 2 or 4 bytes per character.
Characters in the BMP (most common characters) are encoded as 2 bytes.
Characters outside the BMP (supplementary characters) use surrogate pairs—two 2-byte code units (total 4 bytes).
Commonly used in Windows APIs and Java internally.
Requires consideration of byte order (endianness) — UTF-16LE and UTF-16BE variants exist.

UTF-32

Fixed-length encoding: 4 bytes per character.
Each Unicode code point is stored as a 4-byte integer.
Simple for indexing and processing since every character is a fixed size.
Inefficient for storage compared to UTF-8 and UTF-16, used mostly in internal processing or environments where fixed-width encoding is beneficial.

How Java Supports Unicode and UTF Encodings

Java’s native char type is a 16-bit UTF-16 code unit, meaning Java strings are internally encoded as UTF-16 sequences. This design enables Java to handle all BMP characters in a single char, but supplementary characters (outside BMP) are represented using two char units called surrogate pairs.

Java provides rich support for encoding and decoding Unicode text:

Classes like java.nio.charset.Charset, CharsetEncoder, and CharsetDecoder support UTF-8, UTF-16 (both endian variants), and UTF-32 (less common).
StandardCharsets class includes constants like StandardCharsets.UTF_8, StandardCharsets.UTF_16, and StandardCharsets.UTF_16BE.
Input/output streams and readers/writers accept charset parameters for reading and writing Unicode data correctly.
Unicode-aware APIs like String, Character, and CodePoint methods help manage surrogate pairs and supplementary characters.

Practical Advice on Choosing the Right Encoding

Default to UTF-8 when possible UTF-8 is the de facto standard on the internet and in most modern software because it is compact for ASCII-heavy text, compatible with ASCII, and byte-order safe. For new projects and cross-platform interoperability, UTF-8 is generally the best choice.
Use UTF-16 when interacting with systems or protocols that expect it For example, Windows and Java internally use UTF-16, and some APIs or file formats require UTF-16 encoded data. But be aware of byte order and surrogate pairs.
UTF-32 is rarely needed except for specialized processing Fixed-width UTF-32 simplifies character indexing but uses much more space. Use it only if the application demands fixed-width encoding for performance reasons.
Always specify the encoding explicitly Never rely on platform default charset when reading or writing text, as this can cause data corruption or incompatibility.

Common Pitfalls When Handling Unicode Data

Assuming one char equals one character: In Java, char is a UTF-16 code unit, not a full Unicode character. Supplementary characters are represented by surrogate pairs (char pairs), so methods like String.length() may not correspond to the number of actual Unicode characters (code points). Use methods like codePointCount(), codePointAt(), and offsetByCodePoints() for proper handling.
Not specifying charset on IO: Reading a UTF-8 file as ISO-8859-1 will produce garbled output. Always specify charset explicitly, e.g., new InputStreamReader(inputStream, StandardCharsets.UTF_8).
Ignoring byte order in UTF-16: UTF-16 encoded files can be little-endian or big-endian. The presence of a BOM (byte order mark) can help detect the order, but some files omit it. Make sure to use the correct variant or detect BOM properly.
Incorrectly handling surrogate pairs: String operations that manipulate characters by index can break surrogate pairs and corrupt text. Use Unicode-aware APIs.

Summary

Unicode is the universal standard for representing text from all languages, and UTF encodings specify how to serialize those characters into bytes:

UTF-8: Compact, ASCII-compatible, variable length (1-4 bytes), widely used.
UTF-16: Variable length (2 or 4 bytes), used internally by Java and Windows, sensitive to byte order.
UTF-32: Fixed length (4 bytes), simple but space-inefficient.

Java’s internal string representation uses UTF-16, and it offers comprehensive support for all UTF encodings through its Charset APIs. Choosing the right encoding involves balancing compatibility, efficiency, and application requirements, with UTF-8 being the safest default for most cases.

By understanding these fundamentals and pitfalls, developers can correctly handle Unicode text, ensuring applications are robust, internationalized, and interoperable in a multilingual world.

9.4 Charset Conversion Examples

In many real-world Java applications, you often need to convert text data from one character encoding to another—for example, reading a file encoded in ISO-8859-1 and saving it as UTF-8, or processing data streams with mixed encodings. Java’s CharsetDecoder and CharsetEncoder classes, along with utility classes in java.nio.charset, provide robust tools to perform such conversions reliably.

This section explains how to convert text between different charsets in Java and demonstrates practical code examples.

Why Charset Conversion is Important

Different systems and files may use different encodings, and misinterpreting bytes as the wrong charset can lead to corrupted text (mojibake) or exceptions. Charset conversion ensures that:

Text data can be interoperably exchanged between systems.
Legacy data encoded with older encodings can be converted to modern Unicode-based formats like UTF-8.
Your application can display or store text correctly according to user or system requirements.

Core Classes: CharsetDecoder and CharsetEncoder

CharsetDecoder converts bytes from a specific charset into Java characters (char), resulting in a CharBuffer.
CharsetEncoder converts Java characters (char) into bytes according to a target charset, resulting in a ByteBuffer.

To convert text from one charset to another, you:

Decode the original bytes using the source charset into characters.
Encode these characters using the target charset back into bytes.

Example 1: Basic Charset Conversion Using CharsetDecoder and CharsetEncoder

This example reads a byte array encoded in ISO-8859-1 and converts it to a UTF-8 encoded byte array.

import java.nio.ByteBuffer;
import java.nio.CharBuffer;
import java.nio.charset.*;

public class CharsetConversionExample {
    public static void main(String[] args) throws CharacterCodingException {
        // Original text encoded in ISO-8859-1 bytes (for demo)
        byte[] iso8859Bytes = {(byte)0xE9, (byte)0x20, (byte)0x6C, (byte)0xE0, (byte)0x20, (byte)0x63, (byte)0xE9, (byte)0x20, (byte)0x63, (byte)0x61, (byte)0x72};

        // Step 1: Decode ISO-8859-1 bytes to characters
        Charset sourceCharset = Charset.forName("ISO-8859-1");
        CharsetDecoder decoder = sourceCharset.newDecoder();
        ByteBuffer sourceBytes = ByteBuffer.wrap(iso8859Bytes);
        CharBuffer chars = decoder.decode(sourceBytes);

        System.out.println("Decoded characters:");
        System.out.println(chars.toString());  // prints: é là cé car

        // Step 2: Encode characters into UTF-8 bytes
        Charset targetCharset = StandardCharsets.UTF_8;
        CharsetEncoder encoder = targetCharset.newEncoder();
        ByteBuffer utf8Bytes = encoder.encode(chars);

        System.out.println("Re-encoded UTF-8 bytes:");
        while (utf8Bytes.hasRemaining()) {
            System.out.printf("%02X ", utf8Bytes.get());
        }
        // Output: C3 A9 20 6C C3 A0 20 63 C3 A9 20 63 61 72
    }
}

Explanation

We start with a byte array iso8859Bytes containing text encoded in ISO-8859-1, including accented characters like é and à.
Using CharsetDecoder for ISO-8859-1, we decode these bytes into Java characters (CharBuffer).
Then, we encode these characters into UTF-8 bytes using CharsetEncoder.
This two-step decode-encode approach converts text from one charset to another safely.

Example 2: Converting Text File Encoding

Suppose you have a file encoded in Windows-1252 and want to convert it to UTF-8.

import java.io.*;
import java.nio.ByteBuffer;
import java.nio.CharBuffer;
import java.nio.charset.*;

public class FileEncodingConverter {
    public static void main(String[] args) {
        File inputFile = new File("input-win1252.txt");
        File outputFile = new File("output-utf8.txt");

        Charset sourceCharset = Charset.forName("windows-1252");
        Charset targetCharset = StandardCharsets.UTF_8;

        try (InputStream inStream = new FileInputStream(inputFile);
             OutputStream outStream = new FileOutputStream(outputFile)) {

            // Decoder and encoder
            CharsetDecoder decoder = sourceCharset.newDecoder();
            CharsetEncoder encoder = targetCharset.newEncoder();

            // Read all bytes from source file
            byte[] inputBytes = inStream.readAllBytes();
            ByteBuffer sourceBuffer = ByteBuffer.wrap(inputBytes);

            // Decode bytes to chars
            CharBuffer charBuffer = decoder.decode(sourceBuffer);

            // Encode chars to target charset bytes
            ByteBuffer targetBuffer = encoder.encode(charBuffer);

            // Write bytes to output file
            outStream.write(targetBuffer.array(), 0, targetBuffer.limit());

            System.out.println("File encoding converted successfully.");

        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

Explanation

This program reads all bytes from a file encoded in Windows-1252.
It uses CharsetDecoder to decode the bytes into characters.
It then encodes the characters into UTF-8 bytes with CharsetEncoder.
Finally, it writes the UTF-8 bytes to the output file.
This approach preserves the text correctly, regardless of special characters present.

Utility Approach: Using `new String()` and `getBytes()` for Quick Conversions

For simpler scenarios, Java’s String constructors and getBytes() methods can be used for conversion:

import java.io.UnsupportedEncodingException;

public class Test {

    public class SimpleConversion {
        public static void main(String[] args) throws UnsupportedEncodingException {
            byte[] windows1252Bytes = { (byte) 0xE9, (byte) 0x20, (byte) 0x6C, (byte) 0xE0 }; // é là

            // Decode bytes into String with Windows-1252
            String text = new String(windows1252Bytes, "windows-1252");
            System.out.println("Decoded text: " + text);

            // Encode String into UTF-8 bytes
            byte[] utf8Bytes = text.getBytes("UTF-8");

            System.out.println("UTF-8 bytes:");
            for (byte b : utf8Bytes) {
                System.out.printf("%02X ", b);
            }
        }
    }
}

Note

While convenient, this approach doesn’t provide fine-grained control over error handling or streaming conversion and may be less efficient for large data.

Handling Malformed and Unmappable Characters

When converting between charsets, you may encounter characters that cannot be mapped from source to target charset. Both CharsetDecoder and CharsetEncoder allow configuring error handling strategies:

decoder.onMalformedInput(CodingErrorAction.REPLACE);
decoder.onUnmappableCharacter(CodingErrorAction.IGNORE);

encoder.onMalformedInput(CodingErrorAction.REPORT);
encoder.onUnmappableCharacter(CodingErrorAction.REPLACE);

Options include:

REPORT: Throw an exception on error (default).
REPLACE: Substitute with a replacement character (usually ?).
IGNORE: Skip the problematic input.

Use these settings depending on your tolerance for data loss or corruption.

Summary

Charset conversion in Java involves decoding bytes to characters from the source charset and encoding characters to bytes in the target charset.
Use CharsetDecoder and CharsetEncoder for controlled, streaming-capable, and robust conversions.
For quick tasks, new String(byte[], charset) and String.getBytes(charset) may suffice.
Always be mindful of error handling policies to deal with malformed or unmappable characters.
Explicitly specifying charsets ensures your application correctly handles multilingual data and avoids text corruption.

Mastering charset conversion techniques ensures your Java applications remain compatible with diverse data sources and produce reliably encoded output across environments.

Working with Character Sets and Encodings

Java IO and NIO

9.1 Charset and CharsetDecoder/CharsetEncoder

What is a Character Set?

The Charset Class in Java

How to Obtain a Charset

Role of Charset in Java Text Encoding and Decoding

CharsetEncoder and CharsetDecoder Classes

Why Use CharsetEncoder/Decoder Instead of Convenience Methods?

Example: Using CharsetEncoder and CharsetDecoder

Handling Malformed and Unmappable Characters

Summary

9.2 Reading and Writing Text with Charset Support

Why Specifying Charset Matters

Java Classes for Charset-Aware Text IO

Reading Text Files with InputStreamReader

Explanation:

Writing Text Files with OutputStreamWriter

Explanation:

Example: Round-Trip Reading and Writing

Common Pitfalls and Best Practices

Summary

9.3 Handling Unicode and UTF Variants

What Is Unicode?

Why Different UTF Encodings Exist

The UTF Encodings Explained

UTF-16

UTF-32

How Java Supports Unicode and UTF Encodings

Practical Advice on Choosing the Right Encoding

Common Pitfalls When Handling Unicode Data

Summary

9.4 Charset Conversion Examples

Why Charset Conversion is Important

Core Classes: CharsetDecoder and CharsetEncoder

Example 1: Basic Charset Conversion Using CharsetDecoder and CharsetEncoder

Explanation

Example 2: Converting Text File Encoding

Explanation

Utility Approach: Using new String() and getBytes() for Quick Conversions

Note

Handling Malformed and Unmappable Characters

Summary

Related Books

The `Charset` Class in Java

`CharsetEncoder` and `CharsetDecoder` Classes

Reading Text Files with `InputStreamReader`

Writing Text Files with `OutputStreamWriter`

Utility Approach: Using `new String()` and `getBytes()` for Quick Conversions