Chapter 4. Unicode Text Versus Bytes

Table of Contents

What’s New in This Chapter #

Character Issues #

“string as a sequence of characters” needs the term “character” to be defined well
in python 3, it’s “unicode
Unicode char separates:
- identity of the char => refers to its code point
- the byte representation for the char => dependent on the encoding used (codec between code points and byte sequences)

Byte Essentials #

binary sequences, there are 2 builtin types:
- mutable: bytearray
- immutable: byte
Each item in bytes or bytearray is an integer from 0 to 255
literal notation depends (just a visual representation thing):
- if in ascii range, display in ascii
- if it’s a special char like tab and such, then escape it
- if amidst apostrophes, then use escape chars
- else just use the hex notation for it e.g. \x100
most functions work the same, except those that do formatting and those that depend on unicode data so won’t work:
- case, fold
regexes work the same only if regex is compiled from a binary sequence instead of a str
how to build bytes or bytearray :
1. use bytes.fromhex()
2. use bytes.encode(“mystr”, encoding=“utf-8”)
3. use soemthing that implements buffer protocol to create from source object to new binary sequence (e.g. memoryview).
  - This needs us to explicitly typecast
  - generally, this will also always copy the bytes from the source – except for memoryview which is a direct proxy to the memory

Indexing vs slicing & why `str` is special #

Strings are really the special kind here.

For bytes, bytearray, list, etc., seq returns a single element (e.g., an int), while seq[:1] returns a new sequence of the same type containing just that element. This is different from str, where both return a string. This distinction is important when working with binary data or non-string sequences in Python.

More descriptive:

Let's break down the meaning of this passage from "Fluent Python" and the behavior it describes.

## The Key Point

> For all other sequence types in Python, 1 item is not the same as a slice of length 1.

This means:
- **Accessing a single element** (e.g., `seq`) returns a single item (e.g., an int, a character, etc.).
- **Taking a slice of length 1** (e.g., `seq[:1]` or `seq[0:1]`) returns a new sequence of the same type, containing just that one item.

For most sequences in Python, **these two operations do not return the same type or value**.

### The Exception: `str`

For strings, `s` and `s[:1]` both return a string (the first character), so they look the same:

```python
s = "hello"
print(s[0])   # 'h'
print(s[:1])  # 'h'
print(type(s[0]), type(s[:1]))  #
```

### The Rule: Other Sequences

For other sequence types (e.g., `bytes`, `bytearray`, `list`, `tuple`):

- **Indexing** returns a single element (not a sequence).
- **Slicing** returns a sequence of the same type.

#### Example: `bytes`

```python
cafe = bytes('café', encoding='utf_8')
print(cafe)      # b'caf\xc3\xa9'
print(cafe[0])   # 99  (int: ASCII code for 'c')
print(cafe[:1])  # b'c'  (bytes of length 1)
print(type(cafe[0]), type(cafe[:1]))  #
```
- `cafe` gives you the integer value of the first byte (`99`).
- `cafe[:1]` gives you a `bytes` object of length 1: `b'c'`.

#### Example: `bytearray`

```python
cafe_arr = bytearray(cafe)
print(cafe_arr[-1])    # 169 (int: last byte)
print(cafe_arr[-1:])   # bytearray(b'\xa9') (bytearray of length 1)
```
- `cafe_arr[-1]` is an integer (the last byte).
- `cafe_arr[-1:]` is a `bytearray` object of length 1.

#### Example: `list`

```python
lst = [10, 20, 30]
print(lst[0])   # 10
print(lst[:1])  # [10]
```
- `lst` is an int.
- `lst[:1]` is a list of length 1.

## Why Is This Surprising?

- In Python's `str` type, `s` and `s[:1]` both return a string (the first character), which is a bit special.
- For all other sequences, **indexing returns a single element (not a sequence), slicing returns a sequence**.

## Mental Model

- **Indexing** (`seq[i]`): Returns the element at position `i` (type depends on the sequence).
- **Slicing** (`seq[i:j]`): Returns a new sequence of the same type, containing elements from `i` to `j-1`.

## References

- [Fluent Python, 2nd Edition, Chapter 4: Text versus Bytes](https://www.oreilly.com/library/view/fluent-python-2nd/9781492056348/)
- [Python Data Model: Sequence Types](https://docs.python.org/3/library/stdtypes.html#sequence-types-list-tuple-range)

**In summary:**
For `bytes`, `bytearray`, `list`, etc., `seq` returns a single element (e.g., an int), while `seq[:1]` returns a new sequence of the same type containing just that element. This is different from `str`, where both return a string. This distinction is important when working with binary data or non-string sequences in Python.

Basic Encoders/Decoders #

Highlight on page 153 #

Contents #

Each codec has a name, like ‘utf_8’

Comment #

so utf-8 is a codec here

Highlight on page 154 #

Contents #

like ASCII and even the multibyte GB2312, cannot represent every Unicode character. The UTF encod‐ ings, however, are designed to handle every Unicode code point.

Understanding Encode/Decode Problems #

Errors that we can expect:

(generic) UnicodeError
1. UnicodeDecodeError
2. UnicodeEncodeError
When loading libraries, might end-up facing SyntaxError also because of encoding issues

Coping with UnicodeEncodeError #

the error handlers for encoding error can include 'xmlcharrefreplace'. What this does is XML character reference: &#<unicode code point> and in so doing, there’s no loss of that information Here’s more context on it:

  The statement from "Fluent Python"—

  > 'xmlcharrefreplace' replaces unencodable characters with an XML entity. If you can’t use UTF, and you can’t afford to lose data, this is the only option.

  —means that when you encode a string using a limited encoding (like ASCII) and specify `errors='xmlcharrefreplace'`, **any character that cannot be represented in the target encoding is replaced with an XML numeric character reference** (e.g., `&#233;` for "é"). This ensures that **no information is lost**: all original characters are either encoded directly (if possible) or represented as XML entities, which are reversible.

  ### How does it work?

  - When encoding, Python checks each character:
      - If the character can be encoded in the target encoding (e.g., ASCII), it is kept as-is.
      - If it cannot, it is replaced with its XML character reference: `&#;`
  - When decoding, you can later convert these references back to the original characters, so the process is *lossless* in terms of information content.

  #### Example

  ```python
  txt = "Café"
  encoded = txt.encode("ascii", errors="xmlcharrefreplace")
  print(encoded)  # b'Caf&#233;'
  ```
  Here, "é" (which is not in ASCII) is replaced with `&#233;`, preserving the character information[2][7].

  ### Why is there no data loss?

  - **All original characters are represented:** Characters that can't be encoded are replaced with their numeric reference, which uniquely identifies the character.
  - **Reversible:** You can later parse the XML entities back into the original Unicode characters, restoring the original string[1][6].

  ### Contrast with other error handlers

  - `'replace'` swaps unencodable characters for `?` (data loss).
  - `'ignore'` simply omits them (data loss).
  - `'backslashreplace'` uses Python escape sequences (reversible, but not standard in XML/HTML).
  - `'xmlcharrefreplace'` uses XML/HTML-compatible numeric references (reversible, and standard for text interchange).

  ### Practical implication

  If you must encode text in a limited character set (like ASCII or Latin-1) but need to ensure that all characters are preserved in some form (for later recovery or interoperability), `'xmlcharrefreplace'` is the safest choice[4][6][7].

  **In summary:**
  Using `'xmlcharrefreplace'` means that **no original character data is lost**—all characters are either encoded directly or replaced with a reversible XML entity. This is why the book says it is the only option if you can't use UTF and can't afford to lose data.

  [1] https://stackoverflow.com/questions/44293891/python-string-encoding-xmlcharrefreplace-decode
  [2] https://www.w3schools.com/python/ref_string_encode.asp
  [3] https://docs.python.org/3/howto/unicode.html
  [4] https://docs.python.org/3/library/codecs.html
  [5] https://www.codecademy.com/resources/docs/python/strings/encode
  [6] https://code.activestate.com/recipes/303668-encoding-unicode-data-for-xml-and-html/
  [7] https://www.geeksforgeeks.org/python/python-strings-encode-method/
  [8] https://www.digitalocean.com/community/tutorials/python-string-encode-decode
  [9] https://labex.io/tutorials/python-what-is-the-role-of-the-encoding-and-errors-parameters-in-the-str-function-in-python-395133
  [10] https://docs.vultr.com/python/standard-library/str/encode

Coping with UnicodeDecodeError #

Highlight on page 156 #

Contents
On the other hand, many legacy 8-bit encodings like ‘cp1252’, ‘iso8859_1’, and ‘koi8_r’ are able to decode any stream of bytes, including random noise, without reporting errors. Therefore, if your program assumes the wrong 8-bit encoding, it will silently decode garbage.

Comment
utf8/16 will sound off because it’s a strict error check
the older 8bit codecs will do it silently

Highlight on page 157 #

Contents
“�” (code point U+FFFD), the official Unicode REPLACEMENT CHARACTER intended to represent unknown characters.

Comment
there’s an official REPLACEMENT CHARACTER

utf8 default for python source code
fix this by defining explicitly what encoding type to use at the top of the file when writing that file out.
```
  # coding: cp1252
```
OR just fix it by converting to UTF-8

How to Discover the Encoding of a Byte Sequence #

you can’t but you can make a good guess
chardet exists for this reason, it’s an estimated detection of the encoding type.

Highlight on page 159 #

Contents
human languages also have their rules and restrictions, once you assume that a stream of bytes is human plain text, it may be possible to sniff out its encoding using heuristics and statistics. For example, if b’\x00’ bytes are common, it is probably a 16- or 32-bit encoding, and not an 8-bit scheme, because null characters in plain text are bugs. When the byte sequence b’\x20\x00’ appears often, it is more likely to be the space character (U+0020) in a UTF-16LE encoding, rather than the obscure U+2000 EN QUAD character—whatever that is. That is how the package “Chardet—The Universal Character Encoding Detector” works to guess one of more than 30 supported encodings. Chardet is a Python library that you can use in your programs, but also includes a command-line utility, charde tect.

Comment
typically an encoding is declared – so you have to be told what encoding it is
however, it’s possible to guess probabilistically what the encoding could be.
there are packages for that (Chardet)

BOM: A Useful Gremlin #

Byte-Order Mark: helps us know if the machine that the encoding was performed on is little or big endian.
endianness becomes a problem only for any encoding format that takes more than a byte (so for UTF-16 and UTF-32) ==> so BOM only matters for them
so BOM not needed for UTF-8
but it can still be added in (discouraged though)

Highlight on page 160 #

Contents
UTF-16 encoding prepends the text to be encoded with the special invisible character ZERO WIDTH NO-BREAK SPACE (U+FEFF).

Highlight on page 160 #

Contents
This whole issue of endianness only affects encodings that use words of more than one byte, like UTF-16 and UTF-32

Highlight on page 161 #

Contents
using UTF-8 for general interoperability. For example, Python scripts can be made executable in Unix systems if they start with the comment: #!/usr/bin/env python3. The first two bytes of the file must be b’#!’ for that to work, but the BOM breaks that con‐ vention. If you have a specific requirement to export data to apps that need the BOM, use UTF-8-SIG but be aware that Python’s codecs documentation says: “In UTF-8, the use of the BOM is dis‐ couraged and should generally be avoided.”

Comment
use UTF-8-SIG because will be harmless
also note that the python codecs documentation says that in utf8, using a BOM (byte order mark) is discouraged.

Handling Text Files & the “Unicode Sandwich” #

Here’s the gist of why it’s “unicode sandwich”

decode bytes on input
process text only (the meat of the sandwich is the business logic that should use strings)
encode text on output

The best practice for handling text I/O is the “Unicode sandwich” (Figure 4-2).5 This means that bytes should be decoded to str as early as possible on input (e.g., when opening a file for reading). The “filling” of the sandwich is the business logic of your program, where text handling is done exclusively on str objects. You should never be encoding or decoding in the middle of other processing. On output, the str are encoded to bytes as late as possible.

Highlight on page 161 #

Contents #

e best practice for handling text I/O is the “Unicode sandwich” (Figure 4-2).5 This means that bytes should be decoded to str as early as possible on input (e.g., when opening a file for reading). The “filling” of the sandwich is the business logic of your program, where text handling is done exclusively on str objects. You should never be encoding or decoding in the middle of other processing. On output, the str are enco‐ ded to bytes as late as possible.

Comment #

Unicode sandwhich is the best practices for handling text files and their encoding:

bytes -> str (decode bytes as early as possbile, i.e. on input)
process text only in the business logic
encode text on output only

Highlight on page 162 #

Contents #

Code that has to run on multiple machines or on multiple occasions should never depend on encoding defaults. Always pass an explicit encoding= argument

Comment #

cross-platform code should always explicitly define the encoding value!

so unix machines will use utf-8 but then when using, say, a windows machine there might be an encoding issue becaue

Highlight on page 163 #

Contents #

TextIOWrapper with the encoding set to a default from the local

Beware of Encoding Defaults #

even within say windows itself, not every application would have the same encoding.

for unix it’s more standardised, so it’s most likely expected to be utf-8

Defaults #

Main thing to remember is that the most important encoding setting is the one that is retired by locale.getpreferredencoding()

The changes can be effected by changing the environment variables.

Normalizing Unicode for Reliable Comparisons #

canonical equivalents exist, but they have different code points under the hood.
there’s a bunch of different normalisation forms, for extra safety, when saving strings, should normalise that string (using NFC normalistaion for example)
gotcha: some single characters can be normalised to result in visually similar but they compare different
string normalisation can be lossy, so there can be actual dataloss from multiple sandwhich creation, destruction, creation
- NFKC and NFKD are examples of such normalisation forms - these forms should only be used for intermediate representations for search & index

NFC is not sufficient for search and indexing because it preserves compatibility distinctions that are irrelevant (and even counterproductive) for matching. NFKC/NFKD are used as intermediate representations for search and indexing because they erase these distinctions, enabling robust, user-friendly search behavior—at the cost of losing some original form information, which is why they are not used for storage or display. See more info here:

  To understand why **NFC normalization is not always suitable for search and indexing**, and why compatibility forms like **NFKC/NFKD** are often used as intermediate representations for these purposes, let's clarify the properties and goals of each normalization form and their implications for search/index use cases.

  ### **NFC vs. NFKC: What’s the Difference?**

  - **NFC (Normalization Form C, Canonical Composition):**
    - Collapses canonically equivalent sequences into a single, composed form.
    - Preserves distinctions between characters that are *compatible* but not *canonically equivalent* (e.g., ligatures, superscripts, full-width vs. half-width characters).
    - Designed to be *lossless* for textual content, so that round-tripping (normalize, then denormalize) does not lose data[4][7].

  - **NFKC (Normalization Form KC, Compatibility Composition):**
    - Collapses both canonically equivalent and *compatibility equivalent* sequences.
    - This means it will, for example, convert ligatures like 'ﬁ' (U+FB01) to 'fi', or full-width Latin letters to their standard forms.
    - This process is **lossy**: information about the original form (e.g., that a ligature or superscript was used) is lost[4][7].

  ### **Why Not Use NFC for Search and Indexing?**

  **NFC is designed to preserve distinctions that are meaningful in text rendering or data storage, but are often *not* meaningful for search and indexing.**
  For example:
  - The string "ﬁeld" could be encoded as:
    - `U+0066 U+0069 U+0065 U+006C U+0064` ("field")
    - `U+FB01 U+0065 U+006C U+0064` ("ﬁeld" with the ligature 'ﬁ')
  - **NFC will *not* turn the ligature into "fi"; it will keep the distinction.**
    This means a search for "field" will not match a document containing "ﬁeld" (with the ligature) if both are normalized to NFC[4][7].

  **For search and indexing, you often want:**
  - "field" and "ﬁeld" to be treated as equivalent.
  - Full-width and half-width forms, superscripts, circled numbers, etc., to be treated as their base forms for matching.

  **NFKC/NFKD normalization does this:**
  - It "folds" away these compatibility distinctions, making visually or semantically similar forms compare equal[1][4][7].

  ### **Special Role of Compatibility Normalization for Search/Index**

  - **Intermediate representation**:
    You normalize both the indexed data and the search queries to NFKC/NFKD, so that *compatibility* distinctions are erased and matches are more robust.
  - **Not for storage or display**:
    Because NFKC/NFKD is lossy, you shouldn't use it for storing or displaying user data—only for the internal process of searching and indexing[1][4].

  ### **Summary Table**

  | Use Case        | Recommended Normalization | Rationale                                                    |
  |-----------------|--------------------------|--------------------------------------------------------------|
  | Storage/Display | NFC                      | Preserves all distinctions, lossless for most text           |
  | Search/Index    | NFKC/NFKD                | Folds away compatibility differences for robust matching      |

  ### **References from Results**
  - "NFC removes the distinction between equivalent characters, while preserving the distinction between compatible characters or sequences; NFKC removes the distinction between both equivalent and compatible sequences. NFC conversion is not considered lossy, but NFKC conversion is." [4]
  - "It's more explicitly lossy; the compatibility forms are recommended for performing searches since they fold additional characters such as plain latin and 'full-width' latin letters." [1]
  - "The compatibility normalizations are super useful, because ... you often want a search query for `ffi` to match on `ﬃ` (and vice versa). But they are intended to lose symbolic representation (ﬃ and ffi are now the same thing with no way to distinguish), where the canonical normalizations are not." [7]

  ### **In summary**

  **NFC** is not sufficient for search and indexing because it preserves compatibility distinctions that are irrelevant (and even counterproductive) for matching.
  **NFKC/NFKD** are used as intermediate representations for search and indexing because they erase these distinctions, enabling robust, user-friendly search behavior—at the cost of losing some original form information, which is why they are not used for storage or display.

  [1] https://www.mediawiki.org/wiki/Unicode_normalization_considerations
  [2] https://stackoverflow.com/questions/15985888/when-to-use-unicode-normalization-forms-nfc-and-nfd
  [3] https://unicode.org/reports/tr15/
  [4] https://jazz.net/wiki/bin/view/LinkedData/UseOfUnicodeNormalForms
  [5] https://www.w3.org/wiki/I18N/CanonicalNormalizationIssues
  [6] https://blog.reeset.net/archives/2532
  [7] https://news.ycombinator.com/item?id=19379965
  [8] https://go.dev/blog/normalization
  [9] https://www.reddit.com/r/programming/comments/b09c0j/when_zo%C3%AB_zo%C3%AB_or_why_you_need_to_normalize_unicode/
  [10] https://unicode-org.github.io/icu/design/normalization/custom.html

Notes for page 140 V: 39% H: 25% #

sequences like ‘é’ and ’e\u0301’ are called “canonical equivalents,” and applications are supposed to treat them as the same. But Python sees two different sequences of code points, and considers them not equal.

Notes for page 140 V: 82% H: 50% #

it may be good to normalize strings with normalize(‘NFC’, user_text) before saving.

Case Folding (normalisation tranformation) #

folding everything into lowercase
NOTE: casefold() and str.lower() have ~ 300 code points that return different results

Utility Functions for Normalized Text Matching #

util functions that might help:

nfc_equal
fold_equal

Extreme “Normalization”: Taking Out Diacritics #

google search uses this aggressive normalisation based on real world attention that people give to diacritics
also helps for readable URLs (e.g for latin-based languages)
one way to call this transformation is “shaving”. We “shave” the diacritics

Sorting Unicode Text #

python sorts by comparing sequences one by one
for strings, it compares code points
so to sort non-ascii text in python, have to use local.strxfrom to have locale-aware comparisons

Sorting with the Unicode Collation Algorithm #

stdlib solution: there’s a `locale.strxfrm` to do locale-specific comparisons #

Python is to use the locale.strxfrm function which, according to the locale module docs, “transforms a string to one that can be used in locale-aware comparisons.”

1
2
3
4
5
6
import locale
my_locale = locale.setlocale(locale.LC_COLLATE, 'pt_BR.UTF-8')
print(my_locale)
fruits = ['caju', 'atemoia', 'cajá', 'açaí', 'acerola']
sorted_fruits = sorted(fruits, key=locale.strxfrm)
print(sorted_fruits)

use the Unicode Collation Algorithm via `pyuca` lib #

The Unicode Database #

Db is in the form of multiple text files.

Contains:

code point to char name mappings
metadata about the individual characters and how they are related.

That’s how the str methods isalpha, isprintable, isdecimal, and isnumeric work.

Finding Characters by Name #

use name() function from the unicodedata library

Numeric Meaning of Characters #

Some useful string functions here:

.isnumeric()
.isdecimal()

comparisons with the human meaning of these rather than the code point.

common string functions may lookup this unicode database #

This is responsible for the string functions like isdecimal isnumeric…

the Unicode database records whether a character is printable, is a letter, is a decimal digit, or is some other numeric symbol. That’s how the str methods isal pha, isprintable, isdecimal, and isnumeric work. str.casefold also uses infor‐ mation from a Unicode table.

Dual-Mode str and bytes APIs #

str Versus bytes in Regular Expressions #

if given bytes patterns like \d and \w will only match ASCII characters
if given str patterns like \d and \w will only match beyond just ASCII characters.

to make one point: you can use regular expressions on str and bytes, but in the second case, bytes outside the ASCII range are treated as nondigits and nonword characters.

regex patterns using bytes will treat outside-ASCII range chars as nondigits and nonword chars #

trivial example to make one point: you can use regular expressions on str and bytes, but in the second case, bytes outside the ASCII range are treated as nondigits and nonword characters.

str Versus bytes in os Functions #

os functions actually abide by the Unicode Sandwich: they actually call sys.getfilesystemencoding() as soon as they can

Chapter Summary #

remember that 1 char == 1 byte is only true if it’s utf-8, there’s more than just that.
just always be explicit about encodings when reading them Follow the unicode sandwich and ensure the encoding is explicit always.
Unicode provides multiple ways of representing some characters, so normalizing is a prerequisite for text matching.

What’s New in This Chapter #

Character Issues #

Byte Essentials #

Indexing vs slicing & why str is special #

Basic Encoders/Decoders #

Highlight on page 153 #

Contents #

Comment #

Highlight on page 154 #

Contents #

Understanding Encode/Decode Problems #

Coping with UnicodeEncodeError #

Coping with UnicodeDecodeError #

Highlight on page 156 #

Highlight on page 157 #

SyntaxError When Loading Modules with Unexpected Encoding #

How to Discover the Encoding of a Byte Sequence #

Highlight on page 159 #

BOM: A Useful Gremlin #

Highlight on page 160 #

Highlight on page 160 #

Highlight on page 161 #

Handling Text Files & the “Unicode Sandwich” #

Highlight on page 161 #

Contents #

Comment #

Highlight on page 162 #

Contents #

Comment #

Highlight on page 163 #

Contents #

Beware of Encoding Defaults #

Defaults #

Normalizing Unicode for Reliable Comparisons #

Notes for page 140 V: 39% H: 25% #

Notes for page 140 V: 82% H: 50% #

Case Folding (normalisation tranformation) #

Utility Functions for Normalized Text Matching #

Extreme “Normalization”: Taking Out Diacritics #

Sorting Unicode Text #

Sorting with the Unicode Collation Algorithm #

stdlib solution: there’s a locale.strxfrm to do locale-specific comparisons #

use the Unicode Collation Algorithm via pyuca lib #

The Unicode Database #

Finding Characters by Name #

Numeric Meaning of Characters #

common string functions may lookup this unicode database #

Dual-Mode str and bytes APIs #

str Versus bytes in Regular Expressions #

regex patterns using bytes will treat outside-ASCII range chars as nondigits and nonword chars #

str Versus bytes in os Functions #

Chapter Summary #

Further Reading #

Indexing vs slicing & why `str` is special #

stdlib solution: there’s a `locale.strxfrm` to do locale-specific comparisons #

use the Unicode Collation Algorithm via `pyuca` lib #