Python Bytes vs Strings

Bytes vs Strings

In Python, bytes and strings are two data types which represent sequential characters. Bytes contain raw, unsigned 8-bit values often displayed in ASCII, a standard character encoding. Strings (str instances) are immutable sequences of Unicode that depict textual characters from spoken languages. The example below demonstrates the underlying difference in the representation of bytes and strings.

A computer can only store bytes (binary data). However, a bytes literal is not human readable, as seen from the example above. The mapping between bytes and strings is achieved through encoding and decoding. To convert bytes into a str instance, use the decode method by specifying the encoding or using the system default, which is commonly UTF-8, a a Unicode character encoding. Conversely, a string can be encoded into bytes.

There is a convention called a Unicode sandwich to follow. Use strings containing Unicode text within the meat of a program, and only perform encoding and decoding between bytes and strings on the outer layers or boundaries of the program. In effect, any number of text encodings, such as Latin-1 or UTF-16, can be accepted as inputs while standardizing on an output encoding like UTF-8.

There are several gotchas that can occur in a program's execution due to the incompatibilities of bytes and strings. Understanding the differences in the data types and encoding is key to avoid unforeseen errors.

Published January 31, 2022