Understanding UTF-8 Character Encoding

Many programmers have long been confused by the variations in UTF-8, Unicode, ASCII, CP936, GB2312, etc. Why are we always recommended to use UTF-8 for our code in many situations? Character encoding is not a hard problem, but it’s often unclear. To understand the character encoding problem clearly, we should first separate the encoding method from the character set.

Just like Joel Spolsky says: Please do not write another line of code until you finish reading his article. I recommend you take 15 minutes to read my article; it’s easier than Joel’s.

In the ancient computer era (1970?), the world was simple, and the computer was born in the US. The computer scientists (or software engineers?) treated the world as using only English in computers. They designed all information to be represented by 26 alphabets and other symbols; thus, they invented ASCII with 127 code points, which just include English alphabets, punctuation, and other control symbols.

As we know, computers store everything as binary, which we call bits, represented as 0 or 1. We can use 7 bits (2^7 = 128) to store all ASCII code points. Today’s computers use 8 bits (2^8 = 256) to represent a byte. Using 8 bits for a byte allows for easy alignment and power of 2 calculations. That’s why a byte is not 7 or 9 bits. If we defined a byte as 16 or 32 bits, it would waste too much space to store 127 ASCII code points. Therefore, until today, a byte storing ASCII code always starts with 0, representing an empty bit.

People soon found it hard for computers to represent other language characters. We can use our hands to draw any language words on paper, but computers can’t. We should convert our language to digits first, like the ASCII table. In 1991, Unicode was invented; it defined all language characters into digits in a table. For ASCII compatibility, the first 127 code points are the same as ASCII. As of March 2020, there were a total of 143,859 characters. We can use a 32-bit integer to represent all characters, but it doesn’t mean every character needs a 32-bit integer.

For example, if we have a character ‘a’ which is represented as 0x61, the same as ASCII, we can store it in just one byte. If we have a character ‘文’ that is represented as 0xe69687 in Unicode, we should store it in three bytes (e6 96 87).

To reduce storage, our byte lengths should be variable and provide information about how many bytes a character requires.

So, let’s check how UTF-8 is implemented.

UTF-8 uses one byte to represent ASCII (0~127).

If a code point is larger than 127, things become different. It separates into two parts: one part starts with 11, representing the number of bytes, and the other starts with 10, which we call a follow byte.

Here is UTF-8 encoding, where x represents storage data:

0000 0000-0000 007F | 0xxxxxxx                               //ASCII
0000 0080-0000 07FF | 110xxxxx 10xxxxxx                      //Two bytes
0000 0800-0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx             //Three bytes
0001 0000-0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx    //Four bytes

If a character starts with 110, it means it has two bytes. 1110 means three bytes, and 11110 means four bytes. UTF-8 supports a maximum of 4 bytes for encoding; excluding the sign bit, it has 2^21 = 2,097,152 code points.

There is a question here: why do follow bytes start with 10?

In network transfer, we send information byte by byte. If a byte starts with 0, we know it is an ASCII byte. If it starts with 10, it means it is a follow byte. If we lose a byte, we can quickly discard the other byte to prevent the half-word problem. In operating systems, if we want to remove a word, we can find the byte that does not start with 10, which is easy and simple.

As you see above, UTF-8 uses a one-byte orientation to encode every Unicode character. It has two benefits: we don’t need to worry about big-endian or little-endian formats, and some old C libraries can be compatible with UTF-8. strcmp can work because we can compare every word by byte, but strlen does not because many Unicode characters are not stored in just one byte.

That’s why UTF-8 is the most popular: it has these benefits:

Fully compatible with ASCII.
Variable length encoding.
Error-tolerance, easy encoding and decoding.
Byte-oriented, no byte order problems.

You can treat every charset encoding method as two parts: one is a character table, and the other is how to store the code points. Using this method, terms like CP936, GB2312, and other encoding methods will not confuse you.

Note: This post was originally published on liyafu.com (One of our makers' personal blog)

Nowadays, we spend most of my time building softwares. This means less time writing. Building softwares has become my default way of online expression. Currently, we are working on Slippod, a privacy-first desktop note-taking app and TextPixie, a tool to transform text including translation and extraction.