By convention, Unicode code points are represented as U+
followed by at four (more only if needed) upper-case hexadecimal digits.
U+005D | ] | RIGHT SQUARE BRACKET |
U+00F1 | ñ | LATIN SMALL LETTER N WITH TILDE |
U+042F | Я | CYRILLIC CAPITAL LETTER YA |
U+2622 | ☢ | RADIOACTIVE SIGN |
U+1F3A9 | 🎩 | TOP HAT |
RIGHT SQUARE BRACKET is written U+005D, not U+5D. You could also call it Unicode character #93, but don’t.
ASCII: A-Z | Dingbats: ✈ ☞ ✌ ✔ ✰ ☺ ♥ ♦ ♣ ♠ • |
Other Latin: ä ñ « » | Emoji: 🐱 |
Cyrillic: Я | Egyptian hieroglyphics: 𓁥 |
Hebrew: א | Mathematics: ∃ 𝒋 : 𝒋 ∉ ℝ |
Chinese: ⿂ | Musical notation: 𝄞 𝄵 𝆖 𝅘𝅥𝅮 |
Japanese: ア | no Klingon ☹ |
All Unicode “blocks”: http://unicode.org/Public/UNIDATA/Blocks.txt
Initially, Unicode is all about mapping integers to characters:
Decimal | U+hex | Meaning | Example |
---|---|---|---|
97 | U+0061 | LATIN SMALL LETTER A | a |
9786 | U+263A | WHITE SMILING FACE | ☺ |
66506 | U+103CA | OLD PERSIAN SIGN AURAMAZDAAHA | 𐏊 |
Now, do that for 128,000+ more characters.
Fine, so we’ve defined this mapping. How do we actually represent those in a computer? That’s the job of an encoding. An encoding is a mapping of the bits in an integer to bytes.
····J···· ····a···· ····c···· ····k···· ┌────┬────┬────┬────┬────┬────┬────┬────┐ │ 00 │ 4A │ 00 │ 61 │ 00 │ 63 │ 00 │ 6B │ └────┴────┴────┴────┴────┴────┴────┴────┘ 0 1 2 3 4 5 6 7
UTF-16:
UTF-32:
········J········ ········a········ ········c········ ········k········ ┌────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┐ │ 00 │ 00 │ 00 │ 4A │ 00 │ 00 │ 00 │ 61 │ 00 │ 00 │ 00 │ 63 │ 00 │ 00 │ 00 │ 6B │ └────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┘ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Hey, there’s a slash in this string! No, wait, there isn’t.
When using UTF-16 or UTF-32 encoding, a naïve algorithm will falsely detect a slash (oh, excuse me, a solidus) in one of the bytes of U+262F.
Similarly, a C-string cannot hold a UTF-16 or UTF-32 string, because of the embedded zero bytes.
Bits | Range | Byte 1 | Byte 2 | Byte 3 | Byte 4 |
---|---|---|---|---|---|
7 | U+0000–U+007F | 0xxxxxxx | |||
11 | U+0080–U+07FF | 110xxxxx | 10xxxxxx | ||
16 | U+0800–U+FFFF | 1110xxxx | 10xxxxxx | 10xxxxxx | |
21 | U+10000–U+1FFFFF | 11110xxx | 10xxxxxx | 10xxxxxx | 10xxxxxx |
J a c k ┌────┬────┬────┬────┐ │ 4A │ 61 │ 63 │ 6B │ └────┴────┴────┴────┘ 0 1 2 3
U+ | Char | Description | UTF-32BE | UTF-16BE | UTF-8 |
---|---|---|---|---|---|
U+0041 | A | A | 00000041 | 0041 | 41 |
U+03A9 | Ω | Omega | 000003A9 | 03A9 | CE A9 |
U+4DCA | ䷊ | Hexagram for peace | 00004DCA | 4DCA | E4 B7 8A |
U+1F42E | 🐮 | Mooooooooo! | 0001F42E | D83D DC2E | F0 9F 90 AE |
Bits | Range | Byte 1 | Byte 2 | Byte 3 | Byte 4 |
---|---|---|---|---|---|
7 | U+0000–U+007F | 0xxxxxxx | |||
11 | U+0080–U+07FF | 110xxxxx | 10xxxxxx | ||
16 | U+0800–U+FFFF | 1110xxxx | 10xxxxxx | 10xxxxxx | |
21 | U+10000–U+1FFFFF | 11110xxx | 10xxxxxx | 10xxxxxx | 10xxxxxx |
Often, files contain a “magic number”—initial bytes that indicate what sort of file it is.
Encoding | Bytes |
---|---|
UTF-32BE | 00 00 FE FF |
UTF-32LE | FF FE 00 00 |
UTF-16BE | FE FF |
UTF-16LE | FF FE |
UTF-8 | EF BB BF |
The character U+FEFF ZERO WIDTH NO BREAK SPACE, is also used as a Byte Order Mark, or BOM. When used as the first bytes of a data file, indicates the encoding (assuming that you’re limited to Unicode).
If not processed as a BOM, then ZERO WIDTH NO BREAK SPACE is mostly harmless.
It’s all about bytes vs. characters. Too many languages have no byte type, so programmers use char instead. Trouble! The language has no idea whether you’re processing text, which should be treated as Unicode, or bytes of data, which would happen if a program were parsing a JPEG file.
echo \u
: up to four digits; \U
: up to eight digits
% echo -e '\uf1' ñ % echo -e '\U1f435' 🐵
wc -c
counts bytes; wc -m
counts characters
% echo -e '\U1f435' | wc -c 5 % echo -e '\U1f435' | wc -m 2
% echo -e 'ABC' | xxd 00000000: 4142 430a ABC. % echo -e '\U1f435' | xxd 00000000: f09f 90b5 0a ..... % echo -e 'ABC' | od -t x1 0000000 41 42 43 0a 0000004 % echo -e '\U1f435' | od -t x1 0000000 f0 9f 90 b5 0a 0000005