Show Lecture.Unicode as a slide show.
CS253 Unicode
Unicode (ISO-10646)
- First published 1991
- Version 13.0, published March 2020, has 143,859 characters.
- Incorporates ASCII as code points 0–127, no change.
- Incorporates ISO-8859-1 (Latin-1) as code points 0–255, no change.
- Code points, not encoding (patience)
- U+0041 A LATIN CAPITAL LETTER A
- U+2FC2 ⿂ KANGXI RADICAL FISH
- U+1F355 🍕 SLICE OF PIZZA
- Meaning, not pictures
U+ notation
By convention, Unicode code points are represented as U+
followed by four (five if needed) upper-case hexadecimal digits.
U+005D | ] | RIGHT SQUARE BRACKET |
U+00F1 | ñ | LATIN SMALL LETTER N WITH TILDE |
U+042F | Я | CYRILLIC CAPITAL LETTER YA |
U+2622 | ☢ | RADIOACTIVE SIGN |
U+1F3A9 | 🎩 | TOP HAT |
RIGHT SQUARE BRACKET is written U+005D, not U+5D.
You could also call it Unicode character #93, but don’t.
What’s in Unicode
ASCII | ABCxyz!@#$% | Dingbats | ✈☞✌✔✰☺♥♦♣♠• |
Other Latin | áçñöªš¿ | Emoji | 🐱🎃🍎🇺🇳 |
Cyrillic | ЖЙЩЯ | Hieroglyphics | 𓁥𓂀 |
Hebrew | אבגדה | Mathematics | ∃𝒋:𝒋∉ℝ |
Chinese | ⿂ | Music | 𝄞𝄵𝆖𝅘𝅥𝅮 ♯♭ |
Japanese | ア | No Klingon | ☹ |
All Unicode “blocks”: https://unicode.org/Public/UNIDATA/Blocks.txt
Code Points
Initially, Unicode is all about mapping integers to characters:
Decimal | U+hex | Meaning | Example |
97 | U+0061 | LATIN SMALL LETTER A | a |
8804 | U+2264 | LESS-THAN OR EQUAL TO | ≤ |
66506 | U+103CA | OLD PERSIAN SIGN AURAMAZDAAHA | 𐏊 |
Now, do that for 143,000+ more characters.
These integers (usually represented as U+hexadecimal ) are
code points.
History
- Of course, the code points aren’t arbitrary.
- Code points 0–127 are ASCII.
- Code points 128–255 are compatible with the popular Latin-1
encoding, and Microsoft’s Windows-1252 (foolishly called “ANSI”)
- Huge swaths of code points are taken from other character sets,
e.g., emojis from phone manufacturers.
- Other character sets, such as IBM’s EBCDIC, and HP’s Roman-8,
didn’t make the cut.
- I can only imagine the negotiations regarding
Shift-JIS, Big5, or EUC-KR.
Encoding
Fine, we’ve defined this mapping. How do we actually represent those
in a computer, in memory, or in a file on a disk? That’s the job of an
encoding. An encoding is a mapping of the bits in an integer
code point to bytes.
code point:
mapping integers to characters, without caring about how big the
integers get. This is not a computer problem. Bits & bytes have
nothing to do with it. This is a bureaucratic bookkeeping task—just a
big list.
encoding:
how to represent integer code points as bytes in a computer. For ASCII
(0–127), the mapping is simple, since an 8-bit byte can contain 2⁸=256
different values. It gets harder with more than 256 charcters.
16-bit Encodings
- UCS-2:
- Fixed-length 16-bit.
- Each character is two 8-bit bytes, whether in memory, or on a disk.
- Certainly is straightforward.
- Inadequate for modern Unicode, which has many more than 216
characters. Can’t even represent U+1F369 🍩 DOUGHNUT.
- Unicode originally had a much more modest scope, only living languages,
so that might have worked.
····J···· ····a···· ····c···· ····k····
┌────┬────┬────┬────┬────┬────┬────┬────┐
│ 00 │ 4A │ 00 │ 61 │ 00 │ 63 │ 00 │ 6B │
└────┴────┴────┴────┴────┴────┴────┴────┘
0 1 2 3 4 5 6 7
Endian
int n = 0x01020304;
auto p = reinterpret_cast<char *>(&n);
switch (p[0]) {
case 1: cout << "Big-endian"; break;
case 4: cout << "Little-endian"; break;
default: cout << "Weird-endian"; break;
}
for (int i=0; i<4; i++)
cout << ' ' << int(p[i]);
Little-endian 4 3 2 1
- Big-endian: the first byte has the
most significant bits of the value.
- Little-endian: the first byte has the
least significant bits of the value.
- big: 2024‒11‒21
🇨🇦
🇨🇳,
little: 21‒11‒2024
🇲🇽
🇬🇧,
middle: 11/21/2024
🇺🇸
16-bit Encodings
UTF-16:
- Slightly variable-length: values ≤ U+FFFF take two bytes,
other values take four bytes.
- Consider U+203D ‽ INTERROBANG:
- UTF-16BE (big-endian):
bytes are 20 3D
- UTF-16LE (little-endian): bytes are 3D 20
- For values ≥ U+10000 and < U+10FFFF:
- Subtract out 0x10000, yielding a 20-bit number
- Emit U+D800 plus the top ten bits.
- Emit U+DC00 plus the lower ten bits.
- There are no valid code points U+D800…U+DFFF.
- 100% overhead for ASCII text.
32-bit Encodings
UTF-32:
- straightforward rendering of the code point in binary, with the
same problems about byte order:
- UTF-32BE: big-endian version
- UTF-32LE: little-endian version
- 300% overhead for ASCII text.
- Sure, disk space is cheap, but, c’mon.
········J········ ········a········ ········c········ ········k········
┌────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┐
│ 00 │ 00 │ 00 │ 4A │ 00 │ 00 │ 00 │ 61 │ 00 │ 00 │ 00 │ 63 │ 00 │ 00 │ 00 │ 6B │
└────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┘
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
False Positives
Hey, there’s a slash in this string! No, wait, there isn’t.
- U+002F / SOLIDUS
- U+262F ☯ YIN YANG
When using UTF-16 or UTF-32 encoding, a naïve algorithm will falsely
detect a slash (oh, excuse me, a solidus) in one of the bytes of
U+262F.
Similarly, a C-string cannot hold a UTF-16 or UTF-32 string, because
of the embedded zero bytes.
One encoding to rule them all
- Then, a miracle occurred—UTF-8 variable-length encoding was born.
- Naïve programs can still think that they’re only dealing in ASCII.
- This is where the world is going, and it can’t be stopped.
- This slide alone contains:
- U+2014 — EM DASH
- U+00EF ï LATIN SMALL LETTER I WITH DIAERESIS
- U+2019 ’ RIGHT SINGLE QUOTATION MARK
- U+1F9DF 🧟 ZOMBIE
- Microsoft Windows, Linux, and Mac OSX use UTF-8.
- Jooooooooooiiiiiiiiiin uuuuuuuusssssssss! 🧟🧟🧟
UTF-8 Variable-Length Encoding
Bits | Range | Byte 1 | Byte 2 | Byte 3 | Byte 4 |
7 | U+0000–U+007F | 0xxxxxxx | |
11 | U+0080–U+07FF | 110xxxxx | 10xxxxxx | |
16 | U+0800–U+FFFF | 1110xxxx | 10xxxxxx | 10xxxxxx | |
21 | U+10000–U+1FFFFF | 11110xxx | 10xxxxxx | 10xxxxxx | 10xxxxxx |
Illustration of Various Encodings
Code point | Char | Description | UTF-32BE | UTF-16BE | UTF-8 |
U+0041 | A | A | 00000041 | 0041 | 41 |
U+03A9 | Ω | Omega | 000003A9 | 03A9 | CE A9 |
U+4DCA | ䷊ | Hexagram for peace | 00004DCA | 4DCA | E4 B7 8A |
U+1F42E | 🐮 | Mooooooooo! | 0001F42E | D83D DC2E | F0 9F 90 AE |
Example
Bits | Range | Byte 1 | Byte 2 | Byte 3 | Byte 4 |
7 | U+0000–U+007F | 0xxxxxxx | |
11 | U+0080–U+07FF | 110xxxxx | 10xxxxxx | |
16 | U+0800–U+FFFF | 1110xxxx | 10xxxxxx | 10xxxxxx | |
21 | U+10000–U+1FFFFF | 11110xxx | 10xxxxxx | 10xxxxxx | 10xxxxxx |
- Consider U+1F42E 🐮
- 1F42E16 = 1 1111 0100 0010 11102 (17 bits)
- Need 21 bits, add leading zeroes: 0 0001 1111 0100 0010 1110
- Grouped properly: 000 011111 010000 101110
- Byte #1: 11110xxx, use first three bits, 11110 000
- Byte #2: 10xxxxxx, use the next six bits, 10 011111
- Byte #3: 10xxxxxx, use the next six bits, 10 010000
- Byte #4: 10xxxxxx, use the next six bits, 10 101110
- All the bits:
- 11110 000 10 011111 10 010000 10 101110
- 11110000 10011111 10010000 10101110
- 1111 0000 1001 1111 1001 0000 1010 1110
- F0 9F 90 AE
Byte Order Mark
Often, files contain a “magic number”—initial bytes that indicate
file type.
Encoding | Bytes |
UTF-32BE | 00 00 FE FF |
UTF-32LE | FF FE 00 00 |
UTF-16BE | FE FF |
UTF-16LE | FF FE |
UTF-8 | EF BB BF |
The character U+FEFF ZERO WIDTH NO BREAK SPACE, is also used as a
Byte Order Mark, or BOM. When used as the first bytes of
a data file, indicates the encoding (assuming that you’re limited
to Unicode).
If not processed as a BOM, then ZERO WIDTH NO BREAK SPACE is mostly
harmless. It has no width and does not cause a line break.
Programming
- It’s all about bytes vs. characters.
- In the bad old days, we assumed that everything was ASCII,
or, at the very least, an eight-bit encoding such as Latin-1.
- Therefore, each character fit into an eight-bit byte.
- Nowadays, a UTF-8 character can require several bytes.
Programming
Too many languages have no byte type, so programmers use char instead.
Trouble! The language has no idea whether you’re processing text, which
should be treated as Unicode, or bytes of data, which would happen if a
program were parsing a JPEG file.
cout << hex << 'Jack';
c.cc:1: warning: multi-character character constant
4a61636b
cout << hex << "ñ" << 'ñ';
c.cc:1: warning: multi-character character constant
ñc3b1
for (unsigned char c : "abcñ")
cout << hex << int(c) << ' ';
61 62 63 c3 b1 0
Linux Commands
echo \u
: up to four digits; \U
: up to eight digits
% echo -e '\uf1'
ñ
% echo -e '\U1f435'
🐵
wc -c
counts bytes; wc -m
counts characters
% echo -e '\U1f435' | wc -c
5
% echo -e '\U1f435' | wc -m
2
Viewing Files
View with xxd or od:
% echo -e 'ABC' | xxd
00000000: 4142 430a ABC.
% echo -e '\U1f435' | xxd
00000000: f09f 90b5 0a .....
% echo -e 'ABC' | od -t x1
0000000 41 42 43 0a
0000004
% echo -e '\U1f435' | od -t x1
0000000 f0 9f 90 b5 0a
0000005