Show Lecture.Unicode as a slide show.
CS253 Unicode
Unicode (ISO-10646)
- First published 1991.
- Version 15.0, published September 2022, has 149,186 characters.
- Incorporates ASCII as code points 0–127, no change.
- Incorporates ISO-8859-1 (Latin-1) as code points 0–255, no change.
- Code points and encoding are separate issues.
- Meaning, not pictures.
U+ notation
By convention, Unicode code points are represented as U+
followed by four (five if needed) upper-case hexadecimal digits.
U+005D | ] | RIGHT SQUARE BRACKET |
U+00F1 | ñ | LATIN SMALL LETTER N WITH TILDE |
U+042F | Я | CYRILLIC CAPITAL LETTER YA |
U+2622 | ☢ | RADIOACTIVE SIGN |
U+1F3A9 | 🎩 | TOP HAT |
RIGHT SQUARE BRACKET is written U+005D, not U+5D.
You could also call it Unicode character number 93, but don’t.
What’s in Unicode
All Unicode “blocks”: https://unicode.org/Public/UNIDATA/Blocks.txt
Code Points
Initially, Unicode is all about mapping integers to characters:
Decimal | U+hex | Meaning | Example |
97 | U+0061 | LATIN SMALL LETTER A | a |
8804 | U+2264 | LESS-THAN OR EQUAL TO | ≤ |
66506 | U+103CA | OLD PERSIAN SIGN AURAMAZDAAHA | 𐏊 |
Now, do that for 149,000+ more characters.
These integers (written as U+hexadecimal ) are
code points.
History
- Of course, the code points aren’t arbitrary.
- Code points 0–127 are ASCII.
- Code points 128–255 are compatible with the popular Latin-1
encoding, and Microsoft’s Windows-1252 (foolishly called “ANSI”)
- Huge swaths of code points are taken from other character sets,
e.g., emojis from phone manufacturers.
- Other character sets, such as IBM’s EBCDIC, and HP’s Roman-8,
didn’t make the cut.
- I can only imagine the negotiations regarding
Shift-JIS, Big5, or EUC-KR.
Encoding
Fine, we’ve defined this mapping. How do we actually represent those
code points in a computer, in memory, or in a file on a disk? That’s
the job of an encoding. An encoding is a mapping of the bits in an
integer code point to bytes.
code point:
Mapping integers to characters, no matter how big the integers get.
Bits & bytes have nothing to do with it. This is not a computer
problem. This is a bookkeeping task—just a big list.
encoding:
How to represent integer code points as bytes in a computer. For ASCII
(0–127), the mapping is simple, since an 8-bit byte can contain 2⁸=256
different values. It gets harder with more than 256 charcters,
like, y’know, Unicode.
16-bit Encodings
- UCS-2 (obsolete):
- Fixed-length 16-bit.
- Each character is two 8-bit bytes, whether in memory, or on a disk.
- Certainly is straightforward.
- Inadequate for modern Unicode, which has many more than 216
characters. Can’t even represent U+1F369 🍩 DOUGHNUT.
- Unicode originally had a much more modest scope, only living languages,
so that might have worked.
····J···· ····a···· ····c···· ····k····
┌────┬────┬────┬────┬────┬────┬────┬────┐
│ 00 │ 4A │ 00 │ 61 │ 00 │ 63 │ 00 │ 6B │
└────┴────┴────┴────┴────┴────┴────┴────┘
0 1 2 3 4 5 6 7
Endian
- Consider a 32-bit integer containing the value 1.
A single bit is set, all the rest are zeroes.
- It occupies four 8-bit bytes, on a typical computer.
- The four bytes are:
00 00 00 01
.
- If it’s stored at locations 1000…1003, which location
holds the 1 bit? Is it location 1000? Location 1003?
One of the others‽
Implementation-Defined Answer
union {
int n = 0x44332211;
char bytes[4];
};
switch (bytes[0]) {
case 0x44: cout << "Big-endian"; break;
case 0x11: cout << "Little-endian"; break;
default: cout << "Weird-endian"; break;
}
for (int b : bytes)
cout << ' ' << hex << b;
Little-endian 11 22 33 44
- Big-endian: the first byte has the
most significant bits of the value.
- Little-endian: the first byte has the
least significant bits of the value.
- big: 2024‒11‒21
🇨🇦
🇨🇳,
little: 21‒11‒2024
🇲🇽
🇬🇧,
middle: 11/21/2024
🇺🇸
16-bit Encodings
UTF-16 (Unicode Translation Format):
- Slightly variable-length: values ≤ U+FFFF take two bytes,
other values take four bytes.
- Consider U+203D ‽ INTERROBANG:
- For values ≥ U+10000 and < U+10FFFF:
- Subtract out 0x10000, yielding a 20-bit number
- Emit U+D800 plus the top ten bits.
- Emit U+DC00 plus the lower ten bits.
- There are no valid code points U+D800…U+DFFF.
- 100% overhead for ASCII text.
32-bit Encodings
UTF-32:
- straightforward rendering of the code point in binary, with the
same problems about byte order:
- UTF-32BE: big-endian version
- UTF-32LE: little-endian version
- 300% overhead for ASCII text.
- Sure, disk space is cheap, but, c’mon.
·········J········· ·········a········· ·········c········· ·········k·········
┌────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┐
│ 00 │ 00 │ 00 │ 4A │ 00 │ 00 │ 00 │ 61 │ 00 │ 00 │ 00 │ 63 │ 00 │ 00 │ 00 │ 6B │
└────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┘
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
False Positives
Hey, there’s a slash in this string! No, wait, there isn’t.
U+002F | / | SOLIDUS |
U+262F | ☯ | IN YANG |
When using UTF-16 or UTF-32 encoding, a naïve algorithm will falsely
detect a slash (oh, excuse me, a solidus) in one of the bytes of
U+262F.
Similarly, a null-terminated C-string can’t hold a UTF-16 or UTF-32
string, because of the embedded zero bytes. C++ string objects have
no problem holding any data, including null bytes.
One encoding to rule them all
- Then, a miracle occurred—UTF-8 variable-length encoding was born.
- Naïve programs can still think that they’re only dealing in ASCII.
- Microsoft Windows, Linux, and Mac OSX use UTF-8.
- This is where the world is going, and it can’t be stopped.
- Jooooooooooiiiiiiiiiin uuuuuuuusssssssss! 🧟🧟🧟
U+2014 | — | EM DASH |
U+00EF | ï | LATIN SMALL LETTER I WITH DIAERESIS |
U+2019 | ’ | RIGHT SINGLE QUOTATION MARK |
U+1F9DF | 🧟 | ZOMBIE |
UTF-8 Variable-Length Encoding
Bits | Code Point Range | Byte 1 | Byte 2 | Byte 3 | Byte 4 |
7 | U+0000–U+007F | 0xxxxxxx | |
11 | U+0080–U+07FF | 110xxxxx | 10xxxxxx | |
16 | U+0800–U+FFFF | 1110xxxx | 10xxxxxx | 10xxxxxx | |
21 | U+10000–U+1FFFFF | 11110xxx | 10xxxxxx | 10xxxxxx | 10xxxxxx |
Illustration of Various Encodings
Code point | Char | Description | UTF-32BE | UTF-16BE | UTF-8 |
U+0041 | A | A | 00000041 | 0041 | 41 |
U+03A9 | Ω | Omega | 000003A9 | 03A9 | CE A9 |
U+4DCA | ䷊ | Hexagram for peace | 00004DCA | 4DCA | E4 B7 8A |
U+1F42E | 🐮 | Mooooooooo! | 0001F42E | D83D DC2E | F0 9F 90 AE |
Example
Bits | Range | Byte 1 | Byte 2 | Byte 3 | Byte 4 |
7 | U+0000–U+007F | 0xxxxxxx | |
11 | U+0080–U+07FF | 110xxxxx | 10xxxxxx | |
16 | U+0800–U+FFFF | 1110xxxx | 10xxxxxx | 10xxxxxx | |
21 | U+10000–U+1FFFFF | 11110xxx | 10xxxxxx | 10xxxxxx | 10xxxxxx |
- Consider U+1F42E 🐮
- 1F42E16 = 1 1111 0100 0010 11102 (17 bits)
- Need 21 bits, add leading zeroes: 0 0001 1111 0100 0010 1110
- Grouped properly: 000 011111 010000 101110
- Byte #1: 11110xxx, use first three bits, 11110 000
- Byte #2: 10xxxxxx, use the next six bits, 10 011111
- Byte #3: 10xxxxxx, use the next six bits, 10 010000
- Byte #4: 10xxxxxx, use the next six bits, 10 101110
- 11110 000 10 011111 10 010000 10 101110
as flag & data bits
- 11110000 10011111 10010000 10101110
as 8-bit bytes
- 1111 0000 1001 1111 1001 0000 1010 1110
as 4-bit nybbles
- F0 9F 90 AE as hex digits
Byte Order Mark
Often, files contain a “magic number”—initial bytes that indicate
file type.
Encoding | Bytes |
UTF-32BE | 00 00 FE FF |
UTF-32LE | FF FE 00 00 |
UTF-16BE | FE FF |
UTF-16LE | FF FE |
UTF-8 | EF BB BF |
The character U+FEFF ZERO WIDTH NO BREAK SPACE, is also used as a
Byte Order Mark, or BOM. When used as the first bytes of
a data file, indicates the encoding (assuming that you’re limited
to Unicode).
If not processed as a BOM, then ZERO WIDTH NO BREAK SPACE is mostly
harmless. It has no width and does not cause a line break.
Combining characters
ñ
- U+00F1 LATIN SMALL LETTER N WITH TILDE
is used in Spanish.
- This character is available
in two ways: precomposed and build-your-own:
% echo -e '\u00f1 or n\u0303'
ñ or ñ
- The latter uses U+0303 COMBINING TILDE, which merges with the
previous character as if one were printed over the other.
- This means that ñ can be either represented by the code points
U+00F1 or U+006E U+0303. Of course, these become different byte
sequences.
- Good programs perform normalization
before comparing strings.
Combining Sequences
👷 🏽 = 👷🏽
U+1F477 | 👷 | CONSTRUCTION WORKER |
U+1F3FD | | EMOJI MODIFIER FITZPATRICK TYPE-4 |
🏴 ‍ ☠ ️ = 🏴☠️
U+1F3F4 | 🏴 | WAVING BLACK FLAG |
U+200D | | ZERO WIDTH JOINER |
U+2620 | ☠ | SKULL AND CROSSBONES |
U+FE0F | | VARIATION SELECTOR-16 |
More Combining Sequences
🧑 ‍ ⚕ ️ = 🧑⚕️
U+1F9D1 | 🧑 | ADULT |
U+200D | | ZERO WIDTH JOINER |
U+2695 | ⚕ | STAFF OF AESCULAPIUS |
U+FE0F | | VARIATION SELECTOR-16 |
🇨 🇦 = 🇨🇦
U+1F1E8 | | REGIONAL INDICATOR SYMBOL LETTER C |
U+1F1E6 | | REGIONAL INDICATOR SYMBOL LETTER A |
Programming
- It’s all about bytes vs. characters.
- In the bad old days, we assumed that everything was ASCII,
or, at the very best, an eight-bit encoding such as Latin-1.
- Therefore, each character fit into an eight-bit byte.
- Nowadays, a UTF-8 character can require several bytes.
Programming
Too many languages have no byte type, so programmers use char instead.
Trouble! The language has no idea whether you’re processing text, which
should be treated as Unicode, or bytes of data, which would happen if a
program were parsing a JPEG file.
cout << hex << 'Logan'; // 🦡
c.cc:1: warning: character constant too long for its type
6f67616e
cout << hex << "ñ";
cout << hex << 'ñ'; // 🦡
c.cc:2: warning: multi-character character constant
ñc3b1
for (unsigned char c : "abcñ")
cout << hex << setw(2) << setfill('0') << int(c) << ' ';
61 62 63 c3 b1 00
Bash Commands
echo -e \u
: up to four hexadecimal digits; echo -e \U
: up to eight hex digits
% echo -e "\uf1"
ñ
% echo -e "\U1f435"
🐵
wc -c
counts bytes; wc -m
counts characters
% echo "🐵" | wc -c
5
% echo "🐵" | wc -m
2
Two chars? Why two?
echo adds a newline unless the -n
option is given.
Viewing Files with xxd or od:
% cat ~cs253/pub/unicode
0034 4 DIGIT FOUR
05D0 א HEBREW LETTER ALEF
203D ‽ INTERROBANG
1F34C 🍌 BANANA
% xxd ~cs253/pub/unicode
00000000: 3030 3334 2020 3409 2044 4947 4954 2046 0034 4. DIGIT F
00000010: 4f55 520a 3035 4430 2020 d790 0920 4845 OUR.05D0 ... HE
00000020: 4252 4557 204c 4554 5445 5220 414c 4546 BREW LETTER ALEF
00000030: 0a32 3033 4420 20e2 80bd 0920 494e 5445 .203D .... INTE
00000040: 5252 4f42 414e 470a 3146 3334 4320 f09f RROBANG.1F34C ..
00000050: 8d8c 2042 414e 414e 410a .. BANANA.
% od -t x1 ~cs253/pub/unicode
0000000 30 30 33 34 20 20 34 09 20 44 49 47 49 54 20 46
0000020 4f 55 52 0a 30 35 44 30 20 20 d7 90 09 20 48 45
0000040 42 52 45 57 20 4c 45 54 54 45 52 20 41 4c 45 46
0000060 0a 32 30 33 44 20 20 e2 80 bd 09 20 49 4e 54 45
0000100 52 52 4f 42 41 4e 47 0a 31 46 33 34 43 20 f0 9f
0000120 8d 8c 20 42 41 4e 41 4e 41 0a
0000132