CS253: Software Development with C++

Fall 2020

Unicode

Show Lecture.Unicode as a slide show.

CS253 Unicode

Unicode (ISO-10646)

U+ notation

By convention, Unicode code points are represented as U+ followed by at four (more only if needed) upper-case hexadecimal digits.

U+005D]RIGHT SQUARE BRACKET
U+00F1ñLATIN SMALL LETTER N WITH TILDE
U+042FЯCYRILLIC CAPITAL LETTER YA
U+2622RADIOACTIVE SIGN
U+1F3A9🎩TOP HAT

RIGHT SQUARE BRACKET is written U+005D, not U+5D. You could also call it Unicode character #93, but don’t.

What’s in Unicode

ASCIIABCxyz!@#$%Dingbats✈☞✌✔✰☺♥♦♣♠•
Other Latináçñöªš¿Emoji🐱🎃🍎🇺🇳
CyrillicЖЙЩЯHieroglyphics𓁥𓂀
HebrewאבגדהMathematics∃𝒋:𝒋∉ℝ
ChineseMusic𝄞𝄵𝆖𝅘𝅥𝅮 ♯♭
JapaneseNo Klingon

All Unicode “blocks”: https://unicode.org/Public/UNIDATA/Blocks.txt

Code Points

Initially, Unicode is all about mapping integers to characters:

DecimalU+hexMeaningExample
97U+0061LATIN SMALL LETTER Aa
9786U+263AWHITE SMILING FACE
66506U+103CAOLD PERSIAN SIGN AURAMAZDAAHA𐏊

Now, do that for 143,000+ more characters.

Encoding

Fine, so we’ve defined this mapping. How do we actually represent those in a computer, in memory, or in a file on a disk? That’s the job of an encoding. An encoding is a mapping of the bits in an integer code point to bytes.

The difference is essential:

code point
mapping integers to characters, without caring about how big the integers get. This is not a computer problem. Bits & bytes have nothing to do with it. This is strictly a bookkeeping task, to be handled by a bureaucracy. It’s just a big list.
encoding
how to take those integer code points to bytes in a computer. For ASCII (0–127), the mapping is simple, since an 8-bit byte can contain 2⁸=256 different values.

16-bit Encodings

     ····J···· ····a···· ····c···· ····k····
    ┌────┬────┬────┬────┬────┬────┬────┬────┐
    │ 00 │ 4A │ 00 │ 61 │ 00 │ 63 │ 00 │ 6B │
    └────┴────┴────┴────┴────┴────┴────┴────┘
      0    1    2    3    4    5    6    7

Endian

int n = * (int *) "\x11\x22\x33\x44";
switch (n) {
  case 0x11223344: cout << "Big";     break;
  case 0x44332211: cout << "Little";  break;
  case 0x22114433: cout << "Middle";  break;
  default:         cout << "Unknown"; break;
}
cout << "-endian\n"
     << "11 22 33 44 makes " << hex << n;
Little-endian
11 22 33 44 makes 44332211

16-bit Encodings

UTF-16:

32-bit Encodings

UTF-32:

  ········J········   ········a········   ········c········   ········k········
┌────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┐
│ 00 │ 00 │ 00 │ 4A │ 00 │ 00 │ 00 │ 61 │ 00 │ 00 │ 00 │ 63 │ 00 │ 00 │ 00 │ 6B │
└────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┘
   0    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15

False Positives

Hey, there’s a slash in this string! No, wait, there isn’t.

When using UTF-16 or UTF-32 encoding, a naïve algorithm will falsely detect a slash (oh, excuse me, a solidus) in one of the bytes of U+262F.

Similarly, a C-string cannot hold a UTF-16 or UTF-32 string, because of the embedded zero bytes.

UTF-8 Variable-Length Encoding

BitsRangeByte 1Byte 2Byte 3Byte 4
7U+0000–U+007F0xxxxxxx 
11U+0080–U+07FF110xxxxx10xxxxxx 
16U+0800–U+FFFF1110xxxx10xxxxxx10xxxxxx 
21U+10000–U+1FFFFF11110xxx10xxxxxx10xxxxxx10xxxxxx

Illustration of Various Encodings

U+CharDescriptionUTF-32BEUTF-16BEUTF-8
U+0041AA00000041004141
U+03A9ΩOmega000003A903A9CE A9
U+4DCAHexagram for peace00004DCA4DCAE4 B7 8A
U+1F42E🐮Mooooooooo!0001F42ED83D DC2EF0 9F 90 AE

Example

BitsRangeByte 1Byte 2Byte 3Byte 4
7U+0000–U+007F0xxxxxxx 
11U+0080–U+07FF110xxxxx10xxxxxx 
16U+0800–U+FFFF1110xxxx10xxxxxx10xxxxxx 
21U+10000–U+1FFFFF11110xxx10xxxxxx10xxxxxx10xxxxxx
  • Consider U+1F42E 🐮
    • 1F42E16 = 1 1111 0100 0010 11102 (17 bits)
    • Need 21 bits, add leading zeroes: 0 0001 1111 0100 0010 1110
    • Grouped properly: 000 011111 010000 101110
    • Byte #1: 11110xxx, use first three bits, 11110 000
    • Byte #2: 10xxxxxx, use the next six bits, 10 011111
    • Byte #3: 10xxxxxx, use the next six bits, 10 010000
    • Byte #4: 10xxxxxx, use the next six bits, 10 101110
    • All the bits:
      • 11110 000  10 011111  10 010000  10 101110
      • 11110000   10011111   10010000   10101110
      • 1111 0000  1001 1111  1001 0000  1010 1110
      • F0 9F 90 AE

Byte Order Mark

Often, files contain a “magic number”—initial bytes that indicate file type.

EncodingBytes
UTF-32BE00 00 FE FF
UTF-32LEFF FE 00 00
UTF-16BEFE FF
UTF-16LEFF FE
UTF-8EF BB BF

The character U+FEFF ZERO WIDTH NO BREAK SPACE, is also used as a Byte Order Mark, or BOM. When used as the first bytes of a data file, indicates the encoding (assuming that you’re limited to Unicode).

If not processed as a BOM, then ZERO WIDTH NO BREAK SPACE is mostly harmless.

Programming

It’s all about bytes vs. characters. Too many languages have no byte type, so programmers use char instead. Trouble! The language has no idea whether you’re processing text, which should be treated as Unicode, or bytes of data, which would happen if a program were parsing a JPEG file.

cout << hex << 'jack';
c.cc:1: warning: multi-character character constant
6a61636b
cout << hex << "ñ" << 'ñ';
c.cc:1: warning: multi-character character constant
ñc3b1
for (unsigned char c : "abcñ")
    cout << hex << int(c) << ' ';
61 62 63 c3 b1 0 

Linux Commands

echo \u: up to four digits; \U: up to eight digits

    % echo -e '\uf1'
    ñ
    % echo -e '\U1f435'
    🐵

wc -c counts bytes; wc -m counts characters

    % echo -e '\U1f435' | wc -c
    5
    % echo -e '\U1f435' | wc -m
    2

Viewing Files

View with xxd or od:

    % echo -e 'ABC' | xxd
    00000000: 4142 430a                                ABC.
    % echo -e '\U1f435' | xxd
    00000000: f09f 90b5 0a                             .....

    % echo -e 'ABC' | od -t x1
    0000000 41 42 43 0a
    0000004
    % echo -e '\U1f435' | od -t x1
    0000000 f0 9f 90 b5 0a
    0000005