CS253: Software Development with C++

Spring 2022

Unicode

Show Lecture.Unicode as a slide show.

CS253 Unicode

Unicode (ISO-10646)

U+ notation

By convention, Unicode code points are represented as U+ followed by four (five if needed) upper-case hexadecimal digits.

U+005D]RIGHT SQUARE BRACKET
U+00F1ñLATIN SMALL LETTER N WITH TILDE
U+042FЯCYRILLIC CAPITAL LETTER YA
U+2622RADIOACTIVE SIGN
U+1F3A9🎩TOP HAT

RIGHT SQUARE BRACKET is written U+005D, not U+5D. You could also call it Unicode character number 93, but don’t.

What’s in Unicode

ASCIIABCxyz!@#$%Dingbats✈☞✌✔✰☺♥♦♣♠•
Other Latináçñöªš¿Emoji🐱🎃🍎🇺🇳
CyrillicЖЙЩЯHieroglyphs𓁥𓂀
HebrewאבגדהMathematics∃𝒋:𝒋∉ℝ
ChineseMusic𝄞𝄵𝆖𝅘𝅥𝅮 ♯♭
JapaneseNo Klingon

All Unicode “blocks”: https://unicode.org/Public/UNIDATA/Blocks.txt

Code Points

Initially, Unicode is all about mapping integers to characters:

DecimalU+hexMeaningExample
97U+0061LATIN SMALL LETTER Aa
8804U+2264LESS-THAN OR EQUAL TO
66506U+103CAOLD PERSIAN SIGN AURAMAZDAAHA𐏊

Now, do that for 144,000+ more characters. These integers (written as U+hexadecimal ) are code points.

History

Encoding

Fine, we’ve defined this mapping. How do we actually represent those code points in a computer, in memory, or in a file on a disk? That’s the job of an encoding. An encoding is a mapping of the bits in an integer code point to bytes.

code point:

Mapping integers to characters, no matter how big the integers get. Bits & bytes have nothing to do with it. This is not a computer problem. This is a bookkeeping task—just a big list.

encoding:

How to represent integer code points as bytes in a computer. For ASCII (0–127), the mapping is simple, since an 8-bit byte can contain 2⁸=256 different values. It gets harder with more than 256 charcters, like, y’know, Unicode.

16-bit Encodings

     ····J···· ····a···· ····c···· ····k····
    ┌────┬────┬────┬────┬────┬────┬────┬────┐
    │ 00 │ 4A │ 00 │ 61 │ 00 │ 63 │ 00 │ 6B │
    └────┴────┴────┴────┴────┴────┴────┴────┘
      0    1    2    3    4    5    6    7

Endian

Implementation-Defined Answer

a picture of an egg in an egg-cup
union {
    int n = 0x44332211;
    char bytes[4];
};
switch (bytes[0]) {
  case 0x44: cout << "Big-endian";    break;
  case 0x11: cout << "Little-endian"; break;
  default:   cout << "Weird-endian";  break;
}
for (int b : bytes)
    cout << ' ' << hex << b;
Little-endian 11 22 33 44

16-bit Encodings

UTF-16 (Unicode Translation Format):

32-bit Encodings

UTF-32:

 ·········J········· ·········a········· ·········c········· ·········k·········
┌────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┐
│ 00 │ 00 │ 00 │ 4A │ 00 │ 00 │ 00 │ 61 │ 00 │ 00 │ 00 │ 63 │ 00 │ 00 │ 00 │ 6B │
└────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┘
   0    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15

False Positives

a pregnancy test, showing positive

Hey, there’s a slash in this string! No, wait, there isn’t.

U+002F/SOLIDUS
U+262FIN YANG

When using UTF-16 or UTF-32 encoding, a naïve algorithm will falsely detect a slash (oh, excuse me, a solidus) in one of the bytes of U+262F.

Similarly, a null-terminated C-string can’t hold a UTF-16 or UTF-32 string, because of the embedded zero bytes. C++ string objects have no problem holding any data, including null bytes.

One encoding to rule them all

U+2014EM DASH
U+00EFïLATIN SMALL LETTER I WITH DIAERESIS
U+2019RIGHT SINGLE QUOTATION MARK
U+1F9DF🧟ZOMBIE

UTF-8 Variable-Length Encoding

BitsCode Point RangeByte 1Byte 2Byte 3Byte 4
7U+0000–U+007F0xxxxxxx 
11U+0080–U+07FF110xxxxx10xxxxxx 
16U+0800–U+FFFF1110xxxx10xxxxxx10xxxxxx 
21U+10000–U+1FFFFF11110xxx10xxxxxx10xxxxxx10xxxxxx

Illustration of Various Encodings

Code pointCharDescriptionUTF-32BEUTF-16BEUTF-8
U+0041AA00000041004141
U+03A9ΩOmega000003A903A9CE A9
U+4DCAHexagram for peace00004DCA4DCAE4 B7 8A
U+1F42E🐮Mooooooooo!0001F42ED83D DC2EF0 9F 90 AE

Example

BitsRangeByte 1Byte 2Byte 3Byte 4
7U+0000–U+007F0xxxxxxx 
11U+0080–U+07FF110xxxxx10xxxxxx 
16U+0800–U+FFFF1110xxxx10xxxxxx10xxxxxx 
21U+10000–U+1FFFFF11110xxx10xxxxxx10xxxxxx10xxxxxx

Byte Order Mark

Often, files contain a “magic number”—initial bytes that indicate file type.

EncodingBytes
UTF-32BE00 00 FE FF
UTF-32LEFF FE 00 00
UTF-16BEFE FF
UTF-16LEFF FE
UTF-8EF BB BF

The character U+FEFF ZERO WIDTH NO BREAK SPACE, is also used as a Byte Order Mark, or BOM. When used as the first bytes of a data file, indicates the encoding (assuming that you’re limited to Unicode).

If not processed as a BOM, then ZERO WIDTH NO BREAK SPACE is mostly harmless. It has no width and does not cause a line break.

Combining characters

ñ

    
% echo -e '\u00f1 or n\u0303'
ñ or ñ

Combining Sequences

👷   &#x1f3fd; = 👷🏽

U+1F477👷CONSTRUCTION WORKER
U+1F3FD EMOJI MODIFIER FITZPATRICK TYPE-4

🏴   &zwj;   ☠   &#xfe0f; = 🏴‍☠️

U+1F3F4🏴WAVING BLACK FLAG
U+200D ZERO WIDTH JOINER
U+2620SKULL AND CROSSBONES
U+FE0F VARIATION SELECTOR-16

More Combining Sequences

🧑   &zwj;   ⚕   &#xfe0f; = 🧑‍⚕️

U+1F9D1🧑ADULT
U+200D ZERO WIDTH JOINER
U+2695STAFF OF AESCULAPIUS
U+FE0F VARIATION SELECTOR-16

&#x1f1e8;   &#x1f1e6; = 🇨🇦

U+1F1E8 REGIONAL INDICATOR SYMBOL LETTER C
U+1F1E6 REGIONAL INDICATOR SYMBOL LETTER A

Programming

Programming

Too many languages have no byte type, so programmers use char instead. Trouble! The language has no idea whether you’re processing text, which should be treated as Unicode, or bytes of data, which would happen if a program were parsing a JPEG file.

cout << hex << 'Jack';  // 🦡
c.cc:1: warning: multi-character character constant
4a61636b
cout << hex << "ñ";
cout << hex << 'ñ';  // 🦡
c.cc:2: warning: multi-character character constant
ñc3b1
for (unsigned char c : "abcñ")
    cout << hex << setw(2) << setfill('0') << int(c) << ' ';
61 62 63 c3 b1 00 

Bash Commands

echo -e \u: up to four digits; echo -e \U: up to eight digits

% echo -e "\uf1"
ñ
% echo -e "\U1f435"
🐵

wc -c counts bytes; wc -m counts characters

% echo "🐵" | wc -c
5
% echo "🐵" | wc -m
2
Two chars? Why two?

echo adds a newline unless the -n option is given.

Viewing Files with xxd or od:

% cat ~cs253/pub/unicode
0034  4	 DIGIT FOUR
05D0  א	 HEBREW LETTER ALEF
203D  ‽	 INTERROBANG
1F34C 🍌 BANANA
% xxd ~cs253/pub/unicode
00000000: 3030 3334 2020 3409 2044 4947 4954 2046  0034  4. DIGIT F
00000010: 4f55 520a 3035 4430 2020 d790 0920 4845  OUR.05D0  ... HE
00000020: 4252 4557 204c 4554 5445 5220 414c 4546  BREW LETTER ALEF
00000030: 0a32 3033 4420 20e2 80bd 0920 494e 5445  .203D  .... INTE
00000040: 5252 4f42 414e 470a 3146 3334 4320 f09f  RROBANG.1F34C ..
00000050: 8d8c 2042 414e 414e 410a                 .. BANANA.
% od -t x1 ~cs253/pub/unicode
0000000 30 30 33 34 20 20 34 09 20 44 49 47 49 54 20 46
0000020 4f 55 52 0a 30 35 44 30 20 20 d7 90 09 20 48 45
0000040 42 52 45 57 20 4c 45 54 54 45 52 20 41 4c 45 46
0000060 0a 32 30 33 44 20 20 e2 80 bd 09 20 49 4e 54 45
0000100 52 52 4f 42 41 4e 47 0a 31 46 33 34 43 20 f0 9f
0000120 8d 8c 20 42 41 4e 41 4e 41 0a
0000132