Jack Applin |
UnicodeOverview
Introduction
ChaosThe time before Unicode. Pre-ASCIIIt’s all about the mapping of bits to symbols. But what bits should we use to represent a given symbol? Is ‘A’ represented by 1, 1003, or 65? There were many opinions:
ASCII (ISO-646)
National UseGreat, everything was standard! No, wait—the French still wanted their accents (à), the British wanted their pound sterling (£), etc. A number of characters were designated as “National Use”, to be replaced by local characters. For example, @ was replaced by ‘à’ for French use, and ‘§’ for the Germans. Similarly, ‘\’ was replaced by ‘Ö’ for the Swedes, and ‘Ñ’ for Spain. Swedish C programs looked like this: printf("Hello, world!Ön"); I’m told that one got used to it. I’m Still Not SatisfiedStill, this was not good enough. Greeks, Russians, and Israelis needed entire alphabets of non-Latin characters, and the few characters reserved for national use were insufficient. The character positions 128–255 were there for the taking, and so they got took. Many incompatible eight-bit extensions to ASCII were created. Using the Eigth Bit
Convergence was impeded by the usual bickering between organizations reluctant to abandon their proprietary solutions for a common standard. ISO-8859-X
Non-European Languages
All of these encodings are variable-length: one byte for ASCII, two bytes for Japanese/Chinese. Which Encoding?
OrderIncorporationOne way to change something is through incorporation. You don’t try to change the existing thing—you just incorporate it into a bigger framework.
Unicode (ISO-10646)
What’s in Unicode
All Unicode “blocks”: http://unicode.org/Public/UNIDATA/Blocks.txt Code PointsInitially, Unicode is all about mapping integers to characters:
Now, do that for 110,000+ more characters. EncodingFine, so we’ve defined this mapping. How do we actually represent those in a computer? That’s the job of an encoding. An encoding is a mapping of integers to bytes. 16-bit EncodingsUCS-2:
16-bit EncodingsUTF-16:
32-bit EncodingsUTF-32:
False PositivesHey, there’s a slash in this string! No, wait, there isn’t.
When using UTF-16 or UTF-32 encoding, a naïve algorithm will falsely detect a slash (oh, excuse me, a solidus) in one of the bytes of U+262f. Similarly, a C-string cannot hold a UTF-16 or UTF-32 string, because of the embedded zero bytes. Morse CodeConsider the phrase “I ate lunch”, in Morse Code:
Nine characters encoded in 23 bits, not counting spaces between letters and words. That’s less than 2⅔ bits/character. How can this be? etaoin shrdlu Etaoin ShrdluMorse code is designed so that the most common English letters are represented by short sequences. E is a single •, T is a single −, whereas Q is − − • −. Q takes a long time to transmit, but the letter Q doesn’t occur that often, so that’s ok. Similarly, the UTF-8 encoding is designed so that Unicode code points 0–127 (which ones are those, again?) take only a single byte, whereas code points represented by large numbers can take up to four bytes. American imperialism or good engineering? You decide! UTF-8 Variable-Length Encoding
Illustration of Various Encodings
Byte Order MarkOften, files contain a “magic number”—initial bytes that indicate what sort of file it is.
The character U+FEFF ZERO WIDTH NO BREAK SPACE, is also used as a Byte Order Mark, or BOM. When used as the first bytes of a data file, indicates the encoding (assuming that you’re limited to Unicode). If not processed as a BOM, then ZERO WIDTH NO BREAK SPACE is mostly harmless. Email used to be be a real mess. MIME extensions came along to help: From: Greg Redder <Greg.Redder@ColoState.EDU> To: Jack Applin <Jack.Applin@colostate.edu> Subject: Re: SNMP read only string Date: Tue, 11 Oct 2016 22:25:56 +0000 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: base64 MIME-Version: 1.0 SmFjaywNCg0KV2UgY2FuIHNldCB1cCByZWFkLW9ubHkgYWNjZXNzIHRvIHN3aXRjaGVzIGluIHRo ZSBDUyBidWlsZGluZy4gICAgV2UnZCBuZWVkIHRvIG1ha2Ugc3VyZSB0aGF0IHRoZSB3aG9sZSBz bm1wIHRyZWUgaXNuJ3QgcmV0cmlldmVkIGZyZXF1ZW50bHkgb3IgeW91IGNhbiBidXJ5IHRoZSBP Uy4gICAgU28sIGRlcGVuZGluZyB1cG9uIGhvdyBzdXJlIHdlIGFyZSB0aGF0IHdvbid0IGhhcHBl biBtaWdodCBkZXRlcm1pbmUgaG93IG1hbnkgc3dpdGNoZXMgd2UgcHJvdmlkZSBhY2Nlc3MgdG8g Output
%echo $LANG en_US.UTF-8 %rm foo rm: cannot remove ‘foo’: No such file or directory %unset LANG %rm foo rm: cannot remove 'foo': No such file or directory Input is hard!Here’s the problem: there are now more characters available than there are keys on the keyboard. It’s not practical to have a keyboard with 137,000 keys: one with 1cm² keys would be 3½m × 3½m. Of course, people who write in Chinese and other languages with many characters have had to deal with this problem for quite some time. Input: Copy & pasteLow-tech is sometimes the best. Create a file of your most-used Unicode chars:
and copy & paste them as needed. Input: Linux Scripts
%onehalf ½ %micro µ
%u active 2622 ☢ RADIOACTIVE SIGN %u tri.*fire 2632 ☲ TRIGRAM FOR FIRE %u u+2600 u+2603 2600 ☀ BLACK SUN WITH RAYS 2601 ☁ CLOUD 2602 ☂ UMBRELLA 2603 ☃ SNOWMAN Input Methods
ProblemsLife is still not quite perfect. Fonts
EmojisUnicode has, arguably, too many emojis (1791 at last count). These are not established, long-used characters. They are recent inventions. This goes against the usual rules for Unicode. Historical MessUnicode is one big collection of historical compromises. In a hundred years, will anybody care that these first three code points maintained compatibility with Latin-1?
Politics: the art of the possible. More HistoryThe double-struck alphabet, in code point order:
What’s wrong with this picture? Spoofing
Canonical FormAccented characters can be pre-baked, or created in two steps:
Unicode decrees that U+00f1 and the sequence U+006e U+0303 be treated as the same. Comparing strings just got a lot harder. Unicode calls this process normalization. ProgrammingIt’s all about bytes vs. characters. Too many languages have no byte type, so programmers use char instead. Trouble! The language has no idea whether you’re processing text, which should be treated as Unicode, or bytes of data, which would happen if a program were parsing a JPEG file. Linux Commandsecho \u: up to four digits; \U: up to eight digits %echo -e '\uf1' ñ %echo -e '\U1f435' 🐵 wc -c counts bytes; wc -m counts characters %echo -e '\U1f435' | wc -c 5 %echo -e '\U1f435' | wc -m 2 C
C Example// Show the decimal value of each character read. #include <locale.h> #include <wchar.h> #include <stdio.h> int main() { setlocale(LC_ALL, ""); // Set locale per environment wchar_t buf[80]; // 80 wide characters printf("sizeof(buf)=%zd\n", sizeof(buf)); // byte size of 80 wchar_t? while (fgetws(buf, 80, stdin) != NULL) { // duplication of information printf("%ls", buf); // print the wide string for (size_t i=0; i<wcslen(buf); i++) // O(N²) algorithm printf("%d ", (int) buf[i]); puts(""); } } C++ Example// Show the decimal value of each character read. #include <clocale> #include <iostream> using namespace std; int main() { setlocale(LC_ALL, ""); // Set locale per environment wstring s; // wstring, not string while (getline(wcin, s)) { // wcin wcout << s << '\n'; // wcout for (size_t i=0; i<s.length(); i++) wcout << int(s[i]) << ' '; wcout << '\n'; } } Java Example// Show the decimal value of each character read. import java.util.*; class prog { public static void main(String[] args) { Scanner scan = new Scanner(System.in); while (scan.hasNextLine()) { String line = scan.nextLine(); System.out.println(line); for (int i=0; i<line.length(); i++) System.out.print(line.codePointAt(i)+" "); System.out.println(""); } } } Perl Example#! /usr/bin/perl # Show the decimal value of each character read. use 5.14.2; use warnings; use utf8; use open qw(:std :encoding(utf8)); while (<>) { print; for my $β (/./g) { print ord($β), " "; } print "\n"; } Python Example#! /usr/bin/python3 # Show the decimal value of each character read. import sys; for lïné in sys.stdin: print(lïné) for Φ in lïné: print(ord(Φ), end=' ') print Resources
|