rsc | 76193d7 | 2003-09-30 17:47:42 +0000 | [diff] [blame] | 1 | .TH UTF 7 |
| 2 | .SH NAME |
| 3 | UTF, Unicode, ASCII, rune \- character set and format |
| 4 | .SH DESCRIPTION |
| 5 | The Plan 9 character set and representation are |
| 6 | based on the Unicode Standard and on the ISO multibyte |
| 7 | .SM UTF-8 |
| 8 | encoding (Universal Character |
| 9 | Set Transformation Format, 8 bits wide). |
| 10 | The Unicode Standard represents its characters in 16 |
| 11 | bits; |
| 12 | .SM UTF-8 |
| 13 | represents such |
| 14 | values in an 8-bit byte stream. |
| 15 | Throughout this manual, |
| 16 | .SM UTF-8 |
| 17 | is shortened to |
| 18 | .SM UTF. |
| 19 | .PP |
| 20 | In Plan 9, a |
| 21 | .I rune |
| 22 | is a 16-bit quantity representing a Unicode character. |
| 23 | Internally, programs may store characters as runes. |
| 24 | However, any external manifestation of textual information, |
| 25 | in files or at the interface between programs, uses a |
| 26 | machine-independent, byte-stream encoding called |
| 27 | .SM UTF. |
| 28 | .PP |
| 29 | .SM UTF |
| 30 | is designed so the 7-bit |
| 31 | .SM ASCII |
| 32 | set (values hexadecimal 00 to 7F), |
| 33 | appear only as themselves |
| 34 | in the encoding. |
| 35 | Runes with values above 7F appear as sequences of two or more |
| 36 | bytes with values only from 80 to FF. |
| 37 | .PP |
| 38 | The |
| 39 | .SM UTF |
| 40 | encoding of the Unicode Standard is backward compatible with |
| 41 | .SM ASCII\c |
| 42 | : |
| 43 | programs presented only with |
| 44 | .SM ASCII |
| 45 | work on Plan 9 |
| 46 | even if not written to deal with |
| 47 | .SM UTF, |
| 48 | as do |
| 49 | programs that deal with uninterpreted byte streams. |
| 50 | However, programs that perform semantic processing on |
| 51 | .SM ASCII |
| 52 | graphic |
| 53 | characters must convert from |
| 54 | .SM UTF |
| 55 | to runes |
| 56 | in order to work properly with non-\c |
| 57 | .SM ASCII |
| 58 | input. |
| 59 | See |
rsc | 058b011 | 2005-01-03 06:40:20 +0000 | [diff] [blame] | 60 | .IR rune (3). |
rsc | 76193d7 | 2003-09-30 17:47:42 +0000 | [diff] [blame] | 61 | .PP |
| 62 | Letting numbers be binary, |
| 63 | a rune x is converted to a multibyte |
| 64 | .SM UTF |
| 65 | sequence |
| 66 | as follows: |
| 67 | .PP |
| 68 | 01. x in [00000000.0bbbbbbb] → 0bbbbbbb |
| 69 | .br |
| 70 | 10. x in [00000bbb.bbbbbbbb] → 110bbbbb, 10bbbbbb |
| 71 | .br |
| 72 | 11. x in [bbbbbbbb.bbbbbbbb] → 1110bbbb, 10bbbbbb, 10bbbbbb |
| 73 | .br |
| 74 | .PP |
| 75 | Conversion 01 provides a one-byte sequence that spans the |
| 76 | .SM ASCII |
| 77 | character set in a compatible way. |
| 78 | Conversions 10 and 11 represent higher-valued characters |
| 79 | as sequences of two or three bytes with the high bit set. |
| 80 | Plan 9 does not support the 4, 5, and 6 byte sequences proposed by X-Open. |
| 81 | When there are multiple ways to encode a value, for example rune 0, |
| 82 | the shortest encoding is used. |
| 83 | .PP |
| 84 | In the inverse mapping, |
| 85 | any sequence except those described above |
| 86 | is incorrect and is converted to rune hexadecimal 0080. |
| 87 | .SH "SEE ALSO" |
| 88 | .IR ascii (1), |
| 89 | .IR tcs (1), |
| 90 | .IR rune (3), |
| 91 | .IR "The Unicode Standard" . |