previous next contents
ISO 2022

"Information processing -- ISO 7-bit and 8-bit coded character sets -- Code extension techniques"

A set of mechanisms whereby 7-bit and 8-bit encodings may be extended to incorporate characters not normally representable in a single 7-bit or 8-bit encoding.

7-bit extensions are achieved through the use of Escape (ESC), Shift-Out (SO), Shift-In (SI); 8-bit extensions are achieved through the use of these 7-bit extension mechanisms along with Locking-Shift Two (LS2), Locking-Shift Three (LS3), Single-Shift Two (SS2), and Single-Shift Three (SS3). The semantics of these mechanisms and 37 different escape sequences are defined by this standard in order to achieve a generalized extension mechanism.

The extensions supported by this standard revolve around three types of mechanisms: announcements, designations, and invocations. Announcers are used to make explicit default assumptions and to provide information about what follows; designations are used to associate character repertoires with particular control mechanisms (invocation sequences); and invocations are the use of particular control or escape sequences to enable a mapping between a designated character repertoire and a set of codepoints (character code element values).

The designation and invocation mechanism of ISO2022 is general enough to support multibyte encodings and arbitrary "other coding systems". This "other coding system" allows for the incorporation of any kind of encoding system, e.g., ISO10646 UCS2 (Unicode) or UCS4, in an ISO2022 conformant encoding string. Both returnable and non-returnable other coding systems are supported. [Returnable means that one can escape out of the other coding system without having to understand the other encoding system itself. ISO10646 UCS2 and UCS4 require the non-returnable mode since the means by which a return is implemented requires the recognition of the escape sequence ESC 2/5 4/0 (the three octets 0x33 0x25 0x40), e.g., this sequence would occur if the two ISO10646 UCS2 characters DIGIT THREE (0x0033) and FORMS UP HEAVY AND DOWN HORIZONTAL LIGHT (0x2540) occurred on a big endian machine, or if the characters SQUARED APAATO (0x3300) and HANGUL SYLLABLE LIEUL-YE-DIGEUD (0x4025) occurred on a little endian machine. In contrast, all of the proposed UTF encoding forms of ISO10646 can use a returnable mode, since they preserve escape sequences. It might be fair to say that an ISO2022 parser *could* be taught to recognize the invocation of certain other encoding systems, like ISO10646 UCS2 and UCS4; this is possible since the other coding system invocation sequences are known a priori. In this case, the parser could switch its escape sequence parser to use 2-byte or 4-byte encodings, by means of which it could recognize the 2-byte or 4-byte form of the return escape sequence: for UCS2 -- 0x0033 0x0025 0x0040; for UCS4 0x00000033 0x00000025 0x00000040. The registry for ISO2022, ECMA, has already assigned invocation and designation sequences for ISO10646 UCS2, UCS4, and UTF-1 encodings.]

ISO 6937 - a series of standards developed by ISO/TC97 which, among others, represents the needs of Library and Bibliographic communities. Both 7-bit and 8-bit encodings are provided according to ISO2022 encoding techniques and ISO6429 control function definitions.

ISO6937-1 - Information processing, Coded character sets for text communication Part 1: General Introduction.

ISO6937-2 - Information processing, Coded character sets for text communication, Part 2: Latin alphabetic and non-alphabetic graphic characters.

ISO6937-2 AD 1 - Information processing, Coded character sets for text communication, Part 2: Latin alphabetic and non-alphabetic graphic characters Addendum 1.

ISO6937-3 - Information processing, Coded character sets for text communication, Part 3: Control Functions for Page-image Format.

ISO6937-7 - Information processing, Coded character sets for text communication, Part 7: Greek graphic Characters.

ISO6937-8 - Information processing, Coded character sets for text communication, Part 8: Cyrillic graphic Characters.

JIS - One of any number of Japanese Industrial Standards produced by the Japanese Standards Association, such as JIS X 0201, JIS X 0208, JIS X 0212.

JIS X 0201 (1976) - defines two 7-bit character sets and an 8-bit character set. The first 7-bit character set is equivalent to ASCII except for code position 5/12, which has the YEN SIGN instead of BACKSLASH and code position 7/14, which has an OVERSCORE isntead of TILDE; the second 7-bit set contains Katakana symbols in what is called the "hankaku" or half- width form (as compared with the normal width of a Kanji glyph, which is called "zenkaku," or full-width form). The 8-bit set is the union of the two 7-bit sets, with the hankaku katakana located in the G1 space (10/0 through 15/15) of which only 10/1 through 13/15 are used.

JIS X 0208 (1983) - defines a set of 6353 Kanji characters, 169 Kana characters, 166 Latin, Greek, and Cyrillic characters, 10 numeral characters, 147 special characters, and 32 ruled line element characters, a total of 6877 characters. The characters are organized according to "wards" which are blocks of 94 characters that are easily mapped into the ISO2022 G0 code set (2/1 through 7/14 = 94 codepoints). Kanji characters in this standard are divided into two sets: 2965 level 1 Kanji, and 3388 level 2 Kanji. The two different levels are ordered according to different principles: level 1 is ordered phonetically, level 2 is ordered by radical and stroke count, with ties in radical and stroke count broken by onyomi phonetic order. The phonetic order applied to level 1 Kanji operates by choosing a representative "On-Kun" an onyomi (Japanized Chinese pronunciation) or kunyomi (indigenous Japanese pronunciation); ties in "On-Kun" are broken by ordering by onyomi alone followed by kunyomi alone; further ties are broken by radical order then stroke count. All JIS X 0208 require two bytes for their encoding; JIS X 0208 is designated and invoked in an 7-bit ISO2022 conformant string by the escape sequence ESC 2/4 4/2; in an 8-bit ISO2022 conformant string, it is designated by the same escape sequence, and invoked by Shift-In (SI).

JIS X 0212 (1990) - defines a set of nearly 6,000 additional kanji which were not included in JIS X 0208. These kanji are ordered like those of level 2 in JIS X 0208, i.e., by radical, stroke count, then onyomi.

Runic - a cute name given by AT&T Plan 9 designers to ISO10646 UCS2 (Unicode) fixed width 16-bit encodings. These 16-bit encodings are there called "runes". This term should not be used, particularly since ISO10646 UCS2 (Unicode) will eventually have the actual Rune characters included.

UTF - UCS Transformation Format. The first such format, defined in Annex F of DIS 10646-1.2:1992(E).

UTF-1 - same as UTF, a term used to distinguish it from other versions of UTF.

UTF-2 - same as UTF-FSS.

FSS - same as File System Safe.

UTF-FSS - a version of UTF defined by Ken Thompson, and used in the Plan 9 operating system.

UTF 7-bit, 4 byte form -

UTF 7-bit, 3/4 byte form -

7bit clean - a system is 7bit clean if it correctly handles all 7bit codesets.

UTF-8FSS - 8-bit File System Safe (also known as UTF-2), defined by Thompson.

UTF-7FSS - 7-bit File System Safe (also known as UTF-3), defined by Davis/Jenkins.

UTF-MU - 7-bit Mail System Safe (also known as MU), defined by van der Poel. It uses the Base64 encoding, which has already been implemented in many MIME programs.

Base64 characters - the set of characters A to Z, a to z, 0 to 9, +, /, and = (pad in Base64, escape in UTF-MU)

UTF-8FSS is:

Bits Hex Min Hex Max Byte Sequence in Binary
1 7 00000000 0000007f 0vvvvvvv
2 11 00000080 000007FF 110vvvvv 10vvvvvv
3 16 00000800 0000FFFF 1110vvvv 10vvvvvv 10vvvvvv


previous next contents