ISO-10646 Concept Dictionary

ISO-10646

ISO/IEC 10646-1:1993 (to be published in the first or second quarter this year) defines one repertoire of characters and two encoding forms, UCS4 and UCS2.

The UCS2 encoding form is related to the UCS4 encoding form by zero extension; that is, by zero extending the 16-bit form to 32 bits, the equivalent UCS4 encoding form is created.

ISO DIS 10646 - same as ISO 10646.

DIS - same as Draft International Standard.

ISO 10646 was accepted in June 1992, and is to be published in 1993.

It would be very difficult to use 10646 as a file code because of ISO C requirements combined with the de facto standard size of a byte.

UTF is permissible as an ISO C-compliant file code.

UTF-2 - a version of UTF being used in Plan 9.

The boundary between process codes and file codes has always been blurred. Attempts have been made to separate the two, but most people still confuse the two and insist on using them interchangeably.

The goal in creating ISO 10646 was to include all characters from all significant languages; to be a UCS.

UCS - same as Universal Coded Character Set.

Universal Coded Character Set - a codeset containing all characters commonly used in computer applications anywhere in the world.

The initial version of 10646 contains approximately 33,000 characters covering a long list of languages including European, Asian ideographic, Middle Eastern, Indian, and others. It also reserves 6,000 code spaces for private use.

10646 is based heavily on a code set called Unicode.

Unicode was developed primarily by Xerox and Apple, although other companies contributed to its design.

All encodings that OSF currently supports are superset-ASCII (except for some U.S. EBCDIC support in DCE).

The code space in 10646 is divided into cells, rows, planes, and groups.

Group-octet - byte 3 in a UCS-4 encoded character which designates a group of characters within 10646.

Plane-octet - byte 2 in a UCS-4 encoded character which designates a plane of characters within a group.

Row-octet - byte 1 in a UCS-2 or UCS-4 encoded character which designates a row of characters within a plane.

Cell-octet - byte 0 in a UCS-2 or UCS-4 encoded character which designates a character within a row.

cell - the position of a single character within a 10646 plane corresponding to a given row and column.

plane - a set of (no more than 256x256) characters of 10646 arranged in rows and columns.

octet - same as 8bit byte.

The 10646 nomenclature refers to coded characters as multiples of octets and assumes taht octets are serialized, while the Unicode nomenclature rfefores to coded characters as indivisible 16 bit entitles.

10646 defines the "big-endian" order as canonical when text is serialized as a stream of bytes.

BMP - same as Basic Multilingual plane.

Basic Multilingual plane - same as plane zero.

plane zero - the first 65,535 characters defined in 10646 which correspond to the Unicode set.

10646 defines two charset encodings, UCS-2 and UCS-4.

UCS-2 - same as Universal Coded Character Set-2.

Universal Coded Character Set-2 - Characters are encoded in the lower two octets (row and cell). Predictions are that this will be the most commonly used form of 10646.

UCS-4 - same as Universal Coded Character Set-4.

Universal Coded Character Set-4 - Characters are encoded in the full four octets.

UCS-2 and UCS-4 presently encode exactly the same set of characters, but that is expected to change over time.

UCS-2 and UCS-4 are not ASCII-transparent.

If one tries to imagine that either UCS encoding may be used as a multibyte encoding, several problems occur.

UCS-2 and UCS-4 do not provide unique code range compatibility with other multibyte encodings.

combining character - a character formed by the combination of two or more codepoints, where usually one is the base character and the others are diacritical marks.

Some languages are only fully supportable in 10646 through the use of combining characters. Examples include Korean, Arabic, and Thai.

UCS-2 Level 1 - Two-octet form, no combining chars.

UCS-2 Level 2 - Two-octet form, combining chars allowed with restrictions.

UCS-2 Level 3 - Two octet form, combining chars allowed, no restrictions.

UCS-4 Level 1 - Four-octet form, no combining chars.

UCS-4 Level 2 - Two-octet form, combining chars allowed with restrictions.

UCS-4 Level 3 - Two octet form, combining chars allowed, no restrictions.

Unicode R1.1 is equivalent to the UCS-2, Level 3 version of 10646.

UCS Transformation Format - same as UTF.

UTF - multibyte encoding of 10646 defined by a 10646 informative annex.

Informative annexes are not part of official ISO standards, so there is no requirement to support features defined therein.

The existing UTF definition does not obey the unique code range convention although no octets of any UTF characters can be in the range 0x00-0x20 or 0x7f-0x9f.

It is unclear whether UTF will be revised before the official version of the 10646 standard is published.

FSS-UTF - multibyte encoding similar to 10646 UTF developed by Ken Tompson for AT&T's Plan 9. A "file system save Unicode transfer format", where the 16-bit characters are encoded as 8-bit strings in an 7-bit ASCII/C/UNIX compatible way.

subsetting - 10646 also allows subsetting in which an implementation can choose to support a subset of the code positions within 10646, and be ISO 10646 conformant. Such an implementation must identify the characters in its repertoire.

File code requirements are determined by ISO C.

ISO C specifies that source files contain either single-byte or multibyte characters, and that a NULL byte (all bits set to zero) terminates a character string. It further specifies that the second or subsequent bytes of a multibyte character may not equal NULL.

Given the definition of file code from ISO C, it is nearly impossible to use UCS-2 or UCS-4 as a file code.

While ISO C does not define the size of a byte, it says that an implementation does. In implementations that define byte to be eight bits -- all common implementations -- UCS-2 or UCS-4 are not permissible as multibyte encodings. That's because on an eight-bit byte system, the UCS-* data contains octets that are interpreted as NULL bytes.

The compiler vender determines the definition of a byte.

UTF is acceptable as a file code.

Kernel modules do not (and are not supposed to) call setlocale() to determine the user's current locale. Instead, the kernel looks for slash and dot in only one way. If UTF were a supported file code, however, the kernel would need locale information to distinguish ASCII slash and dot from UTF octets that have the same encoding as slash and dot.

AT&T's Plan 9 requires that all file code data be encoded in UTF. NT requires that all file code data be Unicode-encoded strings. There are two main uses of a file code: As the contents of a file, and in system resource names (for example, file and directory names).

UTF-in-files support - support of UTF for the contents of a file but not for resource names.

Providing only UTF-in-files support has two disadvantages. One is that users may not understand why system resource names cannot be supported. The other disadvantage to separating file code into two categories is that there are times when the kernel interprets data within files. The file system may look at character data within binary files when resolving dynamic links. The init process reads several files -- files that contain resource names.

Given the current definition of the wchar_t interfaces, it is possible to use either UCS-2 or UCS-4 as a wide character process code AS LONG AS combining characters are not allowed. The interfaces implicitly assume that one wchar_t equals one complete character.

interchange code - a character encoding used for data when it is travelling on the network, when going between processes (as in a cut-and-paste operation between two Motif windows), or when it is part of a Remote Procedure Call (RPC).

There are three major interchange code strategies, single interchange code, multiple codes with identifying tags, and multiple codes, no tags.

single interchange code - an interchange code strategy in which a system defines a single interchange code and either requires that all data be converted to the format, or blindly treats all data as being in that format.

multiple codes with identifying tags - an interchange code strategy in which multiple interchange codes are defined and a tag is added to each data packet that identifies which code the packet contains.

multiple codes, no tags - an interchange code strategy in which multiple interchange codes are defined, but nothing is done to identify them. Most modern networks work this way.

Choosing a single form of 10646 may be the only logical choice for a single interchange code implementation. After all, a system that only supports one interchange form should be capable of representing all characters. 10646 is the only set that comes close to being able to do that.

A single interchange code would have a performance disadvantage. This model requires two conversions for all interchange tasks: into the interchange format at the beginning of a trip, and back out when the data reaches its destination. No existing standard is completely code set independent. For example, ISO C specifies characteristics that multibyte encodings must meet.

"The New ASCII" - description applied by some to 10646.

A reason for adding 10646-only data types/interfaces is that Microsoft's NT and a future Apple Computer OS include such types and interfaces.

10646 cannot really be regarded as a single codeset, so software that supports it must still be codeset independent.

Efforts to create code-set-specific interfaces must be tied to efforts by compiler vendors to provide code-set-specific types.

The contents of wchar_t differ depending on locale. Because file codes have different characteristics, there are separate algorithms for converting the file codes to wchar_t representations.

The wchar_t as defined is not necessarily exchangeable between processes.

Some have suggested defining UCS-2 level 1 or UCS-2 level 3 as the single, "well-known" process code.

"well-known" process code - codeset that may be used as a process code and that may be exchanged between processes.

As noted above, wchar_t doesn't support combining characters, so it isn't feasible to use UCS-2, Level 3 as the well-known process code. UCS-2, Level 1 is a less-than-ideal choice for other reasons. With combining characters, 10646 has a nearly infinite repertoire of complete characters, but UCS-2, Level 1 necessarily has a smaller repertoire. If we designate UCS-2, Level 1 as the well-known process code, OSF technologies will not be able to process languages that are only fully supportable in 10646 through the use of combining characters, and won't be able to support code sets (like the new Taiwanese set) that don't fit in UCS-2.

NT provides Unicode support by using a single source, dual object model, where one object supports Unicode only (currently without combining characters) and the other handles ASCII-based encodings. Support for combining characters is planned for a future release.

NT uses wchar_t, but only allows it to be 16 bits and to contain only Unicode. NT hard-codes in dependencies on the size and contents of wchar_t. This is contrary to how ISO C and XPG4 define the semantics of wchar_t.

NT uses Unicode as a file code. As noted above, Unicode and UCS-* are not permissible as multibyte file codes on an eight-bit byte system.

NT uses wide character functions that seem similar to the XPG4 functions but are different because they can only process Unicode data. In addition, NT does not include the entire XPG4 set of functions. In essense, Microsoft has created its own wide character functions. In some cases, the Microsoft names match the standard ones; in others, they don't.

OSF/1 R1.2 may augment its iconv converter modules with a new group that converts between existing standard code sets and various forms of ISO 10646. At this time, OSF does not contemplate suport for UCS-* Level 2 or Level 3.

Without UTF or some other multibyte version of 10646, conversion between existing multibyte codesets would still be possible. It is not clear what this functionality would buy the software, however.

Allowing UCS-* or UTF as interchange codes would enhance interoperability with OSes that support these encodings as file codes. Examples include NT and Plan 9.

OSF/1 currently assumes that any octet with the value 0x00-0x3f or 0x7f is an ASCII character. Several encodings, however, including the current version of UTF, allow octets to have values in the subrange 0x21-0x3f.