Multibyte Character Sets Concept Dictionary

Multibyte Character Sets

character - an abstract element of text.

graphic character - a character, other than a control function, that has a visual representation normally handwritten, printed or displayed, and that has a coded representation consisting of one or more bytes.

graphic symbol - a visual representation of a graphic character. character set - a set of characters.

encoding - any numeric representation of a character in a character set. Sometimes also means "set of encodings".

codeset - same as coded character set.

coded character set - a set of unambiguous rules that establishes a character set and the one-to-one relationship between each character of the set and its coded representation.

Typically, a character set standard lists the graphic and non-graphic characters in the character set, and displays the graphic symbol and the numeric encoding for each character. In addition, each character is assigned a name.

codeset independent - a program is said to be codeset independent iff 1) the program supports user specification of the codeset, 2) all codesets offered by the operating system are supported by the program, and 3) details of the codeset are extracted dynamically from the locale database.

8bit codeset independent - a program is said to be 8bit codeset independent iff, for the collection of 8bit codesets supported by the operating systems, it is codeset independent.

code point - an encoding for a character.

character encoding - same as code point.

charset - an encoding in which all code points have the same number of bits.

multibyte codeset - a codeset whose encodings contain a variable number of bytes. By convention, a charset having 8bit encodings is also known as a multibyte encoding.

multibyte character - given a character, the encoding of that character from a multibyte codeset.

Most peripherals, network hardware, and existing programs support ASCII. So I/O must continue to support one byte characters. But one byte cannot represent many characters in other languages. So multibyte codesets are required, at least initially.

wide character codeset - a charset whose encodings contain two bytes or more.

wide character - a character from a wide character codeset. 8bit codeset - a charset whose encodings contain only one byte each.

single byte codeset - same at 8bit codeset.

process code - character encoding used internally by a program. Usually refers to a wide character encoding.

file code - character encoding used for I/O and file storage. By convention, the set of file codes for a character set is the same as the multibyte encoding for the character set.

char - a byte data type used to represent a single byte of a multibyte character.

wchar_t - data type used to represent a wide character.

The wchar_t data type resulted from a ANSI C committee compromise between the proposal to use char and short char (for character and byte) and the Japanese's proposal to introduce a new type along with a large number of new functions to handle it.

NLchar - same as wchar_t. Obsolete. Defined in AIX only. Most existing programs, written to support only 8bit codesets, already work correctly with arbitrary multibyte codesets.

System functions such as getenv(), open(), unlink(), etc., expect and return multibyte strings.

Most string manipulation algorithms can be implemented using multibyte strings exclusively, though possibly with some clumsiness.

Use of wide character encodings should be minimized because: 1) For 8bit locales, performance will be optimized if multibyte encodings are used exclusively, while non-8bit locales will notice no performance degradation. 2) Not all platforms support or require large character sets and so, use of wide character functions would have to be replaced. 3) Debugging almost always occurs in the C locale which is easier with multibyte algorithms. Wide characters are required 1) when performing character classification, 2) when performing single character input.

code set width - the maximum number of bytes required to represent a character as a file code.

code set display width - the maximum number of columns required to display a character on a terminal.

American National Standard Code for Information Interchange - same as ASCII.

ASCII - same as 7bit ASCII.

7bit ASCII - a code set defined by ANSI containing control characters, punctuations, digits, and the upper and lower case characters of the English alphabet.

8bit ASCII - the extention of 7bit ASCII adopted by ANSI to include characters to support western languages other than English.

iso8859 - a collection of 8bit charsets defined by iso.

iso8859--1 - a charset commonly used for English and western European languages defined by iso. For use with the languages: Danish, Dutch, English, Faeroese, Finnish, French, German, Icelandic, Irish, Italian, Norwegian, Portuguese, Spanish, and Swedish.

iso8859--2 - a charset commonly used for eastern European languages defined by iso. For use in the following languages: Albanian, Czech, English, German, Hungarian, Polish, Romanian, Serbocroation, Slovak, and Slovene.

iso8859--3 - a charset commonly used for southern European and southern African languages defined by iso. For use in the following languages: Afrikaans, Catalan, Dutch, English, Esperanto, German, Italian, Maltese, Spanish, and Turkish.

iso8859--4 - a charset containing a majority of Scandanavian characters defined by iso. For use in the following languages: Danish, English, Estonian, Finnish, German, Greenlandic, Lappish, Latvian, Lithuanian, Swedish, and Norwegian.

iso8859--5 - a charset containing ASCII and Cyrillic characters defined by iso. For use in the following languages: Bulgarian, Byelorussian, English, Macedonian, Russian, Serbocroatian, and Ukrainian.

iso8859--6 - a charset containing ASCII and Arabic characters defined by iso.

iso8859--7 - a charset containing ASCII and Greek characters defined by iso.

iso8859--8 - a charset containing ASCII and Hebrew characters defined by iso.

iso8859--9 - a charset which proposes Turkish changes to is8859-3 defined by iso.

latin-1 - same as iso8859-1.

latin-2 - same as iso8859-2.

latin-3 - same as iso8859-3.

latin-4 - same as iso8859-4.

latin-Cyrillic alphabet - same as iso8859-5.

latin-Arabic alphabet - same as iso8859-6.

latin-Greek alphabet - same as iso8859-7.

latin-Hebrew alphabet - same as iso8859-8.

latin-5 - same as iso8859-9.

cyrillic - the Russian alphabet.

pc850 - an 8bit charset defined by IBM, used on the IBM PC and supported on AIX. This charset has a 7bit ASCII subset and extended characters for use in European locales.

ISO 646 - 1983, ISO 7bit coded character set for information interchange. Same as 7bit ASCII.

ISO 2022 - 1986, ISO 7bit and 8bit coded character sets; Code extension techniques. Using the graphic characters and encoding later defined in is8859-1.

ISO 4873 - 1986, ISO 8bit code for information interchange; Structure and rules for implementation. Using the graphic characters and encoding later defined in is8859-1.

ISO 6429 - 1988, ISO 7bit and 8bit coded characters sets; Additional control functions for character-imaging devices.

ISO 6937/2 - coded character sets for text communication; contains part 2 on latin alphabetic and non-alphabetic graphic characters.

ISO 9036 - Arabic 7bit coded character set for information interchange.

ASMO 449 - 7bit coded Arabic character set for information interchange.

ISO/IEC 10367 - Repertoire of standardized coded graphic character sets for use in 8bit codes.

ISO/IEC 10646 - Universal Coded Character Set, same as Unicode. UCS - same as Universal Coded Character Set.

JIS x0208 - Japanese Industrial Standard codeset.

KS C5601 - Korean Standard codeset.

CNS 11643 - Chinese (ROC) codeset.

GB 2312 - Chinese (PRC) codeset.

extended character - given a codeset containing ASCII as a proper subset, an extended character is a character in the codeset which is not in ASCII. Also applies to portable subsets of the codeset other than ASCII.

We may assume that 7bit ASCII is a proper subset of all supported code sets.

ASCII transparent - A code set is said to be ASCII transparent iff none of the encodings requiring multiple bytes contain bytes which are the same as 7bit ASCII encodings.

ASCII non-transparent - Any codeset which is not ASCII transparent is said to be ASCII non-transparent.

It is possible to perform a bytewise scan for 7bit ASCII code points if the codeset is ASCII transparent, not otherwise.

shift byte - in a multibyte character, the first byte is called a shift byte if it changes the interpretation of following bytes. Often, a shift byte is used to select a range of code values while the following byte (or bytes) represents the offset of the desired character in that range.

stateful encodings -- encodings that use locking shifts.

Encodings that use single shifts (such as EUC) are not stateful encodings.

code point range - a set of contiguous code points.

unique code point range - the code point range (0x00 through 0x3F) such that none of the supported code sets have bytes from the unique code point range in any byte of a character that requires multiple bytes to encode. Furthermore, these codes always refer to the same character as specified for 7bit ASCII. the graphic characters in the POSIX.2 portable character set. The characters in the unique code point range are: control characters 0x00-0x1F, !"#$%&'()*+,-./0123456789:;<=>?

Characters in the unique code point range may be scanned for, bytewise, irrespective of whether the codeset is ASCII transparent or not.

unicode - a single charset developed by Unicode, Inc. which supports every known character available in the world.

8bit dirty - a program is said to be 8bit dirty iff it interprets the 8th bit in specialized ways.

8bit clean - a program is said to be 8bit clean iff it is not 8bit dirty. Equivalently, a program is said to be 8bit clean iff it is 8bit codeset independent. Less formally, a program is 8bit clean iff correct behavior of the program does not require that the only supported codeset be 7bit ASCII.

portable filename character set - The set of characters from which portable filenames are constructed, including only ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789._- defined by POSIX.1.

"Portable pathname character set" - same as portable filename character set with the addition of the slash character.

According to POSIX.1, the encoding of the portable filename character set is not specified, but the encoding must be unique.

The X Window System assumes that the portable filename character set encoding is identical on all machines in a networked computing environment.

supplementary code set - the portion of the code set not included in the portable filename character set.

supplementary code set character - character from the supplementary code set.

font - a collection of glyphs used to represent the characters corresponding to each value in a charset.

glyph - same as font glyph.

font glyph - the actual image of the corresponding character that gets displayed.

X portable character set - a superset of the POSIX portable filename character set which contains the characters a..zA..Z0..9!"#$%&'()*+,-./:;<=>?@[]^_`{|}~ and , , and .

host portable character encoding - the encoding for the X portable character set, which must be the same for all locales on a given host machine, but is otherwise unspecified.

latin portable character encoding - the encoding of the X portable character set in which the encodings agree with the latin-1 encoding.

STRING encoding - latin-1 plus newline and tab.

portable character set - The set of characters from which portable names are constructed, including the graphics characters from ASCII and some control characters, defined by POSIX.2.

Currently, the POSIX.2 portable character set is identical to the ISO/IEC 646 character set, though this need not remain true in the future.

Each supported coded character set shall include the portable character set.

POSIX.2 places only the following requirements on the encoded values of the characters in the portable character set: 1) If the encoded values associated with each member of the portable character set are not invariant across all locales supported by the implementation, the results achieved by an application accessing those locales are unspecified. 2) The encoded values associated with the digits '0' to '9' shall be such that the value of each character after '0' shall be one greater than the value of the previous character. 3) A null character, NUL, which has all bits set to zero, shall be in the set of characters.

POSIX.2 does not require that the characters of the portable character set have the same encoding in all codesets.

Implementations shall provide a charmap file for at least one coded character set supported by the implementation.

It is implementation-defined whether or not users or applications can provide additional charmap files. If such a capability is supported, the system documentation shall describe the rules for the creation of such files.

charmap file - character set description file.

Charmap files may (but are not required to) contain the name of the codeset, the value of mb_cur_max, the value of mb_cur_min, the escape character and the comment character which may be used in the charmap file.

A charmap file contains lines composed of the symbolic name, encoding, and comments. A line may also specify a range of symbolic names if those names have numeric suffixes. The escape character is used to introduce an encoding. An encoding shall be expressed as one (for single-byte character values) or more concatenated decimal, octal, or hexadecimal constants.

conversion to portable codeset - the process by which a character in one codeset is converted to a corresponding character or sequence of characters which are members of a portable subset of the given codeset. This is done either by flattening or by escaping.

flattening - process by which a character in one codeset is converted into another character in the same codeset in such a way that the resulting character is a member of a portable codeset and has the same basic appearance as the original character.

flattened character - character resulting from the process of flattening.

appearance-preserving conversion - same as flattening.

escaping - process by which a character in one codeset is converted to an escape sequence of characters in the same codeset in such a way that the characters making up the escape sequence are all members of a portable codeset.

information-preserving conversion - same as escaping.

The functions wctomb(), wcstombs(), mbtowc(), and mbstowcs() may be used to convert between the wide character and multibyte representations of a given character string.

The C language features, code literal, string literal, character constant, and wide character constant, may be used to specify a single character or a string in a program. However, their use should be limited to those characters from the portable code set unless the use of the resulting program in the same locale as the compile-time environment can be guaranteed.

code literal - C language feature which allows the numeric value of a code point to be specified.

string literal - C language feature of the form "abc" which allows a NULL terminated multibyte string of characters to be specified. The string is encoded in the executable in the locale in effect at compile time.

character constant - C language feature of the form 'a' which allows a single character to be specified. The character is encoded in the executable in the locale in effect at compile time.

wide character constant - C language feature of the form L'a' which allows a wide character to be specified. The character is encoded in the locale in effect at compile time.

backward scanning - technique used in string manipulation algorithms in which the end of the string is first located and then for a character or a pattern from the end to the beginning.

Backward scanning is not possible with multibyte strings.

MNLS - a library of I18N functions for wide character support. Provided by USL.

Multi-National Language Supplement Binary - same as MNLS.

scharfe-S - German character representing ss, which looks like a Greek beta.

MB_LEN_MAX - let x be a character from the union of all codesets supported on the system. Then, MB_LEN_MAX = max(length of x in bytes). MB_LEN_MAX is #defined and cannot change at runtime. ANSI requires that the minimum value of MB_LEN_MAX is one.

MB_CUR_MAX - given a codeset, let x be a character from the codeset. Then, MB_CUR_MAX = max(length of x in bytes). MB_CUR_MAX is a parameterless function implemented as a macro, evaluated at runtime.

mblen() - returns the number of bytes in a multibyte character.

mbtowc() - converts a multibyte character to a wide character.

mbstowcs() - converts a multibyte string to a wide character string.

wctomb() - converts a wide character to a multibyte character.

wcstombs() - converts a wide character string to a multibyte string.

When using mbtowc(), there is no need to use mblen() since mbtowc() returns the number of bytes in the source character.

printable character - one of the characters included in the print charaacter classification of the LC_CTYPE category in the current locale.

If different character sets are used by the locale categories, the results achieved by an application utilizing these categories is undefined. Likewise, if different codesets are used for the data being processed by interfaces whose behavior is dependent on the current locale, or the codeset is different from the codeset assumed when the locale was created, the result is also undefined.

IBM Code Page 210 - Greek (obsolete - replaced by Code Page 869).

IBM Code Page 220 - Spanish (international - obsolete?).

IBM Code Page 437 - US (default for most video adapters).

IBM Code Page 850 - Multinational (Latin 1).

IBM Code Page 852 - Slavic (Latin 2).

IBM Code Page 860 - Portugal.

IBM Code Page 863 - Canada (French).

IBM Code Page 865 - Norway.

IBM Code Page 866 - Cyrillic (?).

IBM Code Page 869 - Greek (?).

Gn - one of G0, G1, ..., the nth half code page in a codeset. (?)