EUC Concept Dictionary

EUC

EUC - same as Extended Unix Code.

extended unix code - a multibyte encoding standard developed by AT&T and supported on all System V implementations used to represent large asian characters sets.

EUC is an ISO2022 8-bit compliant portmanteau encoding.

EUC defines a variable length multibyte encoding intended primarily for interchange, and a fixed length encoding primarily intended for processing.

If codeset 0 is ASCII, then the EUC codeset is ASCII transparent. EUC encoding scheme - the rules for describing a legal EUC codeset. These rules are the following: 1) Each character of an EUC multibyte string is chosen from among four distinct multibyte codesets (0,1,2,and 3). 2) Codeset 0 must be a 7bit codeset. 3) No multibyte character of Codeset 1 will use either SS2 or SS3 as its first byte. 4) Characters from codeset 2 will be preceded by the byte SS2. 5) Characters from codeset 3 will be preceded by the byte SS3. 6) For codesets 1, 2, and 3, every byte of every character must have the eighth bit set.

There is no requirement that codeset 0 be ASCII.

primary codeset - codeset 0 in an EUC encoding scheme.

supplementary codeset - codeset 1 or codeset 2 or codeset 3 in an EUC encoding scheme.

SS2 - shift byte which represents the hexadecimal value 0x8e. SS3 - shift byte which represents the hexadecimal value 0x8f. Existing EUC codesets typically have no more than three bytes per character.

The EUC encoding scheme imposes no limit on the number of bytes per character of the various supported codesets. However, the typedef of wchar_t does impose such a limit.

EUC may be used to encode an 8bit codeset which has 7bit ASCII as a subset. In this case, only two codesets are used in the EUC encoding scheme. Codeset 1 is simply the upper half of the 8bit codeset.

Wide character encodings of EUC codesets are implementation defined.

For USL, the wide character encoding of EUC is described in "UNIX System V Programmer's Guide: Internationalization". A simple algorithm exists for converting between the wide character encoding and the multibyte encoding of an EUC codeset.

Technically speaking, EUC is not quite compliant with ISO2022. In particular, it violates section 8.2.2 "Use of single-shift functions".

The five most common EUC encodings are EUC-C, EUC-H, EUC-J, EUC-JX, and EUC-K.

EUC-C - Chinese (PRC) Euc in which codeset 0 is ASCII, codeset 1 is GB 2312 1980, codeset 2 is undefined, codeset 3 is undefined. EUC-H - Han (Taiwanese) Euc in which codeset 0 is ASCII, codeset 1 is CNS 11643 1986 (plane 1), codeset 2 is CNS 11643 1986 (planes 2-255), codeset 3 is undefined.

EUC-J - Japanese Euc in which codeset 0 is ASCII, codeset 1 is JIS X 0208 1983, codeset 2 is JIS X 0201 1976, codeset 3 is undefined.

EUC-JX - Japanese Extended Euc in which codeset 0 is ASCII, codeset 1 is JIS X 0208 1983, codeset 2 is JIS X 0201 1976, codeset 3 is JIS X 0212 1990.

EUC-K - Korean Euc in which codeset 0 is ASCII, codeset 1 is KS C 5601 1987, codeset 2 is undefined, codeset 3 is undefined.