previous next contents
Unicode

The Unicode standard - a standard charset meant to embody a sufficient number of unique codes for the world's scripts and technical symbols in common use.

Even ignoring the East Asian scripts, there are many more commonly used characters than can be represented in 8 bits. The next logical step is 16 bits.

The Unicode Consortium - a non-profit organization whose charter is to maintain and promote the unicode standard, incorporated as Unicode, Inc., in January, 1991.

Member companies include Go, IBM, Metaphor, Claris, Microsoft, NeXT, Sun Microsystems, The Research Libraries Group, Apple, and Xerox, among others.

unicode value - a unique code point of the form U+nnnn assigned to a character where nnnn is a four digit number in hexadecimal notation.

The Unicode standard does not attempt to specify the appearance of the characters, i.e., it is not a font standard, but is rather an encoding standard.

Somewhat more than half of the 65,536 code elements are now assigned to characters from nearly 25 different scripts. A small number of scripts are not yet represented by Unicode, but have draft encodings which are now being reviewed by concerned parties: Burmese, Ethiopic, Khmer, Sinhalese, and Tibetan. A number of other scripts used by relatively small communities or by historical writing systems, such as the vertical Mongolian script, Cree, Cherokee, Egyptian, etc., are being studied for further inclusion in a future version of Unicode.

The Unicode consortium makes programs available on the ftp.Unicode.ORG machine via anonymous ftp in the directory /PROGRAMS (which can also be written as URL:ftp://ftp.unicode.org/PROGRAMS/).

They also have a web catalog of Unicode products at URL:http://www.unicode.org/unicode/products.html.

dashed circle - a character that is shown in the standard with a dashed circle must be rendered in relation to the previous characters in the data stream.

dashed box - a character that is shown as text surrounded by a dashed box has no visible manifestation on its own.

Character names in the standard use only English capital letters and the dash character. Character names (for non-Han characters) are unique and determined by the appropriate standards organization or an authoritative source. UTC - same as Unicode technical committee.

Unicode technical committee - a working group of people responsible for determining the content of the Unicode standard, its interpretation, and its quality. The voting members of the UTC are representatives of The Unicode Consortium member companies.

An initial attempt to approve ISO 10646 failed because of a desire by ISO members to merge ISO 10646 and the Unicode standard. This merging effort is expected to be complete by mid-1992. In this merger, the Unicode standard would be a two-byte subset of the ISO standard.

Unicode is intended to imply "unique", "universal", and "uniform".

Text compression is regarded as a distinct problem from character encoding.

comformance - An application may be considered to conform to the Unicode standard if it makes use of independent fixed-width 16-bit characters and uses Unicode code points to represent Unicode-defined characters. Code conversion from other standards to the Unicode standard will be considered conformant if the matching table produces accurate conversion in both directions. Braille can be considered a font variant.

ISO 8879 - SGML.

ANSI Z39.47-1985 - bibliographic standard used in libraries containing Roman characters.

ANSI Z39.64-1990 - bibliographic standard used in libraries containing East Asian characters.

ISCII 1988 - Indian Standard Code for Information Interchange. GB 2312-1980 - China national standard for characters.

Less common and archaic scripts are under consideration for inclusion in future versions of the Unicode standard.

syllabary -

text element - a minimal unit of text relative to some specific language and some specific process performed on the text (often a single letter).

The classification of text elements is locale-dependent and application-dependent. (For example in Spanish, "ll" is a text element for the process of sorting but not for rendering.)

basic text processes - low level text processes out of which higher-level text processing is built, including rendering, line breaking, computing direction, modifying appearance, word and sentence recognition, spell-checking, comparing and sorting strings, keyboard input, insertion and deletion.

The Unicode standard is independent of the design of basic text processing algorithms, with the exception of directionality.

plain text - a pure sequence of character codes. Plain text must contain enough information to permit the text to be rendered ligibly, nothing more.

legible - it can be read by literate members of some linguistic community. It does not mean that it should look good, or even look nice.

fancy text - any text representation including plain text plus added information such as font size and color.

rich text - same as fancy text.

non-spacing mark - accent, grave, circle, etc, that is used to compose a new character.

The 64 control code positions of ISO 646 and 8859 are retained for compatibility.

character unification - the process by which characters that are equivalent in form, usage, and essential properties across languages are given a single code.

zone - Codespace in the Unicode standard is divided into six zones, general scripts, symbols, CJK auxiliary, CJK ideographs, private use, and compatibility.

general script zone - alphabetic and other scripts that have relatively small character sets including (but not limited to) Latin, Cyrillic, Greek, Hebrew, Arabic, Devanagari, and Thai.

CJK - same as Chinese, Japanese, and Korean.

symbol zone - contains characters for punctuation, mathematics, chemistry, dingbats.

CJK auxiliary zone - punctuation, symbols, Kana, Bopomofo, and single and composite Hangul.

The first 256 codes of Unicode follow precisely the arrangement of ISO 646 and ISO 8859-1.

Null - U+0000 which may be used as a string terminator.

character property table - a table provided for use in parsing, sorting, and other algorithms requiring semantic knowledge about the codepoints. Unicode provides characters for use as line separator, paragraph separator.

backing store order - same as logical order.

logical order - the order of character codes in memory which corresponds to the order in which the characters would be entered at a keyboard (after corrections have taken place).

In Unicode, all characters are stored in logical order.

Unicode provides characters for specifying the directionality of text.

base character - the character to which a non-spacing mark is applied.

Non-spacing marks may be displayed in (apparent) isolation by using blank as the base character.

diacritics - the principal class of non-spacing marks used in European alphabets.

diacritic - same as diacritical mark.

diacritical mark - a mark added to a letter or symbol to show its pronunciation, accent, etc, typically indicating that a phonetic value is different from the unmarked state.

breathing mark - a mark, typically associated with Greek, usually rendered side by side with any other diacritical marks.

interpretation - refers to processes which take 16-bit values as input and produce results based on the assumption that these value represent codes for specific text characters with known identities (semantics).

If a conformant process is able to interpret a given character code, then the interpretation must be consistent with the Unicode character semantics. If a conformant rendereing process is able to interpret and draw a given character code, then the graphic depiction must be consistent with the Unicode character semantics. A conformant (public) interchange process which in any way receives and retransmits encoded text must not change the 16-bit value of any character code that it cannot interpret. It must transmit such a code unchanged from the value that was received.

composite character mapping - Unicode-supplied mapping between a precomposed character and a sequence of base character followed by non-spacing marks.

precomposed character - a character, from existing standards, which can be unambiguously decomposed into a sequence of base character followed by one or more non-spacing marks.

compatibility mapping - refers to compatibility zone which contains duplicates of existing characters but with additional attributes, such as directional, size, positional, and width.

case table - encodings with corresponding upper or lower case equivalents. Two case tables are provided by Unicode, one for uppercasing and one for lowercasing. Caseless code points are not included in the tables.

The vast majority of case mappings are uniform across languages. uppercasing - the process of converting to upper case.

lowercasing - the process of converting to lower case.

caseless character - a character which has no upper or lower case equivalent.

uncased letter - a letter which is a caseless character.

directional formating code - one of a set of Unicode codepoints used to control directionality of text.

implicit bidirectional control - a set of conventions for the control of directionality which does not use any directional formating codes.

Directional formating codes are used only for display of text and are ignored on all other textual processes. Moreover, they control only the horizontal display of text.

Unicode describes an explicit bidirectional control algorithm for numeric and non-numeric text.

byte order mark - the codepoint U+FEFF which functions to ensure recipients that they are looking at a correctly byte-ordered file. The corresponding codepoint, U+FFFE is not a character code.

EOF - The codepoints U+FFFF and U+FFFE may be used as an error condition or EOF indication.

An escape sequence should be converted to and from Unicode character by character so the application doesn't have to recognize the escape as such.

The use of line separator and paragraph separator is considered canonical form of Unicode plain text.

ANSI C requires that the characters from the C execution set correspond to their wide character equivalents by zero extension. The Unicode codepoints U+0020 to U+007F satisfy this condition for 16-bit implementations of wchar_t.

character spelling - the decomposition of a character.

canonical spelling - character spelling requiring the minimum number of bytes for its representation.

letter - basic element of a script as understood by the end user. A higher level of abstraction than "character".

logographic - said of a script in which each character represents a word, not just a sound.

orthography - spelling as a subject of study.

script - handwriting.

neutral character - a character which can be written either right-to-left or left-to-right, depending on context.

diaeresis - same as umlaut.

umlaut - two horizontal dots over a letter.

radical unification - unification which might fold together upper and lower case forms of a character into one, etc.

conservative unification - unification which maintains boundaries between scripts and preserves many distinctions made in other computer character encodings.

Unicode is thought to embody conservative unification.

vowel sign - in many scripts, a mark used to indicate a vowel or vowel quality.

bopomofo - the set of letters used to annotate or teach the phonetics of Chinese, primarily Mandarin.

The semantic value of Zapf Dingbats is their shape.

zapf dingbats - a well established set of symbols comprising the Zapf Dingbat font, currently available on most laser printers.

Georgian - the language of Georgia, the republic of the former USSR.

Devanagari - script used for writing classical Sanskrit and its modern derivative, Hindi.

IPA - same as International Phonetic Alphabet.

International Phonetic Alphabet - a standard system for indicating specific sounds, first introduced in 1886.

extended latin - a set of characters used to extend latin scripts to represent non-European languages.

European latin - a set of characters used to extend latin scripts to represent most European languages.

Arabic and Hebrew are the only scripts currently encoded in the Unicode standard that are written right to left.

The glyph chosen to represent opening parenthesis will depend upon the direction of the rendered text.

non-joiner - a zero-width formatting code. If placed between two characters, the non-joiner prevents them from attaching to each other when rendered.

joiner -

The 21000 characters is the Unicode Han set represent at least 121000 code points in the source character sets from which the repertoire was drawn.

CCCII - Chinese Character Code for Information Interchange developed in Taiwan before 1980.

The Unicode Han repertoire underwent development from 1986 to the present.

The Unicode Standard Vol 2 describes the history and algorithm for Han unification.

Unicode employs a three-dimensional conceptual model for Han unification in which each character occupies a position along the X-axis (semantic value), the Y-axis (abstract shape), and Z-axis (typeface).

Only characters that have the same abstract shape are potential candidates for unification.

Some Han character forms which would ordinarily be considered equivalent, have developed different meanings or usages among the different writing systems which employ the Han script. For example, the Unicode character U+6E6F (Chinese tang1, Japnese yu/tou, Korean t'ang) originally meant 'hot water' in Chinese, but has now come to mean 'soup'; however, the original meaning of 'hot water' continued in Japanese and Korean. These differences in meaning (the semantic or X-axis in CJK-JRG terminology), do not constitute significant differences for encoding purposes, and are therefore not given different character codes.

In general, characters which are unrelated in historical derivation are not unified.

Perhaps the highest priority given to principles in Unicode was the preservation of round-trip translation from important existing character sets and Unicode.

compatibility characters - characters required for round-trip convertibility which, otherwise, might be deemed unnecessary.

The Unicode Technical Committee has a group which is examining implementation issues in the larger domain of I18N and globalization of software in the context of Unicode. Interested parties can join Unicode for a very reasonable rate with the privilege of actively participating in these activities. Indeed, all Unicode Technical Committee meetings are open to the public (except for a short closed-door session).

Unicode was designed to solve one essential problem: the proliferation of incompatible, partial character sets which cover only limited subsets of the set of human languages. While Unicode does not quite yet cover the writing systems of *all* languages, it has the ability to do so, and is quite near that goal now, at least with respect to modern written languages. The Unicode Consortium is now in the process of defining Cleanicode.

cleanicode - a subset of Unicode containing characters guaranteed not to be deprecated by the Unicode Consortium. Also, characters that were duplicated to retain round-trip compatibility are finally unified in cleanicode.

atomic Unicode - same as cleanicode.

allograph - two distinct graphical forms of a character which are not distinct in their interpretation are called allographs. For example, a three stroke grass radical and a four stroke grass radical are each allographs.

grapheme - together, two or more allographs can be said to represent a single underlying abstract form which is called a grapheme.

Most textual operations operate on graphemes, e.g., searching, sorting, equivalence testing, parsing, etc.

glyphic or visual encoding - same as allographic encoding. Variable length encodings of graphemes are much less difficult to handle than dealing with allographic encodings.

The ability of a character set to represent the core content of a written language is predicated on it being able to distinguish between the graphemes of that written language. If it cannot distinguish among graphemes, then it is *inadequate* as a primary representation. Many character sets allow for each grapheme, for example, N WITH TILDE in Spanish, to be represented with a single coded character element; however, there is no substantive reason why multiple coded character elements cannot be used, e.g., N + NON-SPACING TILDE. In such cases, we can say that the character set employs variable length grapheme encodings.

Textual information is meant for more than simply display. Optimizing the representation of text for display alone produces suboptimal behavior for many other operations.

Unicode is a plain text standard.

Just as one can't do much good composition and layout with VI (or EMACS) as long as he uses ASCII alone, he won't be able to do any better if these applications used Unicode.

font attribution - the process by which font information is added to plain text to form rich text containing font choice information.

language attribution - the process by which language information is added to plain text to form rich text containing language or locale information.

out-of-band information - usually non-character data embedded in a text stream.

unihan - the collection of han characters defined by Unicode. Alternatives to han unification have been proposed including a decomposition scheme based on radicals. However, any scheme that would unambiguously represent all known han characters would require more than 1200 radicals or some method for disambiguating decompositions.

latin script - can be though of as the union of symbols contained in all alphabets which are predominantly derived from the symbols used in the Roman alphabet.

writing system - a set of conventions and rules (an orthography) applied to a set of symbols (from one or more scripts) in order to represent some language (sound, meaning, syntax, style, etc.).

For each writing system there is one and only one alphabet, and that such an alphabet is comprised of the collection of symbols employed by the writing system.


previous next contents