Extra Notes from I18N Concept Dictionary

These notes were copied whole or in part from postings to various news groups during 1990-1993, and have not been incorporated into the main body of the concept dictionary because they either did not seem authoritative or because they did not fit cleanly into the outline. Some of these notes might be considered to be somewhat politically incorrect, and I apologize for that.

Extra Notes:

Here is an ISO 8859 summary:

# 8859-1 Latin-1 (Western and Northern Europe, including German and Italian)
# 8859-2 Latin-2 (Eastern Europe except Turkey and the Baltic countries)
# 8859-3 Latin-3 (The Mediterrainean Area and South Africa; obsolete)
# 8859-4 Latin-4 (Scandinavia and the Baltic countries; obsolete)
# 8859-5 Cyrillic (in Europe, except Ukrainian)
# 8859-6 Arabic
# 8859-7 Greek
# 8859-8 Hebrew
# 8859-9 Latin-5 (Turkey, Western Europe including Scandinavia)
# 8859-10 Latin-6 (Northern Europe and the Baltic countries; forthcoming)

"Central Europe" is a very nebulous term. Most often Central Europe is considered to include Poland, the Czech Republic, Slovakia, and Hungary, but neither of their national languages are supported by 8859-1.

The new languages supported by 8859-3 are Catalan, Maltese, Turkish, Afrikaans, and Esperanto. Galician, Dutch, French, German, Italian, and Spanish are also supported.

8859-4 was the first ISO attempt to provide an 8-bit character set covering the languages of the Baltic peoples as well as all languages in Scandinavia, including the minority language Sami (Lappish). It is not used in practice.

The standardization bodies of the Nordic countries have now got a better understanding of the character set problems of Sami and they have direct contact with the new standardization bodies of the Baltic countries. This has exposed a lot of shortcomings in 8859-4. The new ISO 8859-10 (Latin-6), which will be published soon, is the second attempt to cater to these languages. There is a proposal to JTC1/SC2 to withdraw 8859-4.

I have no expertise on the Cyrillic script but I have read that 8859-5 does not fully cover Ukrainian, despite the statement to that effect in the standard.

This is the 8859 part preferred in Turkey. It is almost the same as 8859-1, but four letters used in Icelandic and Faroese are replaced by Turkish letters. (The standard claims to cover Faroese, but this is incorrect.)

A short description of the parts of ISO 8859 should also cover the forthcoming part 10. This has been designed to cover all languages of the Nordic countries (Scandinavia, Iceland and Finland) plus Estonian, Latvian and Lithuanian.

The needs of most users can be fulfilled by using the proper 8-bit character set instead of an internationalized wider character set such as ISO 10646.

There is disagreement on whether most Unicode/10646 implementors tend to use 10646 UCS2 (Unicode) fixed width 16-bit encodings only, for both processing and interchange, or for processing only.

No single part of 8859 is sufficient for most companies and organizations with activity and contacts in different parts of Europe.

Not even 8859-1 fully covers the needs of all countries in the EC using the Latin script, which causes problems for the handling of personal names (all citizens of any EC country are to be treated equally everywhere in the EC). This will get much worse when countries such as Poland and Hungary become associated to EC or new members.

I wanted to add that Sun's messaging scheme uses the gettext() library routine instead of the X/Open catgets() library routine. Whereas catgets() uses numeric keys to find translated strings, gettext() uses text strings. This not only simplifies maintenance, but also provides greater accuracy in distributed environments.

This note is just to let people know that there is now a multilingual version of the GNU Emacs text editor, and one of the features that people have built into it is an interactive way of looking up a Han character by radical first, then strokes, or whatever. I haven't tried it yet, but the way people talked about it on the mailing list, it seems to be working. Since students of the Japanese language might not know how to pronounce a particular character, they wouldn't then be able to use the usual input methods (romaji or kana input), so this tool might be helpful, or at least a starting point for more work. To join their mailing list or get more info, send mail to: mule-request@etl.go.jp

If you don't want to bother the people at this email address, you could choose to quietly ftp the sources themselves, from:

sh.wide.ad.jp [133.4.11.11]:/JAPAN/mule or ftp.funet.fi [128.214.6.100]:/pub/gnu/emacs/mule

Therefore, I'm discarding all claims that Unicode/ISO10646 is intended to provide means for localization ONLY. It is designed for internationalization, i.e., for supporting sufficiently multilingual environments; and this function is the only justification for its introduction.

The acceptability of sorting a character with a diacritical mark near the unmarked character depends on the language. Germans and French don't mind them being near, the Swedish expect the diacritical characters at the end of the alphabet.

In Unicode, Simplified Chinese forms are encoded separately from their traditional counterparts.

CJK-JRG (CJK Joint Research Group)

URO - same as Unified Repertoire and Ordering

The initial collection of Han characters in Unicode comprises the Unified Repertoire and Ordering 2.0 produced by CJK-JRG

Is the simplified Japanese form you refer to contained in JISX0208 or JISX0212? If it is, then it is in Unicode 1.0. If it isn't, a second level of Han characters is now being formulated by the CJK-JRG for inclusion in a future version of Unicode.

The first collection of Han characters currently included in Unicode represent the most important and vast majority of existing CJK character sets. These include all characters from GB 2312-80, GB 12345-90, GB8565-89, CNS116453 (planes 1 & 2), JIS X 0208-1980, JIS X 0212-1990, KS C 5601-1989, and KS C 5657-1991. In addition, some characters of the unsimplified form of GB 7589-87, the unsimplified form of GB 7590-87, CNS 11643 (plane 14), the old Chinese telegraph code, and unique characters from ANSI Z.39.64-1989 were included. If two characters were distinct in any character set of the first list of character sets above, then they are distinct in Unicode -- this is called the Source Set Separation rule.

The second collection of Han characters now being considered by CJK-JRG include those which were not encoded by any of the above mentioned character sets and those which are not currently encoded by any character set. The analysis and inclusion of these new Han characters into Unicode will be a major project over the next few years.

If you would like to obtain (unofficial) mappings between a number of these character sets and Unicode, you can find them by anonymous FTP on METIS.COM [140.186.33.40] in /pub/csets.

I see localized monolingual as the majority case, and multilingual as the minority case. Because of the fact that name space information is not locale specific, you can't have a "/home" for an English user on the system and a "/casa" for the Spanish user. This means that an actual multinational system which is not partitioned by language (chroot with hard links or a translucent "namespace mount?") must have the ability to display the character sets of all the languages in use on the system at any time, if only for consistent operation of "ls" in shared or publicly examinable directories. This is a heavy burden for minority use to carry.

I further believe it to be a long way between the acceptance of a greater than 16-bit font and the ability of, for instance, the X system to support 2^20 character glyphs downloaded to the terminal (20 bits has been consistently used by "Ohta-san" (correctly, the honorific "san" should be on the surname, and I am not sure if he has chosen a westernized ordering on his name; you may be using the equivalent of "Glenn-san"... but I digress 8-)) as a working approximation for his idea of the size requirements of a non-unified set containing the non-intersecting "unification" of all existing sets.

Runic -

Actually, isn't ASCII the *US* form of ISO 646? I thought ISO 646 was mostly ASCII, but with some positions listed as being "national characters", with each national standard possibly putting different characters there.

Yes, the original ISO646 (1983) was based on ASCII but designated 10 characters positions as "national characters" which could be replaced by different national uses. The result was IRV.

IRV - the International Reference Version

However, in 1991, ISO646 was reissued, this time being identical to ASCII (ANSI X3.4-1986), which was considered the US national version of the original ISO646 IRV.

In the current ISO 646 standard there is still 10 national use positions, and two positions that may have two different characters. There may then be different versions of ISO 646 adopted by different national ISO member bodies. There is an International Reference Version (IRV) defined in the standard, which is now equivalent to US ASCII. Thus it is only the IRV version of ISO 646 which is equivalent to US ASCII.

kanjidic as a primary Public Domain reference file for kanji. The kanjidic file (available from pub/nihongo on monu6.cc.monash.edu.au) has a pretty complete Nelson and Halpern coding for JIS 1 & 2. It also has the skip codes (n-n-n) and all the Unicode mappings. Someone is preparing a list of Heisig index numbers to add to this.

The messaging scheme Sun is publishing here originally came from POSIX.1b and uniforum proposals and has been exclusively used in Solaris 2.x and Sun Microsystem's unbundled products.

Glenn Adams maintains mappings from JIS X 0212 (1990) and other Asian standards to Unicode in the directory /pub/csmaps on METIS.COM [140.186.33.40], obtainable by anonymous FTP.

MU - Mail Unicode.

IANA -

MIME -

ISO/IEC CD 11581 Information Technology - Text and Office Systems - Graphical Symbols Used on Screens: Interactive Icons. This document is in the development phase and may have been renamed and/or renumbered last November.

rich-text -

plain text -

content fidelity - coined by Ed Smura, Xerox

appearance fidelity - coined by Ed Smura, Xerox

ISO DIS -

language tagging -

phonetic annotations of Han characters -

After all, specially designed systems from Japanese hardware vendors are sold to many Japanese city offices to handle very special characters used in Japanese people's names and names of places. Either these characters are outside JIS (JIS LEVEL 2 now incorporates many of them.), or that the glyph is very different from the commonly used glyph.

One approach is to decompose any character that is perceived by some language community as composed. So, because English speakers perceive Swedish "A-ring-above" as composed, the representation would be composed. Likewise, because Turkish speakers perceive "i" as composed (dotless i and dot above), the representation of "i" would be composed. Is there any language that perceives W as "V+joiner+V"? (I am aware that it is the historical derivation.)

ISO 3461 - General principles for the creation of graphical symbols ISO 3864 - Safety colors and safety signs

ISO 4196 - Graphical symbols -- use of arrows

ISO 7000 - Graphical symbols for use on equipment

ISO 7001 - Public information symbols

Although past practices tend to equate "character set" and "glyph set", this was not true for all character sets, and certainly is not true for Unicode.

In general, a font which can display all Unicode encoded text in a minimal fashion will have many more glyphs in it than Unicode has characters. In particular, the display of Indic scripts will require glyphs which are not directly encoded by Unicode.

AFII - Association for Font Information Interchange provides a standard way to refer to glyphs; AFII, by the way, *does not encode character sets* - at most it may include glyphs which can be used to display character data.

There is considerable disagreement over which characters should be included in the Unicode standard.

MARC - MAchine Readable Cataloging

floating diacritics -

Libraries seldom have surplus funds but they have been able to handle separately encoded diacritics for over twenty years. Since 1967 library automation in the U.S. and those elsewhere who use U.S. cataloging data have used character sets with separate codes for diacritics and then used software to superimpose the diacritic above or below the letter being modified.

There ARE now Kanji dictionaries with JIS code information for each Kanji. The dictionaries provide all other functions people have become accustomed to: index based on stroke counts, index based on radicals, index based on on-yomi, and kun-yomi.

On-yomi - how to read a kanji character.

kun-yomi - (possibly) how children read a kanji character. yomi - The Japanese word for reading.

grapheme -

Two different kinds of unifications were done in the course of developing the unified Han set:

#1. Unifications between national standards (i.e., instances where people in Taiwan and people in Japan write the "same" character with slight differences).

#2. Unifications within national standards. This basically did not happen, because of the source separation rule used by the CJK-JRG.

URO -

In Japan, the invention of kanji characters by private individuals for their names happened at least a century ago. Today, there is a government decree to restrict such arbitrary "invention". You can't register a name using characters outside the table that government offices use. This decree at the same time recognizes many previously "invented" characters so that peoples' names can be registered (or that the existing registration documents can be usable for identification purposes in land transaction/birth certificate, etc.). One purpose of JIS Level 2 kanjis (addition to original JIS) was an effort to accommodate this table, although there were other reasons as well.

There is quite a bit of confusion between font sets and character sets. Many users are concerned that unification will result in the forced use of incorrect fonts. But the unification is intended only to affect the character sets.

Using incorrect Kanji in one's writing is a very bad thing in terms of social status. If you hand in something with incorrect kanji, it will surely raise eyebrows here and there. This is just like misspellings in English.

There is not supposed to be any overlap between JIS X 0208-1990 and JIS X 0212-1990. There is one character, however, which is treated as a non-kanji in JIS X 0208-1990 and a kanji in JIS X 0212-1990. It is the symbol for "shime," which means "deadline."

A code page is analogous to the ISO 8859-[1-9] character sets, and includes the following:
IBM Code Page 210: Greek (obsolete - replaced by Code Page 869)
IBM Code Page 220: Spanish (international - obsolete?)
IBM Code Page 437: US (default for most video adapters)
IBM Code Page 850: Multinational (Latin 1)
IBM Code Page 852: Slavic (Latin 2)
IBM Code Page 860: Portugal
IBM Code Page 863: Canada (French)
IBM Code Page 865: Norway
IBM Code Page 866: Cyrillic (?)
IBM Code Page 869: Greek (?)

Bill Smith says that the Japanese language has a greater number of homonyms which are sometimes resolved in speech by using a finger of one hand to draw the correct kanji character on the palm of the other hand.

Localization assumes that a system is supposed to work with only one (user's) language. In practice, bilinguality is required because most programming languages and command-line interfaces heavily use English mnemonics.

Multilingual applications present special problems in the areas of spell-checking, case conversion (partly because of language-dependent rules), lexicographic sorting (not only the collation tables, but also the sorting algorithm may be language-dependent), hyphenation, locale-dependent glyphs for the same character, and others.

A pure character set encoding does not provide a means for determining locale-dependent glyphs.

A universal codeset such as Unicode requires a set of very large tables for the specification of collation and character type information. Each locale requires its own set. This is regarded by some as memory-inefficient, especially in the context of multilingual applications.

The idea of a dictionary with sorted entries is of European origin.

Most of Cyrillic-based languages had no written form before the nations were occupied by chauvinistic Russian imperialists, who obviously did not care about the culture or ethnic minorities, and oppressed them by inventing and introducing writing to be able to give orders in the natives' languages.

To ease hyphenation problems it may make sense to add a zero-width "hyphenate hint" symbol (aka \% in troff). This symbol can be added manually or by a language-specific hyphenation algorithm at the input (it may as well be a part of spell-checker, in this role it is essentially "free"). The idea is to leave ALL language specifics at the point of input where the language is supposedly known. If a "generic" hyphenation algorithm is included in the standard, it may be possible to omit hyphenation hints in words which were split correctly by the standard algorithm.

There seems to be agreement that "A" and "T" should not be unified between Cyrillic, Greek, and Latin. Unicode does not unify these characters.

There is disagreement over how much of the I18N support (especially for multilingual applications) should be in the system versus the application.

KOI-8 became a standard in the early 70s and is codified (apparently not by ISO but by CCITT because it is based on "phonetical" system for sending cyrillic messages over international telegraphs which predate computers).

ASCII defeated EBCDIC in the market before it become an international standard.

Learn some foreign language and try to use it in the real-life communications. You'll find that mixed-language texts are far more common than you think. Russian, for example, does not have equivalents for "political correctness" or "mentally disadvantaged".

You don't need changes in file systems to support multilingual texts.

Renaming /etc or /etc/passwd in order to achieve "multinationalization" is unnecessary because these aren't human words but rather kinds of hieroglyphs. Somehow French, Russian, German programmers never had troubles understanding that IF THEN ELSE means conditional execution. It may be not so obvious for English speakers who grow up on COBOL but the computerish semantic units are easier to understand if not to mix them with the natural languages.

I bet that school children in Russia have much less trouble understanding the difference between usage of conditional clauses in natural and algorithmic languages (simply *because* they're different).

I always thought that the hyphenation routine's place is in the SYSTEM library. It's a reuseable piece of code. And there are many places where strings need to be hyphenated (long diagnostics in windows of variable size) even by trivial programs.

KOI-8 is nothing more than an 8-bit ASCII extension with cyrillic letters in codes from 0300 to 0377. It has completely separate alphabets for Russian and English even though there is a number of similar letters.

News groups containing I18N information include comp.protocols.iso, comp.std.misc, and comp.std.internat.

For FAQs, look with anonymous ftp at pit-manager.mit.edu [18.172.1.27] in directory /pub/usenet/news.answers.

format - the structure of a textual string including the ordering of its components, the punctuation used to separate the components, the representation of the components, etc. Includes date/time formats, monetary or numeric formats, etc.

If a widget allows for text entry, then it may be modified to use XIM. The Motif 1.2 Text Widget currently supports the Xlib XIM API.

SSDP probably adds maintenance problems that are not warranted by any speedup in the code. SSSP should be used unless there are compelling reasons to the contrary and SSSP should be the default approach.

SUN projects larger growth in markets in western Europe and Japan than in the US through 1995. In Japan, the growth rate for software revenue is projected at 19%, while for the US, it is 13%. Hardware sales in Japan for UNIX systems have been growing at 30-40% annually.

levels of internationalization - defined by SUN, includes level 1,2,3,4.

level 1 - a level of internationalization support in which text and codeset support is 8bit clean.

level 2 - a level of internationalization support with date/time, monetary and numeric, and collation support.

level 3 - a level of internationalization support in which messages and other user-visible text are externalized.

level 4 - a level of internationalization support with multibyte support.

branding - process by which a brand (from a vendor) is placed on a product.

Association of Language Export Centres, P.O. Box 1574, London NW1 4NJ, United Kingdom, Tel: 44 71 224 3748.

Association of Translation Companies, 7 Buckingham Gate, London SW1E 6JS, United Kingdom, Tel: 44 630 5454.

Institution of Translation and Interpreting, 318a Finchley Road, London NW3 5HT, United Kingdom, Tel: 44 71 794 9931.

ALPNET U.S., 4444 Sourth 700 East, Suite 204, Salt Lake City, UT 84107-3075, Tel: 801 265-3300, Fax: 801 265-3310. A translation company.

Institute for Advanced Professional Studies, 955 Massachusetts Avenue, Cambridge, MA 02139-3107, Tel: 617 497-2075, Fax: 617 497-4829. A consulting firm for internationalization.

SDL Limited, 48 Sheephouse Road, Maidenhead, Berkshire SL6 8HH, United Kingdom, Tel: 44 628 38198, Fax: 44 628 770796. A localization consultancy.

International Documentation, 10474 Santa Monica Blvd., Suite 404, Los Angeles, CA 90025, Tel: 213 446-4666, Fax: 213 446-4661.

World-Wise International Solutions, 1 Tara Blvd., Suite 406, Nashua, NH 03062 Tel: 800 445-WISE, Fax: 603 888-9354.

Uniforum Association, 2901 Tasman Drive, Suite 201, Santa Clara, CA 95054, Tel: 408 986-8840, Fax: 408 986-1645.

USENIX Association, 2560 Ninth Street, Suite 215, Berkeley, CA 94710, Tel: 510 528-8649.

EurOpen, Owles Hall, Buntingford, Hertfordshire SG9 9PL, United Kingdom, Tel: 44 763 73039, Fax: 44 763 73255.

International Organization for Standardization (ISO), General Secretariat, Case Postale 56, 3 Rue de Varembe, CH-1211 Geneva 20, Switzerland, Tel: 412 234 1240, Fax: 412 233 3430.

European Committee for Standardization (CEN), Central Secretariat, Rue de Stassart, 36, B-1050 Brussels, Belgium, Tel: 32 2 5196811 Fax: 32 2 5196819

The X/Open Company Limited, Apex Plaza, Forbury Road, Reading, Berkshire RG1 1AX, United Kingdom, Tel: 44 734 508311, Fax: 44 734 500110.

The Unicode Consortium, 1965 Charleston Road, Mountain View, CA 94043, Tel: 415 961-4189, Fax: 415 966-1637.

American Electronics Association (AEA), Publications Department, 5201 Great America Parkway, Santa Clara, CA 95051, Tel: 800 873-1177, 408 987-4200. Publishes "Soft Landing in Japan: A Market Entry Handbook for U.S. Software Companies v2.0J" for $95.