Japanese Locale Concept Dictionary

The Japanese Locale

A very thorough discussion of Japanese fonts and codesets is located in the files japan1.inf, japan2.inf, and japan3.inf in the directory /pub/JIS available via anonymous FTP at ucdavis.edu (128.120.2.1).

An expert in Japanese software is lunde@adobe.com.

japanization - localization for the Japanese locale.

Many Japanese take pride in the belief that written Japanese is the most complicated language on Earth.

The pronunciation of many technical and computer terms in Japanese is identical to the English with a slight Japanese accent including those sounds in English that the Japanese have trouble making. These words are written phonetically in katakana. (Bill Smith)

JST - Japanese Standard Time, 9 hours ahead of GMT.

DST is not used in Japan.

The Japanese usually follow the Gregorian calendar.

nengo system - the system in which years are numbered according to the reigns of the imperors. The format is nengo n where nengo is the name of the emperor and n number of years since the era began.

nengo - name of the current era.

romanization - the process of converting written Japanese to Romaji.

Hepburn - The most common romanization system.

kunrei-shiki - The least popular romanization system.

furigana - small hiragana or katakana characters that accompany a kanji character to indicate which of several pronunciations to use.

Both kanji characters that represent numbers and Arabic numbers are used in Japanese writing. Arabic numbers are most common in horizontal writing while the corresponding kanji are common in vertical writing.

Traditional Japanese is written in the Chinese vertical, right-to-left style. Modern Japanese is written in the Western horizontal, left-to-right style. Written Japanese does not use a space character although commas, dashes, parentheses and other punctuation are used.

kuri fugo - same as kutoten.

kutoten - Japanese punctuation characters (which look like English punctuation, but usually have a different meaning).

JIS - same as Japanese Industrial Standard.

JIS X0201 - a single byte codeset consisting of 7bit characters corresponding to ISO 646, 7bit characters for katakana, and 8bit characters for both Roman and katakana characters.

JIS X0202 - a single byte codeset that is an extention of JIS X0201.

JIS X0208 - a multibyte codeset consisting of 6355 Kanji, one and two byte numeric characters, hiragana, katakana, Greek characters, Russian characters, line elementes, and ASCII characters, including some undefined codepoints. The most popular codeset in Japan.

JIS Kanji - same as JIS X0208.

JIS X0212 - a supplement to JIS X0208 that defines additional graphic characters including 21 special characters, 245 alphabetic characters, and 5801 kanji characters. Not yet widely accepted.

UJIS - same as EUC-JIS.

JIS - codeset which supports JIS X0201 and JIS X0208 and contains shift-in and shift-out characters.

shift JIS - a multibyte encoding of JIS X0208 which is non-ASCII transparent. Shift-in and Shift-out characters are not used.

level 1 Kanji - the first 2965 kanji characters defined in JIS X0208.

level 2 Kanji - the final 3390 (less common) kanji characters defined in JIS X0208.

katakana - used for words of foreign origin in Japanese.

hiragana - used for native words in Japanese.

hankaku - single display width.

zenkaku - double display width.

hankaku katakana - katakana that are single width on display.

zenkaku katakana - katakana that are double width on display.

All ASCII are, by definition, hankaku.

kana - either katakana or hiragana.

pc932 - same as shift-JIS.

SJIS - same as shift-JIS.

Shift-JIS - a multibyte codeset defined by JIS in which the first byte of a multibyte character is a shift byte if its 8th bit is set.

Japanese can be a dense language. No spaces are used. Single katakana and hiragana characters represent an entire syllable. A single kanji character represents an entire word or thought. On the other hand, if they include foreign words, the transliterations of these may require more characters. All in all, a Japanese translation will often require fewer bytes to store than the equivalent English.

Since spaces are not used in written Japanese, word processors may have difficulty locating word boundaries and wrapping lines. Japanese has its own rules for these processes.

The multibyte encoding of shift-JIS obeys the following rules: 1) One byte characters are the following: 00-7F ASCII, 80 Reserved, A0-DF Single width katakana, FD-FE Unused, FF Reserved. 2) Two byte characters are the following: 81/SBR special symbols, 82/SBR double width digits and latin letters and hiragana, 83/SBR katakana and Greek characters, 84/SBR Cyrillic and line-drawing characters, 85-87/SBR reserved for expansion, 88-98/SBR level 1 kanji, 98-9F/SBR level 2 kanji (initial set), E0-EA/SBR level 2 kanji (continuation set), EB-EF/SBR reserved for expansion, F0-F9/SBR gaiji, FA-FC/SBR implementation-defined kanji.

gaiji - user defined kanji.

SBR - Second Byte Ranges of hexadecimal values in a shift-JIS two byte character. These ranges are 40-7E and 80-FC.

Shift-JIS is ASCII non-transparent.

JIS X 0208-1990 - The Japanese character set as described in the document JIS X 0208-1990 specifies 6,879 standard characters; 6,355 kanji in 2 levels (Level 1: 2,965 kanji arranged by pronunciation; Level 2: 3,390 kanji arranged by radical), 86 katakana, 83 hiragana, 10 numerals, 52 Roman characters, 147 symbols, 66 Russian characters, 48 Greek characters, and 32 line elements (for making charts). This standard was first established in 1978, modified for the first time in 1983 (character position swapping, glyph changes, and four kanji appended to JIS Level 2), and modified again in 1990 (two kanji were appended to JIS Level 2). This character set is widely implemented on a variety of platforms. Encoding methods for JIS X 0208-1990 include Shift-JIS, EUC, and JIS.

JIS X 0212-1990 - Late in 1990 a supplemental Japanese character standard called JIS X 0212-1990 was published by JIS which specified an additional 5,801 kanji, 21 symbols/diacritical marks, and 245 Latin-based characters with diacritical marks. This means that there are now 12,156 standard kanji in Japanese. However, no machine has yet implemented these new characters. The current plan does not call for extending the Shift-JIS encoding (no space left in its code space), but to use EUC instead. There were three plans to extend EUC to include these additional kanji, and the one which was the most popular was to simply put these characters into code set 3.

NON-ELECTRONIC CHARACTER SETS - There are other character sets, all of which played an important role in establishing JIS X 0208-1990. They include the 1850 Toyo Kanji (now the 1945 Joyo Kanji), the 881 Kyoiku Kanji (now the 1006 Gakushu Kanji), and the 284 Jinmei-yo Kanji.

Toyo Kanji is obsolete, and has been replaced by Joyo Kanji Kyoiku Kanji is obsolete, and has been replaced by Gakushu Kanji. Gakushu Kanji is a subset of Joyo Kanji.

Here is how these character sets relate to JIS X 0208-1990 (and earlier versions): All of Joyo Kanji (and Kyoiku Kanji) are included in JIS Level 1 (2965 kanji total) of JIS C 6226-1978. When Joyo Kanji was introduced in 1981, the additional 95 kanji and subsequent glyph changes forced the creation of JIS X 0208-1983 (actually, this was first called JIS C 6226-1983, then changed to the new designation in 1987) -- those extra 95 characters had to be made part of JIS Level 1 (some were in JIS level 2 already, so some code positions were swapped). Jinmei-yo Kanji, on the other hand, had to only appear in JIS Levels 1 or 2, so that is why 4 kanji were appended in 1983, and 2 more in 1990.

JOUYOU KANJI - The 1945 kanji in the Joyo Kanji Table are those officially sanctioned for use in education. These kanji are a subset of JIS X 0208-1990 (actually, a subset of JIS Level 1).

7bit japanese codes - all of these 7-bit codes share a common character encoding system, but their KI and KO escape sequences differ.

kanji-in - escape sequence which tells Japanese terminals to treat what follows as two-bytes per character.

KI - same as kanji-in.

kanji-out - escape sequence which tells Japanese terminals to treat what follows as one-byte per character (back to JIS-Roman or ASCII).

KO - same as kanji-out.

JIS codeset - same as New-JIS.

New-JIS (1983) - an encoding of Joyo Kanji. More common than old-JIS.

new-JIS (1990) - same as new-JIS (1983) with two kanji characters added to level 2.

Old-JIS - an encoding of Toyo Kanji.

NEC-JIS - an encoding developed by NEC containing characters closest to that of Old-JIS (same number of kanji, but a few glyph differences).

NEC-JIS (1978) - same as NEC-JIS.

NEC Kanji codeset - same as NEC-JIS.

JIS7 - an encoding of JIS that contains only 7-bit characters JIS8 - an encoding of JIS that contains both 7-bit and 8-bit characters (used only for encoding half-width katakana).

A two-byte per character encoding system using 7-bit bytes (ASCII) can encode up to 16,384 characters (128 by 128); however, the Japanese use only the 94 printable ASCII codes in their matrix, so a maximum of 8,836 characters (94 by 94) can be encoded.

New-JIS and old-JIS have the same encoding for KO. They each have a JIS-Roman KO and an ASCII KO.

New-JIS, old-JIS, and NEC-JIS have a different encoding for KI. NEC-JIS does not have an ASCII KO.

New-JIS (1983) and Old-JIS (1978) differ in the following ways: The shape in the original code position simplified, and the unsimplified shape was given a new code position at the end of JIS Level 2 (four characters added). Simplified/unsimplified character pairs exchanged code positions between JIS Level 1 and JIS Level 2 (22 pairs total). In the case of New-JIS, the simplified form is in the JIS Level 1 column, and the unsimplified form is in the JIS Level 2 column. In the case of Old- JIS, it is simply the reverse. The shapes of several characters were changed (246 characters total).

Joyo Kanji - introduced in 1981 for education and contains an additional 95 kanji relative to Toyo Kanji.

Toyo Kanji - standard kanji character set used prior to 1981 in education containing 1850 characters.

There are two types of New-JIS; one based on the 1983 standard, and one based on the 1990 standard. The only difference between them is that the 1990 standard includes two additional kanji appended to JIS Level 2. For all practical purposes they can be treated as the same.

JIS7 - This code is identical to the JIS encoding described above, except that it allows one to use half-width katakana. It is called JIS7 since all the characters in such an encoded file have their 8th bits masked.

ASCII SI - ASCII shift-in control character.

ASCII SO - ASCII shift-out control character.

In order to support JIS8, a terminal must be able to handle 8-bit bytes. To get half-width katakana into the output, you simply insert ASCII decimal 161 through 223 into the kanji-out (i.e., single-byte) portion of the stream. In short, a JIS8 terminal will behave like a Shift-JIS terminal for half-width katakana.

In new-JIS and old-JIS, the KI and KO are three byte sequences. In NEC-JIS, KI and KO are two byte sequences.

8bit Japanese codes cannot be used reliably through electronic mail networks since 7-bit paths will strip off the 8th bit. These codes are used primarily for internal processing of Japanese on various systems.

The Japanese system software for the Macintosh uses Shift-JIS code internally.

MS KANJI - same as shift-JIS.

AT&T JIS - same as EUC.

The Shift-JIS implementation is quite unlike that of the JIS7. CompuServe and the UNIX mail system allows escape characters to function properly, so these users can read Japanese text on-line just like normal English text as long as their terminal allows Japanese to be displayed.

There are a variety of Japanese code conversion programs available. Some are portable, and some are specific to a particular platform.

jis.c - allows a user to change the Japanese code within a textfile. It can handle Shift-JIS and EUC (both 8-bit Japanese codes), and all three 7-bit Japanese codes (New-JIS, Old-JIS, and NEC-JIS). Distributed by lunde@adobe.com.

NKF - Network Kanji Filter program, destributed by ichikawa@flab.fujitsu.co.jp, similar to jis.c, was written to run under UNIX, but an MS-DOS port of the program also exists. This program and its patch (nkf.src and nkf.patch, respectively) are available at the FTP site ucdavis.edu (128.120.2.1) in the pub/JIS/C directory.

There are two primary electronic mail networks which are used in Japan: JUNET (Japan UNIX Network) and BITNET (also called CREN). There are other computer networks in Japan, such as JAIN (Japanese Academic Network), TISN (Todai International Science Network), WIDE, etc.

JUNET - the most active electronic mail network in Japan. There are 5 domains within JUNET: AC, GO, OR, AD, and CO.

AC - Academic Institution.

GO - Government.

OR - Organization.

AD - Administration.

CO - Corporation.

jp - the JUNET domain, itself.

BITNET - There are currently 93 BITNET nodes in Japan. Besides JUNET and WIDE developing Japanese Internet (WIDE), there is still an increasing demand for membership in BITNET.

WIDE - the Japanese Internet (WIDE), currently under development. JUNET is more reliable than BITNET.

NiftyServe - a very popular Japanese BBS.

Although there is no direct way to send email from the Internet to NiftyServe, there exists a gateway between NiftyServe and CompuServe. Most NiftyServe users have CompuServe accounts.

TWICS - an electronic mail and computer conferencing system based in Tokyo, Japan. It is operated as a commercial service. TWICS first went online in 1984, became globally accessible through public packet switched data networks in 1986, and joined the world of inter-system inter-network email in 1987.

JUNET News - the Japanese equivalent to Usenet NEWS. Each newsgroup name is prefixed with fj, which means "From Japan." To subscribe to the JUNET News mailing list, simply send a request to Hisao Nojima at nojima@nttlab.ntt.jp or to Stanford University at junet-news-request@russell.stanford.edu.

Wnn - a Japanese front-end processor which runs under UNIX (a Macintosh version already exists). The name Wnn comes from the project objective to make a good Kana-to-Kanji conversion program which can convert "Watashino Namae ha Nakano desu" into correct Japanese on the first attempt. Wnn version 4.0.3 is available at the FTP site utsun.s.u-tokyo.ac.jp (133.11.11.11).

JSTEVIE - a public domain vi-based editor which supports most vi and ex commands, Japanese input, tag stack, etc. It runs mainly on UNIX and MS-DOS systems, and should be easy to port to other platforms. It is based on the STEVIE 3.69 sources by Tony Andrews. The latest version of JSTEVIE (J1.2) is available at the FTP site utsun.s.u-tokyo.ac.jp (133.11.11.11). It is also available at the FTP site mindseye.berkeley.edu (128.32.232.19). NEmacs - (Nihongo Emacs) is a Japanese language editor based on GNU Emacs. It runs under a system in which GNU Emacs runs, and is distributed in the form of patches to GNU Emacs or the previous version of NEmacs. NEmacs can handle kanji and kana characters in a buffer, and displays them on the screen. File I/O, interprocess communication, screen display, and keyboard input are all specially designed for handling Japanese character codes: JIS, Shift-JIS, and EUC.

kterm - kanji terminal emulation program. If you know xterm, kterm is the same, only it uses JIS fonts. It automatically follows shifts between ASCII and JIS formats, and is fine by itself for reading mail. Many sizes of fonts are available. Since X terminals usually have big screens and high resolution, a 16-point font is fine for general use. Kterm also displays Korean and Chinese, if you have the appropriate Hangul/Hanzi fonts. Kinput captures input, uses Wnn to convert to kanji, then sends the JIS codes.

MOKE - (Mark's Own Kanji Editor for DOS) version 2.1 is a Japanese text editor written by Mark Edwards (101015.206@compuserve.com). It allows one to create Japanese text for sending by electronic mail. It can also be used for displaying Japanese text. A ShareWare version of MOKE (version 1.1) is still available at the FTP site utsun.s.u-tokyo.ac.jp (133.11.11.11).

KD - (Kanji Driver for DOS), a program written by Izumi Ohzawa (izumi@violet.berkeley.edu) of University of California-Berkeley. KD is available at the FTP site mindseye.berkeley.edu (128.32.232.19). KD supports JIS Levels 1 and 2, and Japanese files may be displayed on-line (one does not have to download a file for viewing).

hterm - a Japanese terminal program for DOS called hterm (version 2.6.0.0), available free for non-military use at the FTP site azabu.tkl.iis.u- tokyo.ac.jp (130.69.16.7). It is a full-featured terminal program that allows one to view Japanese on-line using American-made IBM/PC's with EGA or VGA. It emulates a VT220 terminal. It also contains a program called hemacs, with which one can read Shift-JIS encoded files up to 800 lines long. IBM DOS J5.0/V - It is a Japanese operating system for MS-DOS computers, and is IBM's answer to KanjiTalk for the Macintosh. It is similar to KanjiTalk in that its Japanese fonts are stored in RAM rather than in ROM.

NinjaTerm 0.962 - a very popular Japanese communications program written by Michiharu Ariza of Adobe Systems Japan. It is FreeWare, and available at the FTP site ucdavis.edu (128.120.2.1). NinjaTerm can send and receive Japanese text in New-JIS and Old-JIS codes; it also supports Shift-JIS and EUC.