JIS encoding
In computing, JIS encoding refers to several Japanese Industrial Standards for encoding the Japanese language.[1] Strictly speaking, the term means either:
- A set of standard coded character sets for Japanese, notably:
- JIS X 0201, the Japanese version of ISO 646 (ASCII) containing the base 7-bit ASCII characters (with some modifications) and 64 half-width katakana characters.
- JIS X 0208, the most common kanji character set containing 6,879 characters, including 6355 kanji and 524 other characters (one 94 by 94 plane)
- JIS X 0212, an supplement for JIS X 0208 which adds 5801 kanji, totalling 12156 kanji (a second 94 by 94 plane)
- JIS X 0213, which extends JIS X 0208 (two planes)
- JIS X 0202 (also known as ISO-2022-JP), a set of encoding mechanisms for sending JIS character data over transmission mediums that only support 7-bit data.
In practice, "JIS encoding" usually refers to JIS X 0208 character data encoded with JIS X 0202. For instance, the IANA uses the JIS_Encoding
label to refer to JIS X 0202, and the ISO-2022-JP
label to refer to the profile thereof defined by RFC 1468.[2]
Other encoding mechanisms for JIS characters include the Shift JIS encoding and EUC-JP. Shift JIS adds the kanji, full-width hiragana and full-width katakana from JIS X 0208 to JIS X 0201 in a backward compatible way.[3] Shift JIS is perhaps the most widely used encoding in Japan, as the compatibility with the single-byte JIS X 0201 character set made it possible for electronic equipment manufacturers (such as cash register manufacturers) to offer an upgrade from older cheaper equipment that was not capable of displaying kanji to newer equipment while retaining character-set compatibility.
EUC-JP is used on UNIX systems, where the JIS encodings are incompatible with POSIX standards.
A more recent alternative to JIS coded characters is Unicode (UCS coded characters), particularly in the UTF-8 encoding mechanism.
Encoding comparison
The following table compares the features of the three main encoding schemes for JIS X 0208.
Encoding | Alternate name | 7-bit?[lower-alpha 1] | ISO 2022? | Stateless?[lower-alpha 2] | Accepts ASCII? | 0x00–7F always ASCII? | Superset of 8-bit JIS X 0201? | Supports JIS X 0212? | Self synchronising? | |
---|---|---|---|---|---|---|---|---|---|---|
ISO-2022-JP | "JIS" (JIS X 0202) | Yes | Yes | No[lower-alpha 3] | Yes | Sequences can be non-ASCII[lower-alpha 3] | No (encoding possible)[lower-alpha 4] | Possible[lower-alpha 5] | No | |
Shift_JIS | "SJIS" | No | No | Yes | Almost[lower-alpha 6] | Isolated bytes can be non-ASCII[lower-alpha 7] | Yes | No | No | |
EUC-JP | "UJIS" (Unixized JIS) | No | Yes[lower-alpha 8] | Yes[lower-alpha 8] | Yes[lower-alpha 9] | Always ASCII | No (encoded)[lower-alpha 10] | Available[lower-alpha 11] | No | |
Unicode formats for comparison[lower-alpha 12] | ||||||||||
UTF-8 | No | No | Yes | Yes | Yes | No (encoded) | Available | Yes | ||
UTF-16 | No | No | Yes | No | No | No (encoded) | Available | Over 16-bit words only. | ||
GB 18030 | No | No[lower-alpha 13] | Yes | Yes | Isolated bytes can be non-ASCII | No (encoded) | Available | No |
- i.e. does not require 8-bit clean transmission.
- i.e. the sequence used to encode a given character is always the same, no matter what the previous character(s) were. See state (computer science).
- ISO-2022-JP is a stateful encoding: all charsets are encoded over 0x21–7E and are switched between using ANSI escapes. Hence, while it is ASCII in its initial state, entire sequences of non-ASCII characters can be encoded with ASCII bytes.
- JIS X 0201 katakana are available in JIS X 0202 and ISO 2022, but not included in the basic ISO-2022-JP profile, although they are a common extension.
- JIS X 0212 is available in JIS X 0202 and ISO 2022, and included in the ISO-2022-JP-1 and ISO-2022-JP-2 profiles, but not in the basic ISO-2022-JP profile.
- Single byte characters 0x21–7E in Shift_JIS are properly ISO-646-JP, in order to be a superset of 8-bit JIS X 0201, but are often decoded (not necessarily displayed) as ASCII, which differs only in two places.
- Some (not all) ASCII bytes can appear as second bytes, but not first bytes, of double-byte characters in Shift_JIS. Hence in a sequence of two or more ASCII bytes, the second byte onward are necessarily ASCII (or ISO-646-JP) characters.
- Packed-format EUC is based on ISO 2022 mechanisms, with charset designations pre-arranged. Charset designation escapes and locking shifts are avoided, whereas use of single shifts can be implemented in a non-stateful manner. The constraints of ISO 2022 are nonetheless followed.
- Single byte characters 0x21–7E in EUC-JP are generally considered ASCII, but sometimes treated as ISO-646-JP.
- Unlike Shift_JIS, EUC-JP will not handle plain 8-bit JIS X 0201 input without prior conversion, due to the different representation of the JIS X 0201 katakana (with single-shifts).
- JIS X 0212 in EUC-JP is not always implemented.
- Besides the properties of the encodings themselves, Unicode formats have further advantages stemming from the underlying character set: they are not limited to JIS coded characters but can represent the entirety of UCS (including the full repertoire of JIS coded characters), and are hence suited to international use. They are also less badly affected by colliding proprietary extensions, due to their greater base repertoire and designated private use areas.
- While GB 18030 and GBK are extensions of the EUC-CN form of GB/T 2312, they do not follow the constraints of EUC or ISO 2022, unlike EUC-JP (or the original EUC-CN).
See also
References
- Haralambous, Yannis (2007). Fonts & Encodings. O'Reilly Media. pp. 42–44. ISBN 9780596102425.
- "Character Sets". IANA.
- Lunde, Ken (2009). CJKV Information Processing. O'Reilly Media. pp. 262–268. ISBN 9780596514471.