[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

9.2.1 Supported character encoding schemes

A CES is represented by its name as a string. Case is ignored. There may be several aliases defined for a single encoding.

You can check whether the specific conversion is supported on your system or not, by the following function.

Function: ces-conversion-supported? from-ces to-ces
Returns #t if conversion from the character encoding scheme (CES) from-ces to to-ces is supported in this system.

Conversion between common japanese CESes (EUC_JP, Shift JIS, UTF-8 and ISO2022-JP) of the character set JIS X 0201 and JIS X 0213 is handled by Gauche's built-in algorithm (see below for details). When other CES name is given, Gauche uses iconv(3) if it is linked.

When Gauche's conversion routine encounters a character that can't be mapped, it replaces the character for "geta mark" (U+3013) if it's a multibyte character in the input encoding, or for '?' if it's a singlebyte character in the input encoding. If that happens in iconv, handling of such character depends on iconv implementation (glibc implementation returns an error).

If the conversion routine encounters an input sequence that is illegal in the input CES, an error is signalled.

Details of Gauche's native conversion algorithm: Between EUC_JP, Shift JIS and ISO2022JP, Gauche uses arithmetic conversion whenever possible. This even maps the undefined codepoint properly. Between Unicode (UTF-8) and EUC_JP, Gauche uses lookup tables. Between Unicode and Shift JIS or ISO2022JP, Gauche converts the input CES to EUC_JP, then convert it to the output CES. If the same CES is specified for input and output, Gauche's conversion routine just copies input characters to output characters, without checking the validity of the encodings.

EUC_JP, EUCJP, EUCJ, EUC_JISX0213
Covers ASCII, JIS X 0201 kana, JIS X 0212 and JIS X 0213 character sets. JIS X 0212 character set is supported merely because it uses the code region JIS X 0213 doesn't use, and JIS X 0212 characters are not converted properly to Shift JIS and UTF-8. Use JIS X 0213.

SJIFT_JIS, SHIFTJIS, SJIS
Covers Shift_JISX0213, except that 0x5c and 0x7e is mapped to ASCII character set (REVERSE SOLIDUS and TILDE), instead of JIS X 0201 Roman (YEN SIGN and OVERLINE).

UTF-8, UTF8
Unicode 3.2. Note that some JIS X 0213 characters are mapped to Extension B (U+20000 and up). Some JIS X 0213 characters are mapped to two unicode characters (one base character plus a combining character).

ISO2022JP, CSISO2022JP, ISO2022JP-1, ISO2022JP-2, ISO2022JP-3
These encodings differ a bit (except ISO2022JP and CSISO2022JP, which are synonyms), but Gauche handles them same. If one of these CES is specified as input, Gauche recognizes escape sequences of any of CES. ISO2022JP-2 defines several non-Japanese escape sequences, and they are recognized by Gauche, but mapped to substitution character ('?' or geta mark).

For output, Gauche assumes ISO2022JP first, and uses ISO2022JP-1 escape sequence to put JIS X 0212 character, or uses ISO2022JP-3 escape sequence to put JIS X 0213 plane 2 character. Thus, if the string contains only JIS X 0208 characters, the output is compatible to ISO2022JP. Precisely speaking, JIS X 0213 specifies some characters in JIS X 0208 codepoint that shouldn't be mixed with JIS X 0208 characters; Gauche output those characters as JIS X 0208 for compatibility. (This is the same policy as Emacs-Mule's iso2022jp-3-compatible mode).


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

This document was generated by Ken Dickey on November, 28 2002 using texi2html