diff options
| author | Dave Love | 2002-06-16 19:57:54 +0000 |
|---|---|---|
| committer | Dave Love | 2002-06-16 19:57:54 +0000 |
| commit | 5a936b4698228cb5c8c86da284a7075a7a34d0c3 (patch) | |
| tree | 3e4cd74238e5c0c0b4a020471f1759952918a30e /src/coding.c | |
| parent | dc8533549ecc3ac1b08dd5fb8f052fcff961ef0e (diff) | |
| download | emacs-5a936b4698228cb5c8c86da284a7075a7a34d0c3.tar.gz emacs-5a936b4698228cb5c8c86da284a7075a7a34d0c3.zip | |
comments
Diffstat (limited to 'src/coding.c')
| -rw-r--r-- | src/coding.c | 106 |
1 files changed, 24 insertions, 82 deletions
diff --git a/src/coding.c b/src/coding.c index abc11ea5eb7..78ab0e0db03 100644 --- a/src/coding.c +++ b/src/coding.c | |||
| @@ -94,7 +94,7 @@ CODING SYSTEM | |||
| 94 | o BIG5 | 94 | o BIG5 |
| 95 | 95 | ||
| 96 | A coding system to encode character sets: ASCII and Big5. Widely | 96 | A coding system to encode character sets: ASCII and Big5. Widely |
| 97 | used by Chinese (mainly in Taiwan and Hong Kong). Details are | 97 | used for Chinese (mainly in Taiwan and Hong Kong). Details are |
| 98 | described in section 8. In this file, when we write "big5" (all | 98 | described in section 8. In this file, when we write "big5" (all |
| 99 | lowercase), we mean the coding system, and when we write "Big5" | 99 | lowercase), we mean the coding system, and when we write "Big5" |
| 100 | (capitalized), we mean the character set. | 100 | (capitalized), we mean the character set. |
| @@ -108,7 +108,7 @@ CODING SYSTEM | |||
| 108 | 108 | ||
| 109 | o Raw-text | 109 | o Raw-text |
| 110 | 110 | ||
| 111 | A coding system for a text containing raw eight-bit data. Emacs | 111 | A coding system for text containing raw eight-bit data. Emacs |
| 112 | treats each byte of source text as a character (except for | 112 | treats each byte of source text as a character (except for |
| 113 | end-of-line conversion). | 113 | end-of-line conversion). |
| 114 | 114 | ||
| @@ -587,7 +587,7 @@ enum iso_code_class_type | |||
| 587 | (XSTRING (AREF (CODING_ID_ATTRS ((coding)->id), coding_attr_ccl_valids)) \ | 587 | (XSTRING (AREF (CODING_ID_ATTRS ((coding)->id), coding_attr_ccl_valids)) \ |
| 588 | ->data) | 588 | ->data) |
| 589 | 589 | ||
| 590 | /* Index for each coding category in `coding_category_table' */ | 590 | /* Index for each coding category in `coding_categories' */ |
| 591 | 591 | ||
| 592 | enum coding_category | 592 | enum coding_category |
| 593 | { | 593 | { |
| @@ -2049,21 +2049,23 @@ encode_coding_emacs_mule (coding) | |||
| 2049 | 2049 | ||
| 2050 | /* The following note describes the coding system ISO2022 briefly. | 2050 | /* The following note describes the coding system ISO2022 briefly. |
| 2051 | Since the intention of this note is to help understand the | 2051 | Since the intention of this note is to help understand the |
| 2052 | functions in this file, some parts are NOT ACCURATE or OVERLY | 2052 | functions in this file, some parts are NOT ACCURATE or are OVERLY |
| 2053 | SIMPLIFIED. For thorough understanding, please refer to the | 2053 | SIMPLIFIED. For thorough understanding, please refer to the |
| 2054 | original document of ISO2022. | 2054 | original document of ISO2022. This is equivalent to the standard |
| 2055 | ECMA-35, obtainable from <URL:http://www.ecma.ch/> (*). | ||
| 2055 | 2056 | ||
| 2056 | ISO2022 provides many mechanisms to encode several character sets | 2057 | ISO2022 provides many mechanisms to encode several character sets |
| 2057 | in 7-bit and 8-bit environments. For 7-bite environments, all text | 2058 | in 7-bit and 8-bit environments. For 7-bit environments, all text |
| 2058 | is encoded using bytes less than 128. This may make the encoded | 2059 | is encoded using bytes less than 128. This may make the encoded |
| 2059 | text a little bit longer, but the text passes more easily through | 2060 | text a little bit longer, but the text passes more easily through |
| 2060 | several gateways, some of which strip off MSB (Most Signigant Bit). | 2061 | several types of gateway, some of which strip off the MSB (Most |
| 2062 | Significant Bit). | ||
| 2061 | 2063 | ||
| 2062 | There are two kinds of character sets: control character set and | 2064 | There are two kinds of character sets: control character sets and |
| 2063 | graphic character set. The former contains control characters such | 2065 | graphic character sets. The former contain control characters such |
| 2064 | as `newline' and `escape' to provide control functions (control | 2066 | as `newline' and `escape' to provide control functions (control |
| 2065 | functions are also provided by escape sequences). The latter | 2067 | functions are also provided by escape sequences). The latter |
| 2066 | contains graphic characters such as 'A' and '-'. Emacs recognizes | 2068 | contain graphic characters such as 'A' and '-'. Emacs recognizes |
| 2067 | two control character sets and many graphic character sets. | 2069 | two control character sets and many graphic character sets. |
| 2068 | 2070 | ||
| 2069 | Graphic character sets are classified into one of the following | 2071 | Graphic character sets are classified into one of the following |
| @@ -2075,14 +2077,14 @@ encode_coding_emacs_mule (coding) | |||
| 2075 | - DIMENSION2_CHARS96 | 2077 | - DIMENSION2_CHARS96 |
| 2076 | 2078 | ||
| 2077 | In addition, each character set is assigned an identification tag, | 2079 | In addition, each character set is assigned an identification tag, |
| 2078 | unique for each set, called "final character" (denoted as <F> | 2080 | unique for each set, called the "final character" (denoted as <F> |
| 2079 | hereafter). The <F> of each character set is decided by ECMA(*) | 2081 | hereafter). The <F> of each character set is decided by ECMA(*) |
| 2080 | when it is registered in ISO. The code range of <F> is 0x30..0x7F | 2082 | when it is registered in ISO. The code range of <F> is 0x30..0x7F |
| 2081 | (0x30..0x3F are for private use only). | 2083 | (0x30..0x3F are for private use only). |
| 2082 | 2084 | ||
| 2083 | Note (*): ECMA = European Computer Manufacturers Association | 2085 | Note (*): ECMA = European Computer Manufacturers Association |
| 2084 | 2086 | ||
| 2085 | Here are examples of graphic character set [NAME(<F>)]: | 2087 | Here are examples of graphic character sets [NAME(<F>)]: |
| 2086 | o DIMENSION1_CHARS94 -- ASCII('B'), right-half-of-JISX0201('I'), ... | 2088 | o DIMENSION1_CHARS94 -- ASCII('B'), right-half-of-JISX0201('I'), ... |
| 2087 | o DIMENSION1_CHARS96 -- right-half-of-ISO8859-1('A'), ... | 2089 | o DIMENSION1_CHARS96 -- right-half-of-ISO8859-1('A'), ... |
| 2088 | o DIMENSION2_CHARS94 -- GB2312('A'), JISX0208('B'), ... | 2090 | o DIMENSION2_CHARS94 -- GB2312('A'), JISX0208('B'), ... |
| @@ -2175,11 +2177,11 @@ encode_coding_emacs_mule (coding) | |||
| 2175 | Note (**): If <F> is '@', 'A', or 'B', the intermediate character | 2177 | Note (**): If <F> is '@', 'A', or 'B', the intermediate character |
| 2176 | '(' must be omitted. We refer to this as "short-form" hereafter. | 2178 | '(' must be omitted. We refer to this as "short-form" hereafter. |
| 2177 | 2179 | ||
| 2178 | Now you may notice that there are a lot of ways for encoding the | 2180 | Now you may notice that there are a lot of ways of encoding the |
| 2179 | same multilingual text in ISO2022. Actually, there exist many | 2181 | same multilingual text in ISO2022. Actually, there exist many |
| 2180 | coding systems such as Compound Text (used in X11's inter client | 2182 | coding systems such as Compound Text (used in X11's inter client |
| 2181 | communication, ISO-2022-JP (used in Japanese internet), ISO-2022-KR | 2183 | communication, ISO-2022-JP (used in Japanese Internet), ISO-2022-KR |
| 2182 | (used in Korean internet), EUC (Extended UNIX Code, used in Asian | 2184 | (used in Korean Internet), EUC (Extended UNIX Code, used in Asian |
| 2183 | localized platforms), and all of these are variants of ISO2022. | 2185 | localized platforms), and all of these are variants of ISO2022. |
| 2184 | 2186 | ||
| 2185 | In addition to the above, Emacs handles two more kinds of escape | 2187 | In addition to the above, Emacs handles two more kinds of escape |
| @@ -2201,19 +2203,19 @@ encode_coding_emacs_mule (coding) | |||
| 2201 | o ESC '3' -- start relative composition with alternate chars (**) | 2203 | o ESC '3' -- start relative composition with alternate chars (**) |
| 2202 | o ESC '4' -- start rule-base composition with alternate chars (**) | 2204 | o ESC '4' -- start rule-base composition with alternate chars (**) |
| 2203 | Since these are not standard escape sequences of any ISO standard, | 2205 | Since these are not standard escape sequences of any ISO standard, |
| 2204 | the use of them for these meaning is restricted to Emacs only. | 2206 | the use of them with these meanings is restricted to Emacs only. |
| 2205 | 2207 | ||
| 2206 | (*) This form is used only in Emacs 20.5 and the older versions, | 2208 | (*) This form is used only in Emacs 20.7 and older versions, |
| 2207 | but the newer versions can safely decode it. | 2209 | but newer versions can safely decode it. |
| 2208 | (**) This form is used only in Emacs 21.1 and the newer versions, | 2210 | (**) This form is used only in Emacs 21.1 and newer versions, |
| 2209 | and the older versions can't decode it. | 2211 | and older versions can't decode it. |
| 2210 | 2212 | ||
| 2211 | Here's a list of examples usages of these composition escape | 2213 | Here's a list of example usages of these composition escape |
| 2212 | sequences (categorized by `enum composition_method'). | 2214 | sequences (categorized by `enum composition_method'). |
| 2213 | 2215 | ||
| 2214 | COMPOSITION_RELATIVE: | 2216 | COMPOSITION_RELATIVE: |
| 2215 | ESC 0 CHAR [ CHAR ] ESC 1 | 2217 | ESC 0 CHAR [ CHAR ] ESC 1 |
| 2216 | COMPOSITOIN_WITH_RULE: | 2218 | COMPOSITION_WITH_RULE: |
| 2217 | ESC 2 CHAR [ RULE CHAR ] ESC 1 | 2219 | ESC 2 CHAR [ RULE CHAR ] ESC 1 |
| 2218 | COMPOSITION_WITH_ALTCHARS: | 2220 | COMPOSITION_WITH_ALTCHARS: |
| 2219 | ESC 3 ALTCHAR [ ALTCHAR ] ESC 0 CHAR [ CHAR ] ESC 1 | 2221 | ESC 3 ALTCHAR [ ALTCHAR ] ESC 0 CHAR [ CHAR ] ESC 1 |
| @@ -4535,66 +4537,6 @@ encode_coding_charset (coding) | |||
| 4535 | 4537 | ||
| 4536 | /*** 7. C library functions ***/ | 4538 | /*** 7. C library functions ***/ |
| 4537 | 4539 | ||
| 4538 | /* In Emacs Lisp, coding system is represented by a Lisp symbol which | ||
| 4539 | has a property `coding-system'. The value of this property is a | ||
| 4540 | vector of length 5 (called as coding-vector). Among elements of | ||
| 4541 | this vector, the first (element[0]) and the fifth (element[4]) | ||
| 4542 | carry important information for decoding/encoding. Before | ||
| 4543 | decoding/encoding, this information should be set in fields of a | ||
| 4544 | structure of type `coding_system'. | ||
| 4545 | |||
| 4546 | A value of property `coding-system' can be a symbol of another | ||
| 4547 | subsidiary coding-system. In that case, Emacs gets coding-vector | ||
| 4548 | from that symbol. | ||
| 4549 | |||
| 4550 | `element[0]' contains information to be set in `coding->type'. The | ||
| 4551 | value and its meaning is as follows: | ||
| 4552 | |||
| 4553 | 0 -- coding_type_emacs_mule | ||
| 4554 | 1 -- coding_type_sjis | ||
| 4555 | 2 -- coding_type_iso_2022 | ||
| 4556 | 3 -- coding_type_big5 | ||
| 4557 | 4 -- coding_type_ccl encoder/decoder written in CCL | ||
| 4558 | nil -- coding_type_no_conversion | ||
| 4559 | t -- coding_type_undecided (automatic conversion on decoding, | ||
| 4560 | no-conversion on encoding) | ||
| 4561 | |||
| 4562 | `element[4]' contains information to be set in `coding->flags' and | ||
| 4563 | `coding->spec'. The meaning varies by `coding->type'. | ||
| 4564 | |||
| 4565 | If `coding->type' is `coding_type_iso_2022', element[4] is a vector | ||
| 4566 | of length 32 (of which the first 13 sub-elements are used now). | ||
| 4567 | Meanings of these sub-elements are: | ||
| 4568 | |||
| 4569 | sub-element[N] where N is 0 through 3: to be set in `coding->spec.iso_2022' | ||
| 4570 | If the value is an integer of valid charset, the charset is | ||
| 4571 | assumed to be designated to graphic register N initially. | ||
| 4572 | |||
| 4573 | If the value is minus, it is a minus value of charset which | ||
| 4574 | reserves graphic register N, which means that the charset is | ||
| 4575 | not designated initially but should be designated to graphic | ||
| 4576 | register N just before encoding a character in that charset. | ||
| 4577 | |||
| 4578 | If the value is nil, graphic register N is never used on | ||
| 4579 | encoding. | ||
| 4580 | |||
| 4581 | sub-element[N] where N is 4 through 11: to be set in `coding->flags' | ||
| 4582 | Each value takes t or nil. See the section ISO2022 of | ||
| 4583 | `coding.h' for more information. | ||
| 4584 | |||
| 4585 | If `coding->type' is `coding_type_big5', element[4] is t to denote | ||
| 4586 | BIG5-ETen or nil to denote BIG5-HKU. | ||
| 4587 | |||
| 4588 | If `coding->type' takes the other value, element[4] is ignored. | ||
| 4589 | |||
| 4590 | Emacs Lisp's coding system also carries information about format of | ||
| 4591 | end-of-line in a value of property `eol-type'. If the value is | ||
| 4592 | integer, 0 means eol_lf, 1 means eol_crlf, and 2 means eol_cr. If | ||
| 4593 | it is not integer, it should be a vector of subsidiary coding | ||
| 4594 | systems of which property `eol-type' has one of above values. | ||
| 4595 | |||
| 4596 | */ | ||
| 4597 | |||
| 4598 | /* Setup coding context CODING from information about CODING_SYSTEM. | 4540 | /* Setup coding context CODING from information about CODING_SYSTEM. |
| 4599 | If CODING_SYSTEM is nil, `no-conversion' is assumed. If | 4541 | If CODING_SYSTEM is nil, `no-conversion' is assumed. If |
| 4600 | CODING_SYSTEM is invalid, signal an error. */ | 4542 | CODING_SYSTEM is invalid, signal an error. */ |