diff options
| author | Kenichi Handa | 1999-03-01 11:52:54 +0000 |
|---|---|---|
| committer | Kenichi Handa | 1999-03-01 11:52:54 +0000 |
| commit | 39787efd638454ea6ebd29c113b19df3498f7dfa (patch) | |
| tree | 6f0b864853d25d53d73d2de8533ab7ee46032181 /src/coding.c | |
| parent | f805a125e14bc40a5f86ae3bbcf6eb6d72f4b917 (diff) | |
| download | emacs-39787efd638454ea6ebd29c113b19df3498f7dfa.tar.gz emacs-39787efd638454ea6ebd29c113b19df3498f7dfa.zip | |
Comment for ISO 2022 encoding mechanism modified.
Diffstat (limited to 'src/coding.c')
| -rw-r--r-- | src/coding.c | 132 |
1 files changed, 71 insertions, 61 deletions
diff --git a/src/coding.c b/src/coding.c index 863f0d89d7a..a31e3ea8bce 100644 --- a/src/coding.c +++ b/src/coding.c | |||
| @@ -525,33 +525,37 @@ detect_coding_emacs_mule (src, src_end) | |||
| 525 | /*** 3. ISO2022 handlers ***/ | 525 | /*** 3. ISO2022 handlers ***/ |
| 526 | 526 | ||
| 527 | /* The following note describes the coding system ISO2022 briefly. | 527 | /* The following note describes the coding system ISO2022 briefly. |
| 528 | Since the intention of this note is to help in understanding of | 528 | Since the intention of this note is to help understand the |
| 529 | the programs in this file, some parts are NOT ACCURATE or OVERLY | 529 | functions in this file, some parts are NOT ACCURATE or OVERLY |
| 530 | SIMPLIFIED. For the thorough understanding, please refer to the | 530 | SIMPLIFIED. For thorough understanding, please refer to the |
| 531 | original document of ISO2022. | 531 | original document of ISO2022. |
| 532 | 532 | ||
| 533 | ISO2022 provides many mechanisms to encode several character sets | 533 | ISO2022 provides many mechanisms to encode several character sets |
| 534 | in 7-bit and 8-bit environment. If one chooses 7-bite environment, | 534 | in 7-bit and 8-bit environments. For 7-bite environments, all text |
| 535 | all text is encoded by codes of less than 128. This may make the | 535 | is encoded using bytes less than 128. This may make the encoded |
| 536 | encoded text a little bit longer, but the text gets more stability | 536 | text a little bit longer, but the text passes more easily through |
| 537 | to pass through several gateways (some of them strip off the MSB). | 537 | several gateways, some of which strip off MSB (Most Signigant Bit). |
| 538 | 538 | ||
| 539 | There are two kinds of character set: control character set and | 539 | There are two kinds of character sets: control character set and |
| 540 | graphic character set. The former contains control characters such | 540 | graphic character set. The former contains control characters such |
| 541 | as `newline' and `escape' to provide control functions (control | 541 | as `newline' and `escape' to provide control functions (control |
| 542 | functions are provided also by escape sequences). The latter | 542 | functions are also provided by escape sequences). The latter |
| 543 | contains graphic characters such as ' A' and '-'. Emacs recognizes | 543 | contains graphic characters such as 'A' and '-'. Emacs recognizes |
| 544 | two control character sets and many graphic character sets. | 544 | two control character sets and many graphic character sets. |
| 545 | 545 | ||
| 546 | Graphic character sets are classified into one of the following | 546 | Graphic character sets are classified into one of the following |
| 547 | four classes, DIMENSION1_CHARS94, DIMENSION1_CHARS96, | 547 | four classes, according to the number of bytes (DIMENSION) and |
| 548 | DIMENSION2_CHARS94, DIMENSION2_CHARS96 according to the number of | 548 | number of characters in one dimension (CHARS) of the set: |
| 549 | bytes (DIMENSION) and the number of characters in one dimension | 549 | - DIMENSION1_CHARS94 |
| 550 | (CHARS) of the set. In addition, each character set is assigned an | 550 | - DIMENSION1_CHARS96 |
| 551 | identification tag (called "final character" and denoted as <F> | 551 | - DIMENSION2_CHARS94 |
| 552 | here after) which is unique in each class. <F> of each character | 552 | - DIMENSION2_CHARS96 |
| 553 | set is decided by ECMA(*) when it is registered in ISO. Code range | 553 | |
| 554 | of <F> is 0x30..0x7F (0x30..0x3F are for private use only). | 554 | In addition, each character set is assigned an identification tag, |
| 555 | unique for each set, called "final character" (denoted as <F> | ||
| 556 | hereafter). The <F> of each character set is decided by ECMA(*) | ||
| 557 | when it is registered in ISO. The code range of <F> is 0x30..0x7F | ||
| 558 | (0x30..0x3F are for private use only). | ||
| 555 | 559 | ||
| 556 | Note (*): ECMA = European Computer Manufacturers Association | 560 | Note (*): ECMA = European Computer Manufacturers Association |
| 557 | 561 | ||
| @@ -561,55 +565,61 @@ detect_coding_emacs_mule (src, src_end) | |||
| 561 | o DIMENSION2_CHARS94 -- GB2312('A'), JISX0208('B'), ... | 565 | o DIMENSION2_CHARS94 -- GB2312('A'), JISX0208('B'), ... |
| 562 | o DIMENSION2_CHARS96 -- none for the moment | 566 | o DIMENSION2_CHARS96 -- none for the moment |
| 563 | 567 | ||
| 564 | A code area (1byte=8bits) is divided into 4 areas, C0, GL, C1, and GR. | 568 | A code area (1 byte=8 bits) is divided into 4 areas, C0, GL, C1, and GR. |
| 565 | C0 [0x00..0x1F] -- control character plane 0 | 569 | C0 [0x00..0x1F] -- control character plane 0 |
| 566 | GL [0x20..0x7F] -- graphic character plane 0 | 570 | GL [0x20..0x7F] -- graphic character plane 0 |
| 567 | C1 [0x80..0x9F] -- control character plane 1 | 571 | C1 [0x80..0x9F] -- control character plane 1 |
| 568 | GR [0xA0..0xFF] -- graphic character plane 1 | 572 | GR [0xA0..0xFF] -- graphic character plane 1 |
| 569 | 573 | ||
| 570 | A control character set is directly designated and invoked to C0 or | 574 | A control character set is directly designated and invoked to C0 or |
| 571 | C1 by an escape sequence. The most common case is that ISO646's | 575 | C1 by an escape sequence. The most common case is that: |
| 572 | control character set is designated/invoked to C0 and ISO6429's | 576 | - ISO646's control character set is designated/invoked to C0, and |
| 573 | control character set is designated/invoked to C1, and usually | 577 | - ISO6429's control character set is designated/invoked to C1, |
| 574 | these designations/invocations are omitted in a coded text. With | 578 | and usually these designations/invocations are omitted in encoded |
| 575 | 7-bit environment, only C0 can be used, and a control character for | 579 | text. In a 7-bit environment, only C0 can be used, and a control |
| 576 | C1 is encoded by an appropriate escape sequence to fit in the | 580 | character for C1 is encoded by an appropriate escape sequence to |
| 577 | environment. All control characters for C1 are defined the | 581 | fit into the environment. All control characters for C1 are |
| 578 | corresponding escape sequences. | 582 | defined to have corresponding escape sequences. |
| 579 | 583 | ||
| 580 | A graphic character set is at first designated to one of four | 584 | A graphic character set is at first designated to one of four |
| 581 | graphic registers (G0 through G3), then these graphic registers are | 585 | graphic registers (G0 through G3), then these graphic registers are |
| 582 | invoked to GL or GR. These designations and invocations can be | 586 | invoked to GL or GR. These designations and invocations can be |
| 583 | done independently. The most common case is that G0 is invoked to | 587 | done independently. The most common case is that G0 is invoked to |
| 584 | GL, G1 is invoked to GR, and ASCII is designated to G0, and usually | 588 | GL, G1 is invoked to GR, and ASCII is designated to G0. Usually |
| 585 | these invocations and designations are omitted in a coded text. | 589 | these invocations and designations are omitted in encoded text. |
| 586 | With 7-bit environment, only GL can be used. | 590 | In a 7-bit environment, only GL can be used. |
| 587 | 591 | ||
| 588 | When a graphic character set of CHARS94 is invoked to GL, code 0x20 | 592 | When a graphic character set of CHARS94 is invoked to GL, codes |
| 589 | and 0x7F of GL area work as control characters SPACE and DEL | 593 | 0x20 and 0x7F of the GL area work as control characters SPACE and |
| 590 | respectively, and code 0xA0 and 0xFF of GR area should not be used. | 594 | DEL respectively, and codes 0xA0 and 0xFF of the GR area should not |
| 595 | be used. | ||
| 591 | 596 | ||
| 592 | There are two ways of invocation: locking-shift and single-shift. | 597 | There are two ways of invocation: locking-shift and single-shift. |
| 593 | With locking-shift, the invocation lasts until the next different | 598 | With locking-shift, the invocation lasts until the next different |
| 594 | invocation, whereas with single-shift, the invocation works only | 599 | invocation, whereas with single-shift, the invocation affects the |
| 595 | for the following character and doesn't affect locking-shift. | 600 | following character only and doesn't affect the locking-shift |
| 596 | Invocations are done by the following control characters or escape | 601 | state. Invocations are done by the following control characters or |
| 597 | sequences. | 602 | escape sequences: |
| 598 | 603 | ||
| 599 | ---------------------------------------------------------------------- | 604 | ---------------------------------------------------------------------- |
| 600 | function control char escape sequence description | 605 | abbrev function cntrl escape seq description |
| 601 | ---------------------------------------------------------------------- | 606 | ---------------------------------------------------------------------- |
| 602 | SI (shift-in) 0x0F none invoke G0 to GL | 607 | SI/LS0 (shift-in) 0x0F none invoke G0 into GL |
| 603 | SO (shift-out) 0x0E none invoke G1 to GL | 608 | SO/LS1 (shift-out) 0x0E none invoke G1 into GL |
| 604 | LS2 (locking-shift-2) none ESC 'n' invoke G2 into GL | 609 | LS2 (locking-shift-2) none ESC 'n' invoke G2 into GL |
| 605 | LS3 (locking-shift-3) none ESC 'o' invoke G3 into GL | 610 | LS3 (locking-shift-3) none ESC 'o' invoke G3 into GL |
| 606 | SS2 (single-shift-2) 0x8E ESC 'N' invoke G2 into GL | 611 | LS1R (locking-shift-1 right) none ESC '~' invoke G1 into GR (*) |
| 607 | SS3 (single-shift-3) 0x8F ESC 'O' invoke G3 into GL | 612 | LS2R (locking-shift-2 right) none ESC '}' invoke G2 into GR (*) |
| 613 | LS3R (locking-shift 3 right) none ESC '|' invoke G3 into GR (*) | ||
| 614 | SS2 (single-shift-2) 0x8E ESC 'N' invoke G2 for one char | ||
| 615 | SS3 (single-shift-3) 0x8F ESC 'O' invoke G3 for one char | ||
| 608 | ---------------------------------------------------------------------- | 616 | ---------------------------------------------------------------------- |
| 609 | The first four are for locking-shift. Control characters for these | 617 | (*) These are not used by any known coding system. |
| 610 | functions are defined by macros ISO_CODE_XXX in `coding.h'. | 618 | |
| 619 | Control characters for these functions are defined by macros | ||
| 620 | ISO_CODE_XXX in `coding.h'. | ||
| 611 | 621 | ||
| 612 | Designations are done by the following escape sequences. | 622 | Designations are done by the following escape sequences: |
| 613 | ---------------------------------------------------------------------- | 623 | ---------------------------------------------------------------------- |
| 614 | escape sequence description | 624 | escape sequence description |
| 615 | ---------------------------------------------------------------------- | 625 | ---------------------------------------------------------------------- |
| @@ -632,40 +642,40 @@ detect_coding_emacs_mule (src, src_end) | |||
| 632 | ---------------------------------------------------------------------- | 642 | ---------------------------------------------------------------------- |
| 633 | 643 | ||
| 634 | In this list, "DIMENSION1_CHARS94<F>" means a graphic character set | 644 | In this list, "DIMENSION1_CHARS94<F>" means a graphic character set |
| 635 | of dimension 1, chars 94, and final character <F>, and etc. | 645 | of dimension 1, chars 94, and final character <F>, etc... |
| 636 | 646 | ||
| 637 | Note (*): Although these designations are not allowed in ISO2022, | 647 | Note (*): Although these designations are not allowed in ISO2022, |
| 638 | Emacs accepts them on decoding, and produces them on encoding | 648 | Emacs accepts them on decoding, and produces them on encoding |
| 639 | CHARS96 character set in a coding system which is characterized as | 649 | CHARS96 character sets in a coding system which is characterized as |
| 640 | 7-bit environment, non-locking-shift, and non-single-shift. | 650 | 7-bit environment, non-locking-shift, and non-single-shift. |
| 641 | 651 | ||
| 642 | Note (**): If <F> is '@', 'A', or 'B', the intermediate character | 652 | Note (**): If <F> is '@', 'A', or 'B', the intermediate character |
| 643 | '(' can be omitted. We call this as "short-form" here after. | 653 | '(' can be omitted. We refer to this as "short-form" hereafter. |
| 644 | 654 | ||
| 645 | Now you may notice that there are a lot of ways for encoding the | 655 | Now you may notice that there are a lot of ways for encoding the |
| 646 | same multilingual text in ISO2022. Actually, there exists many | 656 | same multilingual text in ISO2022. Actually, there exist many |
| 647 | coding systems such as Compound Text (used in X's inter client | 657 | coding systems such as Compound Text (used in X11's inter client |
| 648 | communication, ISO-2022-JP (used in Japanese Internet), ISO-2022-KR | 658 | communication, ISO-2022-JP (used in Japanese internet), ISO-2022-KR |
| 649 | (used in Korean Internet), EUC (Extended UNIX Code, used in Asian | 659 | (used in Korean internet), EUC (Extended UNIX Code, used in Asian |
| 650 | localized platforms), and all of these are variants of ISO2022. | 660 | localized platforms), and all of these are variants of ISO2022. |
| 651 | 661 | ||
| 652 | In addition to the above, Emacs handles two more kinds of escape | 662 | In addition to the above, Emacs handles two more kinds of escape |
| 653 | sequences: ISO6429's direction specification and Emacs' private | 663 | sequences: ISO6429's direction specification and Emacs' private |
| 654 | sequence for specifying character composition. | 664 | sequence for specifying character composition. |
| 655 | 665 | ||
| 656 | ISO6429's direction specification takes the following format: | 666 | ISO6429's direction specification takes the following form: |
| 657 | o CSI ']' -- end of the current direction | 667 | o CSI ']' -- end of the current direction |
| 658 | o CSI '0' ']' -- end of the current direction | 668 | o CSI '0' ']' -- end of the current direction |
| 659 | o CSI '1' ']' -- start of left-to-right text | 669 | o CSI '1' ']' -- start of left-to-right text |
| 660 | o CSI '2' ']' -- start of right-to-left text | 670 | o CSI '2' ']' -- start of right-to-left text |
| 661 | The control character CSI (0x9B: control sequence introducer) is | 671 | The control character CSI (0x9B: control sequence introducer) is |
| 662 | abbreviated to the escape sequence ESC '[' in 7-bit environment. | 672 | abbreviated to the escape sequence ESC '[' in a 7-bit environment. |
| 663 | 673 | ||
| 664 | Character composition specification takes the following format: | 674 | Character composition specification takes the following form: |
| 665 | o ESC '0' -- start character composition | 675 | o ESC '0' -- start character composition |
| 666 | o ESC '1' -- end character composition | 676 | o ESC '1' -- end character composition |
| 667 | Since these are not standard escape sequences of any ISO, the use | 677 | Since these are not standard escape sequences of any ISO standard, |
| 668 | of them for these meaning is restricted to Emacs only. */ | 678 | the use of them for these meaning is restricted to Emacs only. */ |
| 669 | 679 | ||
| 670 | enum iso_code_class_type iso_code_class[256]; | 680 | enum iso_code_class_type iso_code_class[256]; |
| 671 | 681 | ||