Comment for ISO 2022 encoding mechanism modified.

author: Kenichi Handa 1999-03-01 11:52:54 +0000
committer: Kenichi Handa 1999-03-01 11:52:54 +0000
commit: 39787efd638454ea6ebd29c113b19df3498f7dfa (patch)
tree: 6f0b864853d25d53d73d2de8533ab7ee46032181 /src/coding.c
parent: f805a125e14bc40a5f86ae3bbcf6eb6d72f4b917 (diff)
download: emacs-39787efd638454ea6ebd29c113b19df3498f7dfa.tar.gz
emacs-39787efd638454ea6ebd29c113b19df3498f7dfa.zip
1 files changed, 71 insertions, 61 deletions
diff --git a/src/coding.c b/src/coding.c
index 863f0d89d7a..a31e3ea8bce 100644
--- a/src/coding.c
+++ b/src/coding.c
@@ -525,33 +525,37 @@ detect_coding_emacs_mule (src, src_end)
 /*** 3. ISO2022 handlers ***/
 /* The following note describes the coding system ISO2022 briefly.
-   Since the intention of this note is to help in understanding of
+   Since the intention of this note is to help understand the
-   the programs in this file, some parts are NOT ACCURATE or OVERLY
+   functions in this file, some parts are NOT ACCURATE or OVERLY
-   SIMPLIFIED.  For the thorough understanding, please refer to the
+   SIMPLIFIED.  For thorough understanding, please refer to the
   original document of ISO2022.
   ISO2022 provides many mechanisms to encode several character sets
-   in 7-bit and 8-bit environment.  If one chooses 7-bite environment,
+   in 7-bit and 8-bit environments.  For 7-bite environments, all text
-   all text is encoded by codes of less than 128.  This may make the
+   is encoded using bytes less than 128.  This may make the encoded
-   encoded text a little bit longer, but the text gets more stability
+   text a little bit longer, but the text passes more easily through
-   to pass through several gateways (some of them strip off the MSB).
+   several gateways, some of which strip off MSB (Most Signigant Bit).
+ 
-   There are two kinds of character set: control character set and
+   There are two kinds of character sets: control character set and
   graphic character set.  The former contains control characters such
   as `newline' and `escape' to provide control functions (control
-   functions are provided also by escape sequences).  The latter
+   functions are also provided by escape sequences).  The latter
-   contains graphic characters such as ' A' and '-'.  Emacs recognizes
+   contains graphic characters such as 'A' and '-'.  Emacs recognizes
   two control character sets and many graphic character sets.
   Graphic character sets are classified into one of the following
-   four classes, DIMENSION1_CHARS94, DIMENSION1_CHARS96,
+   four classes, according to the number of bytes (DIMENSION) and
-   DIMENSION2_CHARS94, DIMENSION2_CHARS96 according to the number of
+   number of characters in one dimension (CHARS) of the set:
-   bytes (DIMENSION) and the number of characters in one dimension
+   - DIMENSION1_CHARS94
-   (CHARS) of the set.  In addition, each character set is assigned an
+   - DIMENSION1_CHARS96
-   identification tag (called "final character" and denoted as <F>
+   - DIMENSION2_CHARS94
-   here after) which is unique in each class.  <F> of each character
+   - DIMENSION2_CHARS96
-   set is decided by ECMA(*) when it is registered in ISO.  Code range
-   of <F> is 0x30..0x7F (0x30..0x3F are for private use only).
+   In addition, each character set is assigned an identification tag,
+   unique for each set, called "final character" (denoted as <F>
+   hereafter).  The <F> of each character set is decided by ECMA(*)
+   when it is registered in ISO.  The code range of <F> is 0x30..0x7F
+   (0x30..0x3F are for private use only).
   Note (*): ECMA = European Computer Manufacturers Association
@@ -561,55 +565,61 @@ detect_coding_emacs_mule (src, src_end)
        o DIMENSION2_CHARS94 -- GB2312('A'), JISX0208('B'), ...
        o DIMENSION2_CHARS96 -- none for the moment
-   A code area (1byte=8bits) is divided into 4 areas, C0, GL, C1, and GR.
+   A code area (1 byte=8 bits) is divided into 4 areas, C0, GL, C1, and GR.
        C0 [0x00..0x1F] -- control character plane 0
        GL [0x20..0x7F] -- graphic character plane 0
        C1 [0x80..0x9F] -- control character plane 1
        GR [0xA0..0xFF] -- graphic character plane 1
   A control character set is directly designated and invoked to C0 or
-   C1 by an escape sequence.  The most common case is that ISO646's
+   C1 by an escape sequence.  The most common case is that:
-   control character set is designated/invoked to C0 and ISO6429's
+   - ISO646's  control character set is designated/invoked to C0, and
-   control character set is designated/invoked to C1, and usually
+   - ISO6429's control character set is designated/invoked to C1,
-   these designations/invocations are omitted in a coded text.  With
+   and usually these designations/invocations are omitted in encoded
-   7-bit environment, only C0 can be used, and a control character for
+   text.  In a 7-bit environment, only C0 can be used, and a control
-   C1 is encoded by an appropriate escape sequence to fit in the
+   character for C1 is encoded by an appropriate escape sequence to
-   environment.  All control characters for C1 are defined the
+   fit into the environment.  All control characters for C1 are
-   corresponding escape sequences.
+   defined to have corresponding escape sequences.
   A graphic character set is at first designated to one of four
   graphic registers (G0 through G3), then these graphic registers are
   invoked to GL or GR.  These designations and invocations can be
   done independently.  The most common case is that G0 is invoked to
-   GL, G1 is invoked to GR, and ASCII is designated to G0, and usually
+   GL, G1 is invoked to GR, and ASCII is designated to G0.  Usually
-   these invocations and designations are omitted in a coded text.
+   these invocations and designations are omitted in encoded text.
-   With 7-bit environment, only GL can be used.
+   In a 7-bit environment, only GL can be used.
-   When a graphic character set of CHARS94 is invoked to GL, code 0x20
+   When a graphic character set of CHARS94 is invoked to GL, codes
-   and 0x7F of GL area work as control characters SPACE and DEL
+   0x20 and 0x7F of the GL area work as control characters SPACE and
-   respectively, and code 0xA0 and 0xFF of GR area should not be used.
+   DEL respectively, and codes 0xA0 and 0xFF of the GR area should not
+   be used.
   There are two ways of invocation: locking-shift and single-shift.
   With locking-shift, the invocation lasts until the next different
-   invocation, whereas with single-shift, the invocation works only
+   invocation, whereas with single-shift, the invocation affects the
-   for the following character and doesn't affect locking-shift.
+   following character only and doesn't affect the locking-shift
-   Invocations are done by the following control characters or escape
+   state.  Invocations are done by the following control characters or
-   sequences.
+   escape sequences:
   ----------------------------------------------------------------------
-   function             control char    escape sequence description
+   abbrev  function                  cntrl escape seq   description
   ----------------------------------------------------------------------
-   SI  (shift-in)               0x0F    none            invoke G0 to GL
+   SI/LS0  (shift-in)                0x0F  none         invoke G0 into GL
-   SO  (shift-out)              0x0E    none            invoke G1 to GL
+   SO/LS1  (shift-out)               0x0E  none         invoke G1 into GL
-   LS2 (locking-shift-2)        none    ESC 'n'         invoke G2 into GL
+   LS2     (locking-shift-2)         none  ESC 'n'      invoke G2 into GL
-   LS3 (locking-shift-3)        none    ESC 'o'         invoke G3 into GL
+   LS3     (locking-shift-3)         none  ESC 'o'      invoke G3 into GL
-   SS2 (single-shift-2)         0x8E    ESC 'N'         invoke G2 into GL
+   LS1R    (locking-shift-1 right)   none  ESC '~'      invoke G1 into GR (*)
-   SS3 (single-shift-3)         0x8F    ESC 'O'         invoke G3 into GL
+   LS2R    (locking-shift-2 right)   none  ESC '}'      invoke G2 into GR (*)
+   LS3R    (locking-shift 3 right)   none  ESC '|'      invoke G3 into GR (*)
+   SS2     (single-shift-2)          0x8E  ESC 'N'      invoke G2 for one char
+   SS3     (single-shift-3)          0x8F  ESC 'O'      invoke G3 for one char
   ----------------------------------------------------------------------
-   The first four are for locking-shift.  Control characters for these
+   (*) These are not used by any known coding system.
-   functions are defined by macros ISO_CODE_XXX in `coding.h'.
+   Control characters for these functions are defined by macros
+   ISO_CODE_XXX in `coding.h'.
-   Designations are done by the following escape sequences.
+   Designations are done by the following escape sequences:
   ----------------------------------------------------------------------
   escape sequence      description
   ----------------------------------------------------------------------
@@ -632,40 +642,40 @@ detect_coding_emacs_mule (src, src_end)
   ----------------------------------------------------------------------
   In this list, "DIMENSION1_CHARS94<F>" means a graphic character set
-   of dimension 1, chars 94, and final character <F>, and etc.
+   of dimension 1, chars 94, and final character <F>, etc...
   Note (*): Although these designations are not allowed in ISO2022,
   Emacs accepts them on decoding, and produces them on encoding
-   CHARS96 character set in a coding system which is characterized as
+   CHARS96 character sets in a coding system which is characterized as
   7-bit environment, non-locking-shift, and non-single-shift.
   Note (**): If <F> is '@', 'A', or 'B', the intermediate character
-   '(' can be omitted.  We call this as "short-form" here after.
+   '(' can be omitted.  We refer to this as "short-form" hereafter.
   Now you may notice that there are a lot of ways for encoding the
-   same multilingual text in ISO2022.  Actually, there exists many
+   same multilingual text in ISO2022.  Actually, there exist many
-   coding systems such as Compound Text (used in X's inter client
+   coding systems such as Compound Text (used in X11's inter client
-   communication, ISO-2022-JP (used in Japanese Internet), ISO-2022-KR
+   communication, ISO-2022-JP (used in Japanese internet), ISO-2022-KR
-   (used in Korean Internet), EUC (Extended UNIX Code, used in Asian
+   (used in Korean internet), EUC (Extended UNIX Code, used in Asian
   localized platforms), and all of these are variants of ISO2022.
   In addition to the above, Emacs handles two more kinds of escape
   sequences: ISO6429's direction specification and Emacs' private
   sequence for specifying character composition.
-   ISO6429's direction specification takes the following format:
+   ISO6429's direction specification takes the following form:
        o CSI ']'      -- end of the current direction
        o CSI '0' ']'  -- end of the current direction
        o CSI '1' ']'  -- start of left-to-right text
        o CSI '2' ']'  -- start of right-to-left text
   The control character CSI (0x9B: control sequence introducer) is
-   abbreviated to the escape sequence ESC '[' in 7-bit environment.
+   abbreviated to the escape sequence ESC '[' in a 7-bit environment.
-   
-   Character composition specification takes the following format:
+   Character composition specification takes the following form:
        o ESC '0' -- start character composition
        o ESC '1' -- end character composition
-   Since these are not standard escape sequences of any ISO, the use
+   Since these are not standard escape sequences of any ISO standard,
-   of them for these meaning is restricted to Emacs only.  */
+   the use of them for these meaning is restricted to Emacs only.  */
 enum iso_code_class_type iso_code_class[256];
author	Kenichi Handa	1999-03-01 11:52:54 +0000
committer	Kenichi Handa	1999-03-01 11:52:54 +0000
commit	39787efd638454ea6ebd29c113b19df3498f7dfa (patch)
tree	6f0b864853d25d53d73d2de8533ab7ee46032181 /src/coding.c
parent	f805a125e14bc40a5f86ae3bbcf6eb6d72f4b917 (diff)
download	emacs-39787efd638454ea6ebd29c113b19df3498f7dfa.tar.gz emacs-39787efd638454ea6ebd29c113b19df3498f7dfa.zip

diff --git a/src/coding.c b/src/coding.c index 863f0d89d7a..a31e3ea8bce 100644 --- a/src/coding.c +++ b/src/coding.c
@@ -525,33 +525,37 @@ detect_coding_emacs_mule (src, src_end)
525	/* 3. ISO2022 handlers */	525	/* 3. ISO2022 handlers */
526		526
527	/* The following note describes the coding system ISO2022 briefly.	527	/* The following note describes the coding system ISO2022 briefly.
528	Since the intention of this note is to help in understanding of	528	Since the intention of this note is to help understand the
529	the programs in this file, some parts are NOT ACCURATE or OVERLY	529	functions in this file, some parts are NOT ACCURATE or OVERLY
530	SIMPLIFIED. For the thorough understanding, please refer to the	530	SIMPLIFIED. For thorough understanding, please refer to the
531	original document of ISO2022.	531	original document of ISO2022.
532		532
533	ISO2022 provides many mechanisms to encode several character sets	533	ISO2022 provides many mechanisms to encode several character sets
534	in 7-bit and 8-bit environment. If one chooses 7-bite environment,	534	in 7-bit and 8-bit environments. For 7-bite environments, all text
535	all text is encoded by codes of less than 128. This may make the	535	is encoded using bytes less than 128. This may make the encoded
536	encoded text a little bit longer, but the text gets more stability	536	text a little bit longer, but the text passes more easily through
537	to pass through several gateways (some of them strip off the MSB).	537	several gateways, some of which strip off MSB (Most Signigant Bit).
538		538
539	There are two kinds of character set: control character set and	539	There are two kinds of character sets: control character set and
540	graphic character set. The former contains control characters such	540	graphic character set. The former contains control characters such
541	as `newline' and `escape' to provide control functions (control	541	as `newline' and `escape' to provide control functions (control
542	functions are provided also by escape sequences). The latter	542	functions are also provided by escape sequences). The latter
543	contains graphic characters such as ' A' and '-'. Emacs recognizes	543	contains graphic characters such as 'A' and '-'. Emacs recognizes
544	two control character sets and many graphic character sets.	544	two control character sets and many graphic character sets.
545		545
546	Graphic character sets are classified into one of the following	546	Graphic character sets are classified into one of the following
547	four classes, DIMENSION1_CHARS94, DIMENSION1_CHARS96,	547	four classes, according to the number of bytes (DIMENSION) and
548	DIMENSION2_CHARS94, DIMENSION2_CHARS96 according to the number of	548	number of characters in one dimension (CHARS) of the set:
549	bytes (DIMENSION) and the number of characters in one dimension	549	- DIMENSION1_CHARS94
550	(CHARS) of the set. In addition, each character set is assigned an	550	- DIMENSION1_CHARS96
551	identification tag (called "final character" and denoted as <F>	551	- DIMENSION2_CHARS94
552	here after) which is unique in each class. <F> of each character	552	- DIMENSION2_CHARS96
553	set is decided by ECMA(*) when it is registered in ISO. Code range	553
554	of <F> is 0x30..0x7F (0x30..0x3F are for private use only).	554	In addition, each character set is assigned an identification tag,
		555	unique for each set, called "final character" (denoted as <F>
		556	hereafter). The <F> of each character set is decided by ECMA(*)
		557	when it is registered in ISO. The code range of <F> is 0x30..0x7F
		558	(0x30..0x3F are for private use only).
555		559
556	Note (*): ECMA = European Computer Manufacturers Association	560	Note (*): ECMA = European Computer Manufacturers Association
557		561
@@ -561,55 +565,61 @@ detect_coding_emacs_mule (src, src_end)
561	o DIMENSION2_CHARS94 -- GB2312('A'), JISX0208('B'), ...	565	o DIMENSION2_CHARS94 -- GB2312('A'), JISX0208('B'), ...
562	o DIMENSION2_CHARS96 -- none for the moment	566	o DIMENSION2_CHARS96 -- none for the moment
563		567
564	A code area (1byte=8bits) is divided into 4 areas, C0, GL, C1, and GR.	568	A code area (1 byte=8 bits) is divided into 4 areas, C0, GL, C1, and GR.
565	C0 [0x00..0x1F] -- control character plane 0	569	C0 [0x00..0x1F] -- control character plane 0
566	GL [0x20..0x7F] -- graphic character plane 0	570	GL [0x20..0x7F] -- graphic character plane 0
567	C1 [0x80..0x9F] -- control character plane 1	571	C1 [0x80..0x9F] -- control character plane 1
568	GR [0xA0..0xFF] -- graphic character plane 1	572	GR [0xA0..0xFF] -- graphic character plane 1
569		573
570	A control character set is directly designated and invoked to C0 or	574	A control character set is directly designated and invoked to C0 or
571	C1 by an escape sequence. The most common case is that ISO646's	575	C1 by an escape sequence. The most common case is that:
572	control character set is designated/invoked to C0 and ISO6429's	576	- ISO646's control character set is designated/invoked to C0, and
573	control character set is designated/invoked to C1, and usually	577	- ISO6429's control character set is designated/invoked to C1,
574	these designations/invocations are omitted in a coded text. With	578	and usually these designations/invocations are omitted in encoded
575	7-bit environment, only C0 can be used, and a control character for	579	text. In a 7-bit environment, only C0 can be used, and a control
576	C1 is encoded by an appropriate escape sequence to fit in the	580	character for C1 is encoded by an appropriate escape sequence to
577	environment. All control characters for C1 are defined the	581	fit into the environment. All control characters for C1 are
578	corresponding escape sequences.	582	defined to have corresponding escape sequences.
579		583
580	A graphic character set is at first designated to one of four	584	A graphic character set is at first designated to one of four
581	graphic registers (G0 through G3), then these graphic registers are	585	graphic registers (G0 through G3), then these graphic registers are
582	invoked to GL or GR. These designations and invocations can be	586	invoked to GL or GR. These designations and invocations can be
583	done independently. The most common case is that G0 is invoked to	587	done independently. The most common case is that G0 is invoked to
584	GL, G1 is invoked to GR, and ASCII is designated to G0, and usually	588	GL, G1 is invoked to GR, and ASCII is designated to G0. Usually
585	these invocations and designations are omitted in a coded text.	589	these invocations and designations are omitted in encoded text.
586	With 7-bit environment, only GL can be used.	590	In a 7-bit environment, only GL can be used.
587		591
588	When a graphic character set of CHARS94 is invoked to GL, code 0x20	592	When a graphic character set of CHARS94 is invoked to GL, codes
589	and 0x7F of GL area work as control characters SPACE and DEL	593	0x20 and 0x7F of the GL area work as control characters SPACE and
590	respectively, and code 0xA0 and 0xFF of GR area should not be used.	594	DEL respectively, and codes 0xA0 and 0xFF of the GR area should not
		595	be used.
591		596
592	There are two ways of invocation: locking-shift and single-shift.	597	There are two ways of invocation: locking-shift and single-shift.
593	With locking-shift, the invocation lasts until the next different	598	With locking-shift, the invocation lasts until the next different
594	invocation, whereas with single-shift, the invocation works only	599	invocation, whereas with single-shift, the invocation affects the
595	for the following character and doesn't affect locking-shift.	600	following character only and doesn't affect the locking-shift
596	Invocations are done by the following control characters or escape	601	state. Invocations are done by the following control characters or
597	sequences.	602	escape sequences:
598		603
599	----------------------------------------------------------------------	604	----------------------------------------------------------------------
600	function control char escape sequence description	605	abbrev function cntrl escape seq description
601	----------------------------------------------------------------------	606	----------------------------------------------------------------------
602	SI (shift-in) 0x0F none invoke G0 to GL	607	SI/LS0 (shift-in) 0x0F none invoke G0 into GL
603	SO (shift-out) 0x0E none invoke G1 to GL	608	SO/LS1 (shift-out) 0x0E none invoke G1 into GL
604	LS2 (locking-shift-2) none ESC 'n' invoke G2 into GL	609	LS2 (locking-shift-2) none ESC 'n' invoke G2 into GL
605	LS3 (locking-shift-3) none ESC 'o' invoke G3 into GL	610	LS3 (locking-shift-3) none ESC 'o' invoke G3 into GL
606	SS2 (single-shift-2) 0x8E ESC 'N' invoke G2 into GL	611	LS1R (locking-shift-1 right) none ESC '~' invoke G1 into GR (*)
607	SS3 (single-shift-3) 0x8F ESC 'O' invoke G3 into GL	612	LS2R (locking-shift-2 right) none ESC '}' invoke G2 into GR (*)
		613	LS3R (locking-shift 3 right) none ESC '\|' invoke G3 into GR (*)
		614	SS2 (single-shift-2) 0x8E ESC 'N' invoke G2 for one char
		615	SS3 (single-shift-3) 0x8F ESC 'O' invoke G3 for one char
608	----------------------------------------------------------------------	616	----------------------------------------------------------------------
609	The first four are for locking-shift. Control characters for these	617	(*) These are not used by any known coding system.
610	functions are defined by macros ISO_CODE_XXX in `coding.h'.	618
		619	Control characters for these functions are defined by macros
		620	ISO_CODE_XXX in `coding.h'.
611		621
612	Designations are done by the following escape sequences.	622	Designations are done by the following escape sequences:
613	----------------------------------------------------------------------	623	----------------------------------------------------------------------
614	escape sequence description	624	escape sequence description
615	----------------------------------------------------------------------	625	----------------------------------------------------------------------
@@ -632,40 +642,40 @@ detect_coding_emacs_mule (src, src_end)
632	----------------------------------------------------------------------	642	----------------------------------------------------------------------
633		643
634	In this list, "DIMENSION1_CHARS94<F>" means a graphic character set	644	In this list, "DIMENSION1_CHARS94<F>" means a graphic character set
635	of dimension 1, chars 94, and final character <F>, and etc.	645	of dimension 1, chars 94, and final character <F>, etc...
636		646
637	Note (*): Although these designations are not allowed in ISO2022,	647	Note (*): Although these designations are not allowed in ISO2022,
638	Emacs accepts them on decoding, and produces them on encoding	648	Emacs accepts them on decoding, and produces them on encoding
639	CHARS96 character set in a coding system which is characterized as	649	CHARS96 character sets in a coding system which is characterized as
640	7-bit environment, non-locking-shift, and non-single-shift.	650	7-bit environment, non-locking-shift, and non-single-shift.
641		651
642	Note (**): If <F> is '@', 'A', or 'B', the intermediate character	652	Note (**): If <F> is '@', 'A', or 'B', the intermediate character
643	'(' can be omitted. We call this as "short-form" here after.	653	'(' can be omitted. We refer to this as "short-form" hereafter.
644		654
645	Now you may notice that there are a lot of ways for encoding the	655	Now you may notice that there are a lot of ways for encoding the
646	same multilingual text in ISO2022. Actually, there exists many	656	same multilingual text in ISO2022. Actually, there exist many
647	coding systems such as Compound Text (used in X's inter client	657	coding systems such as Compound Text (used in X11's inter client
648	communication, ISO-2022-JP (used in Japanese Internet), ISO-2022-KR	658	communication, ISO-2022-JP (used in Japanese internet), ISO-2022-KR
649	(used in Korean Internet), EUC (Extended UNIX Code, used in Asian	659	(used in Korean internet), EUC (Extended UNIX Code, used in Asian
650	localized platforms), and all of these are variants of ISO2022.	660	localized platforms), and all of these are variants of ISO2022.
651		661
652	In addition to the above, Emacs handles two more kinds of escape	662	In addition to the above, Emacs handles two more kinds of escape
653	sequences: ISO6429's direction specification and Emacs' private	663	sequences: ISO6429's direction specification and Emacs' private
654	sequence for specifying character composition.	664	sequence for specifying character composition.
655		665
656	ISO6429's direction specification takes the following format:	666	ISO6429's direction specification takes the following form:
657	o CSI ']' -- end of the current direction	667	o CSI ']' -- end of the current direction
658	o CSI '0' ']' -- end of the current direction	668	o CSI '0' ']' -- end of the current direction
659	o CSI '1' ']' -- start of left-to-right text	669	o CSI '1' ']' -- start of left-to-right text
660	o CSI '2' ']' -- start of right-to-left text	670	o CSI '2' ']' -- start of right-to-left text
661	The control character CSI (0x9B: control sequence introducer) is	671	The control character CSI (0x9B: control sequence introducer) is
662	abbreviated to the escape sequence ESC '[' in 7-bit environment.	672	abbreviated to the escape sequence ESC '[' in a 7-bit environment.
663		673
664	Character composition specification takes the following format:	674	Character composition specification takes the following form:
665	o ESC '0' -- start character composition	675	o ESC '0' -- start character composition
666	o ESC '1' -- end character composition	676	o ESC '1' -- end character composition
667	Since these are not standard escape sequences of any ISO, the use	677	Since these are not standard escape sequences of any ISO standard,
668	of them for these meaning is restricted to Emacs only. */	678	the use of them for these meaning is restricted to Emacs only. */
669		679
670	enum iso_code_class_type iso_code_class[256];	680	enum iso_code_class_type iso_code_class[256];
671		681