Clarify documentation about escape sequences in strings.

* objects.texi (General Escape Syntax): Clarify the explanation of escape sequences. (Non-ASCII in Strings): Clarify when a string is unibyte vs multibyte. Hex escapes do not automatically make a string multibyte.
author: Chong Yidong 2012-11-03 19:02:43 +0800
committer: Chong Yidong 2012-11-03 19:02:43 +0800
commit: 2395ab64f6152af46b804cecc5743b8139031968 (patch)
tree: 660e27f4dc4739c9ba4c5d5252892b38b5f4eede
parent: 43bcfda6d863c6172eeba2d6aa22d22453849423 (diff)
download: emacs-2395ab64f6152af46b804cecc5743b8139031968.tar.gz
emacs-2395ab64f6152af46b804cecc5743b8139031968.zip
2 files changed, 86 insertions, 68 deletions
diff --git a/doc/lispref/ChangeLog b/doc/lispref/ChangeLog
index fa996191ac4..17bd43fc0d9 100644
--- a/doc/lispref/ChangeLog
+++ b/doc/lispref/ChangeLog
@@ -1,3 +1,11 @@
+2012-11-03  Chong Yidong  <cyd@gnu.org>
+        * objects.texi (General Escape Syntax): Clarify the explanation of
+        escape sequences.
+        (Non-ASCII in Strings): Clarify when a string is unibyte vs
+        multibyte.  Hex escapes do not automatically make a string
+        multibyte.
 2012-11-03  Martin Rudalics  <rudalics@gmx.at>
        * windows.texi (Switching Buffers): Document option
diff --git a/doc/lispref/objects.texi b/doc/lispref/objects.texi
index 7d40f0ff934..2507b0540eb 100644
--- a/doc/lispref/objects.texi
+++ b/doc/lispref/objects.texi
@@ -351,51 +351,48 @@ following text.)
 control characters, Emacs provides several types of escape syntax that
 you can use to specify non-@acronym{ASCII} text characters.
-@cindex unicode character escape
-  You can specify characters by their Unicode values.
-@code{?\u@var{nnnn}} represents a character that maps to the Unicode
-code point @samp{U+@var{nnnn}} (by convention, Unicode code points are
-given in hexadecimal).  There is a slightly different syntax for
-specifying characters with code points higher than
-@code{U+@var{ffff}}: @code{\U00@var{nnnnnn}} represents the character
-whose code point is @samp{U+@var{nnnnnn}}.  The Unicode Standard only
-defines code points up to @samp{U+@var{10ffff}}, so if you specify a
-code point higher than that, Emacs signals an error.
-  This peculiar and inconvenient syntax was adopted for compatibility
-with other programming languages.  Unlike some other languages, Emacs
-Lisp supports this syntax only in character literals and strings.
 @cindex @samp{\} in character constant
 @cindex backslash in character constants
-@cindex octal character code
+@cindex unicode character escape
-  The most general read syntax for a character represents the
+  Firstly, you can specify characters by their Unicode values.
-character code in either octal or hex.  To use octal, write a question
+@code{?\u@var{nnnn}} represents a character with Unicode code point
-mark followed by a backslash and the octal character code (up to three
+@samp{U+@var{nnnn}}, where @var{nnnn} is (by convention) a hexadecimal
-octal digits); thus, @samp{?\101} for the character @kbd{A},
+number with exactly four digits.  The backslash indicates that the
-@samp{?\001} for the character @kbd{C-a}, and @code{?\002} for the
+subsequent characters form an escape sequence, and the @samp{u}
-character @kbd{C-b}.  Although this syntax can represent any
+specifies a Unicode escape sequence.
-@acronym{ASCII} character, it is preferred only when the precise octal
-value is more important than the @acronym{ASCII} representation.
+  There is a slightly different syntax for specifying Unicode
+characters with code points higher than @code{U+@var{ffff}}:
-@example
+@code{?\U00@var{nnnnnn}} represents the character with code point
-@group
+@samp{U+@var{nnnnnn}}, where @var{nnnnnn} is a six-digit hexadecimal
-?\012 @result{} 10         ?\n @result{} 10         ?\C-j @result{} 10
+number.  The Unicode Standard only defines code points up to
-?\101 @result{} 65         ?A @result{} 65
+@samp{U+@var{10ffff}}, so if you specify a code point higher than
-@end group
+that, Emacs signals an error.
-@end example
+  Secondly, you can specify characters by their hexadecimal character
-  To use hex, write a question mark followed by a backslash, @samp{x},
+codes.  A hexadecimal escape sequence consists of a backslash,
-and the hexadecimal character code.  You can use any number of hex
+@samp{x}, and the hexadecimal character code.  Thus, @samp{?\x41} is
-digits, so you can represent any character code in this way.
+the character @kbd{A}, @samp{?\x1} is the character @kbd{C-a}, and
-Thus, @samp{?\x41} for the character @kbd{A}, @samp{?\x1} for the
+@code{?\xe0} is the character
-character @kbd{C-a}, and @code{?\xe0} for the Latin-1 character
 @iftex
 @samp{@`a}.
 @end iftex
 @ifnottex
 @samp{a} with grave accent.
 @end ifnottex
+You can use any number of hex digits, so you can represent any
+character code in this way.
+@cindex octal character code
+  Thirdly, you can specify characters by their character code in
+octal.  An octal escape sequence consists of a backslash followed by
+up to three octal digits; thus, @samp{?\101} for the character
+@kbd{A}, @samp{?\001} for the character @kbd{C-a}, and @code{?\002}
+for the character @kbd{C-b}.  Only characters up to octal code 777 can
+be specified this way.
+  These escape sequences may also be used in strings.  @xref{Non-ASCII
+in Strings}.
 @node Ctl-Char Syntax
 @subsubsection Control-Character Syntax
@@ -1026,40 +1023,53 @@ but the newline is ignored if escaped."
 @node Non-ASCII in Strings
 @subsubsection Non-@acronym{ASCII} Characters in Strings
-  You can include a non-@acronym{ASCII} international character in a
+  There are two text representations for non-@acronym{ASCII}
-string constant by writing it literally.  There are two text
+characters in Emacs strings: multibyte and unibyte (@pxref{Text
-representations for non-@acronym{ASCII} characters in Emacs strings
+Representations}).  Roughly speaking, unibyte strings store raw bytes,
-(and in buffers): unibyte and multibyte (@pxref{Text
+while multibyte strings store human-readable text.  Each character in
-Representations}).  If the string constant is read from a multibyte
+a unibyte string is a byte, i.e.@: its value is between 0 and 255.  By
-source, such as a multibyte buffer or string, or a file that would be
+contrast, each character in a multibyte string may have a value
-visited as multibyte, then Emacs reads the non-@acronym{ASCII}
+between 0 to 4194303 (@pxref{Character Type}).  In both cases,
-character as a multibyte character and automatically makes the string
+characters above 127 are non-@acronym{ASCII}.
-a multibyte string.  If the string constant is read from a unibyte
-source, then Emacs reads the non-@acronym{ASCII} character as unibyte,
+  You can include a non-@acronym{ASCII} character in a string constant
-and makes the string unibyte.
+by writing it literally.  If the string constant is read from a
+multibyte source, such as a multibyte buffer or string, or a file that
-  Instead of writing a non-@acronym{ASCII} character literally into a
+would be visited as multibyte, then Emacs reads each
-multibyte string, you can write it as its character code using a hex
+non-@acronym{ASCII} character as a multibyte character and
-escape, @samp{\x@var{nnnnnnn}}, with as many digits as necessary.
+automatically makes the string a multibyte string.  If the string
-(Multibyte non-@acronym{ASCII} character codes are all greater than
+constant is read from a unibyte source, then Emacs reads the
-256.)  You can also specify a character in a multibyte string using
+non-@acronym{ASCII} character as unibyte, and makes the string
-the @samp{\u} or @samp{\U} Unicode escape syntax (@pxref{General
+unibyte.
-Escape Syntax}).  In either case, any character which is not a valid
-hex digit terminates the construct.  If the next character in the
+  Instead of writing a character literally into a multibyte string,
-string could be interpreted as a hex digit, write @w{@samp{\ }}
+you can write it as its character code using an escape sequence.
-(backslash and space) to terminate the hex escape---for example,
+@xref{General Escape Syntax}, for details about escape sequences.
+  If you use any Unicode-style escape sequence @samp{\uNNNN} or
+@samp{\U00NNNNNN} in a string constant (even for an @acronym{ASCII}
+character), Emacs automatically assumes that it is multibyte.
+  You can also use hexadecimal escape sequences (@samp{\x@var{n}}) and
+octal escape sequences (@samp{\@var{n}}) in string constants.
+@strong{But beware:} If a string constant contains hexadecimal or
+octal escape sequences, and these escape sequences all specify unibyte
+characters (i.e.@: less than 256), and there are no other literal
+non-@acronym{ASCII} characters or Unicode-style escape sequences in
+the string, then Emacs automatically assumes that it is a unibyte
+string.  That is to say, it assumes that all non-@acronym{ASCII}
+characters occurring in the string are 8-bit raw bytes.
+  In hexadecimal and octal escape sequences, the escaped character
+code may contain any number of digits, so the first subsequent
+character which is not a valid hexadecimal or octal digit terminates
+the escape sequence.  If the next character in a string could be
+interpreted as a hexadecimal or octal digit, write @w{@samp{\ }}
+(backslash and space) to terminate the escape sequence.  For example,
 @w{@samp{\xe0\ }} represents one character, @samp{a} with grave
 accent.  @w{@samp{\ }} in a string constant is just like
 backslash-newline; it does not contribute any character to the string,
-but it does terminate the preceding hex escape.  Using any hex escape
+but it does terminate any preceding hex escape.
-in a string (even for an @acronym{ASCII} character) automatically
-forces the string to be multibyte.
-  You can represent a unibyte non-@acronym{ASCII} character with its
-character code, which must be in the range from 128 (0200 octal) to
-255 (0377 octal).  If you write all such character codes in octal and
-the string contains no other characters forcing it to be multibyte,
-this produces a unibyte string.
 @node Nonprinting Characters
 @subsubsection Nonprinting Characters in Strings
author	Chong Yidong	2012-11-03 19:02:43 +0800
committer	Chong Yidong	2012-11-03 19:02:43 +0800
commit	2395ab64f6152af46b804cecc5743b8139031968 (patch)
tree	660e27f4dc4739c9ba4c5d5252892b38b5f4eede
parent	43bcfda6d863c6172eeba2d6aa22d22453849423 (diff)
download	emacs-2395ab64f6152af46b804cecc5743b8139031968.tar.gz emacs-2395ab64f6152af46b804cecc5743b8139031968.zip

diff --git a/doc/lispref/ChangeLog b/doc/lispref/ChangeLog index fa996191ac4..17bd43fc0d9 100644 --- a/doc/lispref/ChangeLog +++ b/doc/lispref/ChangeLog
@@ -1,3 +1,11 @@
		1	2012-11-03 Chong Yidong <cyd@gnu.org>
		2
		3	* objects.texi (General Escape Syntax): Clarify the explanation of
		4	escape sequences.
		5	(Non-ASCII in Strings): Clarify when a string is unibyte vs
		6	multibyte. Hex escapes do not automatically make a string
		7	multibyte.
		8
1	2012-11-03 Martin Rudalics <rudalics@gmx.at>	9	2012-11-03 Martin Rudalics <rudalics@gmx.at>
2		10
3	* windows.texi (Switching Buffers): Document option	11	* windows.texi (Switching Buffers): Document option


diff --git a/doc/lispref/objects.texi b/doc/lispref/objects.texi index 7d40f0ff934..2507b0540eb 100644 --- a/doc/lispref/objects.texi +++ b/doc/lispref/objects.texi
@@ -351,51 +351,48 @@ following text.)
351	control characters, Emacs provides several types of escape syntax that	351	control characters, Emacs provides several types of escape syntax that
352	you can use to specify non-@acronym{ASCII} text characters.	352	you can use to specify non-@acronym{ASCII} text characters.
353		353
354	@cindex unicode character escape
355	You can specify characters by their Unicode values.
356	@code{?\u@var{nnnn}} represents a character that maps to the Unicode
357	code point @samp{U+@var{nnnn}} (by convention, Unicode code points are
358	given in hexadecimal). There is a slightly different syntax for
359	specifying characters with code points higher than
360	@code{U+@var{ffff}}: @code{\U00@var{nnnnnn}} represents the character
361	whose code point is @samp{U+@var{nnnnnn}}. The Unicode Standard only
362	defines code points up to @samp{U+@var{10ffff}}, so if you specify a
363	code point higher than that, Emacs signals an error.
364
365	This peculiar and inconvenient syntax was adopted for compatibility
366	with other programming languages. Unlike some other languages, Emacs
367	Lisp supports this syntax only in character literals and strings.
368
369	@cindex @samp{\} in character constant	354	@cindex @samp{\} in character constant
370	@cindex backslash in character constants	355	@cindex backslash in character constants
371	@cindex octal character code	356	@cindex unicode character escape
372	The most general read syntax for a character represents the	357	Firstly, you can specify characters by their Unicode values.
373	character code in either octal or hex. To use octal, write a question	358	@code{?\u@var{nnnn}} represents a character with Unicode code point
374	mark followed by a backslash and the octal character code (up to three	359	@samp{U+@var{nnnn}}, where @var{nnnn} is (by convention) a hexadecimal
375	octal digits); thus, @samp{?\101} for the character @kbd{A},	360	number with exactly four digits. The backslash indicates that the
376	@samp{?\001} for the character @kbd{C-a}, and @code{?\002} for the	361	subsequent characters form an escape sequence, and the @samp{u}
377	character @kbd{C-b}. Although this syntax can represent any	362	specifies a Unicode escape sequence.
378	@acronym{ASCII} character, it is preferred only when the precise octal	363
379	value is more important than the @acronym{ASCII} representation.	364	There is a slightly different syntax for specifying Unicode
380		365	characters with code points higher than @code{U+@var{ffff}}:
381	@example	366	@code{?\U00@var{nnnnnn}} represents the character with code point
382	@group	367	@samp{U+@var{nnnnnn}}, where @var{nnnnnn} is a six-digit hexadecimal
383	?\012 @result{} 10 ?\n @result{} 10 ?\C-j @result{} 10	368	number. The Unicode Standard only defines code points up to
384	?\101 @result{} 65 ?A @result{} 65	369	@samp{U+@var{10ffff}}, so if you specify a code point higher than
385	@end group	370	that, Emacs signals an error.
386	@end example	371
387		372	Secondly, you can specify characters by their hexadecimal character
388	To use hex, write a question mark followed by a backslash, @samp{x},	373	codes. A hexadecimal escape sequence consists of a backslash,
389	and the hexadecimal character code. You can use any number of hex	374	@samp{x}, and the hexadecimal character code. Thus, @samp{?\x41} is
390	digits, so you can represent any character code in this way.	375	the character @kbd{A}, @samp{?\x1} is the character @kbd{C-a}, and
391	Thus, @samp{?\x41} for the character @kbd{A}, @samp{?\x1} for the	376	@code{?\xe0} is the character
392	character @kbd{C-a}, and @code{?\xe0} for the Latin-1 character
393	@iftex	377	@iftex
394	@samp{@`a}.	378	@samp{@`a}.
395	@end iftex	379	@end iftex
396	@ifnottex	380	@ifnottex
397	@samp{a} with grave accent.	381	@samp{a} with grave accent.
398	@end ifnottex	382	@end ifnottex
		383	You can use any number of hex digits, so you can represent any
		384	character code in this way.
		385
		386	@cindex octal character code
		387	Thirdly, you can specify characters by their character code in
		388	octal. An octal escape sequence consists of a backslash followed by
		389	up to three octal digits; thus, @samp{?\101} for the character
		390	@kbd{A}, @samp{?\001} for the character @kbd{C-a}, and @code{?\002}
		391	for the character @kbd{C-b}. Only characters up to octal code 777 can
		392	be specified this way.
		393
		394	These escape sequences may also be used in strings. @xref{Non-ASCII
		395	in Strings}.
399		396
400	@node Ctl-Char Syntax	397	@node Ctl-Char Syntax
401	@subsubsection Control-Character Syntax	398	@subsubsection Control-Character Syntax
@@ -1026,40 +1023,53 @@ but the newline is ignored if escaped."
1026	@node Non-ASCII in Strings	1023	@node Non-ASCII in Strings
1027	@subsubsection Non-@acronym{ASCII} Characters in Strings	1024	@subsubsection Non-@acronym{ASCII} Characters in Strings
1028		1025
1029	You can include a non-@acronym{ASCII} international character in a	1026	There are two text representations for non-@acronym{ASCII}
1030	string constant by writing it literally. There are two text	1027	characters in Emacs strings: multibyte and unibyte (@pxref{Text
1031	representations for non-@acronym{ASCII} characters in Emacs strings	1028	Representations}). Roughly speaking, unibyte strings store raw bytes,
1032	(and in buffers): unibyte and multibyte (@pxref{Text	1029	while multibyte strings store human-readable text. Each character in
1033	Representations}). If the string constant is read from a multibyte	1030	a unibyte string is a byte, i.e.@: its value is between 0 and 255. By
1034	source, such as a multibyte buffer or string, or a file that would be	1031	contrast, each character in a multibyte string may have a value
1035	visited as multibyte, then Emacs reads the non-@acronym{ASCII}	1032	between 0 to 4194303 (@pxref{Character Type}). In both cases,
1036	character as a multibyte character and automatically makes the string	1033	characters above 127 are non-@acronym{ASCII}.
1037	a multibyte string. If the string constant is read from a unibyte	1034
1038	source, then Emacs reads the non-@acronym{ASCII} character as unibyte,	1035	You can include a non-@acronym{ASCII} character in a string constant
1039	and makes the string unibyte.	1036	by writing it literally. If the string constant is read from a
1040		1037	multibyte source, such as a multibyte buffer or string, or a file that
1041	Instead of writing a non-@acronym{ASCII} character literally into a	1038	would be visited as multibyte, then Emacs reads each
1042	multibyte string, you can write it as its character code using a hex	1039	non-@acronym{ASCII} character as a multibyte character and
1043	escape, @samp{\x@var{nnnnnnn}}, with as many digits as necessary.	1040	automatically makes the string a multibyte string. If the string
1044	(Multibyte non-@acronym{ASCII} character codes are all greater than	1041	constant is read from a unibyte source, then Emacs reads the
1045	256.) You can also specify a character in a multibyte string using	1042	non-@acronym{ASCII} character as unibyte, and makes the string
1046	the @samp{\u} or @samp{\U} Unicode escape syntax (@pxref{General	1043	unibyte.
1047	Escape Syntax}). In either case, any character which is not a valid	1044
1048	hex digit terminates the construct. If the next character in the	1045	Instead of writing a character literally into a multibyte string,
1049	string could be interpreted as a hex digit, write @w{@samp{\ }}	1046	you can write it as its character code using an escape sequence.
1050	(backslash and space) to terminate the hex escape---for example,	1047	@xref{General Escape Syntax}, for details about escape sequences.
		1048
		1049	If you use any Unicode-style escape sequence @samp{\uNNNN} or
		1050	@samp{\U00NNNNNN} in a string constant (even for an @acronym{ASCII}
		1051	character), Emacs automatically assumes that it is multibyte.
		1052
		1053	You can also use hexadecimal escape sequences (@samp{\x@var{n}}) and
		1054	octal escape sequences (@samp{\@var{n}}) in string constants.
		1055	@strong{But beware:} If a string constant contains hexadecimal or
		1056	octal escape sequences, and these escape sequences all specify unibyte
		1057	characters (i.e.@: less than 256), and there are no other literal
		1058	non-@acronym{ASCII} characters or Unicode-style escape sequences in
		1059	the string, then Emacs automatically assumes that it is a unibyte
		1060	string. That is to say, it assumes that all non-@acronym{ASCII}
		1061	characters occurring in the string are 8-bit raw bytes.
		1062
		1063	In hexadecimal and octal escape sequences, the escaped character
		1064	code may contain any number of digits, so the first subsequent
		1065	character which is not a valid hexadecimal or octal digit terminates
		1066	the escape sequence. If the next character in a string could be
		1067	interpreted as a hexadecimal or octal digit, write @w{@samp{\ }}
		1068	(backslash and space) to terminate the escape sequence. For example,
1051	@w{@samp{\xe0\ }} represents one character, @samp{a} with grave	1069	@w{@samp{\xe0\ }} represents one character, @samp{a} with grave
1052	accent. @w{@samp{\ }} in a string constant is just like	1070	accent. @w{@samp{\ }} in a string constant is just like
1053	backslash-newline; it does not contribute any character to the string,	1071	backslash-newline; it does not contribute any character to the string,
1054	but it does terminate the preceding hex escape. Using any hex escape	1072	but it does terminate any preceding hex escape.
1055	in a string (even for an @acronym{ASCII} character) automatically
1056	forces the string to be multibyte.
1057
1058	You can represent a unibyte non-@acronym{ASCII} character with its
1059	character code, which must be in the range from 128 (0200 octal) to
1060	255 (0377 octal). If you write all such character codes in octal and
1061	the string contains no other characters forcing it to be multibyte,
1062	this produces a unibyte string.
1063		1073
1064	@node Nonprinting Characters	1074	@node Nonprinting Characters
1065	@subsubsection Nonprinting Characters in Strings	1075	@subsubsection Nonprinting Characters in Strings