aboutsummaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
authorChong Yidong2012-11-03 19:02:43 +0800
committerChong Yidong2012-11-03 19:02:43 +0800
commit2395ab64f6152af46b804cecc5743b8139031968 (patch)
tree660e27f4dc4739c9ba4c5d5252892b38b5f4eede
parent43bcfda6d863c6172eeba2d6aa22d22453849423 (diff)
downloademacs-2395ab64f6152af46b804cecc5743b8139031968.tar.gz
emacs-2395ab64f6152af46b804cecc5743b8139031968.zip
Clarify documentation about escape sequences in strings.
* objects.texi (General Escape Syntax): Clarify the explanation of escape sequences. (Non-ASCII in Strings): Clarify when a string is unibyte vs multibyte. Hex escapes do not automatically make a string multibyte.
-rw-r--r--doc/lispref/ChangeLog8
-rw-r--r--doc/lispref/objects.texi146
2 files changed, 86 insertions, 68 deletions
diff --git a/doc/lispref/ChangeLog b/doc/lispref/ChangeLog
index fa996191ac4..17bd43fc0d9 100644
--- a/doc/lispref/ChangeLog
+++ b/doc/lispref/ChangeLog
@@ -1,3 +1,11 @@
12012-11-03 Chong Yidong <cyd@gnu.org>
2
3 * objects.texi (General Escape Syntax): Clarify the explanation of
4 escape sequences.
5 (Non-ASCII in Strings): Clarify when a string is unibyte vs
6 multibyte. Hex escapes do not automatically make a string
7 multibyte.
8
12012-11-03 Martin Rudalics <rudalics@gmx.at> 92012-11-03 Martin Rudalics <rudalics@gmx.at>
2 10
3 * windows.texi (Switching Buffers): Document option 11 * windows.texi (Switching Buffers): Document option
diff --git a/doc/lispref/objects.texi b/doc/lispref/objects.texi
index 7d40f0ff934..2507b0540eb 100644
--- a/doc/lispref/objects.texi
+++ b/doc/lispref/objects.texi
@@ -351,51 +351,48 @@ following text.)
351control characters, Emacs provides several types of escape syntax that 351control characters, Emacs provides several types of escape syntax that
352you can use to specify non-@acronym{ASCII} text characters. 352you can use to specify non-@acronym{ASCII} text characters.
353 353
354@cindex unicode character escape
355 You can specify characters by their Unicode values.
356@code{?\u@var{nnnn}} represents a character that maps to the Unicode
357code point @samp{U+@var{nnnn}} (by convention, Unicode code points are
358given in hexadecimal). There is a slightly different syntax for
359specifying characters with code points higher than
360@code{U+@var{ffff}}: @code{\U00@var{nnnnnn}} represents the character
361whose code point is @samp{U+@var{nnnnnn}}. The Unicode Standard only
362defines code points up to @samp{U+@var{10ffff}}, so if you specify a
363code point higher than that, Emacs signals an error.
364
365 This peculiar and inconvenient syntax was adopted for compatibility
366with other programming languages. Unlike some other languages, Emacs
367Lisp supports this syntax only in character literals and strings.
368
369@cindex @samp{\} in character constant 354@cindex @samp{\} in character constant
370@cindex backslash in character constants 355@cindex backslash in character constants
371@cindex octal character code 356@cindex unicode character escape
372 The most general read syntax for a character represents the 357 Firstly, you can specify characters by their Unicode values.
373character code in either octal or hex. To use octal, write a question 358@code{?\u@var{nnnn}} represents a character with Unicode code point
374mark followed by a backslash and the octal character code (up to three 359@samp{U+@var{nnnn}}, where @var{nnnn} is (by convention) a hexadecimal
375octal digits); thus, @samp{?\101} for the character @kbd{A}, 360number with exactly four digits. The backslash indicates that the
376@samp{?\001} for the character @kbd{C-a}, and @code{?\002} for the 361subsequent characters form an escape sequence, and the @samp{u}
377character @kbd{C-b}. Although this syntax can represent any 362specifies a Unicode escape sequence.
378@acronym{ASCII} character, it is preferred only when the precise octal 363
379value is more important than the @acronym{ASCII} representation. 364 There is a slightly different syntax for specifying Unicode
380 365characters with code points higher than @code{U+@var{ffff}}:
381@example 366@code{?\U00@var{nnnnnn}} represents the character with code point
382@group 367@samp{U+@var{nnnnnn}}, where @var{nnnnnn} is a six-digit hexadecimal
383?\012 @result{} 10 ?\n @result{} 10 ?\C-j @result{} 10 368number. The Unicode Standard only defines code points up to
384?\101 @result{} 65 ?A @result{} 65 369@samp{U+@var{10ffff}}, so if you specify a code point higher than
385@end group 370that, Emacs signals an error.
386@end example 371
387 372 Secondly, you can specify characters by their hexadecimal character
388 To use hex, write a question mark followed by a backslash, @samp{x}, 373codes. A hexadecimal escape sequence consists of a backslash,
389and the hexadecimal character code. You can use any number of hex 374@samp{x}, and the hexadecimal character code. Thus, @samp{?\x41} is
390digits, so you can represent any character code in this way. 375the character @kbd{A}, @samp{?\x1} is the character @kbd{C-a}, and
391Thus, @samp{?\x41} for the character @kbd{A}, @samp{?\x1} for the 376@code{?\xe0} is the character
392character @kbd{C-a}, and @code{?\xe0} for the Latin-1 character
393@iftex 377@iftex
394@samp{@`a}. 378@samp{@`a}.
395@end iftex 379@end iftex
396@ifnottex 380@ifnottex
397@samp{a} with grave accent. 381@samp{a} with grave accent.
398@end ifnottex 382@end ifnottex
383You can use any number of hex digits, so you can represent any
384character code in this way.
385
386@cindex octal character code
387 Thirdly, you can specify characters by their character code in
388octal. An octal escape sequence consists of a backslash followed by
389up to three octal digits; thus, @samp{?\101} for the character
390@kbd{A}, @samp{?\001} for the character @kbd{C-a}, and @code{?\002}
391for the character @kbd{C-b}. Only characters up to octal code 777 can
392be specified this way.
393
394 These escape sequences may also be used in strings. @xref{Non-ASCII
395in Strings}.
399 396
400@node Ctl-Char Syntax 397@node Ctl-Char Syntax
401@subsubsection Control-Character Syntax 398@subsubsection Control-Character Syntax
@@ -1026,40 +1023,53 @@ but the newline is ignored if escaped."
1026@node Non-ASCII in Strings 1023@node Non-ASCII in Strings
1027@subsubsection Non-@acronym{ASCII} Characters in Strings 1024@subsubsection Non-@acronym{ASCII} Characters in Strings
1028 1025
1029 You can include a non-@acronym{ASCII} international character in a 1026 There are two text representations for non-@acronym{ASCII}
1030string constant by writing it literally. There are two text 1027characters in Emacs strings: multibyte and unibyte (@pxref{Text
1031representations for non-@acronym{ASCII} characters in Emacs strings 1028Representations}). Roughly speaking, unibyte strings store raw bytes,
1032(and in buffers): unibyte and multibyte (@pxref{Text 1029while multibyte strings store human-readable text. Each character in
1033Representations}). If the string constant is read from a multibyte 1030a unibyte string is a byte, i.e.@: its value is between 0 and 255. By
1034source, such as a multibyte buffer or string, or a file that would be 1031contrast, each character in a multibyte string may have a value
1035visited as multibyte, then Emacs reads the non-@acronym{ASCII} 1032between 0 to 4194303 (@pxref{Character Type}). In both cases,
1036character as a multibyte character and automatically makes the string 1033characters above 127 are non-@acronym{ASCII}.
1037a multibyte string. If the string constant is read from a unibyte 1034
1038source, then Emacs reads the non-@acronym{ASCII} character as unibyte, 1035 You can include a non-@acronym{ASCII} character in a string constant
1039and makes the string unibyte. 1036by writing it literally. If the string constant is read from a
1040 1037multibyte source, such as a multibyte buffer or string, or a file that
1041 Instead of writing a non-@acronym{ASCII} character literally into a 1038would be visited as multibyte, then Emacs reads each
1042multibyte string, you can write it as its character code using a hex 1039non-@acronym{ASCII} character as a multibyte character and
1043escape, @samp{\x@var{nnnnnnn}}, with as many digits as necessary. 1040automatically makes the string a multibyte string. If the string
1044(Multibyte non-@acronym{ASCII} character codes are all greater than 1041constant is read from a unibyte source, then Emacs reads the
1045256.) You can also specify a character in a multibyte string using 1042non-@acronym{ASCII} character as unibyte, and makes the string
1046the @samp{\u} or @samp{\U} Unicode escape syntax (@pxref{General 1043unibyte.
1047Escape Syntax}). In either case, any character which is not a valid 1044
1048hex digit terminates the construct. If the next character in the 1045 Instead of writing a character literally into a multibyte string,
1049string could be interpreted as a hex digit, write @w{@samp{\ }} 1046you can write it as its character code using an escape sequence.
1050(backslash and space) to terminate the hex escape---for example, 1047@xref{General Escape Syntax}, for details about escape sequences.
1048
1049 If you use any Unicode-style escape sequence @samp{\uNNNN} or
1050@samp{\U00NNNNNN} in a string constant (even for an @acronym{ASCII}
1051character), Emacs automatically assumes that it is multibyte.
1052
1053 You can also use hexadecimal escape sequences (@samp{\x@var{n}}) and
1054octal escape sequences (@samp{\@var{n}}) in string constants.
1055@strong{But beware:} If a string constant contains hexadecimal or
1056octal escape sequences, and these escape sequences all specify unibyte
1057characters (i.e.@: less than 256), and there are no other literal
1058non-@acronym{ASCII} characters or Unicode-style escape sequences in
1059the string, then Emacs automatically assumes that it is a unibyte
1060string. That is to say, it assumes that all non-@acronym{ASCII}
1061characters occurring in the string are 8-bit raw bytes.
1062
1063 In hexadecimal and octal escape sequences, the escaped character
1064code may contain any number of digits, so the first subsequent
1065character which is not a valid hexadecimal or octal digit terminates
1066the escape sequence. If the next character in a string could be
1067interpreted as a hexadecimal or octal digit, write @w{@samp{\ }}
1068(backslash and space) to terminate the escape sequence. For example,
1051@w{@samp{\xe0\ }} represents one character, @samp{a} with grave 1069@w{@samp{\xe0\ }} represents one character, @samp{a} with grave
1052accent. @w{@samp{\ }} in a string constant is just like 1070accent. @w{@samp{\ }} in a string constant is just like
1053backslash-newline; it does not contribute any character to the string, 1071backslash-newline; it does not contribute any character to the string,
1054but it does terminate the preceding hex escape. Using any hex escape 1072but it does terminate any preceding hex escape.
1055in a string (even for an @acronym{ASCII} character) automatically
1056forces the string to be multibyte.
1057
1058 You can represent a unibyte non-@acronym{ASCII} character with its
1059character code, which must be in the range from 128 (0200 octal) to
1060255 (0377 octal). If you write all such character codes in octal and
1061the string contains no other characters forcing it to be multibyte,
1062this produces a unibyte string.
1063 1073
1064@node Nonprinting Characters 1074@node Nonprinting Characters
1065@subsubsection Nonprinting Characters in Strings 1075@subsubsection Nonprinting Characters in Strings