diff options
| author | Chong Yidong | 2012-11-03 19:02:43 +0800 |
|---|---|---|
| committer | Chong Yidong | 2012-11-03 19:02:43 +0800 |
| commit | 2395ab64f6152af46b804cecc5743b8139031968 (patch) | |
| tree | 660e27f4dc4739c9ba4c5d5252892b38b5f4eede | |
| parent | 43bcfda6d863c6172eeba2d6aa22d22453849423 (diff) | |
| download | emacs-2395ab64f6152af46b804cecc5743b8139031968.tar.gz emacs-2395ab64f6152af46b804cecc5743b8139031968.zip | |
Clarify documentation about escape sequences in strings.
* objects.texi (General Escape Syntax): Clarify the explanation of
escape sequences.
(Non-ASCII in Strings): Clarify when a string is unibyte vs
multibyte. Hex escapes do not automatically make a string multibyte.
| -rw-r--r-- | doc/lispref/ChangeLog | 8 | ||||
| -rw-r--r-- | doc/lispref/objects.texi | 146 |
2 files changed, 86 insertions, 68 deletions
diff --git a/doc/lispref/ChangeLog b/doc/lispref/ChangeLog index fa996191ac4..17bd43fc0d9 100644 --- a/doc/lispref/ChangeLog +++ b/doc/lispref/ChangeLog | |||
| @@ -1,3 +1,11 @@ | |||
| 1 | 2012-11-03 Chong Yidong <cyd@gnu.org> | ||
| 2 | |||
| 3 | * objects.texi (General Escape Syntax): Clarify the explanation of | ||
| 4 | escape sequences. | ||
| 5 | (Non-ASCII in Strings): Clarify when a string is unibyte vs | ||
| 6 | multibyte. Hex escapes do not automatically make a string | ||
| 7 | multibyte. | ||
| 8 | |||
| 1 | 2012-11-03 Martin Rudalics <rudalics@gmx.at> | 9 | 2012-11-03 Martin Rudalics <rudalics@gmx.at> |
| 2 | 10 | ||
| 3 | * windows.texi (Switching Buffers): Document option | 11 | * windows.texi (Switching Buffers): Document option |
diff --git a/doc/lispref/objects.texi b/doc/lispref/objects.texi index 7d40f0ff934..2507b0540eb 100644 --- a/doc/lispref/objects.texi +++ b/doc/lispref/objects.texi | |||
| @@ -351,51 +351,48 @@ following text.) | |||
| 351 | control characters, Emacs provides several types of escape syntax that | 351 | control characters, Emacs provides several types of escape syntax that |
| 352 | you can use to specify non-@acronym{ASCII} text characters. | 352 | you can use to specify non-@acronym{ASCII} text characters. |
| 353 | 353 | ||
| 354 | @cindex unicode character escape | ||
| 355 | You can specify characters by their Unicode values. | ||
| 356 | @code{?\u@var{nnnn}} represents a character that maps to the Unicode | ||
| 357 | code point @samp{U+@var{nnnn}} (by convention, Unicode code points are | ||
| 358 | given in hexadecimal). There is a slightly different syntax for | ||
| 359 | specifying characters with code points higher than | ||
| 360 | @code{U+@var{ffff}}: @code{\U00@var{nnnnnn}} represents the character | ||
| 361 | whose code point is @samp{U+@var{nnnnnn}}. The Unicode Standard only | ||
| 362 | defines code points up to @samp{U+@var{10ffff}}, so if you specify a | ||
| 363 | code point higher than that, Emacs signals an error. | ||
| 364 | |||
| 365 | This peculiar and inconvenient syntax was adopted for compatibility | ||
| 366 | with other programming languages. Unlike some other languages, Emacs | ||
| 367 | Lisp supports this syntax only in character literals and strings. | ||
| 368 | |||
| 369 | @cindex @samp{\} in character constant | 354 | @cindex @samp{\} in character constant |
| 370 | @cindex backslash in character constants | 355 | @cindex backslash in character constants |
| 371 | @cindex octal character code | 356 | @cindex unicode character escape |
| 372 | The most general read syntax for a character represents the | 357 | Firstly, you can specify characters by their Unicode values. |
| 373 | character code in either octal or hex. To use octal, write a question | 358 | @code{?\u@var{nnnn}} represents a character with Unicode code point |
| 374 | mark followed by a backslash and the octal character code (up to three | 359 | @samp{U+@var{nnnn}}, where @var{nnnn} is (by convention) a hexadecimal |
| 375 | octal digits); thus, @samp{?\101} for the character @kbd{A}, | 360 | number with exactly four digits. The backslash indicates that the |
| 376 | @samp{?\001} for the character @kbd{C-a}, and @code{?\002} for the | 361 | subsequent characters form an escape sequence, and the @samp{u} |
| 377 | character @kbd{C-b}. Although this syntax can represent any | 362 | specifies a Unicode escape sequence. |
| 378 | @acronym{ASCII} character, it is preferred only when the precise octal | 363 | |
| 379 | value is more important than the @acronym{ASCII} representation. | 364 | There is a slightly different syntax for specifying Unicode |
| 380 | 365 | characters with code points higher than @code{U+@var{ffff}}: | |
| 381 | @example | 366 | @code{?\U00@var{nnnnnn}} represents the character with code point |
| 382 | @group | 367 | @samp{U+@var{nnnnnn}}, where @var{nnnnnn} is a six-digit hexadecimal |
| 383 | ?\012 @result{} 10 ?\n @result{} 10 ?\C-j @result{} 10 | 368 | number. The Unicode Standard only defines code points up to |
| 384 | ?\101 @result{} 65 ?A @result{} 65 | 369 | @samp{U+@var{10ffff}}, so if you specify a code point higher than |
| 385 | @end group | 370 | that, Emacs signals an error. |
| 386 | @end example | 371 | |
| 387 | 372 | Secondly, you can specify characters by their hexadecimal character | |
| 388 | To use hex, write a question mark followed by a backslash, @samp{x}, | 373 | codes. A hexadecimal escape sequence consists of a backslash, |
| 389 | and the hexadecimal character code. You can use any number of hex | 374 | @samp{x}, and the hexadecimal character code. Thus, @samp{?\x41} is |
| 390 | digits, so you can represent any character code in this way. | 375 | the character @kbd{A}, @samp{?\x1} is the character @kbd{C-a}, and |
| 391 | Thus, @samp{?\x41} for the character @kbd{A}, @samp{?\x1} for the | 376 | @code{?\xe0} is the character |
| 392 | character @kbd{C-a}, and @code{?\xe0} for the Latin-1 character | ||
| 393 | @iftex | 377 | @iftex |
| 394 | @samp{@`a}. | 378 | @samp{@`a}. |
| 395 | @end iftex | 379 | @end iftex |
| 396 | @ifnottex | 380 | @ifnottex |
| 397 | @samp{a} with grave accent. | 381 | @samp{a} with grave accent. |
| 398 | @end ifnottex | 382 | @end ifnottex |
| 383 | You can use any number of hex digits, so you can represent any | ||
| 384 | character code in this way. | ||
| 385 | |||
| 386 | @cindex octal character code | ||
| 387 | Thirdly, you can specify characters by their character code in | ||
| 388 | octal. An octal escape sequence consists of a backslash followed by | ||
| 389 | up to three octal digits; thus, @samp{?\101} for the character | ||
| 390 | @kbd{A}, @samp{?\001} for the character @kbd{C-a}, and @code{?\002} | ||
| 391 | for the character @kbd{C-b}. Only characters up to octal code 777 can | ||
| 392 | be specified this way. | ||
| 393 | |||
| 394 | These escape sequences may also be used in strings. @xref{Non-ASCII | ||
| 395 | in Strings}. | ||
| 399 | 396 | ||
| 400 | @node Ctl-Char Syntax | 397 | @node Ctl-Char Syntax |
| 401 | @subsubsection Control-Character Syntax | 398 | @subsubsection Control-Character Syntax |
| @@ -1026,40 +1023,53 @@ but the newline is ignored if escaped." | |||
| 1026 | @node Non-ASCII in Strings | 1023 | @node Non-ASCII in Strings |
| 1027 | @subsubsection Non-@acronym{ASCII} Characters in Strings | 1024 | @subsubsection Non-@acronym{ASCII} Characters in Strings |
| 1028 | 1025 | ||
| 1029 | You can include a non-@acronym{ASCII} international character in a | 1026 | There are two text representations for non-@acronym{ASCII} |
| 1030 | string constant by writing it literally. There are two text | 1027 | characters in Emacs strings: multibyte and unibyte (@pxref{Text |
| 1031 | representations for non-@acronym{ASCII} characters in Emacs strings | 1028 | Representations}). Roughly speaking, unibyte strings store raw bytes, |
| 1032 | (and in buffers): unibyte and multibyte (@pxref{Text | 1029 | while multibyte strings store human-readable text. Each character in |
| 1033 | Representations}). If the string constant is read from a multibyte | 1030 | a unibyte string is a byte, i.e.@: its value is between 0 and 255. By |
| 1034 | source, such as a multibyte buffer or string, or a file that would be | 1031 | contrast, each character in a multibyte string may have a value |
| 1035 | visited as multibyte, then Emacs reads the non-@acronym{ASCII} | 1032 | between 0 to 4194303 (@pxref{Character Type}). In both cases, |
| 1036 | character as a multibyte character and automatically makes the string | 1033 | characters above 127 are non-@acronym{ASCII}. |
| 1037 | a multibyte string. If the string constant is read from a unibyte | 1034 | |
| 1038 | source, then Emacs reads the non-@acronym{ASCII} character as unibyte, | 1035 | You can include a non-@acronym{ASCII} character in a string constant |
| 1039 | and makes the string unibyte. | 1036 | by writing it literally. If the string constant is read from a |
| 1040 | 1037 | multibyte source, such as a multibyte buffer or string, or a file that | |
| 1041 | Instead of writing a non-@acronym{ASCII} character literally into a | 1038 | would be visited as multibyte, then Emacs reads each |
| 1042 | multibyte string, you can write it as its character code using a hex | 1039 | non-@acronym{ASCII} character as a multibyte character and |
| 1043 | escape, @samp{\x@var{nnnnnnn}}, with as many digits as necessary. | 1040 | automatically makes the string a multibyte string. If the string |
| 1044 | (Multibyte non-@acronym{ASCII} character codes are all greater than | 1041 | constant is read from a unibyte source, then Emacs reads the |
| 1045 | 256.) You can also specify a character in a multibyte string using | 1042 | non-@acronym{ASCII} character as unibyte, and makes the string |
| 1046 | the @samp{\u} or @samp{\U} Unicode escape syntax (@pxref{General | 1043 | unibyte. |
| 1047 | Escape Syntax}). In either case, any character which is not a valid | 1044 | |
| 1048 | hex digit terminates the construct. If the next character in the | 1045 | Instead of writing a character literally into a multibyte string, |
| 1049 | string could be interpreted as a hex digit, write @w{@samp{\ }} | 1046 | you can write it as its character code using an escape sequence. |
| 1050 | (backslash and space) to terminate the hex escape---for example, | 1047 | @xref{General Escape Syntax}, for details about escape sequences. |
| 1048 | |||
| 1049 | If you use any Unicode-style escape sequence @samp{\uNNNN} or | ||
| 1050 | @samp{\U00NNNNNN} in a string constant (even for an @acronym{ASCII} | ||
| 1051 | character), Emacs automatically assumes that it is multibyte. | ||
| 1052 | |||
| 1053 | You can also use hexadecimal escape sequences (@samp{\x@var{n}}) and | ||
| 1054 | octal escape sequences (@samp{\@var{n}}) in string constants. | ||
| 1055 | @strong{But beware:} If a string constant contains hexadecimal or | ||
| 1056 | octal escape sequences, and these escape sequences all specify unibyte | ||
| 1057 | characters (i.e.@: less than 256), and there are no other literal | ||
| 1058 | non-@acronym{ASCII} characters or Unicode-style escape sequences in | ||
| 1059 | the string, then Emacs automatically assumes that it is a unibyte | ||
| 1060 | string. That is to say, it assumes that all non-@acronym{ASCII} | ||
| 1061 | characters occurring in the string are 8-bit raw bytes. | ||
| 1062 | |||
| 1063 | In hexadecimal and octal escape sequences, the escaped character | ||
| 1064 | code may contain any number of digits, so the first subsequent | ||
| 1065 | character which is not a valid hexadecimal or octal digit terminates | ||
| 1066 | the escape sequence. If the next character in a string could be | ||
| 1067 | interpreted as a hexadecimal or octal digit, write @w{@samp{\ }} | ||
| 1068 | (backslash and space) to terminate the escape sequence. For example, | ||
| 1051 | @w{@samp{\xe0\ }} represents one character, @samp{a} with grave | 1069 | @w{@samp{\xe0\ }} represents one character, @samp{a} with grave |
| 1052 | accent. @w{@samp{\ }} in a string constant is just like | 1070 | accent. @w{@samp{\ }} in a string constant is just like |
| 1053 | backslash-newline; it does not contribute any character to the string, | 1071 | backslash-newline; it does not contribute any character to the string, |
| 1054 | but it does terminate the preceding hex escape. Using any hex escape | 1072 | but it does terminate any preceding hex escape. |
| 1055 | in a string (even for an @acronym{ASCII} character) automatically | ||
| 1056 | forces the string to be multibyte. | ||
| 1057 | |||
| 1058 | You can represent a unibyte non-@acronym{ASCII} character with its | ||
| 1059 | character code, which must be in the range from 128 (0200 octal) to | ||
| 1060 | 255 (0377 octal). If you write all such character codes in octal and | ||
| 1061 | the string contains no other characters forcing it to be multibyte, | ||
| 1062 | this produces a unibyte string. | ||
| 1063 | 1073 | ||
| 1064 | @node Nonprinting Characters | 1074 | @node Nonprinting Characters |
| 1065 | @subsubsection Nonprinting Characters in Strings | 1075 | @subsubsection Nonprinting Characters in Strings |