diff options
| author | Paul Eggert | 2016-04-21 19:26:34 -0700 |
|---|---|---|
| committer | Paul Eggert | 2016-04-21 19:29:41 -0700 |
| commit | bd1c7ca67e7429e07f78d4ff49163fd7a67a6765 (patch) | |
| tree | 941d5cf573be2a4588468b3a315c0c6cb47e2c97 /doc/lispref | |
| parent | e7cb38edc946ff60c1c878b30b068376d6ef56d2 (diff) | |
| download | emacs-bd1c7ca67e7429e07f78d4ff49163fd7a67a6765.tar.gz emacs-bd1c7ca67e7429e07f78d4ff49163fd7a67a6765.zip | |
Improve character name escapes
* doc/lispref/nonascii.texi (Character Properties):
Avoid duplication of Unicode names. Reformat examples to fit in
narrow pages.
* doc/lispref/objects.texi (General Escape Syntax):
Simplify and better-organize explanation of \N{...} escapes.
* src/character.h (CHAR_SURROGATE_PAIR_P): Remove; unused.
(char_surrogate_p): New inline function.
* src/lread.c: Do not include string.h; no longer needed.
(invalid_character_name, check_scalar_value): Remove; the ideas
behind these functions are now bundled into character_name_to_code.
(character_name_to_code): Remove undocumented support for "CJK
IDEOGRAPH-XXXX" names, as "U+XXXX" suffices. Reject monstrosities
like "\N{U+-0}" and null bytes in \N escapes. Reject floating
point in \N escapes instead of returning garbage. Use
AUTO_STRING_WITH_LEN to lessen pressure on the garbage collector.
* test/src/lread-tests.el (lread-char-number, lread-char-name)
(lread-string-char-number, lread-string-char-name):
Test runtime behavior, not compile-time, as the test framework
is not set up to test compile-time.
(lread-char-surrogate-1, lread-char-surrogate-2)
(lread-char-surrogate-3, lread-char-surrogate-4)
(lread-string-char-number-2, lread-string-char-number-3):
New tests.
(lread-string-char-number-1): Rename from lread-string-char-number.
Diffstat (limited to 'doc/lispref')
| -rw-r--r-- | doc/lispref/nonascii.texi | 15 | ||||
| -rw-r--r-- | doc/lispref/objects.texi | 52 |
2 files changed, 35 insertions, 32 deletions
diff --git a/doc/lispref/nonascii.texi b/doc/lispref/nonascii.texi index 66ad9aca71e..0e4aa86e48b 100644 --- a/doc/lispref/nonascii.texi +++ b/doc/lispref/nonascii.texi | |||
| @@ -622,18 +622,21 @@ This function returns the value of @var{char}'s @var{propname} property. | |||
| 622 | @result{} Nd | 622 | @result{} Nd |
| 623 | @end group | 623 | @end group |
| 624 | @group | 624 | @group |
| 625 | ;; U+2084 SUBSCRIPT FOUR | 625 | ;; U+2084 |
| 626 | (get-char-code-property ?\u2084 'digit-value) | 626 | (get-char-code-property ?\N@{SUBSCRIPT FOUR@} |
| 627 | 'digit-value) | ||
| 627 | @result{} 4 | 628 | @result{} 4 |
| 628 | @end group | 629 | @end group |
| 629 | @group | 630 | @group |
| 630 | ;; U+2155 VULGAR FRACTION ONE FIFTH | 631 | ;; U+2155 |
| 631 | (get-char-code-property ?\u2155 'numeric-value) | 632 | (get-char-code-property ?\N@{VULGAR FRACTION ONE FIFTH@} |
| 633 | 'numeric-value) | ||
| 632 | @result{} 0.2 | 634 | @result{} 0.2 |
| 633 | @end group | 635 | @end group |
| 634 | @group | 636 | @group |
| 635 | ;; U+2163 ROMAN NUMERAL FOUR | 637 | ;; U+2163 |
| 636 | (get-char-code-property ?\N@{ROMAN NUMERAL FOUR@} 'numeric-value) | 638 | (get-char-code-property ?\N@{ROMAN NUMERAL FOUR@} |
| 639 | 'numeric-value) | ||
| 637 | @result{} 4 | 640 | @result{} 4 |
| 638 | @end group | 641 | @end group |
| 639 | @group | 642 | @group |
diff --git a/doc/lispref/objects.texi b/doc/lispref/objects.texi index 96b334d2b81..54894b8e24e 100644 --- a/doc/lispref/objects.texi +++ b/doc/lispref/objects.texi | |||
| @@ -353,25 +353,32 @@ following text.) | |||
| 353 | control characters, Emacs provides several types of escape syntax that | 353 | control characters, Emacs provides several types of escape syntax that |
| 354 | you can use to specify non-@acronym{ASCII} text characters. | 354 | you can use to specify non-@acronym{ASCII} text characters. |
| 355 | 355 | ||
| 356 | @enumerate | ||
| 357 | @item | ||
| 356 | @cindex @samp{\} in character constant | 358 | @cindex @samp{\} in character constant |
| 357 | @cindex backslash in character constants | 359 | @cindex backslash in character constants |
| 358 | @cindex unicode character escape | 360 | @cindex unicode character escape |
| 359 | Firstly, you can specify characters by their Unicode values. | 361 | You can specify characters by their Unicode names, if any. |
| 360 | @code{?\u@var{nnnn}} represents a character with Unicode code point | 362 | @code{?\N@{@var{NAME}@}} represents the Unicode character named |
| 361 | @samp{U+@var{nnnn}}, where @var{nnnn} is (by convention) a hexadecimal | 363 | @var{NAME}. Thus, @samp{?\N@{LATIN SMALL LETTER A WITH GRAVE@}} is |
| 362 | number with exactly four digits. The backslash indicates that the | 364 | equivalent to @code{?à} and denotes the Unicode character U+00E0. To |
| 363 | subsequent characters form an escape sequence, and the @samp{u} | 365 | simplify entering multi-line strings, you can replace spaces in the |
| 364 | specifies a Unicode escape sequence. | 366 | names by non-empty sequences of whitespace (e.g., newlines). |
| 365 | 367 | ||
| 366 | There is a slightly different syntax for specifying Unicode | 368 | @item |
| 367 | characters with code points higher than @code{U+@var{ffff}}: | 369 | You can specify characters by their Unicode values. |
| 368 | @code{?\U00@var{nnnnnn}} represents the character with code point | 370 | @code{?\N@{U+@var{X}@}} represents a character with Unicode code point |
| 369 | @samp{U+@var{nnnnnn}}, where @var{nnnnnn} is a six-digit hexadecimal | 371 | @var{X}, where @var{X} is a hexadecimal number. Also, |
| 370 | number. The Unicode Standard only defines code points up to | 372 | @code{?\u@var{xxxx}} and @code{?\U@var{xxxxxxxx}} represent code |
| 371 | @samp{U+@var{10ffff}}, so if you specify a code point higher than | 373 | points @var{xxxx} and @var{xxxxxxxx}, respectively, where each @var{x} |
| 372 | that, Emacs signals an error. | 374 | is a single hexadecimal digit. For example, @code{?\N@{U+E0@}}, |
| 373 | 375 | @code{?\u00e0} and @code{?\U000000E0} are all equivalent to @code{?à} | |
| 374 | Secondly, you can specify characters by their hexadecimal character | 376 | and to @samp{?\N@{LATIN SMALL LETTER A WITH GRAVE@}}. The Unicode |
| 377 | Standard defines code points only up to @samp{U+@var{10ffff}}, so if | ||
| 378 | you specify a code point higher than that, Emacs signals an error. | ||
| 379 | |||
| 380 | @item | ||
| 381 | You can specify characters by their hexadecimal character | ||
| 375 | codes. A hexadecimal escape sequence consists of a backslash, | 382 | codes. A hexadecimal escape sequence consists of a backslash, |
| 376 | @samp{x}, and the hexadecimal character code. Thus, @samp{?\x41} is | 383 | @samp{x}, and the hexadecimal character code. Thus, @samp{?\x41} is |
| 377 | the character @kbd{A}, @samp{?\x1} is the character @kbd{C-a}, and | 384 | the character @kbd{A}, @samp{?\x1} is the character @kbd{C-a}, and |
| @@ -379,23 +386,16 @@ the character @kbd{A}, @samp{?\x1} is the character @kbd{C-a}, and | |||
| 379 | You can use any number of hex digits, so you can represent any | 386 | You can use any number of hex digits, so you can represent any |
| 380 | character code in this way. | 387 | character code in this way. |
| 381 | 388 | ||
| 389 | @item | ||
| 382 | @cindex octal character code | 390 | @cindex octal character code |
| 383 | Thirdly, you can specify characters by their character code in | 391 | You can specify characters by their character code in |
| 384 | octal. An octal escape sequence consists of a backslash followed by | 392 | octal. An octal escape sequence consists of a backslash followed by |
| 385 | up to three octal digits; thus, @samp{?\101} for the character | 393 | up to three octal digits; thus, @samp{?\101} for the character |
| 386 | @kbd{A}, @samp{?\001} for the character @kbd{C-a}, and @code{?\002} | 394 | @kbd{A}, @samp{?\001} for the character @kbd{C-a}, and @code{?\002} |
| 387 | for the character @kbd{C-b}. Only characters up to octal code 777 can | 395 | for the character @kbd{C-b}. Only characters up to octal code 777 can |
| 388 | be specified this way. | 396 | be specified this way. |
| 389 | 397 | ||
| 390 | Fourthly, you can specify characters by their name. A character | 398 | @end enumerate |
| 391 | name escape sequence consists of a backslash, @samp{N@{}, the Unicode | ||
| 392 | character name, and @samp{@}}. Alternatively, you can also put the | ||
| 393 | numeric code point value between the braces, using the syntax | ||
| 394 | @samp{\N@{U+nnnn@}}, where @samp{nnnn} denotes between one and eight | ||
| 395 | hexadecimal digits. Thus, @samp{?\N@{LATIN CAPITAL LETTER A@}} and | ||
| 396 | @samp{?\N@{U+41@}} both denote the character @kbd{A}. To simplify | ||
| 397 | entering multi-line strings, you can replace spaces in the character | ||
| 398 | names by arbitrary non-empty sequence of whitespace (e.g., newlines). | ||
| 399 | 399 | ||
| 400 | These escape sequences may also be used in strings. @xref{Non-ASCII | 400 | These escape sequences may also be used in strings. @xref{Non-ASCII |
| 401 | in Strings}. | 401 | in Strings}. |