Improve character name escapes

* doc/lispref/nonascii.texi (Character Properties): Avoid duplication of Unicode names. Reformat examples to fit in narrow pages. * doc/lispref/objects.texi (General Escape Syntax): Simplify and better-organize explanation of \N{...} escapes. * src/character.h (CHAR_SURROGATE_PAIR_P): Remove; unused. (char_surrogate_p): New inline function. * src/lread.c: Do not include string.h; no longer needed. (invalid_character_name, check_scalar_value): Remove; the ideas behind these functions are now bundled into character_name_to_code. (character_name_to_code): Remove undocumented support for "CJK IDEOGRAPH-XXXX" names, as "U+XXXX" suffices. Reject monstrosities like "\N{U+-0}" and null bytes in \N escapes. Reject floating point in \N escapes instead of returning garbage. Use AUTO_STRING_WITH_LEN to lessen pressure on the garbage collector. * test/src/lread-tests.el (lread-char-number, lread-char-name) (lread-string-char-number, lread-string-char-name): Test runtime behavior, not compile-time, as the test framework is not set up to test compile-time. (lread-char-surrogate-1, lread-char-surrogate-2) (lread-char-surrogate-3, lread-char-surrogate-4) (lread-string-char-number-2, lread-string-char-number-3): New tests. (lread-string-char-number-1): Rename from lread-string-char-number.
author: Paul Eggert 2016-04-21 19:26:34 -0700
committer: Paul Eggert 2016-04-21 19:29:41 -0700
commit: bd1c7ca67e7429e07f78d4ff49163fd7a67a6765 (patch)
tree: 941d5cf573be2a4588468b3a315c0c6cb47e2c97 /doc
parent: e7cb38edc946ff60c1c878b30b068376d6ef56d2 (diff)
download: emacs-bd1c7ca67e7429e07f78d4ff49163fd7a67a6765.tar.gz
emacs-bd1c7ca67e7429e07f78d4ff49163fd7a67a6765.zip
2 files changed, 35 insertions, 32 deletions
diff --git a/doc/lispref/nonascii.texi b/doc/lispref/nonascii.texi
index 66ad9aca71e..0e4aa86e48b 100644
--- a/doc/lispref/nonascii.texi
+++ b/doc/lispref/nonascii.texi
@@ -622,18 +622,21 @@ This function returns the value of @var{char}'s @var{propname} property.
     @result{} Nd
 @end group
 @group
-;; U+2084 SUBSCRIPT FOUR
+;; U+2084
-(get-char-code-property ?\u2084 'digit-value)
+(get-char-code-property ?\N@{SUBSCRIPT FOUR@}
+                        'digit-value)
     @result{} 4
 @end group
 @group
-;; U+2155 VULGAR FRACTION ONE FIFTH
+;; U+2155
-(get-char-code-property ?\u2155 'numeric-value)
+(get-char-code-property ?\N@{VULGAR FRACTION ONE FIFTH@}
+                        'numeric-value)
     @result{} 0.2
 @end group
 @group
-;; U+2163 ROMAN NUMERAL FOUR
+;; U+2163
-(get-char-code-property ?\N@{ROMAN NUMERAL FOUR@} 'numeric-value)
+(get-char-code-property ?\N@{ROMAN NUMERAL FOUR@}
+                        'numeric-value)
     @result{} 4
 @end group
 @group
diff --git a/doc/lispref/objects.texi b/doc/lispref/objects.texi
index 96b334d2b81..54894b8e24e 100644
--- a/doc/lispref/objects.texi
+++ b/doc/lispref/objects.texi
@@ -353,25 +353,32 @@ following text.)
 control characters, Emacs provides several types of escape syntax that
 you can use to specify non-@acronym{ASCII} text characters.
+@enumerate
+@item
 @cindex @samp{\} in character constant
 @cindex backslash in character constants
 @cindex unicode character escape
-  Firstly, you can specify characters by their Unicode values.
+You can specify characters by their Unicode names, if any.
-@code{?\u@var{nnnn}} represents a character with Unicode code point
+@code{?\N@{@var{NAME}@}} represents the Unicode character named
-@samp{U+@var{nnnn}}, where @var{nnnn} is (by convention) a hexadecimal
+@var{NAME}.  Thus, @samp{?\N@{LATIN SMALL LETTER A WITH GRAVE@}} is
-number with exactly four digits.  The backslash indicates that the
+equivalent to @code{?à} and denotes the Unicode character U+00E0.  To
-subsequent characters form an escape sequence, and the @samp{u}
+simplify entering multi-line strings, you can replace spaces in the
-specifies a Unicode escape sequence.
+names by non-empty sequences of whitespace (e.g., newlines).
-  There is a slightly different syntax for specifying Unicode
+@item
-characters with code points higher than @code{U+@var{ffff}}:
+You can specify characters by their Unicode values.
-@code{?\U00@var{nnnnnn}} represents the character with code point
+@code{?\N@{U+@var{X}@}} represents a character with Unicode code point
-@samp{U+@var{nnnnnn}}, where @var{nnnnnn} is a six-digit hexadecimal
+@var{X}, where @var{X} is a hexadecimal number.  Also,
-number.  The Unicode Standard only defines code points up to
+@code{?\u@var{xxxx}} and @code{?\U@var{xxxxxxxx}} represent code
-@samp{U+@var{10ffff}}, so if you specify a code point higher than
+points @var{xxxx} and @var{xxxxxxxx}, respectively, where each @var{x}
-that, Emacs signals an error.
+is a single hexadecimal digit.  For example, @code{?\N@{U+E0@}},
+@code{?\u00e0} and @code{?\U000000E0} are all equivalent to @code{?à}
-  Secondly, you can specify characters by their hexadecimal character
+and to @samp{?\N@{LATIN SMALL LETTER A WITH GRAVE@}}.  The Unicode
+Standard defines code points only up to @samp{U+@var{10ffff}}, so if
+you specify a code point higher than that, Emacs signals an error.
+@item
+You can specify characters by their hexadecimal character
 codes.  A hexadecimal escape sequence consists of a backslash,
 @samp{x}, and the hexadecimal character code.  Thus, @samp{?\x41} is
 the character @kbd{A}, @samp{?\x1} is the character @kbd{C-a}, and
@@ -379,23 +386,16 @@ the character @kbd{A}, @samp{?\x1} is the character @kbd{C-a}, and
 You can use any number of hex digits, so you can represent any
 character code in this way.
+@item
 @cindex octal character code
-  Thirdly, you can specify characters by their character code in
+You can specify characters by their character code in
 octal.  An octal escape sequence consists of a backslash followed by
 up to three octal digits; thus, @samp{?\101} for the character
 @kbd{A}, @samp{?\001} for the character @kbd{C-a}, and @code{?\002}
 for the character @kbd{C-b}.  Only characters up to octal code 777 can
 be specified this way.
-  Fourthly, you can specify characters by their name.  A character
+@end enumerate
-name escape sequence consists of a backslash, @samp{N@{}, the Unicode
-character name, and @samp{@}}.  Alternatively, you can also put the
-numeric code point value between the braces, using the syntax
-@samp{\N@{U+nnnn@}}, where @samp{nnnn} denotes between one and eight
-hexadecimal digits.  Thus, @samp{?\N@{LATIN CAPITAL LETTER A@}} and
-@samp{?\N@{U+41@}} both denote the character @kbd{A}.  To simplify
-entering multi-line strings, you can replace spaces in the character
-names by arbitrary non-empty sequence of whitespace (e.g., newlines).
  These escape sequences may also be used in strings.  @xref{Non-ASCII
 in Strings}.
author	Paul Eggert	2016-04-21 19:26:34 -0700
committer	Paul Eggert	2016-04-21 19:29:41 -0700
commit	bd1c7ca67e7429e07f78d4ff49163fd7a67a6765 (patch)
tree	941d5cf573be2a4588468b3a315c0c6cb47e2c97 /doc
parent	e7cb38edc946ff60c1c878b30b068376d6ef56d2 (diff)
download	emacs-bd1c7ca67e7429e07f78d4ff49163fd7a67a6765.tar.gz emacs-bd1c7ca67e7429e07f78d4ff49163fd7a67a6765.zip

diff --git a/doc/lispref/nonascii.texi b/doc/lispref/nonascii.texi index 66ad9aca71e..0e4aa86e48b 100644 --- a/doc/lispref/nonascii.texi +++ b/doc/lispref/nonascii.texi
@@ -622,18 +622,21 @@ This function returns the value of @var{char}'s @var{propname} property.
622	@result{} Nd	622	@result{} Nd
623	@end group	623	@end group
624	@group	624	@group
625	;; U+2084 SUBSCRIPT FOUR	625	;; U+2084
626	(get-char-code-property ?\u2084 'digit-value)	626	(get-char-code-property ?\N@{SUBSCRIPT FOUR@}
		627	'digit-value)
627	@result{} 4	628	@result{} 4
628	@end group	629	@end group
629	@group	630	@group
630	;; U+2155 VULGAR FRACTION ONE FIFTH	631	;; U+2155
631	(get-char-code-property ?\u2155 'numeric-value)	632	(get-char-code-property ?\N@{VULGAR FRACTION ONE FIFTH@}
		633	'numeric-value)
632	@result{} 0.2	634	@result{} 0.2
633	@end group	635	@end group
634	@group	636	@group
635	;; U+2163 ROMAN NUMERAL FOUR	637	;; U+2163
636	(get-char-code-property ?\N@{ROMAN NUMERAL FOUR@} 'numeric-value)	638	(get-char-code-property ?\N@{ROMAN NUMERAL FOUR@}
		639	'numeric-value)
637	@result{} 4	640	@result{} 4
638	@end group	641	@end group
639	@group	642	@group


diff --git a/doc/lispref/objects.texi b/doc/lispref/objects.texi index 96b334d2b81..54894b8e24e 100644 --- a/doc/lispref/objects.texi +++ b/doc/lispref/objects.texi
@@ -353,25 +353,32 @@ following text.)
353	control characters, Emacs provides several types of escape syntax that	353	control characters, Emacs provides several types of escape syntax that
354	you can use to specify non-@acronym{ASCII} text characters.	354	you can use to specify non-@acronym{ASCII} text characters.
355		355
		356	@enumerate
		357	@item
356	@cindex @samp{\} in character constant	358	@cindex @samp{\} in character constant
357	@cindex backslash in character constants	359	@cindex backslash in character constants
358	@cindex unicode character escape	360	@cindex unicode character escape
359	Firstly, you can specify characters by their Unicode values.	361	You can specify characters by their Unicode names, if any.
360	@code{?\u@var{nnnn}} represents a character with Unicode code point	362	@code{?\N@{@var{NAME}@}} represents the Unicode character named
361	@samp{U+@var{nnnn}}, where @var{nnnn} is (by convention) a hexadecimal	363	@var{NAME}. Thus, @samp{?\N@{LATIN SMALL LETTER A WITH GRAVE@}} is
362	number with exactly four digits. The backslash indicates that the	364	equivalent to @code{?à} and denotes the Unicode character U+00E0. To
363	subsequent characters form an escape sequence, and the @samp{u}	365	simplify entering multi-line strings, you can replace spaces in the
364	specifies a Unicode escape sequence.	366	names by non-empty sequences of whitespace (e.g., newlines).
365		367
366	There is a slightly different syntax for specifying Unicode	368	@item
367	characters with code points higher than @code{U+@var{ffff}}:	369	You can specify characters by their Unicode values.
368	@code{?\U00@var{nnnnnn}} represents the character with code point	370	@code{?\N@{U+@var{X}@}} represents a character with Unicode code point
369	@samp{U+@var{nnnnnn}}, where @var{nnnnnn} is a six-digit hexadecimal	371	@var{X}, where @var{X} is a hexadecimal number. Also,
370	number. The Unicode Standard only defines code points up to	372	@code{?\u@var{xxxx}} and @code{?\U@var{xxxxxxxx}} represent code
371	@samp{U+@var{10ffff}}, so if you specify a code point higher than	373	points @var{xxxx} and @var{xxxxxxxx}, respectively, where each @var{x}
372	that, Emacs signals an error.	374	is a single hexadecimal digit. For example, @code{?\N@{U+E0@}},
373		375	@code{?\u00e0} and @code{?\U000000E0} are all equivalent to @code{?à}
374	Secondly, you can specify characters by their hexadecimal character	376	and to @samp{?\N@{LATIN SMALL LETTER A WITH GRAVE@}}. The Unicode
		377	Standard defines code points only up to @samp{U+@var{10ffff}}, so if
		378	you specify a code point higher than that, Emacs signals an error.
		379
		380	@item
		381	You can specify characters by their hexadecimal character
375	codes. A hexadecimal escape sequence consists of a backslash,	382	codes. A hexadecimal escape sequence consists of a backslash,
376	@samp{x}, and the hexadecimal character code. Thus, @samp{?\x41} is	383	@samp{x}, and the hexadecimal character code. Thus, @samp{?\x41} is
377	the character @kbd{A}, @samp{?\x1} is the character @kbd{C-a}, and	384	the character @kbd{A}, @samp{?\x1} is the character @kbd{C-a}, and
@@ -379,23 +386,16 @@ the character @kbd{A}, @samp{?\x1} is the character @kbd{C-a}, and
379	You can use any number of hex digits, so you can represent any	386	You can use any number of hex digits, so you can represent any
380	character code in this way.	387	character code in this way.
381		388
		389	@item
382	@cindex octal character code	390	@cindex octal character code
383	Thirdly, you can specify characters by their character code in	391	You can specify characters by their character code in
384	octal. An octal escape sequence consists of a backslash followed by	392	octal. An octal escape sequence consists of a backslash followed by
385	up to three octal digits; thus, @samp{?\101} for the character	393	up to three octal digits; thus, @samp{?\101} for the character
386	@kbd{A}, @samp{?\001} for the character @kbd{C-a}, and @code{?\002}	394	@kbd{A}, @samp{?\001} for the character @kbd{C-a}, and @code{?\002}
387	for the character @kbd{C-b}. Only characters up to octal code 777 can	395	for the character @kbd{C-b}. Only characters up to octal code 777 can
388	be specified this way.	396	be specified this way.
389		397
390	Fourthly, you can specify characters by their name. A character	398	@end enumerate
391	name escape sequence consists of a backslash, @samp{N@{}, the Unicode
392	character name, and @samp{@}}. Alternatively, you can also put the
393	numeric code point value between the braces, using the syntax
394	@samp{\N@{U+nnnn@}}, where @samp{nnnn} denotes between one and eight
395	hexadecimal digits. Thus, @samp{?\N@{LATIN CAPITAL LETTER A@}} and
396	@samp{?\N@{U+41@}} both denote the character @kbd{A}. To simplify
397	entering multi-line strings, you can replace spaces in the character
398	names by arbitrary non-empty sequence of whitespace (e.g., newlines).
399		399
400	These escape sequences may also be used in strings. @xref{Non-ASCII	400	These escape sequences may also be used in strings. @xref{Non-ASCII
401	in Strings}.	401	in Strings}.