Merge from origin/emacs-26

0924b27bca Say which regexp ranges should be avoided # Conflicts: # doc/lispref/searching.texi
author: Paul Eggert 2019-04-01 23:43:57 -0700
committer: Paul Eggert 2019-04-01 23:43:57 -0700
commit: f81ec28f4fc122658e59c0ec99ca4d92a1fe439f (patch)
tree: d5864857bbb8dcf48481673a74dc50c8ebddfa94
parent: f5d34496123ce6df53d50082159280da54f052c4 (diff)
parent: 0924b27bca40d219e34529144ea04a581428f1f7 (diff)
download: emacs-f81ec28f4fc122658e59c0ec99ca4d92a1fe439f.tar.gz
emacs-f81ec28f4fc122658e59c0ec99ca4d92a1fe439f.zip
1 files changed, 34 insertions, 18 deletions
diff --git a/doc/lispref/searching.texi b/doc/lispref/searching.texi
index e3f31fdf836..748ab586af9 100644
--- a/doc/lispref/searching.texi
+++ b/doc/lispref/searching.texi
@@ -391,18 +391,11 @@ writing the starting and ending characters with a @samp{-} between them.
 Thus, @samp{[a-z]} matches any lower-case @acronym{ASCII} letter.
 Ranges may be intermixed freely with individual characters, as in
 @samp{[a-z$%.]}, which matches any lower case @acronym{ASCII} letter
-or @samp{$}, @samp{%} or period.
+or @samp{$}, @samp{%} or period.  However, the ending character of one
+range should not be the starting point of another one; for example,
+@samp{[a-m-z]} should be avoided.
-If @code{case-fold-search} is non-@code{nil}, @samp{[a-z]} also
+The usual regexp special characters are not special inside a
-matches upper-case letters.  Note that a range like @samp{[a-z]} is
-not affected by the locale's collation sequence, it always represents
-a sequence in @acronym{ASCII} order.
-@c This wasn't obvious to me, since, e.g., the grep manual "Character
-@c Classes and Bracket Expressions" specifically notes the opposite
-@c behavior.  But by experiment Emacs seems unaffected by LC_COLLATE
-@c in this regard.
-Note also that the usual regexp special characters are not special inside a
 character alternative.  A completely different set of characters is
 special inside character alternatives: @samp{]}, @samp{-} and @samp{^}.
@@ -417,13 +410,34 @@ special there.)
 To include @samp{^} in a character alternative, put it anywhere but at
 the beginning.
-@c What if it starts with a multibyte and ends with a unibyte?
+The following aspects of ranges are specific to Emacs, in that POSIX
-@c That doesn't seem to match anything...?
+allows but does not require this behavior and programs other than
-If a range starts with a unibyte character @var{c} and ends with a
+Emacs may behave differently:
-multibyte character @var{c2}, the range is divided into two parts: one
-spans the unibyte characters @samp{@var{c}..?\377}, the other the
+@enumerate
-multibyte characters @samp{@var{c1}..@var{c2}}, where @var{c1} is the
+@item
-first character of the charset to which @var{c2} belongs.
+If @code{case-fold-search} is non-@code{nil}, @samp{[a-z]} also
+matches upper-case letters.
+@item
+A range is not affected by the locale's collation sequence: it always
+represents the set of characters with codepoints ranging between those
+of its bounds, so that @samp{[a-z]} matches only ASCII letters, even
+outside the C or POSIX locale.
+@item
+As a special case, if either bound of a range is a raw 8-bit byte, the
+other bound should be a unibyte character, and the range matches only
+unibyte characters.
+@item
+If the lower bound of a range is greater than its upper bound, the
+range is empty and represents no characters.  Thus, @samp{[b-a]}
+always fails to match, and @samp{[^b-a]} matches any character,
+including newline.  However, the lower bound should be at most one
+greater than the upper bound; for example, @samp{[c-a]} should be
+avoided.
+@end enumerate
 A character alternative can also specify named character classes
 (@pxref{Char Classes}).  This is a POSIX feature.  For example,
@@ -431,6 +445,8 @@ A character alternative can also specify named character classes
 Using a character class is equivalent to mentioning each of the
 characters in that class; but the latter is not feasible in practice,
 since some classes include thousands of different characters.
+A character class should not appear as the lower or upper bound
+of a range.
 @item @samp{[^ @dots{} ]}
 @cindex @samp{^} in regexp
author	Paul Eggert	2019-04-01 23:43:57 -0700
committer	Paul Eggert	2019-04-01 23:43:57 -0700
commit	f81ec28f4fc122658e59c0ec99ca4d92a1fe439f (patch)
tree	d5864857bbb8dcf48481673a74dc50c8ebddfa94
parent	f5d34496123ce6df53d50082159280da54f052c4 (diff)
parent	0924b27bca40d219e34529144ea04a581428f1f7 (diff)
download	emacs-f81ec28f4fc122658e59c0ec99ca4d92a1fe439f.tar.gz emacs-f81ec28f4fc122658e59c0ec99ca4d92a1fe439f.zip

diff --git a/doc/lispref/searching.texi b/doc/lispref/searching.texi index e3f31fdf836..748ab586af9 100644 --- a/doc/lispref/searching.texi +++ b/doc/lispref/searching.texi
@@ -391,18 +391,11 @@ writing the starting and ending characters with a @samp{-} between them.
391	Thus, @samp{[a-z]} matches any lower-case @acronym{ASCII} letter.	391	Thus, @samp{[a-z]} matches any lower-case @acronym{ASCII} letter.
392	Ranges may be intermixed freely with individual characters, as in	392	Ranges may be intermixed freely with individual characters, as in
393	@samp{[a-z$%.]}, which matches any lower case @acronym{ASCII} letter	393	@samp{[a-z$%.]}, which matches any lower case @acronym{ASCII} letter
394	or @samp{$}, @samp{%} or period.	394	or @samp{$}, @samp{%} or period. However, the ending character of one
		395	range should not be the starting point of another one; for example,
		396	@samp{[a-m-z]} should be avoided.
395		397
396	If @code{case-fold-search} is non-@code{nil}, @samp{[a-z]} also	398	The usual regexp special characters are not special inside a
397	matches upper-case letters. Note that a range like @samp{[a-z]} is
398	not affected by the locale's collation sequence, it always represents
399	a sequence in @acronym{ASCII} order.
400	@c This wasn't obvious to me, since, e.g., the grep manual "Character
401	@c Classes and Bracket Expressions" specifically notes the opposite
402	@c behavior. But by experiment Emacs seems unaffected by LC_COLLATE
403	@c in this regard.
404
405	Note also that the usual regexp special characters are not special inside a
406	character alternative. A completely different set of characters is	399	character alternative. A completely different set of characters is
407	special inside character alternatives: @samp{]}, @samp{-} and @samp{^}.	400	special inside character alternatives: @samp{]}, @samp{-} and @samp{^}.
408		401
@@ -417,13 +410,34 @@ special there.)
417	To include @samp{^} in a character alternative, put it anywhere but at	410	To include @samp{^} in a character alternative, put it anywhere but at
418	the beginning.	411	the beginning.
419		412
420	@c What if it starts with a multibyte and ends with a unibyte?	413	The following aspects of ranges are specific to Emacs, in that POSIX
421	@c That doesn't seem to match anything...?	414	allows but does not require this behavior and programs other than
422	If a range starts with a unibyte character @var{c} and ends with a	415	Emacs may behave differently:
423	multibyte character @var{c2}, the range is divided into two parts: one	416
424	spans the unibyte characters @samp{@var{c}..?\377}, the other the	417	@enumerate
425	multibyte characters @samp{@var{c1}..@var{c2}}, where @var{c1} is the	418	@item
426	first character of the charset to which @var{c2} belongs.	419	If @code{case-fold-search} is non-@code{nil}, @samp{[a-z]} also
		420	matches upper-case letters.
		421
		422	@item
		423	A range is not affected by the locale's collation sequence: it always
		424	represents the set of characters with codepoints ranging between those
		425	of its bounds, so that @samp{[a-z]} matches only ASCII letters, even
		426	outside the C or POSIX locale.
		427
		428	@item
		429	As a special case, if either bound of a range is a raw 8-bit byte, the
		430	other bound should be a unibyte character, and the range matches only
		431	unibyte characters.
		432
		433	@item
		434	If the lower bound of a range is greater than its upper bound, the
		435	range is empty and represents no characters. Thus, @samp{[b-a]}
		436	always fails to match, and @samp{[^b-a]} matches any character,
		437	including newline. However, the lower bound should be at most one
		438	greater than the upper bound; for example, @samp{[c-a]} should be
		439	avoided.
		440	@end enumerate
427		441
428	A character alternative can also specify named character classes	442	A character alternative can also specify named character classes
429	(@pxref{Char Classes}). This is a POSIX feature. For example,	443	(@pxref{Char Classes}). This is a POSIX feature. For example,
@@ -431,6 +445,8 @@ A character alternative can also specify named character classes
431	Using a character class is equivalent to mentioning each of the	445	Using a character class is equivalent to mentioning each of the
432	characters in that class; but the latter is not feasible in practice,	446	characters in that class; but the latter is not feasible in practice,
433	since some classes include thousands of different characters.	447	since some classes include thousands of different characters.
		448	A character class should not appear as the lower or upper bound
		449	of a range.
434		450
435	@item @samp{[^ @dots{} ]}	451	@item @samp{[^ @dots{} ]}
436	@cindex @samp{^} in regexp	452	@cindex @samp{^} in regexp