aboutsummaryrefslogtreecommitdiffstats
path: root/doc
diff options
context:
space:
mode:
authorPaul Eggert2019-03-20 14:43:30 -0700
committerPaul Eggert2019-03-20 14:44:14 -0700
commit0924b27bca40d219e34529144ea04a581428f1f7 (patch)
tree8cacf247cd9a1e3fa6f4f6d98ea2f1b9270b85e2 /doc
parent297a141ca33f7fb25c17ba0b6ed7834dfe111c48 (diff)
downloademacs-0924b27bca40d219e34529144ea04a581428f1f7.tar.gz
emacs-0924b27bca40d219e34529144ea04a581428f1f7.zip
Say which regexp ranges should be avoided
* doc/lispref/searching.texi (Regexp Special): Say that regular expressions like "[a-m-z]" and "[[:alpha:]-~]" should be avoided, for the same reason that regular expressions like "+" and "*" should be avoided: POSIX says their behavior is undefined, and they are confusing anyway. Also, explain better what happens when the bound of a range is a raw 8-bit byte; the old explanation appears to have been obsolete anyway. Finally, say that ranges like "[\u00FF-\xFF]" that mix non-ASCII characters and raw 8-bit bytes should be avoided, since it’s not clear what they should mean.
Diffstat (limited to 'doc')
-rw-r--r--doc/lispref/searching.texi54
1 files changed, 35 insertions, 19 deletions
diff --git a/doc/lispref/searching.texi b/doc/lispref/searching.texi
index 7546863dde2..0cf527b6ac7 100644
--- a/doc/lispref/searching.texi
+++ b/doc/lispref/searching.texi
@@ -391,25 +391,18 @@ writing the starting and ending characters with a @samp{-} between them.
391Thus, @samp{[a-z]} matches any lower-case @acronym{ASCII} letter. 391Thus, @samp{[a-z]} matches any lower-case @acronym{ASCII} letter.
392Ranges may be intermixed freely with individual characters, as in 392Ranges may be intermixed freely with individual characters, as in
393@samp{[a-z$%.]}, which matches any lower case @acronym{ASCII} letter 393@samp{[a-z$%.]}, which matches any lower case @acronym{ASCII} letter
394or @samp{$}, @samp{%} or period. 394or @samp{$}, @samp{%} or period. However, the ending character of one
395range should not be the starting point of another one; for example,
396@samp{[a-m-z]} should be avoided.
395 397
396If @code{case-fold-search} is non-@code{nil}, @samp{[a-z]} also 398The usual regexp special characters are not special inside a
397matches upper-case letters. Note that a range like @samp{[a-z]} is
398not affected by the locale's collation sequence, it always represents
399a sequence in @acronym{ASCII} order.
400@c This wasn't obvious to me, since, e.g., the grep manual "Character
401@c Classes and Bracket Expressions" specifically notes the opposite
402@c behavior. But by experiment Emacs seems unaffected by LC_COLLATE
403@c in this regard.
404
405Note also that the usual regexp special characters are not special inside a
406character alternative. A completely different set of characters is 399character alternative. A completely different set of characters is
407special inside character alternatives: @samp{]}, @samp{-} and @samp{^}. 400special inside character alternatives: @samp{]}, @samp{-} and @samp{^}.
408 401
409To include a @samp{]} in a character alternative, you must make it the 402To include a @samp{]} in a character alternative, you must make it the
410first character. For example, @samp{[]a]} matches @samp{]} or @samp{a}. 403first character. For example, @samp{[]a]} matches @samp{]} or @samp{a}.
411To include a @samp{-}, write @samp{-} as the first or last character of 404To include a @samp{-}, write @samp{-} as the first or last character of
412the character alternative, or put it after a range. Thus, @samp{[]-]} 405the character alternative, or as the upper bound of a range. Thus, @samp{[]-]}
413matches both @samp{]} and @samp{-}. (As explained below, you cannot 406matches both @samp{]} and @samp{-}. (As explained below, you cannot
414use @samp{\]} to include a @samp{]} inside a character alternative, 407use @samp{\]} to include a @samp{]} inside a character alternative,
415since @samp{\} is not special there.) 408since @samp{\} is not special there.)
@@ -417,13 +410,34 @@ since @samp{\} is not special there.)
417To include @samp{^} in a character alternative, put it anywhere but at 410To include @samp{^} in a character alternative, put it anywhere but at
418the beginning. 411the beginning.
419 412
420@c What if it starts with a multibyte and ends with a unibyte? 413The following aspects of ranges are specific to Emacs, in that POSIX
421@c That doesn't seem to match anything...? 414allows but does not require this behavior and programs other than
422If a range starts with a unibyte character @var{c} and ends with a 415Emacs may behave differently:
423multibyte character @var{c2}, the range is divided into two parts: one 416
424spans the unibyte characters @samp{@var{c}..?\377}, the other the 417@enumerate
425multibyte characters @samp{@var{c1}..@var{c2}}, where @var{c1} is the 418@item
426first character of the charset to which @var{c2} belongs. 419If @code{case-fold-search} is non-@code{nil}, @samp{[a-z]} also
420matches upper-case letters.
421
422@item
423A range is not affected by the locale's collation sequence: it always
424represents the set of characters with codepoints ranging between those
425of its bounds, so that @samp{[a-z]} matches only ASCII letters, even
426outside the C or POSIX locale.
427
428@item
429As a special case, if either bound of a range is a raw 8-bit byte, the
430other bound should be a unibyte character, and the range matches only
431unibyte characters.
432
433@item
434If the lower bound of a range is greater than its upper bound, the
435range is empty and represents no characters. Thus, @samp{[b-a]}
436always fails to match, and @samp{[^b-a]} matches any character,
437including newline. However, the lower bound should be at most one
438greater than the upper bound; for example, @samp{[c-a]} should be
439avoided.
440@end enumerate
427 441
428A character alternative can also specify named character classes 442A character alternative can also specify named character classes
429(@pxref{Char Classes}). This is a POSIX feature. For example, 443(@pxref{Char Classes}). This is a POSIX feature. For example,
@@ -431,6 +445,8 @@ A character alternative can also specify named character classes
431Using a character class is equivalent to mentioning each of the 445Using a character class is equivalent to mentioning each of the
432characters in that class; but the latter is not feasible in practice, 446characters in that class; but the latter is not feasible in practice,
433since some classes include thousands of different characters. 447since some classes include thousands of different characters.
448A character class should not appear as the lower or upper bound
449of a range.
434 450
435@item @samp{[^ @dots{} ]} 451@item @samp{[^ @dots{} ]}
436@cindex @samp{^} in regexp 452@cindex @samp{^} in regexp