diff options
| author | Paul Eggert | 2019-03-20 14:43:30 -0700 |
|---|---|---|
| committer | Paul Eggert | 2019-03-20 14:44:14 -0700 |
| commit | 0924b27bca40d219e34529144ea04a581428f1f7 (patch) | |
| tree | 8cacf247cd9a1e3fa6f4f6d98ea2f1b9270b85e2 /doc | |
| parent | 297a141ca33f7fb25c17ba0b6ed7834dfe111c48 (diff) | |
| download | emacs-0924b27bca40d219e34529144ea04a581428f1f7.tar.gz emacs-0924b27bca40d219e34529144ea04a581428f1f7.zip | |
Say which regexp ranges should be avoided
* doc/lispref/searching.texi (Regexp Special): Say that
regular expressions like "[a-m-z]" and "[[:alpha:]-~]" should
be avoided, for the same reason that regular expressions like
"+" and "*" should be avoided: POSIX says their behavior is
undefined, and they are confusing anyway. Also, explain
better what happens when the bound of a range is a raw 8-bit
byte; the old explanation appears to have been obsolete
anyway. Finally, say that ranges like "[\u00FF-\xFF]" that
mix non-ASCII characters and raw 8-bit bytes should be
avoided, since it’s not clear what they should mean.
Diffstat (limited to 'doc')
| -rw-r--r-- | doc/lispref/searching.texi | 54 |
1 files changed, 35 insertions, 19 deletions
diff --git a/doc/lispref/searching.texi b/doc/lispref/searching.texi index 7546863dde2..0cf527b6ac7 100644 --- a/doc/lispref/searching.texi +++ b/doc/lispref/searching.texi | |||
| @@ -391,25 +391,18 @@ writing the starting and ending characters with a @samp{-} between them. | |||
| 391 | Thus, @samp{[a-z]} matches any lower-case @acronym{ASCII} letter. | 391 | Thus, @samp{[a-z]} matches any lower-case @acronym{ASCII} letter. |
| 392 | Ranges may be intermixed freely with individual characters, as in | 392 | Ranges may be intermixed freely with individual characters, as in |
| 393 | @samp{[a-z$%.]}, which matches any lower case @acronym{ASCII} letter | 393 | @samp{[a-z$%.]}, which matches any lower case @acronym{ASCII} letter |
| 394 | or @samp{$}, @samp{%} or period. | 394 | or @samp{$}, @samp{%} or period. However, the ending character of one |
| 395 | range should not be the starting point of another one; for example, | ||
| 396 | @samp{[a-m-z]} should be avoided. | ||
| 395 | 397 | ||
| 396 | If @code{case-fold-search} is non-@code{nil}, @samp{[a-z]} also | 398 | The usual regexp special characters are not special inside a |
| 397 | matches upper-case letters. Note that a range like @samp{[a-z]} is | ||
| 398 | not affected by the locale's collation sequence, it always represents | ||
| 399 | a sequence in @acronym{ASCII} order. | ||
| 400 | @c This wasn't obvious to me, since, e.g., the grep manual "Character | ||
| 401 | @c Classes and Bracket Expressions" specifically notes the opposite | ||
| 402 | @c behavior. But by experiment Emacs seems unaffected by LC_COLLATE | ||
| 403 | @c in this regard. | ||
| 404 | |||
| 405 | Note also that the usual regexp special characters are not special inside a | ||
| 406 | character alternative. A completely different set of characters is | 399 | character alternative. A completely different set of characters is |
| 407 | special inside character alternatives: @samp{]}, @samp{-} and @samp{^}. | 400 | special inside character alternatives: @samp{]}, @samp{-} and @samp{^}. |
| 408 | 401 | ||
| 409 | To include a @samp{]} in a character alternative, you must make it the | 402 | To include a @samp{]} in a character alternative, you must make it the |
| 410 | first character. For example, @samp{[]a]} matches @samp{]} or @samp{a}. | 403 | first character. For example, @samp{[]a]} matches @samp{]} or @samp{a}. |
| 411 | To include a @samp{-}, write @samp{-} as the first or last character of | 404 | To include a @samp{-}, write @samp{-} as the first or last character of |
| 412 | the character alternative, or put it after a range. Thus, @samp{[]-]} | 405 | the character alternative, or as the upper bound of a range. Thus, @samp{[]-]} |
| 413 | matches both @samp{]} and @samp{-}. (As explained below, you cannot | 406 | matches both @samp{]} and @samp{-}. (As explained below, you cannot |
| 414 | use @samp{\]} to include a @samp{]} inside a character alternative, | 407 | use @samp{\]} to include a @samp{]} inside a character alternative, |
| 415 | since @samp{\} is not special there.) | 408 | since @samp{\} is not special there.) |
| @@ -417,13 +410,34 @@ since @samp{\} is not special there.) | |||
| 417 | To include @samp{^} in a character alternative, put it anywhere but at | 410 | To include @samp{^} in a character alternative, put it anywhere but at |
| 418 | the beginning. | 411 | the beginning. |
| 419 | 412 | ||
| 420 | @c What if it starts with a multibyte and ends with a unibyte? | 413 | The following aspects of ranges are specific to Emacs, in that POSIX |
| 421 | @c That doesn't seem to match anything...? | 414 | allows but does not require this behavior and programs other than |
| 422 | If a range starts with a unibyte character @var{c} and ends with a | 415 | Emacs may behave differently: |
| 423 | multibyte character @var{c2}, the range is divided into two parts: one | 416 | |
| 424 | spans the unibyte characters @samp{@var{c}..?\377}, the other the | 417 | @enumerate |
| 425 | multibyte characters @samp{@var{c1}..@var{c2}}, where @var{c1} is the | 418 | @item |
| 426 | first character of the charset to which @var{c2} belongs. | 419 | If @code{case-fold-search} is non-@code{nil}, @samp{[a-z]} also |
| 420 | matches upper-case letters. | ||
| 421 | |||
| 422 | @item | ||
| 423 | A range is not affected by the locale's collation sequence: it always | ||
| 424 | represents the set of characters with codepoints ranging between those | ||
| 425 | of its bounds, so that @samp{[a-z]} matches only ASCII letters, even | ||
| 426 | outside the C or POSIX locale. | ||
| 427 | |||
| 428 | @item | ||
| 429 | As a special case, if either bound of a range is a raw 8-bit byte, the | ||
| 430 | other bound should be a unibyte character, and the range matches only | ||
| 431 | unibyte characters. | ||
| 432 | |||
| 433 | @item | ||
| 434 | If the lower bound of a range is greater than its upper bound, the | ||
| 435 | range is empty and represents no characters. Thus, @samp{[b-a]} | ||
| 436 | always fails to match, and @samp{[^b-a]} matches any character, | ||
| 437 | including newline. However, the lower bound should be at most one | ||
| 438 | greater than the upper bound; for example, @samp{[c-a]} should be | ||
| 439 | avoided. | ||
| 440 | @end enumerate | ||
| 427 | 441 | ||
| 428 | A character alternative can also specify named character classes | 442 | A character alternative can also specify named character classes |
| 429 | (@pxref{Char Classes}). This is a POSIX feature. For example, | 443 | (@pxref{Char Classes}). This is a POSIX feature. For example, |
| @@ -431,6 +445,8 @@ A character alternative can also specify named character classes | |||
| 431 | Using a character class is equivalent to mentioning each of the | 445 | Using a character class is equivalent to mentioning each of the |
| 432 | characters in that class; but the latter is not feasible in practice, | 446 | characters in that class; but the latter is not feasible in practice, |
| 433 | since some classes include thousands of different characters. | 447 | since some classes include thousands of different characters. |
| 448 | A character class should not appear as the lower or upper bound | ||
| 449 | of a range. | ||
| 434 | 450 | ||
| 435 | @item @samp{[^ @dots{} ]} | 451 | @item @samp{[^ @dots{} ]} |
| 436 | @cindex @samp{^} in regexp | 452 | @cindex @samp{^} in regexp |