diff options
| author | Paul Eggert | 2019-04-01 23:43:57 -0700 |
|---|---|---|
| committer | Paul Eggert | 2019-04-01 23:43:57 -0700 |
| commit | f81ec28f4fc122658e59c0ec99ca4d92a1fe439f (patch) | |
| tree | d5864857bbb8dcf48481673a74dc50c8ebddfa94 | |
| parent | f5d34496123ce6df53d50082159280da54f052c4 (diff) | |
| parent | 0924b27bca40d219e34529144ea04a581428f1f7 (diff) | |
| download | emacs-f81ec28f4fc122658e59c0ec99ca4d92a1fe439f.tar.gz emacs-f81ec28f4fc122658e59c0ec99ca4d92a1fe439f.zip | |
Merge from origin/emacs-26
0924b27bca Say which regexp ranges should be avoided
# Conflicts:
# doc/lispref/searching.texi
| -rw-r--r-- | doc/lispref/searching.texi | 52 |
1 files changed, 34 insertions, 18 deletions
diff --git a/doc/lispref/searching.texi b/doc/lispref/searching.texi index e3f31fdf836..748ab586af9 100644 --- a/doc/lispref/searching.texi +++ b/doc/lispref/searching.texi | |||
| @@ -391,18 +391,11 @@ writing the starting and ending characters with a @samp{-} between them. | |||
| 391 | Thus, @samp{[a-z]} matches any lower-case @acronym{ASCII} letter. | 391 | Thus, @samp{[a-z]} matches any lower-case @acronym{ASCII} letter. |
| 392 | Ranges may be intermixed freely with individual characters, as in | 392 | Ranges may be intermixed freely with individual characters, as in |
| 393 | @samp{[a-z$%.]}, which matches any lower case @acronym{ASCII} letter | 393 | @samp{[a-z$%.]}, which matches any lower case @acronym{ASCII} letter |
| 394 | or @samp{$}, @samp{%} or period. | 394 | or @samp{$}, @samp{%} or period. However, the ending character of one |
| 395 | range should not be the starting point of another one; for example, | ||
| 396 | @samp{[a-m-z]} should be avoided. | ||
| 395 | 397 | ||
| 396 | If @code{case-fold-search} is non-@code{nil}, @samp{[a-z]} also | 398 | The usual regexp special characters are not special inside a |
| 397 | matches upper-case letters. Note that a range like @samp{[a-z]} is | ||
| 398 | not affected by the locale's collation sequence, it always represents | ||
| 399 | a sequence in @acronym{ASCII} order. | ||
| 400 | @c This wasn't obvious to me, since, e.g., the grep manual "Character | ||
| 401 | @c Classes and Bracket Expressions" specifically notes the opposite | ||
| 402 | @c behavior. But by experiment Emacs seems unaffected by LC_COLLATE | ||
| 403 | @c in this regard. | ||
| 404 | |||
| 405 | Note also that the usual regexp special characters are not special inside a | ||
| 406 | character alternative. A completely different set of characters is | 399 | character alternative. A completely different set of characters is |
| 407 | special inside character alternatives: @samp{]}, @samp{-} and @samp{^}. | 400 | special inside character alternatives: @samp{]}, @samp{-} and @samp{^}. |
| 408 | 401 | ||
| @@ -417,13 +410,34 @@ special there.) | |||
| 417 | To include @samp{^} in a character alternative, put it anywhere but at | 410 | To include @samp{^} in a character alternative, put it anywhere but at |
| 418 | the beginning. | 411 | the beginning. |
| 419 | 412 | ||
| 420 | @c What if it starts with a multibyte and ends with a unibyte? | 413 | The following aspects of ranges are specific to Emacs, in that POSIX |
| 421 | @c That doesn't seem to match anything...? | 414 | allows but does not require this behavior and programs other than |
| 422 | If a range starts with a unibyte character @var{c} and ends with a | 415 | Emacs may behave differently: |
| 423 | multibyte character @var{c2}, the range is divided into two parts: one | 416 | |
| 424 | spans the unibyte characters @samp{@var{c}..?\377}, the other the | 417 | @enumerate |
| 425 | multibyte characters @samp{@var{c1}..@var{c2}}, where @var{c1} is the | 418 | @item |
| 426 | first character of the charset to which @var{c2} belongs. | 419 | If @code{case-fold-search} is non-@code{nil}, @samp{[a-z]} also |
| 420 | matches upper-case letters. | ||
| 421 | |||
| 422 | @item | ||
| 423 | A range is not affected by the locale's collation sequence: it always | ||
| 424 | represents the set of characters with codepoints ranging between those | ||
| 425 | of its bounds, so that @samp{[a-z]} matches only ASCII letters, even | ||
| 426 | outside the C or POSIX locale. | ||
| 427 | |||
| 428 | @item | ||
| 429 | As a special case, if either bound of a range is a raw 8-bit byte, the | ||
| 430 | other bound should be a unibyte character, and the range matches only | ||
| 431 | unibyte characters. | ||
| 432 | |||
| 433 | @item | ||
| 434 | If the lower bound of a range is greater than its upper bound, the | ||
| 435 | range is empty and represents no characters. Thus, @samp{[b-a]} | ||
| 436 | always fails to match, and @samp{[^b-a]} matches any character, | ||
| 437 | including newline. However, the lower bound should be at most one | ||
| 438 | greater than the upper bound; for example, @samp{[c-a]} should be | ||
| 439 | avoided. | ||
| 440 | @end enumerate | ||
| 427 | 441 | ||
| 428 | A character alternative can also specify named character classes | 442 | A character alternative can also specify named character classes |
| 429 | (@pxref{Char Classes}). This is a POSIX feature. For example, | 443 | (@pxref{Char Classes}). This is a POSIX feature. For example, |
| @@ -431,6 +445,8 @@ A character alternative can also specify named character classes | |||
| 431 | Using a character class is equivalent to mentioning each of the | 445 | Using a character class is equivalent to mentioning each of the |
| 432 | characters in that class; but the latter is not feasible in practice, | 446 | characters in that class; but the latter is not feasible in practice, |
| 433 | since some classes include thousands of different characters. | 447 | since some classes include thousands of different characters. |
| 448 | A character class should not appear as the lower or upper bound | ||
| 449 | of a range. | ||
| 434 | 450 | ||
| 435 | @item @samp{[^ @dots{} ]} | 451 | @item @samp{[^ @dots{} ]} |
| 436 | @cindex @samp{^} in regexp | 452 | @cindex @samp{^} in regexp |