diff options
| -rw-r--r-- | doc/lispref/nonascii.texi | 247 |
1 files changed, 247 insertions, 0 deletions
diff --git a/doc/lispref/nonascii.texi b/doc/lispref/nonascii.texi index 256d2c8f38a..c967c28f631 100644 --- a/doc/lispref/nonascii.texi +++ b/doc/lispref/nonascii.texi | |||
| @@ -19,6 +19,8 @@ how they are stored in strings and buffers. | |||
| 19 | * Selecting a Representation:: Treating a byte sequence as unibyte or multi. | 19 | * Selecting a Representation:: Treating a byte sequence as unibyte or multi. |
| 20 | * Character Codes:: How unibyte and multibyte relate to | 20 | * Character Codes:: How unibyte and multibyte relate to |
| 21 | codes of individual characters. | 21 | codes of individual characters. |
| 22 | * Character Properties:: Character attributes that define their | ||
| 23 | behavior and handling. | ||
| 22 | * Character Sets:: The space of possible character codes | 24 | * Character Sets:: The space of possible character codes |
| 23 | is divided into various character sets. | 25 | is divided into various character sets. |
| 24 | * Scanning Charsets:: Which character sets are used in a buffer? | 26 | * Scanning Charsets:: Which character sets are used in a buffer? |
| @@ -344,6 +346,184 @@ The optional argument @var{string} means to get a byte value from that | |||
| 344 | string instead of the current buffer. | 346 | string instead of the current buffer. |
| 345 | @end defun | 347 | @end defun |
| 346 | 348 | ||
| 349 | @node Character Properties | ||
| 350 | @section Character Properties | ||
| 351 | @cindex character properties | ||
| 352 | A @dfn{character property} is a named attribute of a character that | ||
| 353 | specifies how the character behaves and how it should be handled | ||
| 354 | during text processing and display. Thus, character properties are an | ||
| 355 | important part of specifying the character's semantics. | ||
| 356 | |||
| 357 | Emacs generally follows the Unicode Standard in its implementation | ||
| 358 | of character properties. In particular, Emacs supports the | ||
| 359 | @uref{http://www.unicode.org/reports/tr23/, Unicode Character Property | ||
| 360 | Model}, and the Emacs character property database is derived from the | ||
| 361 | Unicode Character Database (@acronym{UCD}). See the | ||
| 362 | @uref{http://www.unicode.org/versions/Unicode5.0.0/ch04.pdf, Character | ||
| 363 | Properties chapter of the Unicode Standard}, for more details about | ||
| 364 | Unicode character properties and their meaning. | ||
| 365 | |||
| 366 | The facilities documented in this section are useful for setting and | ||
| 367 | retrieving properties of characters. | ||
| 368 | |||
| 369 | In Emacs, each property has a name, which is a symbol, and a set of | ||
| 370 | possible values, whose types depend on the property. Here's the full | ||
| 371 | list of character properties that Emacs knows about: | ||
| 372 | |||
| 373 | @table @code | ||
| 374 | @item name | ||
| 375 | The character's canonical unique name. The value of the property is a | ||
| 376 | string consisting of upper-case Latin letters A to Z, digits, spaces, | ||
| 377 | and hyphen @samp{-} characters. | ||
| 378 | |||
| 379 | @item general-category | ||
| 380 | This property assigns the character to one of the major classes, such | ||
| 381 | as letters, punctuation, and symbols, and its important subclasses. | ||
| 382 | The value is a symbol whose name is a 2-letter abbreviation. The | ||
| 383 | first letter specifies the character's major class and the second | ||
| 384 | letter designates a subclass of that major class. | ||
| 385 | |||
| 386 | @item canonical-combining-class | ||
| 387 | This property classifies combining characters into several classes, | ||
| 388 | depending on the details of their behavior in sequences of combining | ||
| 389 | characters. The property's value is an integer number. | ||
| 390 | |||
| 391 | @item bidi-class | ||
| 392 | This property specifies character attributes required for correct | ||
| 393 | display of @dfn{bidirectional text} used by right-to-left scripts, | ||
| 394 | such as Arabic and Hebrew. The value is a symbol whose name is the | ||
| 395 | Unicode @dfn{directional type} of the character. | ||
| 396 | |||
| 397 | @item decomposition | ||
| 398 | This property defines a mapping from a character to a sequence of one | ||
| 399 | or more characters that is a canonical or compatibility equivalent to | ||
| 400 | it. The value is a list, whose first element may be a symbol | ||
| 401 | representing a compatibility formatting tag, such as @code{<small>}; | ||
| 402 | the other elements are characters that give the compatibility | ||
| 403 | decomposition sequence. | ||
| 404 | |||
| 405 | @item decimal-digit-value | ||
| 406 | This property specifies a numeric value of characters that represent | ||
| 407 | decimal digits. The value is an integer number. | ||
| 408 | |||
| 409 | @item digit | ||
| 410 | This property specifies a numeric value of characters that represent | ||
| 411 | digits, but not necessarily decimal. Examples include compatibility | ||
| 412 | subscript and superscript digits. The value is an integer number. | ||
| 413 | |||
| 414 | @item numeric-value | ||
| 415 | This property specifies whether the character represents a number. | ||
| 416 | Examples of characters that do include fractions, subscripts, | ||
| 417 | superscripts, Roman numerals, currency numerators, and encircled | ||
| 418 | numbers. The value is a symbol whose name gives the numeric value; | ||
| 419 | for example, the value of this property for the character | ||
| 420 | @code{U+2155} (@sc{vulgar fraction one fifth}) is the symbol | ||
| 421 | @samp{1/5}. | ||
| 422 | |||
| 423 | @item mirrored | ||
| 424 | This is a property of characters such as parentheses, which need to be | ||
| 425 | mirrored horizontally in right to left scripts. The value is a | ||
| 426 | symbol, either @samp{Y} or @samp{N}. | ||
| 427 | |||
| 428 | @item old-name | ||
| 429 | This property's value specifies the name, if any, of the character in | ||
| 430 | the old version 1.0 of the Unicode Standard. The value is a string. | ||
| 431 | |||
| 432 | @item iso-10646-comment | ||
| 433 | This character's comment field from the ISO 10646 standard. The value | ||
| 434 | is a string, or @code{nil} if there's no comment. | ||
| 435 | |||
| 436 | @item uppercase | ||
| 437 | If this character has an upper-case equivalent that is a single | ||
| 438 | character, then the value of this property is that upper-case | ||
| 439 | equivalent. Otherwise, the value is @code{nil}. | ||
| 440 | |||
| 441 | @item lowercase | ||
| 442 | If this character has an lower-case equivalent that is a single | ||
| 443 | character, then the value of this property is that lower-case | ||
| 444 | equivalent. Otherwise, the value is @code{nil}. | ||
| 445 | |||
| 446 | @item titlecase | ||
| 447 | @dfn{Title case} is a special form of a character used when the first | ||
| 448 | character of a word needs to be capitalized. If a character has a | ||
| 449 | title-case equivalent that is a single character, then the value of | ||
| 450 | this property is that title-case equivalent. Otherwise, the value is | ||
| 451 | @code{nil}. | ||
| 452 | @end table | ||
| 453 | |||
| 454 | @defun get-char-code-property char propname | ||
| 455 | This function returns the value of @var{char}'s @var{propname} property. | ||
| 456 | |||
| 457 | @example | ||
| 458 | @group | ||
| 459 | (get-char-code-property ? 'general-category) | ||
| 460 | @result{} Zs | ||
| 461 | @end group | ||
| 462 | @group | ||
| 463 | (get-char-code-property ?1 'general-category) | ||
| 464 | @result{} Nd | ||
| 465 | @end group | ||
| 466 | @group | ||
| 467 | (get-char-code-property ?\u2084 'digit-value) ; subscript 4 | ||
| 468 | @result{} 4 | ||
| 469 | @end group | ||
| 470 | @group | ||
| 471 | (get-char-code-property ?\u2155 'numeric-value) ; one fifth | ||
| 472 | @result{} 1/5 | ||
| 473 | @end group | ||
| 474 | @group | ||
| 475 | (get-char-code-property ?\u2163 'numeric-value) ; Roman IV | ||
| 476 | @result{} \4 | ||
| 477 | @end group | ||
| 478 | @end example | ||
| 479 | @end defun | ||
| 480 | |||
| 481 | @defun char-code-property-description prop value | ||
| 482 | This function returns the description string of property @var{prop}'s | ||
| 483 | @var{value}, or @code{nil} if @var{value} has no description. | ||
| 484 | |||
| 485 | @example | ||
| 486 | @group | ||
| 487 | (char-code-property-description 'general-category 'Zs) | ||
| 488 | @result{} "Separator, Space" | ||
| 489 | @end group | ||
| 490 | @group | ||
| 491 | (char-code-property-description 'general-category 'Nd) | ||
| 492 | @result{} "Number, Decimal Digit" | ||
| 493 | @end group | ||
| 494 | @group | ||
| 495 | (char-code-property-description 'numeric-value '1/5) | ||
| 496 | @result{} nil | ||
| 497 | @end group | ||
| 498 | @end example | ||
| 499 | @end defun | ||
| 500 | |||
| 501 | @defun put-char-code-property char propname value | ||
| 502 | This function stores @var{value} as the value of the property | ||
| 503 | @var{propname} for the character @var{char}. | ||
| 504 | @end defun | ||
| 505 | |||
| 506 | @defvar char-script-table | ||
| 507 | The value of this variable is a char-table (@pxref{Char-Tables}) that | ||
| 508 | specifies, for each character, a symbol whose name is the script to | ||
| 509 | which the character belongs, according to the Unicode Standard | ||
| 510 | classification of the Unicode code space into script-specific blocks. | ||
| 511 | This char-table has a single extra slot whose value is the list of all | ||
| 512 | script symbols. | ||
| 513 | @end defvar | ||
| 514 | |||
| 515 | @defvar char-width-table | ||
| 516 | The value of this variable is a char-table that specifies the width of | ||
| 517 | each character in columns that it will occupy on the screen. | ||
| 518 | @end defvar | ||
| 519 | |||
| 520 | @defvar printable-chars | ||
| 521 | The value of this variable is a char-table that specifies, for each | ||
| 522 | character, whether it is printable or not. That is, if evaluating | ||
| 523 | @code{(aref printable-chars char)} results in @code{t}, the character | ||
| 524 | is printable, and if it results in @code{nil}, it is not. | ||
| 525 | @end defvar | ||
| 526 | |||
| 347 | @node Character Sets | 527 | @node Character Sets |
| 348 | @section Character Sets | 528 | @section Character Sets |
| 349 | @cindex character sets | 529 | @cindex character sets |
| @@ -692,6 +872,10 @@ The value of the @code{:mime-charset} property is also defined | |||
| 692 | as an alias for the coding system. | 872 | as an alias for the coding system. |
| 693 | @end defun | 873 | @end defun |
| 694 | 874 | ||
| 875 | @defun coding-system-aliases coding-system | ||
| 876 | This function returns the list of aliases of @var{coding-system}. | ||
| 877 | @end defun | ||
| 878 | |||
| 695 | @node Encoding and I/O | 879 | @node Encoding and I/O |
| 696 | @subsection Encoding and I/O | 880 | @subsection Encoding and I/O |
| 697 | 881 | ||
| @@ -865,6 +1049,22 @@ This function returns a list of coding systems that could be used to | |||
| 865 | encode all the character sets in the list @var{charsets}. | 1049 | encode all the character sets in the list @var{charsets}. |
| 866 | @end defun | 1050 | @end defun |
| 867 | 1051 | ||
| 1052 | @defun check-coding-systems-region start end coding-system-list | ||
| 1053 | This function checks whether coding systems in the list | ||
| 1054 | @code{coding-system-list} can encode all the characters in the region | ||
| 1055 | between @var{start} and @var{end}. If all of the coding systems in | ||
| 1056 | the list can encode the specified text, the function returns | ||
| 1057 | @code{nil}. If some coding systems cannot encode some of the | ||
| 1058 | characters, the value is an alist, each element of which has the form | ||
| 1059 | @code{(@var{coding-system1} @var{pos1} @var{pos2} @dots{})}, meaning | ||
| 1060 | that @var{coding-system1} cannot encode characters at buffer positions | ||
| 1061 | @var{pos1}, @var{pos2}, @enddots{}. | ||
| 1062 | |||
| 1063 | @var{start} may be a string, in which case @var{end} is ignored and | ||
| 1064 | the returned value references string indices instead of buffer | ||
| 1065 | positions. | ||
| 1066 | @end defun | ||
| 1067 | |||
| 868 | @defun detect-coding-region start end &optional highest | 1068 | @defun detect-coding-region start end &optional highest |
| 869 | This function chooses a plausible coding system for decoding the text | 1069 | This function chooses a plausible coding system for decoding the text |
| 870 | from @var{start} to @var{end}. This text should be a byte sequence, | 1070 | from @var{start} to @var{end}. This text should be a byte sequence, |
| @@ -888,6 +1088,26 @@ This function is like @code{detect-coding-region} except that it | |||
| 888 | operates on the contents of @var{string} instead of bytes in the buffer. | 1088 | operates on the contents of @var{string} instead of bytes in the buffer. |
| 889 | @end defun | 1089 | @end defun |
| 890 | 1090 | ||
| 1091 | @defun coding-system-charset-list coding-system | ||
| 1092 | This function returns the list of character sets (@pxref{Character | ||
| 1093 | Sets}) supported by @var{coding-system}. Some coding systems that | ||
| 1094 | support too many character sets to list them all yield special values: | ||
| 1095 | @itemize @bullet | ||
| 1096 | @item | ||
| 1097 | If @var{coding-system} supports all the ISO-2022 charsets, the value | ||
| 1098 | is @code{iso-2022}. | ||
| 1099 | @item | ||
| 1100 | If @var{coding-system} supports all Emacs characters, the value is | ||
| 1101 | @code{(emacs)}. | ||
| 1102 | @item | ||
| 1103 | If @var{coding-system} supports all emacs-mule characters, the value | ||
| 1104 | is @code{emacs-mule}. | ||
| 1105 | @item | ||
| 1106 | If @var{coding-system} supports all Unicode characters, the value is | ||
| 1107 | @code{(unicode)}. | ||
| 1108 | @end itemize | ||
| 1109 | @end defun | ||
| 1110 | |||
| 891 | @xref{Coding systems for a subprocess,, Process Information}, in | 1111 | @xref{Coding systems for a subprocess,, Process Information}, in |
| 892 | particular the description of the functions | 1112 | particular the description of the functions |
| 893 | @code{process-coding-system} and @code{set-process-coding-system}, for | 1113 | @code{process-coding-system} and @code{set-process-coding-system}, for |
| @@ -1179,6 +1399,33 @@ Emacs I/O and subprocess primitives, and to the explicit encoding and | |||
| 1179 | decoding functions (@pxref{Explicit Encoding}). | 1399 | decoding functions (@pxref{Explicit Encoding}). |
| 1180 | @end defvar | 1400 | @end defvar |
| 1181 | 1401 | ||
| 1402 | @cindex priority order of coding systems | ||
| 1403 | @cindex coding systems, priority | ||
| 1404 | Sometimes, you need to prefer several coding systems for some | ||
| 1405 | operation, rather than fix a single one. Emacs lets you specify a | ||
| 1406 | priority order for using coding systems. This ordering affects the | ||
| 1407 | sorting of lists of coding sysems returned by functions such as | ||
| 1408 | @code{find-coding-systems-region} (@pxref{Lisp and Coding Systems}). | ||
| 1409 | |||
| 1410 | @defun coding-system-priority-list &optional highestp | ||
| 1411 | This function returns the list of coding systems in the order of their | ||
| 1412 | current priorities. Optional argument @var{highestp}, if | ||
| 1413 | non-@code{nil}, means return only the highest priority coding system. | ||
| 1414 | @end defun | ||
| 1415 | |||
| 1416 | @defun set-coding-system-priority &rest coding-systems | ||
| 1417 | This function puts @var{coding-systems} at the beginning of the | ||
| 1418 | priority list for coding systems, thus making their priority higher | ||
| 1419 | than all the rest. | ||
| 1420 | @end defun | ||
| 1421 | |||
| 1422 | @defmac with-coding-priority coding-systems &rest body@dots{} | ||
| 1423 | This macro execute @var{body}, like @code{progn} does | ||
| 1424 | (@pxref{Sequencing, progn}), with @var{coding-systems} at the front of | ||
| 1425 | the priority list for coding systems. @var{coding-systems} should be | ||
| 1426 | a list of coding systems to prefer during execution of @var{body}. | ||
| 1427 | @end defmac | ||
| 1428 | |||
| 1182 | @node Explicit Encoding | 1429 | @node Explicit Encoding |
| 1183 | @subsection Explicit Encoding and Decoding | 1430 | @subsection Explicit Encoding and Decoding |
| 1184 | @cindex encoding in coding systems | 1431 | @cindex encoding in coding systems |