diff options
| author | Eli Zaretskii | 2008-11-28 13:26:43 +0000 |
|---|---|---|
| committer | Eli Zaretskii | 2008-11-28 13:26:43 +0000 |
| commit | 8b80cdf500c514dc9c448b4fe37265cf16127ae5 (patch) | |
| tree | 516ab434ca3c00418d94c5f6d303f7500b41129d | |
| parent | e8e2bd93103909d092205b95d72bfeb8d8f6d129 (diff) | |
| download | emacs-8b80cdf500c514dc9c448b4fe37265cf16127ae5.tar.gz emacs-8b80cdf500c514dc9c448b4fe37265cf16127ae5.zip | |
(Text Representations, Converting Representations, Character Sets,
Scanning Charsets, Translation of Characters): Make text more accurate.
| -rw-r--r-- | doc/lispref/ChangeLog | 6 | ||||
| -rw-r--r-- | doc/lispref/nonascii.texi | 68 |
2 files changed, 51 insertions, 23 deletions
diff --git a/doc/lispref/ChangeLog b/doc/lispref/ChangeLog index e0d465a0a73..3b6f5fb33fa 100644 --- a/doc/lispref/ChangeLog +++ b/doc/lispref/ChangeLog | |||
| @@ -1,3 +1,9 @@ | |||
| 1 | 2008-11-28 Eli Zaretskii <eliz@gnu.org> | ||
| 2 | |||
| 3 | * nonascii.texi (Text Representations, Converting Representations) | ||
| 4 | (Character Sets, Scanning Charsets, Translation of Characters): | ||
| 5 | Make text more accurate. | ||
| 6 | |||
| 1 | 2008-11-28 Glenn Morris <rgm@gnu.org> | 7 | 2008-11-28 Glenn Morris <rgm@gnu.org> |
| 2 | 8 | ||
| 3 | * files.texi (Format Conversion Round-Trip): Improve previous change. | 9 | * files.texi (Format Conversion Round-Trip): Improve previous change. |
diff --git a/doc/lispref/nonascii.texi b/doc/lispref/nonascii.texi index f2656806bdb..eab748bab8d 100644 --- a/doc/lispref/nonascii.texi +++ b/doc/lispref/nonascii.texi | |||
| @@ -44,7 +44,7 @@ text in most any known written language. | |||
| 44 | follows the @dfn{Unicode Standard}. The Unicode Standard assigns a | 44 | follows the @dfn{Unicode Standard}. The Unicode Standard assigns a |
| 45 | unique number, called a @dfn{codepoint}, to each and every character. | 45 | unique number, called a @dfn{codepoint}, to each and every character. |
| 46 | The range of codepoints defined by Unicode, or the Unicode | 46 | The range of codepoints defined by Unicode, or the Unicode |
| 47 | @dfn{codespace}, is @code{0..10FFFF} (in hex) inclusive. Emacs | 47 | @dfn{codespace}, is @code{0..10FFFF} (in hex), inclusive. Emacs |
| 48 | extends this range with codepoints in the range @code{110000..3FFFFF}, | 48 | extends this range with codepoints in the range @code{110000..3FFFFF}, |
| 49 | which it uses for representing characters that are not unified with | 49 | which it uses for representing characters that are not unified with |
| 50 | Unicode and raw 8-bit bytes that cannot be interpreted as characters | 50 | Unicode and raw 8-bit bytes that cannot be interpreted as characters |
| @@ -62,7 +62,8 @@ bytes, depending on the magnitude of its codepoint@footnote{ | |||
| 62 | This internal representation is based on one of the encodings defined | 62 | This internal representation is based on one of the encodings defined |
| 63 | by the Unicode Standard, called @dfn{UTF-8}, for representing any | 63 | by the Unicode Standard, called @dfn{UTF-8}, for representing any |
| 64 | Unicode codepoint, but Emacs extends UTF-8 to represent the additional | 64 | Unicode codepoint, but Emacs extends UTF-8 to represent the additional |
| 65 | codepoints it uses for raw 8-bit bytes.}. | 65 | codepoints it uses for raw 8-bit bytes and characters not unified with |
| 66 | Unicode.}. | ||
| 66 | For example, any @acronym{ASCII} character takes up only 1 byte, a | 67 | For example, any @acronym{ASCII} character takes up only 1 byte, a |
| 67 | Latin-1 character takes up 2 bytes, etc. We call this representation | 68 | Latin-1 character takes up 2 bytes, etc. We call this representation |
| 68 | of text @dfn{multibyte}, because it uses several bytes for each | 69 | of text @dfn{multibyte}, because it uses several bytes for each |
| @@ -157,7 +158,7 @@ result a unibyte string. | |||
| 157 | 158 | ||
| 158 | Emacs can convert unibyte text to multibyte; it can also convert | 159 | Emacs can convert unibyte text to multibyte; it can also convert |
| 159 | multibyte text to unibyte, provided that the multibyte text contains | 160 | multibyte text to unibyte, provided that the multibyte text contains |
| 160 | only @acronym{ASCII} and 8-bit characters. In general, these | 161 | only @acronym{ASCII} and 8-bit raw bytes. In general, these |
| 161 | conversions happen when inserting text into a buffer, or when putting | 162 | conversions happen when inserting text into a buffer, or when putting |
| 162 | text from several strings together in one string. You can also | 163 | text from several strings together in one string. You can also |
| 163 | explicitly convert a string's contents to either representation. | 164 | explicitly convert a string's contents to either representation. |
| @@ -194,25 +195,32 @@ newly created string with no text properties. | |||
| 194 | @defun string-to-multibyte string | 195 | @defun string-to-multibyte string |
| 195 | This function returns a multibyte string containing the same sequence | 196 | This function returns a multibyte string containing the same sequence |
| 196 | of characters as @var{string}. If @var{string} is a multibyte string, | 197 | of characters as @var{string}. If @var{string} is a multibyte string, |
| 197 | it is returned unchanged. | 198 | it is returned unchanged. The function assumes that @var{string} |
| 199 | includes only @acronym{ASCII} characters and raw 8-bit bytes; the | ||
| 200 | latter are converted to their multibyte representation corresponding | ||
| 201 | to the codepoints in the @code{3FFF80..3FFFFF} area (@pxref{Text | ||
| 202 | Representations, codepoints}). | ||
| 198 | @end defun | 203 | @end defun |
| 199 | 204 | ||
| 200 | @defun string-to-unibyte string | 205 | @defun string-to-unibyte string |
| 201 | This function returns a unibyte string containing the same sequence of | 206 | This function returns a unibyte string containing the same sequence of |
| 202 | characters as @var{string}. It signals an error if @var{string} | 207 | characters as @var{string}. It signals an error if @var{string} |
| 203 | contains a non-@acronym{ASCII} character. If @var{string} is a | 208 | contains a non-@acronym{ASCII} character. If @var{string} is a |
| 204 | unibyte string, it is returned unchanged. | 209 | unibyte string, it is returned unchanged. Use this function for |
| 210 | @var{string} arguments that contain only @acronym{ASCII} and eight-bit | ||
| 211 | characters. | ||
| 205 | @end defun | 212 | @end defun |
| 206 | 213 | ||
| 207 | @defun multibyte-char-to-unibyte char | 214 | @defun multibyte-char-to-unibyte char |
| 208 | This convert the multibyte character @var{char} to a unibyte | 215 | This convert the multibyte character @var{char} to a unibyte |
| 209 | character. If @var{char} is a non-@acronym{ASCII} character, the | 216 | character. If @var{char} is a character that is neither |
| 210 | value is -1. | 217 | @acronym{ASCII} nor eight-bit, the value is -1. |
| 211 | @end defun | 218 | @end defun |
| 212 | 219 | ||
| 213 | @defun unibyte-char-to-multibyte char | 220 | @defun unibyte-char-to-multibyte char |
| 214 | This convert the unibyte character @var{char} to a multibyte | 221 | This convert the unibyte character @var{char} to a multibyte |
| 215 | character. | 222 | character, assuming @var{char} is either @acronym{ASCII} or raw 8-bit |
| 223 | byte. | ||
| 216 | @end defun | 224 | @end defun |
| 217 | 225 | ||
| 218 | @node Selecting a Representation | 226 | @node Selecting a Representation |
| @@ -320,7 +328,7 @@ string instead of the current buffer. | |||
| 320 | @cindex coded character set | 328 | @cindex coded character set |
| 321 | An Emacs @dfn{character set}, or @dfn{charset}, is a set of characters | 329 | An Emacs @dfn{character set}, or @dfn{charset}, is a set of characters |
| 322 | in which each character is assigned a numeric code point. (The | 330 | in which each character is assigned a numeric code point. (The |
| 323 | Unicode standard calls this a @dfn{coded character set}.) Each | 331 | Unicode standard calls this a @dfn{coded character set}.) Each Emacs |
| 324 | charset has a name which is a symbol. A single character can belong | 332 | charset has a name which is a symbol. A single character can belong |
| 325 | to any number of different character sets, but it will generally have | 333 | to any number of different character sets, but it will generally have |
| 326 | a different code point in each charset. Examples of character sets | 334 | a different code point in each charset. Examples of character sets |
| @@ -387,30 +395,42 @@ This command displays a list of characters in the character set | |||
| 387 | @var{charset}. | 395 | @var{charset}. |
| 388 | @end deffn | 396 | @end deffn |
| 389 | 397 | ||
| 398 | Emacs can convert between its internal representation of a character | ||
| 399 | and the character's codepoint in a specific charset. The following | ||
| 400 | two functions support these conversions. | ||
| 401 | |||
| 402 | @c FIXME: decode-char and encode-char accept and ignore an additional | ||
| 403 | @c argument @var{restriction}. When that argument actually makes a | ||
| 404 | @c difference, it should be documented here. | ||
| 390 | @defun decode-char charset code-point | 405 | @defun decode-char charset code-point |
| 391 | This function decodes a character that is assigned a @var{code-point} | 406 | This function decodes a character that is assigned a @var{code-point} |
| 392 | in @var{charset}, to the corresponding Emacs character, and returns | 407 | in @var{charset}, to the corresponding Emacs character, and returns |
| 393 | that character. If @var{charset} doesn't contain a character of that | 408 | it. If @var{charset} doesn't contain a character of that code point, |
| 394 | code point, the value is @code{nil}. If @var{code-point} doesnt't fit | 409 | the value is @code{nil}. If @var{code-point} doesn't fit in a Lisp |
| 395 | in a Lisp integer (@pxref{Integer Basics, most-positive-fixnum}), it | 410 | integer (@pxref{Integer Basics, most-positive-fixnum}), it can be |
| 396 | can be specified as a cons cell @code{(@var{high} . @var{low})}, where | 411 | specified as a cons cell @code{(@var{high} . @var{low})}, where |
| 397 | @var{low} are the lower 16 bits of the value and @var{high} are the | 412 | @var{low} are the lower 16 bits of the value and @var{high} are the |
| 398 | high 16 bits. | 413 | high 16 bits. |
| 399 | @end defun | 414 | @end defun |
| 400 | 415 | ||
| 401 | @defun encode-char char charset | 416 | @defun encode-char char charset |
| 402 | This function returns the code point assigned to the character | 417 | This function returns the code point assigned to the character |
| 403 | @var{char} in @var{charset}. If @var{charset} doesn't contain | 418 | @var{char} in @var{charset}. If the result does not fit in a Lisp |
| 404 | @var{char}, the value is @code{nil}. | 419 | integer, it is returned as a cons cell @code{(@var{high} . @var{low})} |
| 420 | that fits the second argument of @code{decode-char} above. If | ||
| 421 | @var{charset} doesn't have a codepoint for @var{char}, the value is | ||
| 422 | @code{nil}. | ||
| 405 | @end defun | 423 | @end defun |
| 406 | 424 | ||
| 407 | @node Scanning Charsets | 425 | @node Scanning Charsets |
| 408 | @section Scanning for Character Sets | 426 | @section Scanning for Character Sets |
| 409 | 427 | ||
| 410 | Sometimes it is useful to find out which character sets appear in a | 428 | Sometimes it is useful to find out, for characters that appear in a |
| 411 | part of a buffer or a string. One use for this is in determining which | 429 | certain part of a buffer or a string, to which character sets they |
| 412 | coding systems (@pxref{Coding Systems}) are capable of representing all | 430 | belong. One use for this is in determining which coding systems |
| 413 | of the text in question. | 431 | (@pxref{Coding Systems}) are capable of representing all of the text |
| 432 | in question; another is to determine the font(s) for displaying that | ||
| 433 | text. | ||
| 414 | 434 | ||
| 415 | @defun charset-after &optional pos | 435 | @defun charset-after &optional pos |
| 416 | This function returns the charset of highest priority containing the | 436 | This function returns the charset of highest priority containing the |
| @@ -421,7 +441,7 @@ If @var{pos} is out of range, the value is @code{nil}. | |||
| 421 | 441 | ||
| 422 | @defun find-charset-region beg end &optional translation | 442 | @defun find-charset-region beg end &optional translation |
| 423 | This function returns a list of the character sets of highest priority | 443 | This function returns a list of the character sets of highest priority |
| 424 | that contain charcters in the current buffer between positions | 444 | that contain characters in the current buffer between positions |
| 425 | @var{beg} and @var{end}. | 445 | @var{beg} and @var{end}. |
| 426 | 446 | ||
| 427 | The optional argument @var{translation} specifies a translation table to | 447 | The optional argument @var{translation} specifies a translation table to |
| @@ -453,7 +473,8 @@ systems. | |||
| 453 | A translation table has two extra slots. The first is either | 473 | A translation table has two extra slots. The first is either |
| 454 | @code{nil} or a translation table that performs the reverse | 474 | @code{nil} or a translation table that performs the reverse |
| 455 | translation; the second is the maximum number of characters to look up | 475 | translation; the second is the maximum number of characters to look up |
| 456 | for translation. | 476 | for translating sequences of characters (see the description of |
| 477 | @code{make-translation-table-from-alist} below). | ||
| 457 | 478 | ||
| 458 | @defun make-translation-table &rest translations | 479 | @defun make-translation-table &rest translations |
| 459 | This function returns a translation table based on the argument | 480 | This function returns a translation table based on the argument |
| @@ -504,7 +525,7 @@ This function returns a translation table made from @var{vec} that is | |||
| 504 | an array of 256 elements to map byte values 0 through 255 to | 525 | an array of 256 elements to map byte values 0 through 255 to |
| 505 | characters. Elements may be @code{nil} for untranslated bytes. The | 526 | characters. Elements may be @code{nil} for untranslated bytes. The |
| 506 | returned table has a translation table for reverse mapping in the | 527 | returned table has a translation table for reverse mapping in the |
| 507 | first extra slot. | 528 | first extra slot, and the value @code{1} in the second extra slot. |
| 508 | 529 | ||
| 509 | This function provides an easy way to make a private coding system | 530 | This function provides an easy way to make a private coding system |
| 510 | that maps each byte to a specific character. You can specify the | 531 | that maps each byte to a specific character. You can specify the |
| @@ -524,7 +545,8 @@ character, that character is translated to @var{to} (i.e.@: to a | |||
| 524 | character or a character sequence). If @var{from} is a vector of | 545 | character or a character sequence). If @var{from} is a vector of |
| 525 | characters, that sequence is translated to @var{to}. The returned | 546 | characters, that sequence is translated to @var{to}. The returned |
| 526 | table has a translation table for reverse mapping in the first extra | 547 | table has a translation table for reverse mapping in the first extra |
| 527 | slot. | 548 | slot, and the maximum length of all the @var{from} character sequences |
| 549 | in the second extra slot. | ||
| 528 | @end defun | 550 | @end defun |
| 529 | 551 | ||
| 530 | @node Coding Systems | 552 | @node Coding Systems |