diff options
| author | Eli Zaretskii | 2008-11-29 12:18:14 +0000 |
|---|---|---|
| committer | Eli Zaretskii | 2008-11-29 12:18:14 +0000 |
| commit | 800702607a4a0e84eb2ccea967d6819d4073a3ac (patch) | |
| tree | dcf9b29df4ff01975259ac13a3f2f42e31617121 | |
| parent | 2543eb396b7c5b2754ed10c46e333e144c1967ce (diff) | |
| download | emacs-800702607a4a0e84eb2ccea967d6819d4073a3ac.tar.gz emacs-800702607a4a0e84eb2ccea967d6819d4073a3ac.zip | |
(Explicit Encoding): Update for Emacs 23.
(Character Codes): Document `max-char'.
| -rw-r--r-- | doc/lispref/nonascii.texi | 192 |
1 files changed, 123 insertions, 69 deletions
diff --git a/doc/lispref/nonascii.texi b/doc/lispref/nonascii.texi index eab748bab8d..256d2c8f38a 100644 --- a/doc/lispref/nonascii.texi +++ b/doc/lispref/nonascii.texi | |||
| @@ -298,12 +298,36 @@ This returns @code{t} if @var{charcode} is a valid character, and | |||
| 298 | @code{nil} otherwise. | 298 | @code{nil} otherwise. |
| 299 | 299 | ||
| 300 | @example | 300 | @example |
| 301 | @group | ||
| 301 | (characterp 65) | 302 | (characterp 65) |
| 302 | @result{} t | 303 | @result{} t |
| 304 | @end group | ||
| 305 | @group | ||
| 303 | (characterp 4194303) | 306 | (characterp 4194303) |
| 304 | @result{} t | 307 | @result{} t |
| 308 | @end group | ||
| 309 | @group | ||
| 305 | (characterp 4194304) | 310 | (characterp 4194304) |
| 306 | @result{} nil | 311 | @result{} nil |
| 312 | @end group | ||
| 313 | @end example | ||
| 314 | @end defun | ||
| 315 | |||
| 316 | @cindex maximum value of character codepoint | ||
| 317 | @cindex codepoint, largest value | ||
| 318 | @defun max-char | ||
| 319 | This function returns the largest value that a valid character | ||
| 320 | codepoint can have. | ||
| 321 | |||
| 322 | @example | ||
| 323 | @group | ||
| 324 | (characterp (max-char)) | ||
| 325 | @result{} t | ||
| 326 | @end group | ||
| 327 | @group | ||
| 328 | (characterp (1+ (max-char))) | ||
| 329 | @result{} nil | ||
| 330 | @end group | ||
| 307 | @end example | 331 | @end example |
| 308 | @end defun | 332 | @end defun |
| 309 | 333 | ||
| @@ -579,48 +603,51 @@ documented here. | |||
| 579 | @subsection Basic Concepts of Coding Systems | 603 | @subsection Basic Concepts of Coding Systems |
| 580 | 604 | ||
| 581 | @cindex character code conversion | 605 | @cindex character code conversion |
| 582 | @dfn{Character code conversion} involves conversion between the encoding | 606 | @dfn{Character code conversion} involves conversion between the |
| 583 | used inside Emacs and some other encoding. Emacs supports many | 607 | internal representation of characters used inside Emacs and some other |
| 584 | different encodings, in that it can convert to and from them. For | 608 | encoding. Emacs supports many different encodings, in that it can |
| 585 | example, it can convert text to or from encodings such as Latin 1, Latin | 609 | convert to and from them. For example, it can convert text to or from |
| 586 | 2, Latin 3, Latin 4, Latin 5, and several variants of ISO 2022. In some | 610 | encodings such as Latin 1, Latin 2, Latin 3, Latin 4, Latin 5, and |
| 587 | cases, Emacs supports several alternative encodings for the same | 611 | several variants of ISO 2022. In some cases, Emacs supports several |
| 588 | characters; for example, there are three coding systems for the Cyrillic | 612 | alternative encodings for the same characters; for example, there are |
| 589 | (Russian) alphabet: ISO, Alternativnyj, and KOI8. | 613 | three coding systems for the Cyrillic (Russian) alphabet: ISO, |
| 590 | 614 | Alternativnyj, and KOI8. | |
| 615 | |||
| 616 | @c I think this paragraph is no longer correct. | ||
| 617 | @ignore | ||
| 591 | Most coding systems specify a particular character code for | 618 | Most coding systems specify a particular character code for |
| 592 | conversion, but some of them leave the choice unspecified---to be chosen | 619 | conversion, but some of them leave the choice unspecified---to be chosen |
| 593 | heuristically for each file, based on the data. | 620 | heuristically for each file, based on the data. |
| 621 | @end ignore | ||
| 594 | 622 | ||
| 595 | In general, a coding system doesn't guarantee roundtrip identity: | 623 | In general, a coding system doesn't guarantee roundtrip identity: |
| 596 | decoding a byte sequence using coding system, then encoding the | 624 | decoding a byte sequence using coding system, then encoding the |
| 597 | resulting text in the same coding system, can produce a different byte | 625 | resulting text in the same coding system, can produce a different byte |
| 598 | sequence. However, the following coding systems do guarantee that the | 626 | sequence. But some coding systems do guarantee that the byte sequence |
| 599 | byte sequence will be the same as what you originally decoded: | 627 | will be the same as what you originally decoded. Here are a few |
| 628 | examples: | ||
| 600 | 629 | ||
| 601 | @quotation | 630 | @quotation |
| 602 | chinese-big5 chinese-iso-8bit cyrillic-iso-8bit emacs-mule | 631 | iso-8859-1, utf-8, big5, shift_jis, euc-jp |
| 603 | greek-iso-8bit hebrew-iso-8bit iso-latin-1 iso-latin-2 iso-latin-3 | ||
| 604 | iso-latin-4 iso-latin-5 iso-latin-8 iso-latin-9 iso-safe | ||
| 605 | japanese-iso-8bit japanese-shift-jis korean-iso-8bit raw-text | ||
| 606 | @end quotation | 632 | @end quotation |
| 607 | 633 | ||
| 608 | Encoding buffer text and then decoding the result can also fail to | 634 | Encoding buffer text and then decoding the result can also fail to |
| 609 | reproduce the original text. For instance, if you encode Latin-2 | 635 | reproduce the original text. For instance, if you encode a character |
| 610 | characters with @code{utf-8} and decode the result using the same | 636 | with a coding system which does not support that character, the result |
| 611 | coding system, you'll get Unicode characters (of charset | 637 | is unpredictable, and thus decoding it using the same coding system |
| 612 | @code{mule-unicode-0100-24ff}). If you encode Unicode characters with | 638 | may produce a different text. Currently, Emacs can't report errors |
| 613 | @code{iso-latin-2} and decode the result with the same coding system, | 639 | that result from encoding unsupported characters. |
| 614 | you'll get Latin-2 characters. | ||
| 615 | 640 | ||
| 616 | @cindex EOL conversion | 641 | @cindex EOL conversion |
| 617 | @cindex end-of-line conversion | 642 | @cindex end-of-line conversion |
| 618 | @cindex line end conversion | 643 | @cindex line end conversion |
| 619 | @dfn{End of line conversion} handles three different conventions used | 644 | @dfn{End of line conversion} handles three different conventions |
| 620 | on various systems for representing end of line in files. The Unix | 645 | used on various systems for representing end of line in files. The |
| 621 | convention is to use the linefeed character (also called newline). The | 646 | Unix convention, used on GNU and Unix systems, is to use the linefeed |
| 622 | DOS convention is to use a carriage-return and a linefeed at the end of | 647 | character (also called newline). The DOS convention, used on |
| 623 | a line. The Mac convention is to use just carriage-return. | 648 | MS-Windows and MS-DOS systems, is to use a carriage-return and a |
| 649 | linefeed at the end of a line. The Mac convention is to use just | ||
| 650 | carriage-return. | ||
| 624 | 651 | ||
| 625 | @cindex base coding system | 652 | @cindex base coding system |
| 626 | @cindex variant coding system | 653 | @cindex variant coding system |
| @@ -639,7 +666,8 @@ data, and has the usual three variants which specify the end-of-line | |||
| 639 | conversion. @code{no-conversion} is equivalent to @code{raw-text-unix}: | 666 | conversion. @code{no-conversion} is equivalent to @code{raw-text-unix}: |
| 640 | it specifies no conversion of either character codes or end-of-line. | 667 | it specifies no conversion of either character codes or end-of-line. |
| 641 | 668 | ||
| 642 | The coding system @code{emacs-mule} specifies that the data is | 669 | @vindex emacs-internal@r{ coding system} |
| 670 | The coding system @code{emacs-internal} specifies that the data is | ||
| 643 | represented in the internal Emacs encoding. This is like | 671 | represented in the internal Emacs encoding. This is like |
| 644 | @code{raw-text} in that no code conversion happens, but different in | 672 | @code{raw-text} in that no code conversion happens, but different in |
| 645 | that the result is multibyte data. | 673 | that the result is multibyte data. |
| @@ -647,20 +675,20 @@ that the result is multibyte data. | |||
| 647 | @defun coding-system-get coding-system property | 675 | @defun coding-system-get coding-system property |
| 648 | This function returns the specified property of the coding system | 676 | This function returns the specified property of the coding system |
| 649 | @var{coding-system}. Most coding system properties exist for internal | 677 | @var{coding-system}. Most coding system properties exist for internal |
| 650 | purposes, but one that you might find useful is @code{mime-charset}. | 678 | purposes, but one that you might find useful is @code{:mime-charset}. |
| 651 | That property's value is the name used in MIME for the character coding | 679 | That property's value is the name used in MIME for the character coding |
| 652 | which this coding system can read and write. Examples: | 680 | which this coding system can read and write. Examples: |
| 653 | 681 | ||
| 654 | @example | 682 | @example |
| 655 | (coding-system-get 'iso-latin-1 'mime-charset) | 683 | (coding-system-get 'iso-latin-1 :mime-charset) |
| 656 | @result{} iso-8859-1 | 684 | @result{} iso-8859-1 |
| 657 | (coding-system-get 'iso-2022-cn 'mime-charset) | 685 | (coding-system-get 'iso-2022-cn :mime-charset) |
| 658 | @result{} iso-2022-cn | 686 | @result{} iso-2022-cn |
| 659 | (coding-system-get 'cyrillic-koi8 'mime-charset) | 687 | (coding-system-get 'cyrillic-koi8 :mime-charset) |
| 660 | @result{} koi8-r | 688 | @result{} koi8-r |
| 661 | @end example | 689 | @end example |
| 662 | 690 | ||
| 663 | The value of the @code{mime-charset} property is also defined | 691 | The value of the @code{:mime-charset} property is also defined |
| 664 | as an alias for the coding system. | 692 | as an alias for the coding system. |
| 665 | @end defun | 693 | @end defun |
| 666 | 694 | ||
| @@ -763,9 +791,11 @@ name or @code{nil}. | |||
| 763 | @end defun | 791 | @end defun |
| 764 | 792 | ||
| 765 | @defun check-coding-system coding-system | 793 | @defun check-coding-system coding-system |
| 766 | This function checks the validity of @var{coding-system}. | 794 | This function checks the validity of @var{coding-system}. If that is |
| 767 | If that is valid, it returns @var{coding-system}. | 795 | valid, it returns @var{coding-system}. If @var{coding-system} is |
| 768 | Otherwise it signals an error with condition @code{coding-system-error}. | 796 | @code{nil}, the function return @code{nil}. For any other values, it |
| 797 | signals an error whose @code{error-symbol} is @code{coding-system-error} | ||
| 798 | (@pxref{Signaling Errors, signal}). | ||
| 769 | @end defun | 799 | @end defun |
| 770 | 800 | ||
| 771 | @defun coding-system-eol-type coding-system | 801 | @defun coding-system-eol-type coding-system |
| @@ -837,8 +867,9 @@ encode all the character sets in the list @var{charsets}. | |||
| 837 | 867 | ||
| 838 | @defun detect-coding-region start end &optional highest | 868 | @defun detect-coding-region start end &optional highest |
| 839 | This function chooses a plausible coding system for decoding the text | 869 | This function chooses a plausible coding system for decoding the text |
| 840 | from @var{start} to @var{end}. This text should be a byte sequence | 870 | from @var{start} to @var{end}. This text should be a byte sequence, |
| 841 | (@pxref{Explicit Encoding}). | 871 | i.e.@: unibyte text or multibyte text with only @acronym{ASCII} and |
| 872 | eight-bit characters (@pxref{Explicit Encoding}). | ||
| 842 | 873 | ||
| 843 | Normally this function returns a list of coding systems that could | 874 | Normally this function returns a list of coding systems that could |
| 844 | handle decoding the text that was scanned. They are listed in order of | 875 | handle decoding the text that was scanned. They are listed in order of |
| @@ -1160,10 +1191,12 @@ in this section. | |||
| 1160 | 1191 | ||
| 1161 | The result of encoding, and the input to decoding, are not ordinary | 1192 | The result of encoding, and the input to decoding, are not ordinary |
| 1162 | text. They logically consist of a series of byte values; that is, a | 1193 | text. They logically consist of a series of byte values; that is, a |
| 1163 | series of characters whose codes are in the range 0 through 255. In a | 1194 | series of @acronym{ASCII} and eight-bit characters. In unibyte |
| 1164 | multibyte buffer or string, character codes 128 through 159 are | 1195 | buffers and strings, these characters have codes in the range 0 |
| 1165 | represented by multibyte sequences, but this is invisible to Lisp | 1196 | through 255. In a multibyte buffer or string, eight-bit characters |
| 1166 | programs. | 1197 | have character codes higher than 255 (@pxref{Text Representations}), |
| 1198 | but Emacs transparently converts them to their single-byte values when | ||
| 1199 | you encode or decode such text. | ||
| 1167 | 1200 | ||
| 1168 | The usual way to read a file into a buffer as a sequence of bytes, so | 1201 | The usual way to read a file into a buffer as a sequence of bytes, so |
| 1169 | you can decode the contents explicitly, is with | 1202 | you can decode the contents explicitly, is with |
| @@ -1181,19 +1214,28 @@ encoding by binding @code{coding-system-for-write} to | |||
| 1181 | Here are the functions to perform explicit encoding or decoding. The | 1214 | Here are the functions to perform explicit encoding or decoding. The |
| 1182 | encoding functions produce sequences of bytes; the decoding functions | 1215 | encoding functions produce sequences of bytes; the decoding functions |
| 1183 | are meant to operate on sequences of bytes. All of these functions | 1216 | are meant to operate on sequences of bytes. All of these functions |
| 1184 | discard text properties. | 1217 | discard text properties. They also set @code{last-coding-system-used} |
| 1218 | to the precise coding system they used. | ||
| 1185 | 1219 | ||
| 1186 | @deffn Command encode-coding-region start end coding-system | 1220 | @deffn Command encode-coding-region start end coding-system &optional destination |
| 1187 | This command encodes the text from @var{start} to @var{end} according | 1221 | This command encodes the text from @var{start} to @var{end} according |
| 1188 | to coding system @var{coding-system}. The encoded text replaces the | 1222 | to coding system @var{coding-system}. Normally, the encoded text |
| 1189 | original text in the buffer. The result of encoding is logically a | 1223 | replaces the original text in the buffer, but the optional argument |
| 1190 | sequence of bytes, but the buffer remains multibyte if it was multibyte | 1224 | @var{destination} can change that. If @var{destination} is a buffer, |
| 1191 | before. | 1225 | the encoded text is inserted in that buffer after point (point does |
| 1192 | 1226 | not move); if it is @code{t}, the command returns the encoded text as | |
| 1193 | This command returns the length of the encoded text. | 1227 | a unibyte string without inserting it. |
| 1228 | |||
| 1229 | If encoded text is inserted in some buffer, this command returns the | ||
| 1230 | length of the encoded text. | ||
| 1231 | |||
| 1232 | The result of encoding is logically a sequence of bytes, but the | ||
| 1233 | buffer remains multibyte if it was multibyte before, and any 8-bit | ||
| 1234 | bytes are converted to their multibyte representation (@pxref{Text | ||
| 1235 | Representations}). | ||
| 1194 | @end deffn | 1236 | @end deffn |
| 1195 | 1237 | ||
| 1196 | @defun encode-coding-string string coding-system &optional nocopy | 1238 | @defun encode-coding-string string coding-system &optional nocopy buffer |
| 1197 | This function encodes the text in @var{string} according to coding | 1239 | This function encodes the text in @var{string} according to coding |
| 1198 | system @var{coding-system}. It returns a new string containing the | 1240 | system @var{coding-system}. It returns a new string containing the |
| 1199 | encoded text, except when @var{nocopy} is non-@code{nil}, in which | 1241 | encoded text, except when @var{nocopy} is non-@code{nil}, in which |
| @@ -1201,24 +1243,36 @@ case the function may return @var{string} itself if the encoding | |||
| 1201 | operation is trivial. The result of encoding is a unibyte string. | 1243 | operation is trivial. The result of encoding is a unibyte string. |
| 1202 | @end defun | 1244 | @end defun |
| 1203 | 1245 | ||
| 1204 | @deffn Command decode-coding-region start end coding-system | 1246 | @deffn Command decode-coding-region start end coding-system destination |
| 1205 | This command decodes the text from @var{start} to @var{end} according | 1247 | This command decodes the text from @var{start} to @var{end} according |
| 1206 | to coding system @var{coding-system}. The decoded text replaces the | 1248 | to coding system @var{coding-system}. To make explicit decoding |
| 1207 | original text in the buffer. To make explicit decoding useful, the text | 1249 | useful, the text before decoding ought to be a sequence of byte |
| 1208 | before decoding ought to be a sequence of byte values, but both | 1250 | values, but both multibyte and unibyte buffers are acceptable (in the |
| 1209 | multibyte and unibyte buffers are acceptable. | 1251 | multibyte case, the raw byte values should be represented as eight-bit |
| 1210 | 1252 | characters). Normally, the decoded text replaces the original text in | |
| 1211 | This command returns the length of the decoded text. | 1253 | the buffer, but the optional argument @var{destination} can change |
| 1254 | that. If @var{destination} is a buffer, the decoded text is inserted | ||
| 1255 | in that buffer after point (point does not move); if it is @code{t}, | ||
| 1256 | the command returns the decoded text as a multibyte string without | ||
| 1257 | inserting it. | ||
| 1258 | |||
| 1259 | If decoded text is inserted in some buffer, this command returns the | ||
| 1260 | length of the decoded text. | ||
| 1212 | @end deffn | 1261 | @end deffn |
| 1213 | 1262 | ||
| 1214 | @defun decode-coding-string string coding-system &optional nocopy | 1263 | @defun decode-coding-string string coding-system &optional nocopy buffer |
| 1215 | This function decodes the text in @var{string} according to coding | 1264 | This function decodes the text in @var{string} according to |
| 1216 | system @var{coding-system}. It returns a new string containing the | 1265 | @var{coding-system}. It returns a new string containing the decoded |
| 1217 | decoded text, except when @var{nocopy} is non-@code{nil}, in which | 1266 | text, except when @var{nocopy} is non-@code{nil}, in which case the |
| 1218 | case the function may return @var{string} itself if the decoding | 1267 | function may return @var{string} itself if the decoding operation is |
| 1219 | operation is trivial. To make explicit decoding useful, the contents | 1268 | trivial. To make explicit decoding useful, the contents of |
| 1220 | of @var{string} ought to be a sequence of byte values, but a multibyte | 1269 | @var{string} ought to be a unibyte string with a sequence of byte |
| 1221 | string is acceptable. | 1270 | values, but a multibyte string is also acceptable (assuming it |
| 1271 | contains 8-bit bytes in their multibyte form). | ||
| 1272 | |||
| 1273 | If optional argument @var{buffer} specifies a buffer, the decoded text | ||
| 1274 | is inserted in that buffer after point (point does not move). In this | ||
| 1275 | case, the return value is the length of the decoded text. | ||
| 1222 | @end defun | 1276 | @end defun |
| 1223 | 1277 | ||
| 1224 | @defun decode-coding-inserted-region from to filename &optional visit beg end replace | 1278 | @defun decode-coding-inserted-region from to filename &optional visit beg end replace |
| @@ -1236,10 +1290,10 @@ decoding, you can call this function. | |||
| 1236 | @subsection Terminal I/O Encoding | 1290 | @subsection Terminal I/O Encoding |
| 1237 | 1291 | ||
| 1238 | Emacs can decode keyboard input using a coding system, and encode | 1292 | Emacs can decode keyboard input using a coding system, and encode |
| 1239 | terminal output. This is useful for terminals that transmit or display | 1293 | terminal output. This is useful for terminals that transmit or |
| 1240 | text using a particular encoding such as Latin-1. Emacs does not set | 1294 | display text using a particular encoding such as Latin-1. Emacs does |
| 1241 | @code{last-coding-system-used} for encoding or decoding for the | 1295 | not set @code{last-coding-system-used} for encoding or decoding of |
| 1242 | terminal. | 1296 | terminal I/O. |
| 1243 | 1297 | ||
| 1244 | @defun keyboard-coding-system | 1298 | @defun keyboard-coding-system |
| 1245 | This function returns the coding system that is in use for decoding | 1299 | This function returns the coding system that is in use for decoding |