diff options
| author | Eli Zaretskii | 2008-11-01 16:36:10 +0000 |
|---|---|---|
| committer | Eli Zaretskii | 2008-11-01 16:36:10 +0000 |
| commit | c4526e933cdf0e55387767b32b2f18c0abbdae70 (patch) | |
| tree | b5f030325cd5425babe61acf5fa089420ade697a | |
| parent | d41784eef44d7c34becb4c35f29ac1215dfb15ab (diff) | |
| download | emacs-c4526e933cdf0e55387767b32b2f18c0abbdae70.tar.gz emacs-c4526e933cdf0e55387767b32b2f18c0abbdae70.zip | |
(Text Representations): Rewrite to make consistent with Emacs 23
internal representation of characters. Document `unibyte-string'.
| -rw-r--r-- | doc/lispref/ChangeLog | 6 | ||||
| -rw-r--r-- | doc/lispref/nonascii.texi | 112 | ||||
| -rw-r--r-- | etc/NEWS | 2 |
3 files changed, 78 insertions, 42 deletions
diff --git a/doc/lispref/ChangeLog b/doc/lispref/ChangeLog index 68d4996a39b..0037eccc6b5 100644 --- a/doc/lispref/ChangeLog +++ b/doc/lispref/ChangeLog | |||
| @@ -1,3 +1,9 @@ | |||
| 1 | 2008-11-01 Eli Zaretskii <eliz@gnu.org> | ||
| 2 | |||
| 3 | * nonascii.texi (Text Representations): Rewrite to make consistent | ||
| 4 | with Emacs 23 internal representation of characters. Document | ||
| 5 | `unibyte-string'. | ||
| 6 | |||
| 1 | 2008-10-28 Chong Yidong <cyd@stupidchicken.com> | 7 | 2008-10-28 Chong Yidong <cyd@stupidchicken.com> |
| 2 | 8 | ||
| 3 | * processes.texi (Process Information): Note that process-status | 9 | * processes.texi (Process Information): Note that process-status |
diff --git a/doc/lispref/nonascii.texi b/doc/lispref/nonascii.texi index 4a8205c178d..c70f8e56973 100644 --- a/doc/lispref/nonascii.texi +++ b/doc/lispref/nonascii.texi | |||
| @@ -10,11 +10,11 @@ | |||
| 10 | @cindex characters, multi-byte | 10 | @cindex characters, multi-byte |
| 11 | @cindex non-@acronym{ASCII} characters | 11 | @cindex non-@acronym{ASCII} characters |
| 12 | 12 | ||
| 13 | This chapter covers the special issues relating to non-@acronym{ASCII} | 13 | This chapter covers the special issues relating to characters and |
| 14 | characters and how they are stored in strings and buffers. | 14 | how they are stored in strings and buffers. |
| 15 | 15 | ||
| 16 | @menu | 16 | @menu |
| 17 | * Text Representations:: Unibyte and multibyte representations | 17 | * Text Representations:: How Emacs represents text. |
| 18 | * Converting Representations:: Converting unibyte to multibyte and vice versa. | 18 | * Converting Representations:: Converting unibyte to multibyte and vice versa. |
| 19 | * Selecting a Representation:: Treating a byte sequence as unibyte or multi. | 19 | * Selecting a Representation:: Treating a byte sequence as unibyte or multi. |
| 20 | * Character Codes:: How unibyte and multibyte relate to | 20 | * Character Codes:: How unibyte and multibyte relate to |
| @@ -33,41 +33,62 @@ characters and how they are stored in strings and buffers. | |||
| 33 | 33 | ||
| 34 | @node Text Representations | 34 | @node Text Representations |
| 35 | @section Text Representations | 35 | @section Text Representations |
| 36 | @cindex text representations | 36 | @cindex text representation |
| 37 | 37 | ||
| 38 | Emacs has two @dfn{text representations}---two ways to represent text | 38 | Emacs buffers and strings support a large repertoire of characters |
| 39 | in a string or buffer. These are called @dfn{unibyte} and | 39 | from many different scripts. This is so users could type and display |
| 40 | @dfn{multibyte}. Each string, and each buffer, uses one of these two | 40 | text in most any known written language. |
| 41 | representations. For most purposes, you can ignore the issue of | 41 | |
| 42 | representations, because Emacs converts text between them as | 42 | @cindex character codepoint |
| 43 | appropriate. Occasionally in Lisp programming you will need to pay | 43 | @cindex codespace |
| 44 | attention to the difference. | 44 | @cindex Unicode |
| 45 | To support this multitude of characters and scripts, Emacs closely | ||
| 46 | follows the @dfn{Unicode Standard}. The Unicode Standard assigns a | ||
| 47 | unique number, called a @dfn{codepoint}, to each and every character. | ||
| 48 | The range of codepoints defined by Unicode, or the Unicode | ||
| 49 | @dfn{codespace}, is @code{0..10FFFF} (in hex) inclusive. Emacs | ||
| 50 | extends this range with codepoints in the range @code{3FFF80..3FFFFF}, | ||
| 51 | which it uses for representing raw 8-bit bytes that cannot be | ||
| 52 | interpreted as characters. Thus, a character codepoint in Emacs is a | ||
| 53 | 22-bit integer number. | ||
| 54 | |||
| 55 | @cindex internal representation of characters | ||
| 56 | @cindex characters, representation in buffers and strings | ||
| 57 | @cindex multibyte text | ||
| 58 | To conserve memory, Emacs does not hold fixed-length 22-bit numbers | ||
| 59 | that are codepoints of text characters within buffers and strings. | ||
| 60 | Rather, Emacs uses a variable-length internal representation of | ||
| 61 | characters, that stores each character as a sequence of 1 to 5 8-bit | ||
| 62 | bytes, depending on the magnitude of its codepoint@footnote{ | ||
| 63 | This internal representation is based on one of the encodings defined | ||
| 64 | by the Unicode Standard, called @dfn{UTF-8}, for representing any | ||
| 65 | Unicode codepoint, but Emacs extends UTF-8 to represent the additional | ||
| 66 | codepoints it uses for raw 8-bit bytes.}. | ||
| 67 | For example, any @acronym{ASCII} character takes up only 1 byte, a | ||
| 68 | Latin-1 character takes up 2 bytes, etc. We call this representation | ||
| 69 | of text @dfn{multibyte}, because it uses several bytes for each | ||
| 70 | character. | ||
| 71 | |||
| 72 | Outside Emacs, characters can be represented in many different | ||
| 73 | encodings, such as ISO-8859-1, GB-2312, Big-5, etc. Emacs converts | ||
| 74 | between these external encodings and the internal representation, as | ||
| 75 | appropriate, when it reads text into a buffer or a string, or when it | ||
| 76 | writes text to a disk file or passes it to some other process. | ||
| 77 | |||
| 78 | Occasionally, Emacs needs to hold and manipulate encoded text or | ||
| 79 | binary non-text data in its buffer or string. For example, when Emacs | ||
| 80 | visits a file, it first reads the file's text verbatim into a buffer, | ||
| 81 | and only then converts it to the internal representation. Before the | ||
| 82 | conversion, the buffer holds encoded text. | ||
| 45 | 83 | ||
| 46 | @cindex unibyte text | 84 | @cindex unibyte text |
| 47 | In unibyte representation, each character occupies one byte and | 85 | Encoded text is not really text, as far as Emacs is concerned, but |
| 48 | therefore the possible character codes range from 0 to 255. Codes 0 | 86 | rather a sequence of raw 8-bit bytes. We call buffers and strings |
| 49 | through 127 are @acronym{ASCII} characters; the codes from 128 through 255 | 87 | that hold encoded text @dfn{unibyte} buffers and strings, because |
| 50 | are used for one non-@acronym{ASCII} character set (you can choose which | 88 | Emacs treats them as a sequence of individual bytes. In particular, |
| 51 | character set by setting the variable @code{nonascii-insert-offset}). | 89 | Emacs usually displays unibyte buffers and strings as octal codes such |
| 52 | 90 | as @code{\237}. We recommend that you never use unibyte buffers and | |
| 53 | @cindex leading code | 91 | strings except for manipulating encoded text or binary non-text data. |
| 54 | @cindex multibyte text | ||
| 55 | @cindex trailing codes | ||
| 56 | In multibyte representation, a character may occupy more than one | ||
| 57 | byte, and as a result, the full range of Emacs character codes can be | ||
| 58 | stored. The first byte of a multibyte character is always in the range | ||
| 59 | 128 through 159 (octal 0200 through 0237). These values are called | ||
| 60 | @dfn{leading codes}. The second and subsequent bytes of a multibyte | ||
| 61 | character are always in the range 160 through 255 (octal 0240 through | ||
| 62 | 0377); these values are @dfn{trailing codes}. | ||
| 63 | |||
| 64 | Some sequences of bytes are not valid in multibyte text: for example, | ||
| 65 | a single isolated byte in the range 128 through 159 is not allowed. But | ||
| 66 | character codes 128 through 159 can appear in multibyte text, | ||
| 67 | represented as two-byte sequences. All the character codes 128 through | ||
| 68 | 255 are possible (though slightly abnormal) in multibyte text; they | ||
| 69 | appear in multibyte buffers and strings when you do explicit encoding | ||
| 70 | and decoding (@pxref{Explicit Encoding}). | ||
| 71 | 92 | ||
| 72 | In a buffer, the buffer-local value of the variable | 93 | In a buffer, the buffer-local value of the variable |
| 73 | @code{enable-multibyte-characters} specifies the representation used. | 94 | @code{enable-multibyte-characters} specifies the representation used. |
| @@ -77,7 +98,7 @@ when the string is constructed. | |||
| 77 | @defvar enable-multibyte-characters | 98 | @defvar enable-multibyte-characters |
| 78 | This variable specifies the current buffer's text representation. | 99 | This variable specifies the current buffer's text representation. |
| 79 | If it is non-@code{nil}, the buffer contains multibyte text; otherwise, | 100 | If it is non-@code{nil}, the buffer contains multibyte text; otherwise, |
| 80 | it contains unibyte text. | 101 | it contains unibyte encoded text or binary non-text data. |
| 81 | 102 | ||
| 82 | You cannot set this variable directly; instead, use the function | 103 | You cannot set this variable directly; instead, use the function |
| 83 | @code{set-buffer-multibyte} to change a buffer's representation. | 104 | @code{set-buffer-multibyte} to change a buffer's representation. |
| @@ -96,20 +117,22 @@ default value to @code{nil} early in startup. | |||
| 96 | @end defvar | 117 | @end defvar |
| 97 | 118 | ||
| 98 | @defun position-bytes position | 119 | @defun position-bytes position |
| 99 | Return the byte-position corresponding to buffer position | 120 | Buffer positions are measured in character units. This function |
| 121 | returns the byte-position corresponding to buffer position | ||
| 100 | @var{position} in the current buffer. This is 1 at the start of the | 122 | @var{position} in the current buffer. This is 1 at the start of the |
| 101 | buffer, and counts upward in bytes. If @var{position} is out of | 123 | buffer, and counts upward in bytes. If @var{position} is out of |
| 102 | range, the value is @code{nil}. | 124 | range, the value is @code{nil}. |
| 103 | @end defun | 125 | @end defun |
| 104 | 126 | ||
| 105 | @defun byte-to-position byte-position | 127 | @defun byte-to-position byte-position |
| 106 | Return the buffer position corresponding to byte-position | 128 | Return the buffer position, in character units, corresponding to |
| 107 | @var{byte-position} in the current buffer. If @var{byte-position} is | 129 | byte-position @var{byte-position} in the current buffer. If |
| 108 | out of range, the value is @code{nil}. | 130 | @var{byte-position} is out of range, the value is @code{nil}. |
| 109 | @end defun | 131 | @end defun |
| 110 | 132 | ||
| 111 | @defun multibyte-string-p string | 133 | @defun multibyte-string-p string |
| 112 | Return @code{t} if @var{string} is a multibyte string. | 134 | Return @code{t} if @var{string} is a multibyte string, @code{nil} |
| 135 | otherwise. | ||
| 113 | @end defun | 136 | @end defun |
| 114 | 137 | ||
| 115 | @defun string-bytes string | 138 | @defun string-bytes string |
| @@ -119,6 +142,11 @@ If @var{string} is a multibyte string, this can be greater than | |||
| 119 | @code{(length @var{string})}. | 142 | @code{(length @var{string})}. |
| 120 | @end defun | 143 | @end defun |
| 121 | 144 | ||
| 145 | @defun unibyte-string &rest bytes | ||
| 146 | This function concatenates all its argument @var{bytes} and makes the | ||
| 147 | result a unibyte string. | ||
| 148 | @end defun | ||
| 149 | |||
| 122 | @node Converting Representations | 150 | @node Converting Representations |
| 123 | @section Converting Text Representations | 151 | @section Converting Text Representations |
| 124 | 152 | ||
| @@ -1347,6 +1347,7 @@ returns its output as a list of lines. | |||
| 1347 | 1347 | ||
| 1348 | ** Character code, representation, and charset changes. | 1348 | ** Character code, representation, and charset changes. |
| 1349 | 1349 | ||
| 1350 | +++ | ||
| 1350 | The character code space is now 0x0..0x3FFFFF with no gap. | 1351 | The character code space is now 0x0..0x3FFFFF with no gap. |
| 1351 | Characters of code 0x0..0x10FFFF are Unicode characters of the same code points. | 1352 | Characters of code 0x0..0x10FFFF are Unicode characters of the same code points. |
| 1352 | Characters of code 0x3FFF80..0x3FFFFF are raw 8-bit bytes. | 1353 | Characters of code 0x3FFF80..0x3FFFFF are raw 8-bit bytes. |
| @@ -1354,6 +1355,7 @@ Characters of code 0x3FFF80..0x3FFFFF are raw 8-bit bytes. | |||
| 1354 | +++ | 1355 | +++ |
| 1355 | Generic characters no longer exist. | 1356 | Generic characters no longer exist. |
| 1356 | 1357 | ||
| 1358 | +++ | ||
| 1357 | In buffers and strings, characters are represented by UTF-8 byte | 1359 | In buffers and strings, characters are represented by UTF-8 byte |
| 1358 | sequences in a multibyte buffer/string. | 1360 | sequences in a multibyte buffer/string. |
| 1359 | 1361 | ||