aboutsummaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
authorEli Zaretskii2008-11-01 16:36:10 +0000
committerEli Zaretskii2008-11-01 16:36:10 +0000
commitc4526e933cdf0e55387767b32b2f18c0abbdae70 (patch)
treeb5f030325cd5425babe61acf5fa089420ade697a
parentd41784eef44d7c34becb4c35f29ac1215dfb15ab (diff)
downloademacs-c4526e933cdf0e55387767b32b2f18c0abbdae70.tar.gz
emacs-c4526e933cdf0e55387767b32b2f18c0abbdae70.zip
(Text Representations): Rewrite to make consistent with Emacs 23
internal representation of characters. Document `unibyte-string'.
-rw-r--r--doc/lispref/ChangeLog6
-rw-r--r--doc/lispref/nonascii.texi112
-rw-r--r--etc/NEWS2
3 files changed, 78 insertions, 42 deletions
diff --git a/doc/lispref/ChangeLog b/doc/lispref/ChangeLog
index 68d4996a39b..0037eccc6b5 100644
--- a/doc/lispref/ChangeLog
+++ b/doc/lispref/ChangeLog
@@ -1,3 +1,9 @@
12008-11-01 Eli Zaretskii <eliz@gnu.org>
2
3 * nonascii.texi (Text Representations): Rewrite to make consistent
4 with Emacs 23 internal representation of characters. Document
5 `unibyte-string'.
6
12008-10-28 Chong Yidong <cyd@stupidchicken.com> 72008-10-28 Chong Yidong <cyd@stupidchicken.com>
2 8
3 * processes.texi (Process Information): Note that process-status 9 * processes.texi (Process Information): Note that process-status
diff --git a/doc/lispref/nonascii.texi b/doc/lispref/nonascii.texi
index 4a8205c178d..c70f8e56973 100644
--- a/doc/lispref/nonascii.texi
+++ b/doc/lispref/nonascii.texi
@@ -10,11 +10,11 @@
10@cindex characters, multi-byte 10@cindex characters, multi-byte
11@cindex non-@acronym{ASCII} characters 11@cindex non-@acronym{ASCII} characters
12 12
13 This chapter covers the special issues relating to non-@acronym{ASCII} 13 This chapter covers the special issues relating to characters and
14characters and how they are stored in strings and buffers. 14how they are stored in strings and buffers.
15 15
16@menu 16@menu
17* Text Representations:: Unibyte and multibyte representations 17* Text Representations:: How Emacs represents text.
18* Converting Representations:: Converting unibyte to multibyte and vice versa. 18* Converting Representations:: Converting unibyte to multibyte and vice versa.
19* Selecting a Representation:: Treating a byte sequence as unibyte or multi. 19* Selecting a Representation:: Treating a byte sequence as unibyte or multi.
20* Character Codes:: How unibyte and multibyte relate to 20* Character Codes:: How unibyte and multibyte relate to
@@ -33,41 +33,62 @@ characters and how they are stored in strings and buffers.
33 33
34@node Text Representations 34@node Text Representations
35@section Text Representations 35@section Text Representations
36@cindex text representations 36@cindex text representation
37 37
38 Emacs has two @dfn{text representations}---two ways to represent text 38 Emacs buffers and strings support a large repertoire of characters
39in a string or buffer. These are called @dfn{unibyte} and 39from many different scripts. This is so users could type and display
40@dfn{multibyte}. Each string, and each buffer, uses one of these two 40text in most any known written language.
41representations. For most purposes, you can ignore the issue of 41
42representations, because Emacs converts text between them as 42@cindex character codepoint
43appropriate. Occasionally in Lisp programming you will need to pay 43@cindex codespace
44attention to the difference. 44@cindex Unicode
45 To support this multitude of characters and scripts, Emacs closely
46follows the @dfn{Unicode Standard}. The Unicode Standard assigns a
47unique number, called a @dfn{codepoint}, to each and every character.
48The range of codepoints defined by Unicode, or the Unicode
49@dfn{codespace}, is @code{0..10FFFF} (in hex) inclusive. Emacs
50extends this range with codepoints in the range @code{3FFF80..3FFFFF},
51which it uses for representing raw 8-bit bytes that cannot be
52interpreted as characters. Thus, a character codepoint in Emacs is a
5322-bit integer number.
54
55@cindex internal representation of characters
56@cindex characters, representation in buffers and strings
57@cindex multibyte text
58 To conserve memory, Emacs does not hold fixed-length 22-bit numbers
59that are codepoints of text characters within buffers and strings.
60Rather, Emacs uses a variable-length internal representation of
61characters, that stores each character as a sequence of 1 to 5 8-bit
62bytes, depending on the magnitude of its codepoint@footnote{
63This internal representation is based on one of the encodings defined
64by the Unicode Standard, called @dfn{UTF-8}, for representing any
65Unicode codepoint, but Emacs extends UTF-8 to represent the additional
66codepoints it uses for raw 8-bit bytes.}.
67For example, any @acronym{ASCII} character takes up only 1 byte, a
68Latin-1 character takes up 2 bytes, etc. We call this representation
69of text @dfn{multibyte}, because it uses several bytes for each
70character.
71
72 Outside Emacs, characters can be represented in many different
73encodings, such as ISO-8859-1, GB-2312, Big-5, etc. Emacs converts
74between these external encodings and the internal representation, as
75appropriate, when it reads text into a buffer or a string, or when it
76writes text to a disk file or passes it to some other process.
77
78 Occasionally, Emacs needs to hold and manipulate encoded text or
79binary non-text data in its buffer or string. For example, when Emacs
80visits a file, it first reads the file's text verbatim into a buffer,
81and only then converts it to the internal representation. Before the
82conversion, the buffer holds encoded text.
45 83
46@cindex unibyte text 84@cindex unibyte text
47 In unibyte representation, each character occupies one byte and 85 Encoded text is not really text, as far as Emacs is concerned, but
48therefore the possible character codes range from 0 to 255. Codes 0 86rather a sequence of raw 8-bit bytes. We call buffers and strings
49through 127 are @acronym{ASCII} characters; the codes from 128 through 255 87that hold encoded text @dfn{unibyte} buffers and strings, because
50are used for one non-@acronym{ASCII} character set (you can choose which 88Emacs treats them as a sequence of individual bytes. In particular,
51character set by setting the variable @code{nonascii-insert-offset}). 89Emacs usually displays unibyte buffers and strings as octal codes such
52 90as @code{\237}. We recommend that you never use unibyte buffers and
53@cindex leading code 91strings except for manipulating encoded text or binary non-text data.
54@cindex multibyte text
55@cindex trailing codes
56 In multibyte representation, a character may occupy more than one
57byte, and as a result, the full range of Emacs character codes can be
58stored. The first byte of a multibyte character is always in the range
59128 through 159 (octal 0200 through 0237). These values are called
60@dfn{leading codes}. The second and subsequent bytes of a multibyte
61character are always in the range 160 through 255 (octal 0240 through
620377); these values are @dfn{trailing codes}.
63
64 Some sequences of bytes are not valid in multibyte text: for example,
65a single isolated byte in the range 128 through 159 is not allowed. But
66character codes 128 through 159 can appear in multibyte text,
67represented as two-byte sequences. All the character codes 128 through
68255 are possible (though slightly abnormal) in multibyte text; they
69appear in multibyte buffers and strings when you do explicit encoding
70and decoding (@pxref{Explicit Encoding}).
71 92
72 In a buffer, the buffer-local value of the variable 93 In a buffer, the buffer-local value of the variable
73@code{enable-multibyte-characters} specifies the representation used. 94@code{enable-multibyte-characters} specifies the representation used.
@@ -77,7 +98,7 @@ when the string is constructed.
77@defvar enable-multibyte-characters 98@defvar enable-multibyte-characters
78This variable specifies the current buffer's text representation. 99This variable specifies the current buffer's text representation.
79If it is non-@code{nil}, the buffer contains multibyte text; otherwise, 100If it is non-@code{nil}, the buffer contains multibyte text; otherwise,
80it contains unibyte text. 101it contains unibyte encoded text or binary non-text data.
81 102
82You cannot set this variable directly; instead, use the function 103You cannot set this variable directly; instead, use the function
83@code{set-buffer-multibyte} to change a buffer's representation. 104@code{set-buffer-multibyte} to change a buffer's representation.
@@ -96,20 +117,22 @@ default value to @code{nil} early in startup.
96@end defvar 117@end defvar
97 118
98@defun position-bytes position 119@defun position-bytes position
99Return the byte-position corresponding to buffer position 120Buffer positions are measured in character units. This function
121returns the byte-position corresponding to buffer position
100@var{position} in the current buffer. This is 1 at the start of the 122@var{position} in the current buffer. This is 1 at the start of the
101buffer, and counts upward in bytes. If @var{position} is out of 123buffer, and counts upward in bytes. If @var{position} is out of
102range, the value is @code{nil}. 124range, the value is @code{nil}.
103@end defun 125@end defun
104 126
105@defun byte-to-position byte-position 127@defun byte-to-position byte-position
106Return the buffer position corresponding to byte-position 128Return the buffer position, in character units, corresponding to
107@var{byte-position} in the current buffer. If @var{byte-position} is 129byte-position @var{byte-position} in the current buffer. If
108out of range, the value is @code{nil}. 130@var{byte-position} is out of range, the value is @code{nil}.
109@end defun 131@end defun
110 132
111@defun multibyte-string-p string 133@defun multibyte-string-p string
112Return @code{t} if @var{string} is a multibyte string. 134Return @code{t} if @var{string} is a multibyte string, @code{nil}
135otherwise.
113@end defun 136@end defun
114 137
115@defun string-bytes string 138@defun string-bytes string
@@ -119,6 +142,11 @@ If @var{string} is a multibyte string, this can be greater than
119@code{(length @var{string})}. 142@code{(length @var{string})}.
120@end defun 143@end defun
121 144
145@defun unibyte-string &rest bytes
146This function concatenates all its argument @var{bytes} and makes the
147result a unibyte string.
148@end defun
149
122@node Converting Representations 150@node Converting Representations
123@section Converting Text Representations 151@section Converting Text Representations
124 152
diff --git a/etc/NEWS b/etc/NEWS
index 6e4273cae42..b0f2177e547 100644
--- a/etc/NEWS
+++ b/etc/NEWS
@@ -1347,6 +1347,7 @@ returns its output as a list of lines.
1347 1347
1348** Character code, representation, and charset changes. 1348** Character code, representation, and charset changes.
1349 1349
1350+++
1350The character code space is now 0x0..0x3FFFFF with no gap. 1351The character code space is now 0x0..0x3FFFFF with no gap.
1351Characters of code 0x0..0x10FFFF are Unicode characters of the same code points. 1352Characters of code 0x0..0x10FFFF are Unicode characters of the same code points.
1352Characters of code 0x3FFF80..0x3FFFFF are raw 8-bit bytes. 1353Characters of code 0x3FFF80..0x3FFFFF are raw 8-bit bytes.
@@ -1354,6 +1355,7 @@ Characters of code 0x3FFF80..0x3FFFFF are raw 8-bit bytes.
1354+++ 1355+++
1355Generic characters no longer exist. 1356Generic characters no longer exist.
1356 1357
1358+++
1357In buffers and strings, characters are represented by UTF-8 byte 1359In buffers and strings, characters are represented by UTF-8 byte
1358sequences in a multibyte buffer/string. 1360sequences in a multibyte buffer/string.
1359 1361