(Text Representations): Rewrite to make consistent with Emacs 23

internal representation of characters. Document `unibyte-string'.
author: Eli Zaretskii 2008-11-01 16:36:10 +0000
committer: Eli Zaretskii 2008-11-01 16:36:10 +0000
commit: c4526e933cdf0e55387767b32b2f18c0abbdae70 (patch)
tree: b5f030325cd5425babe61acf5fa089420ade697a
parent: d41784eef44d7c34becb4c35f29ac1215dfb15ab (diff)
download: emacs-c4526e933cdf0e55387767b32b2f18c0abbdae70.tar.gz
emacs-c4526e933cdf0e55387767b32b2f18c0abbdae70.zip
3 files changed, 78 insertions, 42 deletions
diff --git a/doc/lispref/ChangeLog b/doc/lispref/ChangeLog
index 68d4996a39b..0037eccc6b5 100644
--- a/doc/lispref/ChangeLog
+++ b/doc/lispref/ChangeLog
@@ -1,3 +1,9 @@
+2008-11-01  Eli Zaretskii  <eliz@gnu.org>
+        * nonascii.texi (Text Representations): Rewrite to make consistent
+        with Emacs 23 internal representation of characters.  Document
+        `unibyte-string'.
 2008-10-28  Chong Yidong  <cyd@stupidchicken.com>
        * processes.texi (Process Information): Note that process-status
diff --git a/doc/lispref/nonascii.texi b/doc/lispref/nonascii.texi
index 4a8205c178d..c70f8e56973 100644
--- a/doc/lispref/nonascii.texi
+++ b/doc/lispref/nonascii.texi
@@ -10,11 +10,11 @@
 @cindex characters, multi-byte
 @cindex non-@acronym{ASCII} characters
-  This chapter covers the special issues relating to non-@acronym{ASCII}
+  This chapter covers the special issues relating to characters and
-characters and how they are stored in strings and buffers.
+how they are stored in strings and buffers.
 @menu
-* Text Representations::    Unibyte and multibyte representations
+* Text Representations::    How Emacs represents text.
 * Converting Representations::  Converting unibyte to multibyte and vice versa.
 * Selecting a Representation::  Treating a byte sequence as unibyte or multi.
 * Character Codes::         How unibyte and multibyte relate to
@@ -33,41 +33,62 @@ characters and how they are stored in strings and buffers.
 @node Text Representations
 @section Text Representations
-@cindex text representations
+@cindex text representation
-  Emacs has two @dfn{text representations}---two ways to represent text
+  Emacs buffers and strings support a large repertoire of characters
-in a string or buffer.  These are called @dfn{unibyte} and
+from many different scripts.  This is so users could type and display
-@dfn{multibyte}.  Each string, and each buffer, uses one of these two
+text in most any known written language.
-representations.  For most purposes, you can ignore the issue of
-representations, because Emacs converts text between them as
+@cindex character codepoint
-appropriate.  Occasionally in Lisp programming you will need to pay
+@cindex codespace
-attention to the difference.
+@cindex Unicode
+  To support this multitude of characters and scripts, Emacs closely
+follows the @dfn{Unicode Standard}.  The Unicode Standard assigns a
+unique number, called a @dfn{codepoint}, to each and every character.
+The range of codepoints defined by Unicode, or the Unicode
+@dfn{codespace}, is @code{0..10FFFF} (in hex) inclusive.  Emacs
+extends this range with codepoints in the range @code{3FFF80..3FFFFF},
+which it uses for representing raw 8-bit bytes that cannot be
+interpreted as characters.  Thus, a character codepoint in Emacs is a
+22-bit integer number.
+@cindex internal representation of characters
+@cindex characters, representation in buffers and strings
+@cindex multibyte text
+  To conserve memory, Emacs does not hold fixed-length 22-bit numbers
+that are codepoints of text characters within buffers and strings.
+Rather, Emacs uses a variable-length internal representation of
+characters, that stores each character as a sequence of 1 to 5 8-bit
+bytes, depending on the magnitude of its codepoint@footnote{
+This internal representation is based on one of the encodings defined
+by the Unicode Standard, called @dfn{UTF-8}, for representing any
+Unicode codepoint, but Emacs extends UTF-8 to represent the additional
+codepoints it uses for raw 8-bit bytes.}.
+For example, any @acronym{ASCII} character takes up only 1 byte, a
+Latin-1 character takes up 2 bytes, etc.  We call this representation
+of text @dfn{multibyte}, because it uses several bytes for each
+character.
+  Outside Emacs, characters can be represented in many different
+encodings, such as ISO-8859-1, GB-2312, Big-5, etc.  Emacs converts
+between these external encodings and the internal representation, as
+appropriate, when it reads text into a buffer or a string, or when it
+writes text to a disk file or passes it to some other process.
+  Occasionally, Emacs needs to hold and manipulate encoded text or
+binary non-text data in its buffer or string.  For example, when Emacs
+visits a file, it first reads the file's text verbatim into a buffer,
+and only then converts it to the internal representation.  Before the
+conversion, the buffer holds encoded text.
 @cindex unibyte text
-  In unibyte representation, each character occupies one byte and
+  Encoded text is not really text, as far as Emacs is concerned, but
-therefore the possible character codes range from 0 to 255.  Codes 0
+rather a sequence of raw 8-bit bytes.  We call buffers and strings
-through 127 are @acronym{ASCII} characters; the codes from 128 through 255
+that hold encoded text @dfn{unibyte} buffers and strings, because
-are used for one non-@acronym{ASCII} character set (you can choose which
+Emacs treats them as a sequence of individual bytes.  In particular,
-character set by setting the variable @code{nonascii-insert-offset}).
+Emacs usually displays unibyte buffers and strings as octal codes such
+as @code{\237}.  We recommend that you never use unibyte buffers and
-@cindex leading code
+strings except for manipulating encoded text or binary non-text data.
-@cindex multibyte text
-@cindex trailing codes
-  In multibyte representation, a character may occupy more than one
-byte, and as a result, the full range of Emacs character codes can be
-stored.  The first byte of a multibyte character is always in the range
-128 through 159 (octal 0200 through 0237).  These values are called
-@dfn{leading codes}.  The second and subsequent bytes of a multibyte
-character are always in the range 160 through 255 (octal 0240 through
-0377); these values are @dfn{trailing codes}.
-  Some sequences of bytes are not valid in multibyte text: for example,
-a single isolated byte in the range 128 through 159 is not allowed.  But
-character codes 128 through 159 can appear in multibyte text,
-represented as two-byte sequences.  All the character codes 128 through
-255 are possible (though slightly abnormal) in multibyte text; they
-appear in multibyte buffers and strings when you do explicit encoding
-and decoding (@pxref{Explicit Encoding}).
  In a buffer, the buffer-local value of the variable
 @code{enable-multibyte-characters} specifies the representation used.
@@ -77,7 +98,7 @@ when the string is constructed.
 @defvar enable-multibyte-characters
 This variable specifies the current buffer's text representation.
 If it is non-@code{nil}, the buffer contains multibyte text; otherwise,
-it contains unibyte text.
+it contains unibyte encoded text or binary non-text data.
 You cannot set this variable directly; instead, use the function
 @code{set-buffer-multibyte} to change a buffer's representation.
@@ -96,20 +117,22 @@ default value to @code{nil} early in startup.
 @end defvar
 @defun position-bytes position
-Return the byte-position corresponding to buffer position
+Buffer positions are measured in character units.  This function
+returns the byte-position corresponding to buffer position
 @var{position} in the current buffer.  This is 1 at the start of the
 buffer, and counts upward in bytes.  If @var{position} is out of
 range, the value is @code{nil}.
 @end defun
 @defun byte-to-position byte-position
-Return the buffer position corresponding to byte-position
+Return the buffer position, in character units, corresponding to
-@var{byte-position} in the current buffer.  If @var{byte-position} is
+byte-position @var{byte-position} in the current buffer.  If
-out of range, the value is @code{nil}.
+@var{byte-position} is out of range, the value is @code{nil}.
 @end defun
 @defun multibyte-string-p string
-Return @code{t} if @var{string} is a multibyte string.
+Return @code{t} if @var{string} is a multibyte string, @code{nil}
+otherwise.
 @end defun
 @defun string-bytes string
@@ -119,6 +142,11 @@ If @var{string} is a multibyte string, this can be greater than
 @code{(length @var{string})}.
 @end defun
+@defun unibyte-string &rest bytes
+This function concatenates all its argument @var{bytes} and makes the
+result a unibyte string.
+@end defun
 @node Converting Representations
 @section Converting Text Representations
diff --git a/etc/NEWS b/etc/NEWS
index 6e4273cae42..b0f2177e547 100644
--- a/etc/NEWS
+++ b/etc/NEWS
@@ -1347,6 +1347,7 @@ returns its output as a list of lines.
 ** Character code, representation, and charset changes.
+++
 The character code space is now 0x0..0x3FFFFF with no gap.
 Characters of code 0x0..0x10FFFF are Unicode characters of the same code points.
 Characters of code 0x3FFF80..0x3FFFFF are raw 8-bit bytes.
@@ -1354,6 +1355,7 @@ Characters of code 0x3FFF80..0x3FFFFF are raw 8-bit bytes.
 +++
 Generic characters no longer exist.
+++
 In buffers and strings, characters are represented by UTF-8 byte
 sequences in a multibyte buffer/string.
author	Eli Zaretskii	2008-11-01 16:36:10 +0000
committer	Eli Zaretskii	2008-11-01 16:36:10 +0000
commit	c4526e933cdf0e55387767b32b2f18c0abbdae70 (patch)
tree	b5f030325cd5425babe61acf5fa089420ade697a
parent	d41784eef44d7c34becb4c35f29ac1215dfb15ab (diff)
download	emacs-c4526e933cdf0e55387767b32b2f18c0abbdae70.tar.gz emacs-c4526e933cdf0e55387767b32b2f18c0abbdae70.zip

diff --git a/doc/lispref/ChangeLog b/doc/lispref/ChangeLog index 68d4996a39b..0037eccc6b5 100644 --- a/doc/lispref/ChangeLog +++ b/doc/lispref/ChangeLog
@@ -1,3 +1,9 @@
		1	2008-11-01 Eli Zaretskii <eliz@gnu.org>
		2
		3	* nonascii.texi (Text Representations): Rewrite to make consistent
		4	with Emacs 23 internal representation of characters. Document
		5	`unibyte-string'.
		6
1	2008-10-28 Chong Yidong <cyd@stupidchicken.com>	7	2008-10-28 Chong Yidong <cyd@stupidchicken.com>
2		8
3	* processes.texi (Process Information): Note that process-status	9	* processes.texi (Process Information): Note that process-status


diff --git a/doc/lispref/nonascii.texi b/doc/lispref/nonascii.texi index 4a8205c178d..c70f8e56973 100644 --- a/doc/lispref/nonascii.texi +++ b/doc/lispref/nonascii.texi
@@ -10,11 +10,11 @@
10	@cindex characters, multi-byte	10	@cindex characters, multi-byte
11	@cindex non-@acronym{ASCII} characters	11	@cindex non-@acronym{ASCII} characters
12		12
13	This chapter covers the special issues relating to non-@acronym{ASCII}	13	This chapter covers the special issues relating to characters and
14	characters and how they are stored in strings and buffers.	14	how they are stored in strings and buffers.
15		15
16	@menu	16	@menu
17	* Text Representations:: Unibyte and multibyte representations	17	* Text Representations:: How Emacs represents text.
18	* Converting Representations:: Converting unibyte to multibyte and vice versa.	18	* Converting Representations:: Converting unibyte to multibyte and vice versa.
19	* Selecting a Representation:: Treating a byte sequence as unibyte or multi.	19	* Selecting a Representation:: Treating a byte sequence as unibyte or multi.
20	* Character Codes:: How unibyte and multibyte relate to	20	* Character Codes:: How unibyte and multibyte relate to
@@ -33,41 +33,62 @@ characters and how they are stored in strings and buffers.
33		33
34	@node Text Representations	34	@node Text Representations
35	@section Text Representations	35	@section Text Representations
36	@cindex text representations	36	@cindex text representation
37		37
38	Emacs has two @dfn{text representations}---two ways to represent text	38	Emacs buffers and strings support a large repertoire of characters
39	in a string or buffer. These are called @dfn{unibyte} and	39	from many different scripts. This is so users could type and display
40	@dfn{multibyte}. Each string, and each buffer, uses one of these two	40	text in most any known written language.
41	representations. For most purposes, you can ignore the issue of	41
42	representations, because Emacs converts text between them as	42	@cindex character codepoint
43	appropriate. Occasionally in Lisp programming you will need to pay	43	@cindex codespace
44	attention to the difference.	44	@cindex Unicode
		45	To support this multitude of characters and scripts, Emacs closely
		46	follows the @dfn{Unicode Standard}. The Unicode Standard assigns a
		47	unique number, called a @dfn{codepoint}, to each and every character.
		48	The range of codepoints defined by Unicode, or the Unicode
		49	@dfn{codespace}, is @code{0..10FFFF} (in hex) inclusive. Emacs
		50	extends this range with codepoints in the range @code{3FFF80..3FFFFF},
		51	which it uses for representing raw 8-bit bytes that cannot be
		52	interpreted as characters. Thus, a character codepoint in Emacs is a
		53	22-bit integer number.
		54
		55	@cindex internal representation of characters
		56	@cindex characters, representation in buffers and strings
		57	@cindex multibyte text
		58	To conserve memory, Emacs does not hold fixed-length 22-bit numbers
		59	that are codepoints of text characters within buffers and strings.
		60	Rather, Emacs uses a variable-length internal representation of
		61	characters, that stores each character as a sequence of 1 to 5 8-bit
		62	bytes, depending on the magnitude of its codepoint@footnote{
		63	This internal representation is based on one of the encodings defined
		64	by the Unicode Standard, called @dfn{UTF-8}, for representing any
		65	Unicode codepoint, but Emacs extends UTF-8 to represent the additional
		66	codepoints it uses for raw 8-bit bytes.}.
		67	For example, any @acronym{ASCII} character takes up only 1 byte, a
		68	Latin-1 character takes up 2 bytes, etc. We call this representation
		69	of text @dfn{multibyte}, because it uses several bytes for each
		70	character.
		71
		72	Outside Emacs, characters can be represented in many different
		73	encodings, such as ISO-8859-1, GB-2312, Big-5, etc. Emacs converts
		74	between these external encodings and the internal representation, as
		75	appropriate, when it reads text into a buffer or a string, or when it
		76	writes text to a disk file or passes it to some other process.
		77
		78	Occasionally, Emacs needs to hold and manipulate encoded text or
		79	binary non-text data in its buffer or string. For example, when Emacs
		80	visits a file, it first reads the file's text verbatim into a buffer,
		81	and only then converts it to the internal representation. Before the
		82	conversion, the buffer holds encoded text.
45		83
46	@cindex unibyte text	84	@cindex unibyte text
47	In unibyte representation, each character occupies one byte and	85	Encoded text is not really text, as far as Emacs is concerned, but
48	therefore the possible character codes range from 0 to 255. Codes 0	86	rather a sequence of raw 8-bit bytes. We call buffers and strings
49	through 127 are @acronym{ASCII} characters; the codes from 128 through 255	87	that hold encoded text @dfn{unibyte} buffers and strings, because
50	are used for one non-@acronym{ASCII} character set (you can choose which	88	Emacs treats them as a sequence of individual bytes. In particular,
51	character set by setting the variable @code{nonascii-insert-offset}).	89	Emacs usually displays unibyte buffers and strings as octal codes such
52		90	as @code{\237}. We recommend that you never use unibyte buffers and
53	@cindex leading code	91	strings except for manipulating encoded text or binary non-text data.
54	@cindex multibyte text
55	@cindex trailing codes
56	In multibyte representation, a character may occupy more than one
57	byte, and as a result, the full range of Emacs character codes can be
58	stored. The first byte of a multibyte character is always in the range
59	128 through 159 (octal 0200 through 0237). These values are called
60	@dfn{leading codes}. The second and subsequent bytes of a multibyte
61	character are always in the range 160 through 255 (octal 0240 through
62	0377); these values are @dfn{trailing codes}.
63
64	Some sequences of bytes are not valid in multibyte text: for example,
65	a single isolated byte in the range 128 through 159 is not allowed. But
66	character codes 128 through 159 can appear in multibyte text,
67	represented as two-byte sequences. All the character codes 128 through
68	255 are possible (though slightly abnormal) in multibyte text; they
69	appear in multibyte buffers and strings when you do explicit encoding
70	and decoding (@pxref{Explicit Encoding}).
71		92
72	In a buffer, the buffer-local value of the variable	93	In a buffer, the buffer-local value of the variable
73	@code{enable-multibyte-characters} specifies the representation used.	94	@code{enable-multibyte-characters} specifies the representation used.
@@ -77,7 +98,7 @@ when the string is constructed.
77	@defvar enable-multibyte-characters	98	@defvar enable-multibyte-characters
78	This variable specifies the current buffer's text representation.	99	This variable specifies the current buffer's text representation.
79	If it is non-@code{nil}, the buffer contains multibyte text; otherwise,	100	If it is non-@code{nil}, the buffer contains multibyte text; otherwise,
80	it contains unibyte text.	101	it contains unibyte encoded text or binary non-text data.
81		102
82	You cannot set this variable directly; instead, use the function	103	You cannot set this variable directly; instead, use the function
83	@code{set-buffer-multibyte} to change a buffer's representation.	104	@code{set-buffer-multibyte} to change a buffer's representation.
@@ -96,20 +117,22 @@ default value to @code{nil} early in startup.
96	@end defvar	117	@end defvar
97		118
98	@defun position-bytes position	119	@defun position-bytes position
99	Return the byte-position corresponding to buffer position	120	Buffer positions are measured in character units. This function
		121	returns the byte-position corresponding to buffer position
100	@var{position} in the current buffer. This is 1 at the start of the	122	@var{position} in the current buffer. This is 1 at the start of the
101	buffer, and counts upward in bytes. If @var{position} is out of	123	buffer, and counts upward in bytes. If @var{position} is out of
102	range, the value is @code{nil}.	124	range, the value is @code{nil}.
103	@end defun	125	@end defun
104		126
105	@defun byte-to-position byte-position	127	@defun byte-to-position byte-position
106	Return the buffer position corresponding to byte-position	128	Return the buffer position, in character units, corresponding to
107	@var{byte-position} in the current buffer. If @var{byte-position} is	129	byte-position @var{byte-position} in the current buffer. If
108	out of range, the value is @code{nil}.	130	@var{byte-position} is out of range, the value is @code{nil}.
109	@end defun	131	@end defun
110		132
111	@defun multibyte-string-p string	133	@defun multibyte-string-p string
112	Return @code{t} if @var{string} is a multibyte string.	134	Return @code{t} if @var{string} is a multibyte string, @code{nil}
		135	otherwise.
113	@end defun	136	@end defun
114		137
115	@defun string-bytes string	138	@defun string-bytes string
@@ -119,6 +142,11 @@ If @var{string} is a multibyte string, this can be greater than
119	@code{(length @var{string})}.	142	@code{(length @var{string})}.
120	@end defun	143	@end defun
121		144
		145	@defun unibyte-string &rest bytes
		146	This function concatenates all its argument @var{bytes} and makes the
		147	result a unibyte string.
		148	@end defun
		149
122	@node Converting Representations	150	@node Converting Representations
123	@section Converting Text Representations	151	@section Converting Text Representations
124		152


diff --git a/etc/NEWS b/etc/NEWS index 6e4273cae42..b0f2177e547 100644 --- a/etc/NEWS +++ b/etc/NEWS
@@ -1347,6 +1347,7 @@ returns its output as a list of lines.
1347		1347
1348	** Character code, representation, and charset changes.	1348	** Character code, representation, and charset changes.
1349		1349
		1350	+++
1350	The character code space is now 0x0..0x3FFFFF with no gap.	1351	The character code space is now 0x0..0x3FFFFF with no gap.
1351	Characters of code 0x0..0x10FFFF are Unicode characters of the same code points.	1352	Characters of code 0x0..0x10FFFF are Unicode characters of the same code points.
1352	Characters of code 0x3FFF80..0x3FFFFF are raw 8-bit bytes.	1353	Characters of code 0x3FFF80..0x3FFFFF are raw 8-bit bytes.
@@ -1354,6 +1355,7 @@ Characters of code 0x3FFF80..0x3FFFFF are raw 8-bit bytes.
1354	+++	1355	+++
1355	Generic characters no longer exist.	1356	Generic characters no longer exist.
1356		1357
		1358	+++
1357	In buffers and strings, characters are represented by UTF-8 byte	1359	In buffers and strings, characters are represented by UTF-8 byte
1358	sequences in a multibyte buffer/string.	1360	sequences in a multibyte buffer/string.
1359		1361