aboutsummaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
authorChong Yidong2009-04-10 01:16:27 +0000
committerChong Yidong2009-04-10 01:16:27 +0000
commit97d8273fa2687731d652687cf6b4c7c48dd0661a (patch)
treed65f28226463eb7dd59f7c4aa925db18d2d8dc3c
parentc872c51e2b8805ca4ee674ee7600f5b914492a68 (diff)
downloademacs-97d8273fa2687731d652687cf6b4c7c48dd0661a.tar.gz
emacs-97d8273fa2687731d652687cf6b4c7c48dd0661a.zip
* nonascii.texi (Text Representations): Copyedits.
(Coding System Basics): Also mention utf-8-emacs. (Converting Representations, Selecting a Representation) (Scanning Charsets, Translation of Characters, Encoding and I/O): Copyedits. (Character Codes): Mention role of codepoints 1114112 to 4194175.
-rw-r--r--doc/lispref/ChangeLog9
-rw-r--r--doc/lispref/nonascii.texi153
2 files changed, 83 insertions, 79 deletions
diff --git a/doc/lispref/ChangeLog b/doc/lispref/ChangeLog
index 50e87de8332..283598c2137 100644
--- a/doc/lispref/ChangeLog
+++ b/doc/lispref/ChangeLog
@@ -1,3 +1,12 @@
12009-04-10 Chong Yidong <cyd@stupidchicken.com>
2
3 * nonascii.texi (Text Representations): Copyedits.
4 (Coding System Basics): Also mention utf-8-emacs.
5 (Converting Representations, Selecting a Representation)
6 (Scanning Charsets, Translation of Characters, Encoding and I/O):
7 Copyedits.
8 (Character Codes): Mention role of codepoints 1114112 to 4194175.
9
12009-04-09 Chong Yidong <cyd@stupidchicken.com> 102009-04-09 Chong Yidong <cyd@stupidchicken.com>
2 11
3 * text.texi (Yank Commands): Note that yank uses push-mark. 12 * text.texi (Yank Commands): Note that yank uses push-mark.
diff --git a/doc/lispref/nonascii.texi b/doc/lispref/nonascii.texi
index 478a9eca060..818cc096b83 100644
--- a/doc/lispref/nonascii.texi
+++ b/doc/lispref/nonascii.texi
@@ -36,8 +36,8 @@ how they are stored in strings and buffers.
36@cindex text representation 36@cindex text representation
37 37
38 Emacs buffers and strings support a large repertoire of characters 38 Emacs buffers and strings support a large repertoire of characters
39from many different scripts. This is so users could type and display 39from many different scripts, allowing users to type and display text
40text in most any known written language. 40in most any known written language.
41 41
42@cindex character codepoint 42@cindex character codepoint
43@cindex codespace 43@cindex codespace
@@ -65,15 +65,13 @@ This internal representation is based on one of the encodings defined
65by the Unicode Standard, called @dfn{UTF-8}, for representing any 65by the Unicode Standard, called @dfn{UTF-8}, for representing any
66Unicode codepoint, but Emacs extends UTF-8 to represent the additional 66Unicode codepoint, but Emacs extends UTF-8 to represent the additional
67codepoints it uses for raw 8-bit bytes and characters not unified with 67codepoints it uses for raw 8-bit bytes and characters not unified with
68Unicode.}. 68Unicode.}. For example, any @acronym{ASCII} character takes up only 1
69For example, any @acronym{ASCII} character takes up only 1 byte, a 69byte, a Latin-1 character takes up 2 bytes, etc. We call this
70Latin-1 character takes up 2 bytes, etc. We call this representation 70representation of text @dfn{multibyte}.
71of text @dfn{multibyte}, because it uses several bytes for each
72character.
73 71
74 Outside Emacs, characters can be represented in many different 72 Outside Emacs, characters can be represented in many different
75encodings, such as ISO-8859-1, GB-2312, Big-5, etc. Emacs converts 73encodings, such as ISO-8859-1, GB-2312, Big-5, etc. Emacs converts
76between these external encodings and the internal representation, as 74between these external encodings and its internal representation, as
77appropriate, when it reads text into a buffer or a string, or when it 75appropriate, when it reads text into a buffer or a string, or when it
78writes text to a disk file or passes it to some other process. 76writes text to a disk file or passes it to some other process.
79 77
@@ -87,9 +85,9 @@ Before the conversion, the buffer holds encoded text.
87 Encoded text is not really text, as far as Emacs is concerned, but 85 Encoded text is not really text, as far as Emacs is concerned, but
88rather a sequence of raw 8-bit bytes. We call buffers and strings 86rather a sequence of raw 8-bit bytes. We call buffers and strings
89that hold encoded text @dfn{unibyte} buffers and strings, because 87that hold encoded text @dfn{unibyte} buffers and strings, because
90Emacs treats them as a sequence of individual bytes. In particular, 88Emacs treats them as a sequence of individual bytes. Usually, Emacs
91Emacs usually displays unibyte buffers and strings as octal codes such 89displays unibyte buffers and strings as octal codes such as
92as @code{\237}. We recommend that you never use unibyte buffers and 90@code{\237}. We recommend that you never use unibyte buffers and
93strings except for manipulating encoded text or binary non-text data. 91strings except for manipulating encoded text or binary non-text data.
94 92
95 In a buffer, the buffer-local value of the variable 93 In a buffer, the buffer-local value of the variable
@@ -165,10 +163,10 @@ conversions happen when inserting text into a buffer, or when putting
165text from several strings together in one string. You can also 163text from several strings together in one string. You can also
166explicitly convert a string's contents to either representation. 164explicitly convert a string's contents to either representation.
167 165
168 Emacs chooses the representation for a string based on the text that 166 Emacs chooses the representation for a string based on the text from
169it is constructed from. The general rule is to convert unibyte text to 167which it is constructed. The general rule is to convert unibyte text
170multibyte text when combining it with other multibyte text, because the 168to multibyte text when combining it with other multibyte text, because
171multibyte representation is more general and can hold whatever 169the multibyte representation is more general and can hold whatever
172characters the unibyte text has. 170characters the unibyte text has.
173 171
174 When inserting text into a buffer, Emacs converts the text to the 172 When inserting text into a buffer, Emacs converts the text to the
@@ -181,9 +179,9 @@ alternative, to convert the buffer contents to multibyte, is not
181acceptable because the buffer's representation is a choice made by the 179acceptable because the buffer's representation is a choice made by the
182user that cannot be overridden automatically. 180user that cannot be overridden automatically.
183 181
184 Converting unibyte text to multibyte text leaves @acronym{ASCII} characters 182 Converting unibyte text to multibyte text leaves @acronym{ASCII}
185unchanged, and converts bytes with codes 128 through 159 to the 183characters unchanged, and converts bytes with codes 128 through 159 to
186multibyte representation of raw eight-bit bytes. 184the multibyte representation of raw eight-bit bytes.
187 185
188 Converting multibyte text to unibyte converts all @acronym{ASCII} 186 Converting multibyte text to unibyte converts all @acronym{ASCII}
189and eight-bit characters to their single-byte form, but loses 187and eight-bit characters to their single-byte form, but loses
@@ -214,9 +212,9 @@ characters.
214@end defun 212@end defun
215 213
216@defun multibyte-char-to-unibyte char 214@defun multibyte-char-to-unibyte char
217This convert the multibyte character @var{char} to a unibyte 215This converts the multibyte character @var{char} to a unibyte
218character. If @var{char} is a character that is neither 216character, and returns that character. If @var{char} is neither
219@acronym{ASCII} nor eight-bit, the value is -1. 217@acronym{ASCII} nor eight-bit, the function returns -1.
220@end defun 218@end defun
221 219
222@defun unibyte-char-to-multibyte char 220@defun unibyte-char-to-multibyte char
@@ -238,9 +236,9 @@ is @code{nil}, the buffer becomes unibyte.
238 236
239This function leaves the buffer contents unchanged when viewed as a 237This function leaves the buffer contents unchanged when viewed as a
240sequence of bytes. As a consequence, it can change the contents 238sequence of bytes. As a consequence, it can change the contents
241viewed as characters; a sequence of three bytes which is treated as 239viewed as characters; for instance, a sequence of three bytes which is
242one character in multibyte representation will count as three 240treated as one character in multibyte representation will count as
243characters in unibyte representation. Eight-bit characters 241three characters in unibyte representation. Eight-bit characters
244representing raw bytes are an exception. They are represented by one 242representing raw bytes are an exception. They are represented by one
245byte in a unibyte buffer, but when the buffer is set to multibyte, 243byte in a unibyte buffer, but when the buffer is set to multibyte,
246they are converted to two-byte sequences, and vice versa. 244they are converted to two-byte sequences, and vice versa.
@@ -256,28 +254,24 @@ base buffer.
256@end defun 254@end defun
257 255
258@defun string-as-unibyte string 256@defun string-as-unibyte string
259This function returns a string with the same bytes as @var{string} but 257If @var{string} is already a unibyte string, this function returns
260treating each byte as a character. This means that the value may have 258@var{string} itself. Otherwise, it returns a new string with the same
261more characters than @var{string} has. Eight-bit characters 259bytes as @var{string}, but treating each byte as a separate character
262representing raw bytes are an exception: each one of them is converted 260(so that the value may have more characters than @var{string}); as an
263to a single byte. 261exception, each eight-bit character representing a raw byte is
264 262converted into a single byte. The newly-created string contains no
265If @var{string} is already a unibyte string, then the value is
266@var{string} itself. Otherwise it is a newly created string, with no
267text properties. 263text properties.
268@end defun 264@end defun
269 265
270@defun string-as-multibyte string 266@defun string-as-multibyte string
271This function returns a string with the same bytes as @var{string} but 267If @var{string} is a multibyte string, this function returns
272treating each multibyte sequence as one character. This means that 268@var{string} itself. Otherwise, it returns a new string with the same
273the value may have fewer characters than @var{string} has. If a byte 269bytes as @var{string}, but treating each multibyte sequence as one
274sequence in @var{string} is invalid as a multibyte representation of a 270character. This means that the value may have fewer characters than
275single character, each byte in the sequence is treated as raw 8-bit 271@var{string} has. If a byte sequence in @var{string} is invalid as a
276byte. 272multibyte representation of a single character, each byte in the
277 273sequence is treated as a raw 8-bit byte. The newly-created string
278If @var{string} is already a multibyte string, then the value is 274contains no text properties.
279@var{string} itself. Otherwise it is a newly created string, with no
280text properties.
281@end defun 275@end defun
282 276
283@node Character Codes 277@node Character Codes
@@ -291,9 +285,10 @@ character codes for multibyte representation range from 0 to 4194303
291(#x3FFFFF). In this code space, values 0 through 127 are for 285(#x3FFFFF). In this code space, values 0 through 127 are for
292@acronym{ASCII} charcters, and values 129 through 4194175 (#x3FFF7F) 286@acronym{ASCII} charcters, and values 129 through 4194175 (#x3FFF7F)
293are for non-@acronym{ASCII} characters. Values 0 through 1114111 287are for non-@acronym{ASCII} characters. Values 0 through 1114111
294(#10FFFF) corresponds to Unicode characters of the same codepoint, 288(#10FFFF) correspond to Unicode characters of the same codepoint;
295while values 4194176 (#x3FFF80) through 4194303 (#x3FFFFF) are for 289values 1114112 (#110000) through 4194175 (#x3FFF7F) represent
296representing eight-bit raw bytes. 290characters that are not unified with Unicode; and values 4194176
291(#x3FFF80) through 4194303 (#x3FFFFF) represent eight-bit raw bytes.
297 292
298@defun characterp charcode 293@defun characterp charcode
299This returns @code{t} if @var{charcode} is a valid character, and 294This returns @code{t} if @var{charcode} is a valid character, and
@@ -334,9 +329,9 @@ codepoint can have.
334@end defun 329@end defun
335 330
336@defun get-byte pos &optional string 331@defun get-byte pos &optional string
337This function returns the byte at current buffer's character position 332This function returns the byte at character position @var{pos} in the
338@var{pos}. If the current buffer is unibyte, this is literally the 333current buffer. If the current buffer is unibyte, this is literally
339byte at that position. If the buffer is multibyte, byte values of 334the byte at that position. If the buffer is multibyte, byte values of
340@acronym{ASCII} characters are the same as character codepoints, 335@acronym{ASCII} characters are the same as character codepoints,
341whereas eight-bit raw bytes are converted to their 8-bit codes. The 336whereas eight-bit raw bytes are converted to their 8-bit codes. The
342function signals an error if the character at @var{pos} is 337function signals an error if the character at @var{pos} is
@@ -360,13 +355,11 @@ of character properties. In particular, Emacs supports the
360Model}, and the Emacs character property database is derived from the 355Model}, and the Emacs character property database is derived from the
361Unicode Character Database (@acronym{UCD}). See the 356Unicode Character Database (@acronym{UCD}). See the
362@uref{http://www.unicode.org/versions/Unicode5.0.0/ch04.pdf, Character 357@uref{http://www.unicode.org/versions/Unicode5.0.0/ch04.pdf, Character
363Properties chapter of the Unicode Standard}, for detailed description 358Properties chapter of the Unicode Standard}, for a detailed
364of Unicode character properties and their meaning. This section 359description of Unicode character properties and their meaning. This
365assumes you are already familiar with that chapter of the Unicode 360section assumes you are already familiar with that chapter of the
366Standard, and want to apply that knowledge to Emacs Lisp programs. 361Unicode Standard, and want to apply that knowledge to Emacs Lisp
367 362programs.
368 The facilities documented in this section are useful for setting and
369retrieving properties of characters.
370 363
371 In Emacs, each property has a name, which is a symbol, and a set of 364 In Emacs, each property has a name, which is a symbol, and a set of
372possible values, whose types depend on the property; if a character 365possible values, whose types depend on the property; if a character
@@ -378,8 +371,8 @@ replacing each @samp{_} character with a dash @samp{-}. For example,
378@code{canonical-combining-class}. However, sometimes we shorten the 371@code{canonical-combining-class}. However, sometimes we shorten the
379names to make their use easier. 372names to make their use easier.
380 373
381 Here's the full list of value types for all the character properties 374 Here is the full list of value types for all the character
382that Emacs knows about: 375properties that Emacs knows about:
383 376
384@table @code 377@table @code
385@item name 378@item name
@@ -428,7 +421,7 @@ corresponding number.
428@item numeric-value 421@item numeric-value
429Corresponds to the Unicode @code{Numeric_Value} property for 422Corresponds to the Unicode @code{Numeric_Value} property for
430characters whose @code{Numeric_Type} is @samp{Numeric}. The value of 423characters whose @code{Numeric_Type} is @samp{Numeric}. The value of
431this property is an integer of a floating-point number. Examples of 424this property is an integer or a floating-point number. Examples of
432characters that have this property include fractions, subscripts, 425characters that have this property include fractions, subscripts,
433superscripts, Roman numerals, currency numerators, and encircled 426superscripts, Roman numerals, currency numerators, and encircled
434numbers. For example, the value of this property for the character 427numbers. For example, the value of this property for the character
@@ -656,16 +649,15 @@ or last codepoint of @var{charset}, respectively.
656@node Scanning Charsets 649@node Scanning Charsets
657@section Scanning for Character Sets 650@section Scanning for Character Sets
658 651
659 Sometimes it is useful to find out, for characters that appear in a 652 Sometimes it is useful to find out which character set a particular
660certain part of a buffer or a string, to which character sets they 653character belongs to. One use for this is in determining which coding
661belong. One use for this is in determining which coding systems 654systems (@pxref{Coding Systems}) are capable of representing all of
662(@pxref{Coding Systems}) are capable of representing all of the text 655the text in question; another is to determine the font(s) for
663in question; another is to determine the font(s) for displaying that 656displaying that text.
664text.
665 657
666@defun charset-after &optional pos 658@defun charset-after &optional pos
667This function returns the charset of highest priority containing the 659This function returns the charset of highest priority containing the
668character in the current buffer at position @var{pos}. If @var{pos} 660character at position @var{pos} in the current buffer. If @var{pos}
669is omitted or @code{nil}, it defaults to the current value of point. 661is omitted or @code{nil}, it defaults to the current value of point.
670If @var{pos} is out of range, the value is @code{nil}. 662If @var{pos} is out of range, the value is @code{nil}.
671@end defun 663@end defun
@@ -675,15 +667,15 @@ This function returns a list of the character sets of highest priority
675that contain characters in the current buffer between positions 667that contain characters in the current buffer between positions
676@var{beg} and @var{end}. 668@var{beg} and @var{end}.
677 669
678The optional argument @var{translation} specifies a translation table to 670The optional argument @var{translation} specifies a translation table
679be used in scanning the text (@pxref{Translation of Characters}). If it 671to use for scanning the text (@pxref{Translation of Characters}). If
680is non-@code{nil}, then each character in the region is translated 672it is non-@code{nil}, then each character in the region is translated
681through this table, and the value returned describes the translated 673through this table, and the value returned describes the translated
682characters instead of the characters actually in the buffer. 674characters instead of the characters actually in the buffer.
683@end defun 675@end defun
684 676
685@defun find-charset-string string &optional translation 677@defun find-charset-string string &optional translation
686This function returns a list of the character sets of highest priority 678This function returns a list of character sets of highest priority
687that contain characters in @var{string}. It is just like 679that contain characters in @var{string}. It is just like
688@code{find-charset-region}, except that it applies to the contents of 680@code{find-charset-region}, except that it applies to the contents of
689@var{string} instead of part of the current buffer. 681@var{string} instead of part of the current buffer.
@@ -721,7 +713,7 @@ character, say @var{to-alt}, @var{from} is also translated to
721 713
722 During decoding, the translation table's translations are applied to 714 During decoding, the translation table's translations are applied to
723the characters that result from ordinary decoding. If a coding system 715the characters that result from ordinary decoding. If a coding system
724has property @code{:decode-translation-table}, that specifies the 716has the property @code{:decode-translation-table}, that specifies the
725translation table to use, or a list of translation tables to apply in 717translation table to use, or a list of translation tables to apply in
726sequence. (This is a property of the coding system, as returned by 718sequence. (This is a property of the coding system, as returned by
727@code{coding-system-get}, not a property of the symbol that is the 719@code{coding-system-get}, not a property of the symbol that is the
@@ -779,8 +771,8 @@ respectively in the @var{props} argument to
779This function is similar to @code{make-translation-table} but returns 771This function is similar to @code{make-translation-table} but returns
780a complex translation table rather than a simple one-to-one mapping. 772a complex translation table rather than a simple one-to-one mapping.
781Each element of @var{alist} is of the form @code{(@var{from} 773Each element of @var{alist} is of the form @code{(@var{from}
782. @var{to})}, where @var{from} and @var{to} are either a character or 774. @var{to})}, where @var{from} and @var{to} are either characters or
783a vector specifying a sequence of characters. If @var{from} is a 775vectors specifying a sequence of characters. If @var{from} is a
784character, that character is translated to @var{to} (i.e.@: to a 776character, that character is translated to @var{to} (i.e.@: to a
785character or a character sequence). If @var{from} is a vector of 777character or a character sequence). If @var{from} is a vector of
786characters, that sequence is translated to @var{to}. The returned 778characters, that sequence is translated to @var{to}. The returned
@@ -891,10 +883,13 @@ end-of-line conversion.
891codes or end-of-line. 883codes or end-of-line.
892 884
893@vindex emacs-internal@r{ coding system} 885@vindex emacs-internal@r{ coding system}
894 The coding system @code{emacs-internal} specifies that the data is 886@vindex utf-8-emacs@r{ coding system}
895represented in the internal Emacs encoding. This is like 887 The coding system @code{utf-8-emacs} specifies that the data is
896@code{raw-text} in that no code conversion happens, but different in 888represented in the internal Emacs encoding (@pxref{Text
897that the result is multibyte data. 889Representations}). This is like @code{raw-text} in that no code
890conversion happens, but different in that the result is multibyte
891data. The name @code{emacs-internal} is an alias for
892@code{utf-8-emacs}.
898 893
899@defun coding-system-get coding-system property 894@defun coding-system-get coding-system property
900This function returns the specified property of the coding system 895This function returns the specified property of the coding system
@@ -924,9 +919,9 @@ This function returns the list of aliases of @var{coding-system}.
924@subsection Encoding and I/O 919@subsection Encoding and I/O
925 920
926 The principal purpose of coding systems is for use in reading and 921 The principal purpose of coding systems is for use in reading and
927writing files. The function @code{insert-file-contents} uses 922writing files. The function @code{insert-file-contents} uses a coding
928a coding system for decoding the file data, and @code{write-region} 923system to decode the file data, and @code{write-region} uses one to
929uses one to encode the buffer contents. 924encode the buffer contents.
930 925
931 You can specify the coding system to use either explicitly 926 You can specify the coding system to use either explicitly
932(@pxref{Specifying Coding Systems}), or implicitly using a default 927(@pxref{Specifying Coding Systems}), or implicitly using a default