aboutsummaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
authorEli Zaretskii2008-11-28 13:26:43 +0000
committerEli Zaretskii2008-11-28 13:26:43 +0000
commit8b80cdf500c514dc9c448b4fe37265cf16127ae5 (patch)
tree516ab434ca3c00418d94c5f6d303f7500b41129d
parente8e2bd93103909d092205b95d72bfeb8d8f6d129 (diff)
downloademacs-8b80cdf500c514dc9c448b4fe37265cf16127ae5.tar.gz
emacs-8b80cdf500c514dc9c448b4fe37265cf16127ae5.zip
(Text Representations, Converting Representations, Character Sets,
Scanning Charsets, Translation of Characters): Make text more accurate.
-rw-r--r--doc/lispref/ChangeLog6
-rw-r--r--doc/lispref/nonascii.texi68
2 files changed, 51 insertions, 23 deletions
diff --git a/doc/lispref/ChangeLog b/doc/lispref/ChangeLog
index e0d465a0a73..3b6f5fb33fa 100644
--- a/doc/lispref/ChangeLog
+++ b/doc/lispref/ChangeLog
@@ -1,3 +1,9 @@
12008-11-28 Eli Zaretskii <eliz@gnu.org>
2
3 * nonascii.texi (Text Representations, Converting Representations)
4 (Character Sets, Scanning Charsets, Translation of Characters):
5 Make text more accurate.
6
12008-11-28 Glenn Morris <rgm@gnu.org> 72008-11-28 Glenn Morris <rgm@gnu.org>
2 8
3 * files.texi (Format Conversion Round-Trip): Improve previous change. 9 * files.texi (Format Conversion Round-Trip): Improve previous change.
diff --git a/doc/lispref/nonascii.texi b/doc/lispref/nonascii.texi
index f2656806bdb..eab748bab8d 100644
--- a/doc/lispref/nonascii.texi
+++ b/doc/lispref/nonascii.texi
@@ -44,7 +44,7 @@ text in most any known written language.
44follows the @dfn{Unicode Standard}. The Unicode Standard assigns a 44follows the @dfn{Unicode Standard}. The Unicode Standard assigns a
45unique number, called a @dfn{codepoint}, to each and every character. 45unique number, called a @dfn{codepoint}, to each and every character.
46The range of codepoints defined by Unicode, or the Unicode 46The range of codepoints defined by Unicode, or the Unicode
47@dfn{codespace}, is @code{0..10FFFF} (in hex) inclusive. Emacs 47@dfn{codespace}, is @code{0..10FFFF} (in hex), inclusive. Emacs
48extends this range with codepoints in the range @code{110000..3FFFFF}, 48extends this range with codepoints in the range @code{110000..3FFFFF},
49which it uses for representing characters that are not unified with 49which it uses for representing characters that are not unified with
50Unicode and raw 8-bit bytes that cannot be interpreted as characters 50Unicode and raw 8-bit bytes that cannot be interpreted as characters
@@ -62,7 +62,8 @@ bytes, depending on the magnitude of its codepoint@footnote{
62This internal representation is based on one of the encodings defined 62This internal representation is based on one of the encodings defined
63by the Unicode Standard, called @dfn{UTF-8}, for representing any 63by the Unicode Standard, called @dfn{UTF-8}, for representing any
64Unicode codepoint, but Emacs extends UTF-8 to represent the additional 64Unicode codepoint, but Emacs extends UTF-8 to represent the additional
65codepoints it uses for raw 8-bit bytes.}. 65codepoints it uses for raw 8-bit bytes and characters not unified with
66Unicode.}.
66For example, any @acronym{ASCII} character takes up only 1 byte, a 67For example, any @acronym{ASCII} character takes up only 1 byte, a
67Latin-1 character takes up 2 bytes, etc. We call this representation 68Latin-1 character takes up 2 bytes, etc. We call this representation
68of text @dfn{multibyte}, because it uses several bytes for each 69of text @dfn{multibyte}, because it uses several bytes for each
@@ -157,7 +158,7 @@ result a unibyte string.
157 158
158 Emacs can convert unibyte text to multibyte; it can also convert 159 Emacs can convert unibyte text to multibyte; it can also convert
159multibyte text to unibyte, provided that the multibyte text contains 160multibyte text to unibyte, provided that the multibyte text contains
160only @acronym{ASCII} and 8-bit characters. In general, these 161only @acronym{ASCII} and 8-bit raw bytes. In general, these
161conversions happen when inserting text into a buffer, or when putting 162conversions happen when inserting text into a buffer, or when putting
162text from several strings together in one string. You can also 163text from several strings together in one string. You can also
163explicitly convert a string's contents to either representation. 164explicitly convert a string's contents to either representation.
@@ -194,25 +195,32 @@ newly created string with no text properties.
194@defun string-to-multibyte string 195@defun string-to-multibyte string
195This function returns a multibyte string containing the same sequence 196This function returns a multibyte string containing the same sequence
196of characters as @var{string}. If @var{string} is a multibyte string, 197of characters as @var{string}. If @var{string} is a multibyte string,
197it is returned unchanged. 198it is returned unchanged. The function assumes that @var{string}
199includes only @acronym{ASCII} characters and raw 8-bit bytes; the
200latter are converted to their multibyte representation corresponding
201to the codepoints in the @code{3FFF80..3FFFFF} area (@pxref{Text
202Representations, codepoints}).
198@end defun 203@end defun
199 204
200@defun string-to-unibyte string 205@defun string-to-unibyte string
201This function returns a unibyte string containing the same sequence of 206This function returns a unibyte string containing the same sequence of
202characters as @var{string}. It signals an error if @var{string} 207characters as @var{string}. It signals an error if @var{string}
203contains a non-@acronym{ASCII} character. If @var{string} is a 208contains a non-@acronym{ASCII} character. If @var{string} is a
204unibyte string, it is returned unchanged. 209unibyte string, it is returned unchanged. Use this function for
210@var{string} arguments that contain only @acronym{ASCII} and eight-bit
211characters.
205@end defun 212@end defun
206 213
207@defun multibyte-char-to-unibyte char 214@defun multibyte-char-to-unibyte char
208This convert the multibyte character @var{char} to a unibyte 215This convert the multibyte character @var{char} to a unibyte
209character. If @var{char} is a non-@acronym{ASCII} character, the 216character. If @var{char} is a character that is neither
210value is -1. 217@acronym{ASCII} nor eight-bit, the value is -1.
211@end defun 218@end defun
212 219
213@defun unibyte-char-to-multibyte char 220@defun unibyte-char-to-multibyte char
214This convert the unibyte character @var{char} to a multibyte 221This convert the unibyte character @var{char} to a multibyte
215character. 222character, assuming @var{char} is either @acronym{ASCII} or raw 8-bit
223byte.
216@end defun 224@end defun
217 225
218@node Selecting a Representation 226@node Selecting a Representation
@@ -320,7 +328,7 @@ string instead of the current buffer.
320@cindex coded character set 328@cindex coded character set
321An Emacs @dfn{character set}, or @dfn{charset}, is a set of characters 329An Emacs @dfn{character set}, or @dfn{charset}, is a set of characters
322in which each character is assigned a numeric code point. (The 330in which each character is assigned a numeric code point. (The
323Unicode standard calls this a @dfn{coded character set}.) Each 331Unicode standard calls this a @dfn{coded character set}.) Each Emacs
324charset has a name which is a symbol. A single character can belong 332charset has a name which is a symbol. A single character can belong
325to any number of different character sets, but it will generally have 333to any number of different character sets, but it will generally have
326a different code point in each charset. Examples of character sets 334a different code point in each charset. Examples of character sets
@@ -387,30 +395,42 @@ This command displays a list of characters in the character set
387@var{charset}. 395@var{charset}.
388@end deffn 396@end deffn
389 397
398 Emacs can convert between its internal representation of a character
399and the character's codepoint in a specific charset. The following
400two functions support these conversions.
401
402@c FIXME: decode-char and encode-char accept and ignore an additional
403@c argument @var{restriction}. When that argument actually makes a
404@c difference, it should be documented here.
390@defun decode-char charset code-point 405@defun decode-char charset code-point
391This function decodes a character that is assigned a @var{code-point} 406This function decodes a character that is assigned a @var{code-point}
392in @var{charset}, to the corresponding Emacs character, and returns 407in @var{charset}, to the corresponding Emacs character, and returns
393that character. If @var{charset} doesn't contain a character of that 408it. If @var{charset} doesn't contain a character of that code point,
394code point, the value is @code{nil}. If @var{code-point} doesnt't fit 409the value is @code{nil}. If @var{code-point} doesn't fit in a Lisp
395in a Lisp integer (@pxref{Integer Basics, most-positive-fixnum}), it 410integer (@pxref{Integer Basics, most-positive-fixnum}), it can be
396can be specified as a cons cell @code{(@var{high} . @var{low})}, where 411specified as a cons cell @code{(@var{high} . @var{low})}, where
397@var{low} are the lower 16 bits of the value and @var{high} are the 412@var{low} are the lower 16 bits of the value and @var{high} are the
398high 16 bits. 413high 16 bits.
399@end defun 414@end defun
400 415
401@defun encode-char char charset 416@defun encode-char char charset
402This function returns the code point assigned to the character 417This function returns the code point assigned to the character
403@var{char} in @var{charset}. If @var{charset} doesn't contain 418@var{char} in @var{charset}. If the result does not fit in a Lisp
404@var{char}, the value is @code{nil}. 419integer, it is returned as a cons cell @code{(@var{high} . @var{low})}
420that fits the second argument of @code{decode-char} above. If
421@var{charset} doesn't have a codepoint for @var{char}, the value is
422@code{nil}.
405@end defun 423@end defun
406 424
407@node Scanning Charsets 425@node Scanning Charsets
408@section Scanning for Character Sets 426@section Scanning for Character Sets
409 427
410 Sometimes it is useful to find out which character sets appear in a 428 Sometimes it is useful to find out, for characters that appear in a
411part of a buffer or a string. One use for this is in determining which 429certain part of a buffer or a string, to which character sets they
412coding systems (@pxref{Coding Systems}) are capable of representing all 430belong. One use for this is in determining which coding systems
413of the text in question. 431(@pxref{Coding Systems}) are capable of representing all of the text
432in question; another is to determine the font(s) for displaying that
433text.
414 434
415@defun charset-after &optional pos 435@defun charset-after &optional pos
416This function returns the charset of highest priority containing the 436This function returns the charset of highest priority containing the
@@ -421,7 +441,7 @@ If @var{pos} is out of range, the value is @code{nil}.
421 441
422@defun find-charset-region beg end &optional translation 442@defun find-charset-region beg end &optional translation
423This function returns a list of the character sets of highest priority 443This function returns a list of the character sets of highest priority
424that contain charcters in the current buffer between positions 444that contain characters in the current buffer between positions
425@var{beg} and @var{end}. 445@var{beg} and @var{end}.
426 446
427The optional argument @var{translation} specifies a translation table to 447The optional argument @var{translation} specifies a translation table to
@@ -453,7 +473,8 @@ systems.
453 A translation table has two extra slots. The first is either 473 A translation table has two extra slots. The first is either
454@code{nil} or a translation table that performs the reverse 474@code{nil} or a translation table that performs the reverse
455translation; the second is the maximum number of characters to look up 475translation; the second is the maximum number of characters to look up
456for translation. 476for translating sequences of characters (see the description of
477@code{make-translation-table-from-alist} below).
457 478
458@defun make-translation-table &rest translations 479@defun make-translation-table &rest translations
459This function returns a translation table based on the argument 480This function returns a translation table based on the argument
@@ -504,7 +525,7 @@ This function returns a translation table made from @var{vec} that is
504an array of 256 elements to map byte values 0 through 255 to 525an array of 256 elements to map byte values 0 through 255 to
505characters. Elements may be @code{nil} for untranslated bytes. The 526characters. Elements may be @code{nil} for untranslated bytes. The
506returned table has a translation table for reverse mapping in the 527returned table has a translation table for reverse mapping in the
507first extra slot. 528first extra slot, and the value @code{1} in the second extra slot.
508 529
509This function provides an easy way to make a private coding system 530This function provides an easy way to make a private coding system
510that maps each byte to a specific character. You can specify the 531that maps each byte to a specific character. You can specify the
@@ -524,7 +545,8 @@ character, that character is translated to @var{to} (i.e.@: to a
524character or a character sequence). If @var{from} is a vector of 545character or a character sequence). If @var{from} is a vector of
525characters, that sequence is translated to @var{to}. The returned 546characters, that sequence is translated to @var{to}. The returned
526table has a translation table for reverse mapping in the first extra 547table has a translation table for reverse mapping in the first extra
527slot. 548slot, and the maximum length of all the @var{from} character sequences
549in the second extra slot.
528@end defun 550@end defun
529 551
530@node Coding Systems 552@node Coding Systems