aboutsummaryrefslogtreecommitdiffstats
path: root/doc/lispref
diff options
context:
space:
mode:
Diffstat (limited to 'doc/lispref')
-rw-r--r--doc/lispref/nonascii.texi476
1 files changed, 215 insertions, 261 deletions
diff --git a/doc/lispref/nonascii.texi b/doc/lispref/nonascii.texi
index c70f8e56973..f2656806bdb 100644
--- a/doc/lispref/nonascii.texi
+++ b/doc/lispref/nonascii.texi
@@ -21,8 +21,6 @@ how they are stored in strings and buffers.
21 codes of individual characters. 21 codes of individual characters.
22* Character Sets:: The space of possible character codes 22* Character Sets:: The space of possible character codes
23 is divided into various character sets. 23 is divided into various character sets.
24* Chars and Bytes:: More information about multibyte encodings.
25* Splitting Characters:: Converting a character to its byte sequence.
26* Scanning Charsets:: Which character sets are used in a buffer? 24* Scanning Charsets:: Which character sets are used in a buffer?
27* Translation of Characters:: Translation tables are used for conversion. 25* Translation of Characters:: Translation tables are used for conversion.
28* Coding Systems:: Coding systems are conversions for saving files. 26* Coding Systems:: Coding systems are conversions for saving files.
@@ -47,10 +45,11 @@ follows the @dfn{Unicode Standard}. The Unicode Standard assigns a
47unique number, called a @dfn{codepoint}, to each and every character. 45unique number, called a @dfn{codepoint}, to each and every character.
48The range of codepoints defined by Unicode, or the Unicode 46The range of codepoints defined by Unicode, or the Unicode
49@dfn{codespace}, is @code{0..10FFFF} (in hex) inclusive. Emacs 47@dfn{codespace}, is @code{0..10FFFF} (in hex) inclusive. Emacs
50extends this range with codepoints in the range @code{3FFF80..3FFFFF}, 48extends this range with codepoints in the range @code{110000..3FFFFF},
51which it uses for representing raw 8-bit bytes that cannot be 49which it uses for representing characters that are not unified with
52interpreted as characters. Thus, a character codepoint in Emacs is a 50Unicode and raw 8-bit bytes that cannot be interpreted as characters
5322-bit integer number. 51(the latter occupy the range @code{3FFF80..3FFFFF}). Thus, a
52character codepoint in Emacs is a 22-bit integer number.
54 53
55@cindex internal representation of characters 54@cindex internal representation of characters
56@cindex characters, representation in buffers and strings 55@cindex characters, representation in buffers and strings
@@ -76,10 +75,10 @@ appropriate, when it reads text into a buffer or a string, or when it
76writes text to a disk file or passes it to some other process. 75writes text to a disk file or passes it to some other process.
77 76
78 Occasionally, Emacs needs to hold and manipulate encoded text or 77 Occasionally, Emacs needs to hold and manipulate encoded text or
79binary non-text data in its buffer or string. For example, when Emacs 78binary non-text data in its buffers or strings. For example, when
80visits a file, it first reads the file's text verbatim into a buffer, 79Emacs visits a file, it first reads the file's text verbatim into a
81and only then converts it to the internal representation. Before the 80buffer, and only then converts it to the internal representation.
82conversion, the buffer holds encoded text. 81Before the conversion, the buffer holds encoded text.
83 82
84@cindex unibyte text 83@cindex unibyte text
85 Encoded text is not really text, as far as Emacs is concerned, but 84 Encoded text is not really text, as far as Emacs is concerned, but
@@ -125,9 +124,15 @@ range, the value is @code{nil}.
125@end defun 124@end defun
126 125
127@defun byte-to-position byte-position 126@defun byte-to-position byte-position
128Return the buffer position, in character units, corresponding to 127Return the buffer position, in character units, corresponding to given
129byte-position @var{byte-position} in the current buffer. If 128@var{byte-position} in the current buffer. If @var{byte-position} is
130@var{byte-position} is out of range, the value is @code{nil}. 129out of range, the value is @code{nil}. In a multibyte buffer, an
130arbitrary value of @var{byte-position} can be not at character
131boundary, but inside a multibyte sequence representing a single
132character; in this case, this function returns the buffer position of
133the character whose multibyte sequence includes @var{byte-position}.
134In other words, the value does not change for all byte positions that
135belong to the same character.
131@end defun 136@end defun
132 137
133@defun multibyte-string-p string 138@defun multibyte-string-p string
@@ -151,10 +156,11 @@ result a unibyte string.
151@section Converting Text Representations 156@section Converting Text Representations
152 157
153 Emacs can convert unibyte text to multibyte; it can also convert 158 Emacs can convert unibyte text to multibyte; it can also convert
154multibyte text to unibyte, though this conversion loses information. In 159multibyte text to unibyte, provided that the multibyte text contains
155general these conversions happen when inserting text into a buffer, or 160only @acronym{ASCII} and 8-bit characters. In general, these
156when putting text from several strings together in one string. You can 161conversions happen when inserting text into a buffer, or when putting
157also explicitly convert a string's contents to either representation. 162text from several strings together in one string. You can also
163explicitly convert a string's contents to either representation.
158 164
159 Emacs chooses the representation for a string based on the text that 165 Emacs chooses the representation for a string based on the text that
160it is constructed from. The general rule is to convert unibyte text to 166it is constructed from. The general rule is to convert unibyte text to
@@ -173,89 +179,40 @@ acceptable because the buffer's representation is a choice made by the
173user that cannot be overridden automatically. 179user that cannot be overridden automatically.
174 180
175 Converting unibyte text to multibyte text leaves @acronym{ASCII} characters 181 Converting unibyte text to multibyte text leaves @acronym{ASCII} characters
176unchanged, and likewise character codes 128 through 159. It converts 182unchanged, and converts bytes with codes 128 through 159 to the
177the non-@acronym{ASCII} codes 160 through 255 by adding the value 183multibyte representation of raw eight-bit bytes.
178@code{nonascii-insert-offset} to each character code. By setting this
179variable, you specify which character set the unibyte characters
180correspond to (@pxref{Character Sets}). For example, if
181@code{nonascii-insert-offset} is 2048, which is @code{(- (make-char
182'latin-iso8859-1) 128)}, then the unibyte non-@acronym{ASCII} characters
183correspond to Latin 1. If it is 2688, which is @code{(- (make-char
184'greek-iso8859-7) 128)}, then they correspond to Greek letters.
185
186 Converting multibyte text to unibyte is simpler: it discards all but
187the low 8 bits of each character code. If @code{nonascii-insert-offset}
188has a reasonable value, corresponding to the beginning of some character
189set, this conversion is the inverse of the other: converting unibyte
190text to multibyte and back to unibyte reproduces the original unibyte
191text.
192
193@defvar nonascii-insert-offset
194This variable specifies the amount to add to a non-@acronym{ASCII} character
195when converting unibyte text to multibyte. It also applies when
196@code{self-insert-command} inserts a character in the unibyte
197non-@acronym{ASCII} range, 128 through 255. However, the functions
198@code{insert} and @code{insert-char} do not perform this conversion.
199
200The right value to use to select character set @var{cs} is @code{(-
201(make-char @var{cs}) 128)}. If the value of
202@code{nonascii-insert-offset} is zero, then conversion actually uses the
203value for the Latin 1 character set, rather than zero.
204@end defvar
205 184
206@defvar nonascii-translation-table 185 Converting multibyte text to unibyte converts all @acronym{ASCII}
207This variable provides a more general alternative to 186and eight-bit characters to their single-byte form, but loses
208@code{nonascii-insert-offset}. You can use it to specify independently 187information for non-@acronym{ASCII} characters by discarding all but
209how to translate each code in the range of 128 through 255 into a 188the low 8 bits of each character's codepoint. Converting unibyte text
210multibyte character. The value should be a char-table, or @code{nil}. 189to multibyte and back to unibyte reproduces the original unibyte text.
211If this is non-@code{nil}, it overrides @code{nonascii-insert-offset}.
212@end defvar
213 190
214The next three functions either return the argument @var{string}, or a 191The next two functions either return the argument @var{string}, or a
215newly created string with no text properties. 192newly created string with no text properties.
216 193
217@defun string-make-unibyte string
218This function converts the text of @var{string} to unibyte
219representation, if it isn't already, and returns the result. If
220@var{string} is a unibyte string, it is returned unchanged. Multibyte
221character codes are converted to unibyte according to
222@code{nonascii-translation-table} or, if that is @code{nil}, using
223@code{nonascii-insert-offset}. If the lookup in the translation table
224fails, this function takes just the low 8 bits of each character.
225@end defun
226
227@defun string-make-multibyte string
228This function converts the text of @var{string} to multibyte
229representation, if it isn't already, and returns the result. If
230@var{string} is a multibyte string or consists entirely of
231@acronym{ASCII} characters, it is returned unchanged. In particular,
232if @var{string} is unibyte and entirely @acronym{ASCII}, the returned
233string is unibyte. (When the characters are all @acronym{ASCII},
234Emacs primitives will treat the string the same way whether it is
235unibyte or multibyte.) If @var{string} is unibyte and contains
236non-@acronym{ASCII} characters, the function
237@code{unibyte-char-to-multibyte} is used to convert each unibyte
238character to a multibyte character.
239@end defun
240
241@defun string-to-multibyte string 194@defun string-to-multibyte string
242This function returns a multibyte string containing the same sequence 195This function returns a multibyte string containing the same sequence
243of character codes as @var{string}. Unlike 196of characters as @var{string}. If @var{string} is a multibyte string,
244@code{string-make-multibyte}, this function unconditionally returns a 197it is returned unchanged.
245multibyte string. If @var{string} is a multibyte string, it is 198@end defun
246returned unchanged. 199
200@defun string-to-unibyte string
201This function returns a unibyte string containing the same sequence of
202characters as @var{string}. It signals an error if @var{string}
203contains a non-@acronym{ASCII} character. If @var{string} is a
204unibyte string, it is returned unchanged.
247@end defun 205@end defun
248 206
249@defun multibyte-char-to-unibyte char 207@defun multibyte-char-to-unibyte char
250This convert the multibyte character @var{char} to a unibyte 208This convert the multibyte character @var{char} to a unibyte
251character, based on @code{nonascii-translation-table} and 209character. If @var{char} is a non-@acronym{ASCII} character, the
252@code{nonascii-insert-offset}. 210value is -1.
253@end defun 211@end defun
254 212
255@defun unibyte-char-to-multibyte char 213@defun unibyte-char-to-multibyte char
256This convert the unibyte character @var{char} to a multibyte 214This convert the unibyte character @var{char} to a multibyte
257character, based on @code{nonascii-translation-table} and 215character.
258@code{nonascii-insert-offset}.
259@end defun 216@end defun
260 217
261@node Selecting a Representation 218@node Selecting a Representation
@@ -270,13 +227,13 @@ is non-@code{nil}, the buffer becomes multibyte. If @var{multibyte}
270is @code{nil}, the buffer becomes unibyte. 227is @code{nil}, the buffer becomes unibyte.
271 228
272This function leaves the buffer contents unchanged when viewed as a 229This function leaves the buffer contents unchanged when viewed as a
273sequence of bytes. As a consequence, it can change the contents viewed 230sequence of bytes. As a consequence, it can change the contents
274as characters; a sequence of two bytes which is treated as one character 231viewed as characters; a sequence of three bytes which is treated as
275in multibyte representation will count as two characters in unibyte 232one character in multibyte representation will count as three
276representation. Character codes 128 through 159 are an exception. They 233characters in unibyte representation. Eight-bit characters
277are represented by one byte in a unibyte buffer, but when the buffer is 234representing raw bytes are an exception. They are represented by one
278set to multibyte, they are converted to two-byte sequences, and vice 235byte in a unibyte buffer, but when the buffer is set to multibyte,
279versa. 236they are converted to two-byte sequences, and vice versa.
280 237
281This function sets @code{enable-multibyte-characters} to record which 238This function sets @code{enable-multibyte-characters} to record which
282representation is in use. It also adjusts various data in the buffer 239representation is in use. It also adjusts various data in the buffer
@@ -291,26 +248,26 @@ base buffer.
291@defun string-as-unibyte string 248@defun string-as-unibyte string
292This function returns a string with the same bytes as @var{string} but 249This function returns a string with the same bytes as @var{string} but
293treating each byte as a character. This means that the value may have 250treating each byte as a character. This means that the value may have
294more characters than @var{string} has. 251more characters than @var{string} has. Eight-bit characters
252representing raw bytes are an exception: each one of them is converted
253to a single byte.
295 254
296If @var{string} is already a unibyte string, then the value is 255If @var{string} is already a unibyte string, then the value is
297@var{string} itself. Otherwise it is a newly created string, with no 256@var{string} itself. Otherwise it is a newly created string, with no
298text properties. If @var{string} is multibyte, any characters it 257text properties.
299contains of charset @code{eight-bit-control} or @code{eight-bit-graphic}
300are converted to the corresponding single byte.
301@end defun 258@end defun
302 259
303@defun string-as-multibyte string 260@defun string-as-multibyte string
304This function returns a string with the same bytes as @var{string} but 261This function returns a string with the same bytes as @var{string} but
305treating each multibyte sequence as one character. This means that the 262treating each multibyte sequence as one character. This means that
306value may have fewer characters than @var{string} has. 263the value may have fewer characters than @var{string} has. If a byte
264sequence in @var{string} is invalid as a multibyte representation of a
265single character, each byte in the sequence is treated as raw 8-bit
266byte.
307 267
308If @var{string} is already a multibyte string, then the value is 268If @var{string} is already a multibyte string, then the value is
309@var{string} itself. Otherwise it is a newly created string, with no 269@var{string} itself. Otherwise it is a newly created string, with no
310text properties. If @var{string} is unibyte and contains any individual 270text properties.
3118-bit bytes (i.e.@: not part of a multibyte form), they are converted to
312the corresponding multibyte character of charset @code{eight-bit-control}
313or @code{eight-bit-graphic}.
314@end defun 271@end defun
315 272
316@node Character Codes 273@node Character Codes
@@ -320,13 +277,13 @@ or @code{eight-bit-graphic}.
320 The unibyte and multibyte text representations use different 277 The unibyte and multibyte text representations use different
321character codes. The valid character codes for unibyte representation 278character codes. The valid character codes for unibyte representation
322range from 0 to 255---the values that can fit in one byte. The valid 279range from 0 to 255---the values that can fit in one byte. The valid
323character codes for multibyte representation range from 0 to 4194303, 280character codes for multibyte representation range from 0 to 4194303
324but not all values in that range are valid. The values 128 through 281(#x3FFFFF). In this code space, values 0 through 127 are for
325255 do not usually show up in multibyte text, but they can occur if 282@acronym{ASCII} charcters, and values 129 through 4194175 (#x3FFF7F)
326you do explicit encoding and decoding (@pxref{Explicit Encoding}). 283are for non-@acronym{ASCII} characters. Values 0 through 1114111
327Some other character codes cannot occur at all in multibyte text. 284(#10FFFF) corresponds to Unicode characters of the same codepoint,
328Only the @acronym{ASCII} codes 0 through 127 are completely legitimate 285while values 4194176 (#x3FFF80) through 4194303 (#x3FFFFF) are for
329in both representations. 286representing eight-bit raw bytes.
330 287
331@defun characterp charcode 288@defun characterp charcode
332This returns @code{t} if @var{charcode} is a valid character, and 289This returns @code{t} if @var{charcode} is a valid character, and
@@ -335,8 +292,6 @@ This returns @code{t} if @var{charcode} is a valid character, and
335@example 292@example
336(characterp 65) 293(characterp 65)
337 @result{} t 294 @result{} t
338(characterp 256)
339 @result{} nil
340(characterp 4194303) 295(characterp 4194303)
341 @result{} t 296 @result{} t
342(characterp 4194304) 297(characterp 4194304)
@@ -344,27 +299,45 @@ This returns @code{t} if @var{charcode} is a valid character, and
344@end example 299@end example
345@end defun 300@end defun
346 301
302@defun get-byte pos &optional string
303This function returns the byte at current buffer's character position
304@var{pos}. If the current buffer is unibyte, this is literally the
305byte at that position. If the buffer is multibyte, byte values of
306@acronym{ASCII} characters are the same as character codepoints,
307whereas eight-bit raw bytes are converted to their 8-bit codes. The
308function signals an error if the character at @var{pos} is
309non-@acronym{ASCII}.
310
311The optional argument @var{string} means to get a byte value from that
312string instead of the current buffer.
313@end defun
314
347@node Character Sets 315@node Character Sets
348@section Character Sets 316@section Character Sets
349@cindex character sets 317@cindex character sets
350 318
351 Emacs classifies characters into various @dfn{character sets}, each of 319@cindex charset
352which has a name which is a symbol. Each character belongs to one and 320@cindex coded character set
353only one character set. 321An Emacs @dfn{character set}, or @dfn{charset}, is a set of characters
354 322in which each character is assigned a numeric code point. (The
355 In general, there is one character set for each distinct script. For 323Unicode standard calls this a @dfn{coded character set}.) Each
356example, @code{latin-iso8859-1} is one character set, 324charset has a name which is a symbol. A single character can belong
357@code{greek-iso8859-7} is another, and @code{ascii} is another. An 325to any number of different character sets, but it will generally have
358Emacs character set can hold at most 9025 characters; therefore, in some 326a different code point in each charset. Examples of character sets
359cases, characters that would logically be grouped together are split 327include @code{ascii}, @code{iso-8859-1}, @code{greek-iso8859-7}, and
360into several character sets. For example, one set of Chinese 328@code{windows-1255}. The code point assigned to a character in a
361characters, generally known as Big 5, is divided into two Emacs 329charset is usually different from its code point used in Emacs buffers
362character sets, @code{chinese-big5-1} and @code{chinese-big5-2}. 330and strings.
363 331
364 @acronym{ASCII} characters are in character set @code{ascii}. The 332@cindex @code{emacs}, a charset
365non-@acronym{ASCII} characters 128 through 159 are in character set 333@cindex @code{unicode}, a charset
366@code{eight-bit-control}, and codes 160 through 255 are in character set 334@cindex @code{eight-bit}, a charset
367@code{eight-bit-graphic}. 335 Emacs defines several special character sets. The character set
336@code{unicode} includes all the characters whose Emacs code points are
337in the range @code{0..10FFFF}. The character set @code{emacs}
338includes all @acronym{ASCII} and non-@acronym{ASCII} characters.
339Finally, the @code{eight-bit} charset includes the 8-bit raw bytes;
340Emacs uses it to represent raw bytes encountered in text.
368 341
369@defun charsetp object 342@defun charsetp object
370Returns @code{t} if @var{object} is a symbol that names a character set, 343Returns @code{t} if @var{object} is a symbol that names a character set,
@@ -375,110 +348,60 @@ Returns @code{t} if @var{object} is a symbol that names a character set,
375The value is a list of all defined character set names. 348The value is a list of all defined character set names.
376@end defvar 349@end defvar
377 350
378@defun charset-list 351@defun charset-priority-list &optional highestp
379This function returns the value of @code{charset-list}. It is only 352This functions returns a list of all defined character sets ordered by
380provided for backward compatibility. 353their priority. If @var{highestp} is non-@code{nil}, the function
354returns a single character set of the highest priority.
355@end defun
356
357@defun set-charset-priority &rest charsets
358This function makes @var{charsets} the highest priority character sets.
381@end defun 359@end defun
382 360
383@defun char-charset character 361@defun char-charset character
384This function returns the name of the character set that @var{character} 362This function returns the name of the character set of highest
385belongs to, or the symbol @code{unknown} if @var{character} is not a 363priority that @var{character} belongs to. @acronym{ASCII} characters
386valid character. 364are an exception: for them, this function always returns @code{ascii}.
387@end defun 365@end defun
388 366
389@defun charset-plist charset 367@defun charset-plist charset
390This function returns the charset property list of the character set 368This function returns the property list of the character set
391@var{charset}. Although @var{charset} is a symbol, this is not the same 369@var{charset}. Although @var{charset} is a symbol, this is not the
392as the property list of that symbol. Charset properties are used for 370same as the property list of that symbol. Charset properties include
393special purposes within Emacs. 371important information about the charset, such as its documentation
372string, short name, etc.
394@end defun 373@end defun
395 374
396@deffn Command list-charset-chars charset 375@defun put-charset-property charset propname value
397This command displays a list of characters in the character set 376This function sets the @var{propname} property of @var{charset} to the
398@var{charset}. 377given @var{value}.
399@end deffn
400
401@node Chars and Bytes
402@section Characters and Bytes
403@cindex bytes and characters
404
405@cindex introduction sequence (of character)
406@cindex dimension (of character set)
407 In multibyte representation, each character occupies one or more
408bytes. Each character set has an @dfn{introduction sequence}, which is
409normally one or two bytes long. (Exception: the @code{ascii} character
410set and the @code{eight-bit-graphic} character set have a zero-length
411introduction sequence.) The introduction sequence is the beginning of
412the byte sequence for any character in the character set. The rest of
413the character's bytes distinguish it from the other characters in the
414same character set. Depending on the character set, there are either
415one or two distinguishing bytes; the number of such bytes is called the
416@dfn{dimension} of the character set.
417
418@defun charset-dimension charset
419This function returns the dimension of @var{charset}; at present, the
420dimension is always 1 or 2.
421@end defun 378@end defun
422 379
423@defun charset-bytes charset 380@defun get-charset-property charset propname
424This function returns the number of bytes used to represent a character 381This function returns the value of @var{charset}s property
425in character set @var{charset}. 382@var{propname}.
426@end defun 383@end defun
427 384
428 This is the simplest way to determine the byte length of a character 385@deffn Command list-charset-chars charset
429set's introduction sequence: 386This command displays a list of characters in the character set
430 387@var{charset}.
431@example 388@end deffn
432(- (charset-bytes @var{charset})
433 (charset-dimension @var{charset}))
434@end example
435
436@node Splitting Characters
437@section Splitting Characters
438@cindex character as bytes
439
440 The functions in this section convert between characters and the byte
441values used to represent them. For most purposes, there is no need to
442be concerned with the sequence of bytes used to represent a character,
443because Emacs translates automatically when necessary.
444
445@defun split-char character
446Return a list containing the name of the character set of
447@var{character}, followed by one or two byte values (integers) which
448identify @var{character} within that character set. The number of byte
449values is the character set's dimension.
450
451If @var{character} is invalid as a character code, @code{split-char}
452returns a list consisting of the symbol @code{unknown} and @var{character}.
453 389
454@example 390@defun decode-char charset code-point
455(split-char 2248) 391This function decodes a character that is assigned a @var{code-point}
456 @result{} (latin-iso8859-1 72) 392in @var{charset}, to the corresponding Emacs character, and returns
457(split-char 65) 393that character. If @var{charset} doesn't contain a character of that
458 @result{} (ascii 65) 394code point, the value is @code{nil}. If @var{code-point} doesnt't fit
459(split-char 128) 395in a Lisp integer (@pxref{Integer Basics, most-positive-fixnum}), it
460 @result{} (eight-bit-control 128) 396can be specified as a cons cell @code{(@var{high} . @var{low})}, where
461@end example 397@var{low} are the lower 16 bits of the value and @var{high} are the
398high 16 bits.
462@end defun 399@end defun
463 400
464@c FIXME: update split-char and make-char 401@defun encode-char char charset
465@cindex generate characters in charsets 402This function returns the code point assigned to the character
466@defun make-char charset &optional code1 code2 403@var{char} in @var{charset}. If @var{charset} doesn't contain
467This function returns the character in character set @var{charset} whose 404@var{char}, the value is @code{nil}.
468position codes are @var{code1} and @var{code2}. This is roughly the
469inverse of @code{split-char}. Normally, you should specify either one
470or both of @var{code1} and @var{code2} according to the dimension of
471@var{charset}. For example,
472
473@example
474(make-char 'latin-iso8859-1 72)
475 @result{} 2248
476@end example
477
478Actually, the eighth bit of both @var{code1} and @var{code2} is zeroed
479before they are used to index @var{charset}. Thus you may use, for
480instance, an ISO 8859 character code rather than subtracting 128, as
481is necessary to index the corresponding Emacs charset.
482@end defun 405@end defun
483 406
484@node Scanning Charsets 407@node Scanning Charsets
@@ -490,15 +413,16 @@ coding systems (@pxref{Coding Systems}) are capable of representing all
490of the text in question. 413of the text in question.
491 414
492@defun charset-after &optional pos 415@defun charset-after &optional pos
493This function return the charset of a character in the current buffer 416This function returns the charset of highest priority containing the
494at position @var{pos}. If @var{pos} is omitted or @code{nil}, it 417character in the current buffer at position @var{pos}. If @var{pos}
495defaults to the current value of point. If @var{pos} is out of range, 418is omitted or @code{nil}, it defaults to the current value of point.
496the value is @code{nil}. 419If @var{pos} is out of range, the value is @code{nil}.
497@end defun 420@end defun
498 421
499@defun find-charset-region beg end &optional translation 422@defun find-charset-region beg end &optional translation
500This function returns a list of the character sets that appear in the 423This function returns a list of the character sets of highest priority
501current buffer between positions @var{beg} and @var{end}. 424that contain charcters in the current buffer between positions
425@var{beg} and @var{end}.
502 426
503The optional argument @var{translation} specifies a translation table to 427The optional argument @var{translation} specifies a translation table to
504be used in scanning the text (@pxref{Translation of Characters}). If it 428be used in scanning the text (@pxref{Translation of Characters}). If it
@@ -508,10 +432,10 @@ characters instead of the characters actually in the buffer.
508@end defun 432@end defun
509 433
510@defun find-charset-string string &optional translation 434@defun find-charset-string string &optional translation
511This function returns a list of the character sets that appear in the 435This function returns a list of the character sets of highest priority
512string @var{string}. It is just like @code{find-charset-region}, except 436that contain characters in @var{string}. It is just like
513that it applies to the contents of @var{string} instead of part of the 437@code{find-charset-region}, except that it applies to the contents of
514current buffer. 438@var{string} instead of part of the current buffer.
515@end defun 439@end defun
516 440
517@node Translation of Characters 441@node Translation of Characters
@@ -519,19 +443,17 @@ current buffer.
519@cindex character translation tables 443@cindex character translation tables
520@cindex translation tables 444@cindex translation tables
521 445
522 A @dfn{translation table} is a char-table that specifies a mapping 446 A @dfn{translation table} is a char-table (@pxref{Char-Tables}) that
523of characters into characters. These tables are used in encoding and 447specifies a mapping of characters into characters. These tables are
524decoding, and for other purposes. Some coding systems specify their 448used in encoding and decoding, and for other purposes. Some coding
525own particular translation tables; there are also default translation 449systems specify their own particular translation tables; there are
526tables which apply to all other coding systems. 450also default translation tables which apply to all other coding
451systems.
527 452
528 For instance, the coding-system @code{utf-8} has a translation table 453 A translation table has two extra slots. The first is either
529that maps characters of various charsets (e.g., 454@code{nil} or a translation table that performs the reverse
530@code{latin-iso8859-@var{x}}) into Unicode character sets. This way, 455translation; the second is the maximum number of characters to look up
531it can encode Latin-2 characters into UTF-8. Meanwhile, 456for translation.
532@code{unify-8859-on-decoding-mode} operates by specifying
533@code{standard-translation-table-for-decode} to translate
534Latin-@var{x} characters into corresponding Unicode characters.
535 457
536@defun make-translation-table &rest translations 458@defun make-translation-table &rest translations
537This function returns a translation table based on the argument 459This function returns a translation table based on the argument
@@ -545,34 +467,66 @@ character, say @var{to-alt}, @var{from} is also translated to
545@var{to-alt}. 467@var{to-alt}.
546@end defun 468@end defun
547 469
548 In decoding, the translation table's translations are applied to the 470 During decoding, the translation table's translations are applied to
549characters that result from ordinary decoding. If a coding system has 471the characters that result from ordinary decoding. If a coding system
550property @code{translation-table-for-decode}, that specifies the 472has property @code{:decode-translation-table}, that specifies the
551translation table to use. (This is a property of the coding system, 473translation table to use, or a list of translation tables to apply in
552as returned by @code{coding-system-get}, not a property of the symbol 474sequence. (This is a property of the coding system, as returned by
553that is the coding system's name. @xref{Coding System Basics,, Basic 475@code{coding-system-get}, not a property of the symbol that is the
554Concepts of Coding Systems}.) Otherwise, if 476coding system's name. @xref{Coding System Basics,, Basic Concepts of
555@code{standard-translation-table-for-decode} is non-@code{nil}, 477Coding Systems}.) Finally, if
556decoding uses that table. 478@code{standard-translation-table-for-decode} is non-@code{nil}, the
557 479resulting characters are translated by that table.
558 In encoding, the translation table's translations are applied to the 480
559characters in the buffer, and the result of translation is actually 481 During encoding, the translation table's translations are applied to
560encoded. If a coding system has property 482the characters in the buffer, and the result of translation is
561@code{translation-table-for-encode}, that specifies the translation 483actually encoded. If a coding system has property
562table to use. Otherwise the variable 484@code{:encode-translation-table}, that specifies the translation table
563@code{standard-translation-table-for-encode} specifies the translation 485to use, or a list of translation tables to apply in sequence. In
564table. 486addition, if the variable @code{standard-translation-table-for-encode}
487is non-@code{nil}, it specifies the translation table to use for
488translating the result.
565 489
566@defvar standard-translation-table-for-decode 490@defvar standard-translation-table-for-decode
567This is the default translation table for decoding, for 491This is the default translation table for decoding. If a coding
568coding systems that don't specify any other translation table. 492systems specifies its own translation tables, the table that is the
493value of this variable, if non-@code{nil}, is applied after them.
569@end defvar 494@end defvar
570 495
571@defvar standard-translation-table-for-encode 496@defvar standard-translation-table-for-encode
572This is the default translation table for encoding, for 497This is the default translation table for encoding. If a coding
573coding systems that don't specify any other translation table. 498systems specifies its own translation tables, the table that is the
499value of this variable, if non-@code{nil}, is applied after them.
574@end defvar 500@end defvar
575 501
502@defun make-translation-table-from-vector vec
503This function returns a translation table made from @var{vec} that is
504an array of 256 elements to map byte values 0 through 255 to
505characters. Elements may be @code{nil} for untranslated bytes. The
506returned table has a translation table for reverse mapping in the
507first extra slot.
508
509This function provides an easy way to make a private coding system
510that maps each byte to a specific character. You can specify the
511returned table and the reverse translation table using the properties
512@code{:decode-translation-table} and @code{:encode-translation-table}
513respectively in the @var{props} argument to
514@code{define-coding-system}.
515@end defun
516
517@defun make-translation-table-from-alist alist
518This function is similar to @code{make-translation-table} but returns
519a complex translation table rather than a simple one-to-one mapping.
520Each element of @var{alist} is of the form @code{(@var{from}
521. @var{to})}, where @var{from} and @var{to} are either a character or
522a vector specifying a sequence of characters. If @var{from} is a
523character, that character is translated to @var{to} (i.e.@: to a
524character or a character sequence). If @var{from} is a vector of
525characters, that sequence is translated to @var{to}. The returned
526table has a translation table for reverse mapping in the first extra
527slot.
528@end defun
529
576@node Coding Systems 530@node Coding Systems
577@section Coding Systems 531@section Coding Systems
578 532