diff options
Diffstat (limited to 'doc/lispref')
| -rw-r--r-- | doc/lispref/nonascii.texi | 476 |
1 files changed, 215 insertions, 261 deletions
diff --git a/doc/lispref/nonascii.texi b/doc/lispref/nonascii.texi index c70f8e56973..f2656806bdb 100644 --- a/doc/lispref/nonascii.texi +++ b/doc/lispref/nonascii.texi | |||
| @@ -21,8 +21,6 @@ how they are stored in strings and buffers. | |||
| 21 | codes of individual characters. | 21 | codes of individual characters. |
| 22 | * Character Sets:: The space of possible character codes | 22 | * Character Sets:: The space of possible character codes |
| 23 | is divided into various character sets. | 23 | is divided into various character sets. |
| 24 | * Chars and Bytes:: More information about multibyte encodings. | ||
| 25 | * Splitting Characters:: Converting a character to its byte sequence. | ||
| 26 | * Scanning Charsets:: Which character sets are used in a buffer? | 24 | * Scanning Charsets:: Which character sets are used in a buffer? |
| 27 | * Translation of Characters:: Translation tables are used for conversion. | 25 | * Translation of Characters:: Translation tables are used for conversion. |
| 28 | * Coding Systems:: Coding systems are conversions for saving files. | 26 | * Coding Systems:: Coding systems are conversions for saving files. |
| @@ -47,10 +45,11 @@ follows the @dfn{Unicode Standard}. The Unicode Standard assigns a | |||
| 47 | unique number, called a @dfn{codepoint}, to each and every character. | 45 | unique number, called a @dfn{codepoint}, to each and every character. |
| 48 | The range of codepoints defined by Unicode, or the Unicode | 46 | The range of codepoints defined by Unicode, or the Unicode |
| 49 | @dfn{codespace}, is @code{0..10FFFF} (in hex) inclusive. Emacs | 47 | @dfn{codespace}, is @code{0..10FFFF} (in hex) inclusive. Emacs |
| 50 | extends this range with codepoints in the range @code{3FFF80..3FFFFF}, | 48 | extends this range with codepoints in the range @code{110000..3FFFFF}, |
| 51 | which it uses for representing raw 8-bit bytes that cannot be | 49 | which it uses for representing characters that are not unified with |
| 52 | interpreted as characters. Thus, a character codepoint in Emacs is a | 50 | Unicode and raw 8-bit bytes that cannot be interpreted as characters |
| 53 | 22-bit integer number. | 51 | (the latter occupy the range @code{3FFF80..3FFFFF}). Thus, a |
| 52 | character codepoint in Emacs is a 22-bit integer number. | ||
| 54 | 53 | ||
| 55 | @cindex internal representation of characters | 54 | @cindex internal representation of characters |
| 56 | @cindex characters, representation in buffers and strings | 55 | @cindex characters, representation in buffers and strings |
| @@ -76,10 +75,10 @@ appropriate, when it reads text into a buffer or a string, or when it | |||
| 76 | writes text to a disk file or passes it to some other process. | 75 | writes text to a disk file or passes it to some other process. |
| 77 | 76 | ||
| 78 | Occasionally, Emacs needs to hold and manipulate encoded text or | 77 | Occasionally, Emacs needs to hold and manipulate encoded text or |
| 79 | binary non-text data in its buffer or string. For example, when Emacs | 78 | binary non-text data in its buffers or strings. For example, when |
| 80 | visits a file, it first reads the file's text verbatim into a buffer, | 79 | Emacs visits a file, it first reads the file's text verbatim into a |
| 81 | and only then converts it to the internal representation. Before the | 80 | buffer, and only then converts it to the internal representation. |
| 82 | conversion, the buffer holds encoded text. | 81 | Before the conversion, the buffer holds encoded text. |
| 83 | 82 | ||
| 84 | @cindex unibyte text | 83 | @cindex unibyte text |
| 85 | Encoded text is not really text, as far as Emacs is concerned, but | 84 | Encoded text is not really text, as far as Emacs is concerned, but |
| @@ -125,9 +124,15 @@ range, the value is @code{nil}. | |||
| 125 | @end defun | 124 | @end defun |
| 126 | 125 | ||
| 127 | @defun byte-to-position byte-position | 126 | @defun byte-to-position byte-position |
| 128 | Return the buffer position, in character units, corresponding to | 127 | Return the buffer position, in character units, corresponding to given |
| 129 | byte-position @var{byte-position} in the current buffer. If | 128 | @var{byte-position} in the current buffer. If @var{byte-position} is |
| 130 | @var{byte-position} is out of range, the value is @code{nil}. | 129 | out of range, the value is @code{nil}. In a multibyte buffer, an |
| 130 | arbitrary value of @var{byte-position} can be not at character | ||
| 131 | boundary, but inside a multibyte sequence representing a single | ||
| 132 | character; in this case, this function returns the buffer position of | ||
| 133 | the character whose multibyte sequence includes @var{byte-position}. | ||
| 134 | In other words, the value does not change for all byte positions that | ||
| 135 | belong to the same character. | ||
| 131 | @end defun | 136 | @end defun |
| 132 | 137 | ||
| 133 | @defun multibyte-string-p string | 138 | @defun multibyte-string-p string |
| @@ -151,10 +156,11 @@ result a unibyte string. | |||
| 151 | @section Converting Text Representations | 156 | @section Converting Text Representations |
| 152 | 157 | ||
| 153 | Emacs can convert unibyte text to multibyte; it can also convert | 158 | Emacs can convert unibyte text to multibyte; it can also convert |
| 154 | multibyte text to unibyte, though this conversion loses information. In | 159 | multibyte text to unibyte, provided that the multibyte text contains |
| 155 | general these conversions happen when inserting text into a buffer, or | 160 | only @acronym{ASCII} and 8-bit characters. In general, these |
| 156 | when putting text from several strings together in one string. You can | 161 | conversions happen when inserting text into a buffer, or when putting |
| 157 | also explicitly convert a string's contents to either representation. | 162 | text from several strings together in one string. You can also |
| 163 | explicitly convert a string's contents to either representation. | ||
| 158 | 164 | ||
| 159 | Emacs chooses the representation for a string based on the text that | 165 | Emacs chooses the representation for a string based on the text that |
| 160 | it is constructed from. The general rule is to convert unibyte text to | 166 | it is constructed from. The general rule is to convert unibyte text to |
| @@ -173,89 +179,40 @@ acceptable because the buffer's representation is a choice made by the | |||
| 173 | user that cannot be overridden automatically. | 179 | user that cannot be overridden automatically. |
| 174 | 180 | ||
| 175 | Converting unibyte text to multibyte text leaves @acronym{ASCII} characters | 181 | Converting unibyte text to multibyte text leaves @acronym{ASCII} characters |
| 176 | unchanged, and likewise character codes 128 through 159. It converts | 182 | unchanged, and converts bytes with codes 128 through 159 to the |
| 177 | the non-@acronym{ASCII} codes 160 through 255 by adding the value | 183 | multibyte representation of raw eight-bit bytes. |
| 178 | @code{nonascii-insert-offset} to each character code. By setting this | ||
| 179 | variable, you specify which character set the unibyte characters | ||
| 180 | correspond to (@pxref{Character Sets}). For example, if | ||
| 181 | @code{nonascii-insert-offset} is 2048, which is @code{(- (make-char | ||
| 182 | 'latin-iso8859-1) 128)}, then the unibyte non-@acronym{ASCII} characters | ||
| 183 | correspond to Latin 1. If it is 2688, which is @code{(- (make-char | ||
| 184 | 'greek-iso8859-7) 128)}, then they correspond to Greek letters. | ||
| 185 | |||
| 186 | Converting multibyte text to unibyte is simpler: it discards all but | ||
| 187 | the low 8 bits of each character code. If @code{nonascii-insert-offset} | ||
| 188 | has a reasonable value, corresponding to the beginning of some character | ||
| 189 | set, this conversion is the inverse of the other: converting unibyte | ||
| 190 | text to multibyte and back to unibyte reproduces the original unibyte | ||
| 191 | text. | ||
| 192 | |||
| 193 | @defvar nonascii-insert-offset | ||
| 194 | This variable specifies the amount to add to a non-@acronym{ASCII} character | ||
| 195 | when converting unibyte text to multibyte. It also applies when | ||
| 196 | @code{self-insert-command} inserts a character in the unibyte | ||
| 197 | non-@acronym{ASCII} range, 128 through 255. However, the functions | ||
| 198 | @code{insert} and @code{insert-char} do not perform this conversion. | ||
| 199 | |||
| 200 | The right value to use to select character set @var{cs} is @code{(- | ||
| 201 | (make-char @var{cs}) 128)}. If the value of | ||
| 202 | @code{nonascii-insert-offset} is zero, then conversion actually uses the | ||
| 203 | value for the Latin 1 character set, rather than zero. | ||
| 204 | @end defvar | ||
| 205 | 184 | ||
| 206 | @defvar nonascii-translation-table | 185 | Converting multibyte text to unibyte converts all @acronym{ASCII} |
| 207 | This variable provides a more general alternative to | 186 | and eight-bit characters to their single-byte form, but loses |
| 208 | @code{nonascii-insert-offset}. You can use it to specify independently | 187 | information for non-@acronym{ASCII} characters by discarding all but |
| 209 | how to translate each code in the range of 128 through 255 into a | 188 | the low 8 bits of each character's codepoint. Converting unibyte text |
| 210 | multibyte character. The value should be a char-table, or @code{nil}. | 189 | to multibyte and back to unibyte reproduces the original unibyte text. |
| 211 | If this is non-@code{nil}, it overrides @code{nonascii-insert-offset}. | ||
| 212 | @end defvar | ||
| 213 | 190 | ||
| 214 | The next three functions either return the argument @var{string}, or a | 191 | The next two functions either return the argument @var{string}, or a |
| 215 | newly created string with no text properties. | 192 | newly created string with no text properties. |
| 216 | 193 | ||
| 217 | @defun string-make-unibyte string | ||
| 218 | This function converts the text of @var{string} to unibyte | ||
| 219 | representation, if it isn't already, and returns the result. If | ||
| 220 | @var{string} is a unibyte string, it is returned unchanged. Multibyte | ||
| 221 | character codes are converted to unibyte according to | ||
| 222 | @code{nonascii-translation-table} or, if that is @code{nil}, using | ||
| 223 | @code{nonascii-insert-offset}. If the lookup in the translation table | ||
| 224 | fails, this function takes just the low 8 bits of each character. | ||
| 225 | @end defun | ||
| 226 | |||
| 227 | @defun string-make-multibyte string | ||
| 228 | This function converts the text of @var{string} to multibyte | ||
| 229 | representation, if it isn't already, and returns the result. If | ||
| 230 | @var{string} is a multibyte string or consists entirely of | ||
| 231 | @acronym{ASCII} characters, it is returned unchanged. In particular, | ||
| 232 | if @var{string} is unibyte and entirely @acronym{ASCII}, the returned | ||
| 233 | string is unibyte. (When the characters are all @acronym{ASCII}, | ||
| 234 | Emacs primitives will treat the string the same way whether it is | ||
| 235 | unibyte or multibyte.) If @var{string} is unibyte and contains | ||
| 236 | non-@acronym{ASCII} characters, the function | ||
| 237 | @code{unibyte-char-to-multibyte} is used to convert each unibyte | ||
| 238 | character to a multibyte character. | ||
| 239 | @end defun | ||
| 240 | |||
| 241 | @defun string-to-multibyte string | 194 | @defun string-to-multibyte string |
| 242 | This function returns a multibyte string containing the same sequence | 195 | This function returns a multibyte string containing the same sequence |
| 243 | of character codes as @var{string}. Unlike | 196 | of characters as @var{string}. If @var{string} is a multibyte string, |
| 244 | @code{string-make-multibyte}, this function unconditionally returns a | 197 | it is returned unchanged. |
| 245 | multibyte string. If @var{string} is a multibyte string, it is | 198 | @end defun |
| 246 | returned unchanged. | 199 | |
| 200 | @defun string-to-unibyte string | ||
| 201 | This function returns a unibyte string containing the same sequence of | ||
| 202 | characters as @var{string}. It signals an error if @var{string} | ||
| 203 | contains a non-@acronym{ASCII} character. If @var{string} is a | ||
| 204 | unibyte string, it is returned unchanged. | ||
| 247 | @end defun | 205 | @end defun |
| 248 | 206 | ||
| 249 | @defun multibyte-char-to-unibyte char | 207 | @defun multibyte-char-to-unibyte char |
| 250 | This convert the multibyte character @var{char} to a unibyte | 208 | This convert the multibyte character @var{char} to a unibyte |
| 251 | character, based on @code{nonascii-translation-table} and | 209 | character. If @var{char} is a non-@acronym{ASCII} character, the |
| 252 | @code{nonascii-insert-offset}. | 210 | value is -1. |
| 253 | @end defun | 211 | @end defun |
| 254 | 212 | ||
| 255 | @defun unibyte-char-to-multibyte char | 213 | @defun unibyte-char-to-multibyte char |
| 256 | This convert the unibyte character @var{char} to a multibyte | 214 | This convert the unibyte character @var{char} to a multibyte |
| 257 | character, based on @code{nonascii-translation-table} and | 215 | character. |
| 258 | @code{nonascii-insert-offset}. | ||
| 259 | @end defun | 216 | @end defun |
| 260 | 217 | ||
| 261 | @node Selecting a Representation | 218 | @node Selecting a Representation |
| @@ -270,13 +227,13 @@ is non-@code{nil}, the buffer becomes multibyte. If @var{multibyte} | |||
| 270 | is @code{nil}, the buffer becomes unibyte. | 227 | is @code{nil}, the buffer becomes unibyte. |
| 271 | 228 | ||
| 272 | This function leaves the buffer contents unchanged when viewed as a | 229 | This function leaves the buffer contents unchanged when viewed as a |
| 273 | sequence of bytes. As a consequence, it can change the contents viewed | 230 | sequence of bytes. As a consequence, it can change the contents |
| 274 | as characters; a sequence of two bytes which is treated as one character | 231 | viewed as characters; a sequence of three bytes which is treated as |
| 275 | in multibyte representation will count as two characters in unibyte | 232 | one character in multibyte representation will count as three |
| 276 | representation. Character codes 128 through 159 are an exception. They | 233 | characters in unibyte representation. Eight-bit characters |
| 277 | are represented by one byte in a unibyte buffer, but when the buffer is | 234 | representing raw bytes are an exception. They are represented by one |
| 278 | set to multibyte, they are converted to two-byte sequences, and vice | 235 | byte in a unibyte buffer, but when the buffer is set to multibyte, |
| 279 | versa. | 236 | they are converted to two-byte sequences, and vice versa. |
| 280 | 237 | ||
| 281 | This function sets @code{enable-multibyte-characters} to record which | 238 | This function sets @code{enable-multibyte-characters} to record which |
| 282 | representation is in use. It also adjusts various data in the buffer | 239 | representation is in use. It also adjusts various data in the buffer |
| @@ -291,26 +248,26 @@ base buffer. | |||
| 291 | @defun string-as-unibyte string | 248 | @defun string-as-unibyte string |
| 292 | This function returns a string with the same bytes as @var{string} but | 249 | This function returns a string with the same bytes as @var{string} but |
| 293 | treating each byte as a character. This means that the value may have | 250 | treating each byte as a character. This means that the value may have |
| 294 | more characters than @var{string} has. | 251 | more characters than @var{string} has. Eight-bit characters |
| 252 | representing raw bytes are an exception: each one of them is converted | ||
| 253 | to a single byte. | ||
| 295 | 254 | ||
| 296 | If @var{string} is already a unibyte string, then the value is | 255 | If @var{string} is already a unibyte string, then the value is |
| 297 | @var{string} itself. Otherwise it is a newly created string, with no | 256 | @var{string} itself. Otherwise it is a newly created string, with no |
| 298 | text properties. If @var{string} is multibyte, any characters it | 257 | text properties. |
| 299 | contains of charset @code{eight-bit-control} or @code{eight-bit-graphic} | ||
| 300 | are converted to the corresponding single byte. | ||
| 301 | @end defun | 258 | @end defun |
| 302 | 259 | ||
| 303 | @defun string-as-multibyte string | 260 | @defun string-as-multibyte string |
| 304 | This function returns a string with the same bytes as @var{string} but | 261 | This function returns a string with the same bytes as @var{string} but |
| 305 | treating each multibyte sequence as one character. This means that the | 262 | treating each multibyte sequence as one character. This means that |
| 306 | value may have fewer characters than @var{string} has. | 263 | the value may have fewer characters than @var{string} has. If a byte |
| 264 | sequence in @var{string} is invalid as a multibyte representation of a | ||
| 265 | single character, each byte in the sequence is treated as raw 8-bit | ||
| 266 | byte. | ||
| 307 | 267 | ||
| 308 | If @var{string} is already a multibyte string, then the value is | 268 | If @var{string} is already a multibyte string, then the value is |
| 309 | @var{string} itself. Otherwise it is a newly created string, with no | 269 | @var{string} itself. Otherwise it is a newly created string, with no |
| 310 | text properties. If @var{string} is unibyte and contains any individual | 270 | text properties. |
| 311 | 8-bit bytes (i.e.@: not part of a multibyte form), they are converted to | ||
| 312 | the corresponding multibyte character of charset @code{eight-bit-control} | ||
| 313 | or @code{eight-bit-graphic}. | ||
| 314 | @end defun | 271 | @end defun |
| 315 | 272 | ||
| 316 | @node Character Codes | 273 | @node Character Codes |
| @@ -320,13 +277,13 @@ or @code{eight-bit-graphic}. | |||
| 320 | The unibyte and multibyte text representations use different | 277 | The unibyte and multibyte text representations use different |
| 321 | character codes. The valid character codes for unibyte representation | 278 | character codes. The valid character codes for unibyte representation |
| 322 | range from 0 to 255---the values that can fit in one byte. The valid | 279 | range from 0 to 255---the values that can fit in one byte. The valid |
| 323 | character codes for multibyte representation range from 0 to 4194303, | 280 | character codes for multibyte representation range from 0 to 4194303 |
| 324 | but not all values in that range are valid. The values 128 through | 281 | (#x3FFFFF). In this code space, values 0 through 127 are for |
| 325 | 255 do not usually show up in multibyte text, but they can occur if | 282 | @acronym{ASCII} charcters, and values 129 through 4194175 (#x3FFF7F) |
| 326 | you do explicit encoding and decoding (@pxref{Explicit Encoding}). | 283 | are for non-@acronym{ASCII} characters. Values 0 through 1114111 |
| 327 | Some other character codes cannot occur at all in multibyte text. | 284 | (#10FFFF) corresponds to Unicode characters of the same codepoint, |
| 328 | Only the @acronym{ASCII} codes 0 through 127 are completely legitimate | 285 | while values 4194176 (#x3FFF80) through 4194303 (#x3FFFFF) are for |
| 329 | in both representations. | 286 | representing eight-bit raw bytes. |
| 330 | 287 | ||
| 331 | @defun characterp charcode | 288 | @defun characterp charcode |
| 332 | This returns @code{t} if @var{charcode} is a valid character, and | 289 | This returns @code{t} if @var{charcode} is a valid character, and |
| @@ -335,8 +292,6 @@ This returns @code{t} if @var{charcode} is a valid character, and | |||
| 335 | @example | 292 | @example |
| 336 | (characterp 65) | 293 | (characterp 65) |
| 337 | @result{} t | 294 | @result{} t |
| 338 | (characterp 256) | ||
| 339 | @result{} nil | ||
| 340 | (characterp 4194303) | 295 | (characterp 4194303) |
| 341 | @result{} t | 296 | @result{} t |
| 342 | (characterp 4194304) | 297 | (characterp 4194304) |
| @@ -344,27 +299,45 @@ This returns @code{t} if @var{charcode} is a valid character, and | |||
| 344 | @end example | 299 | @end example |
| 345 | @end defun | 300 | @end defun |
| 346 | 301 | ||
| 302 | @defun get-byte pos &optional string | ||
| 303 | This function returns the byte at current buffer's character position | ||
| 304 | @var{pos}. If the current buffer is unibyte, this is literally the | ||
| 305 | byte at that position. If the buffer is multibyte, byte values of | ||
| 306 | @acronym{ASCII} characters are the same as character codepoints, | ||
| 307 | whereas eight-bit raw bytes are converted to their 8-bit codes. The | ||
| 308 | function signals an error if the character at @var{pos} is | ||
| 309 | non-@acronym{ASCII}. | ||
| 310 | |||
| 311 | The optional argument @var{string} means to get a byte value from that | ||
| 312 | string instead of the current buffer. | ||
| 313 | @end defun | ||
| 314 | |||
| 347 | @node Character Sets | 315 | @node Character Sets |
| 348 | @section Character Sets | 316 | @section Character Sets |
| 349 | @cindex character sets | 317 | @cindex character sets |
| 350 | 318 | ||
| 351 | Emacs classifies characters into various @dfn{character sets}, each of | 319 | @cindex charset |
| 352 | which has a name which is a symbol. Each character belongs to one and | 320 | @cindex coded character set |
| 353 | only one character set. | 321 | An Emacs @dfn{character set}, or @dfn{charset}, is a set of characters |
| 354 | 322 | in which each character is assigned a numeric code point. (The | |
| 355 | In general, there is one character set for each distinct script. For | 323 | Unicode standard calls this a @dfn{coded character set}.) Each |
| 356 | example, @code{latin-iso8859-1} is one character set, | 324 | charset has a name which is a symbol. A single character can belong |
| 357 | @code{greek-iso8859-7} is another, and @code{ascii} is another. An | 325 | to any number of different character sets, but it will generally have |
| 358 | Emacs character set can hold at most 9025 characters; therefore, in some | 326 | a different code point in each charset. Examples of character sets |
| 359 | cases, characters that would logically be grouped together are split | 327 | include @code{ascii}, @code{iso-8859-1}, @code{greek-iso8859-7}, and |
| 360 | into several character sets. For example, one set of Chinese | 328 | @code{windows-1255}. The code point assigned to a character in a |
| 361 | characters, generally known as Big 5, is divided into two Emacs | 329 | charset is usually different from its code point used in Emacs buffers |
| 362 | character sets, @code{chinese-big5-1} and @code{chinese-big5-2}. | 330 | and strings. |
| 363 | 331 | ||
| 364 | @acronym{ASCII} characters are in character set @code{ascii}. The | 332 | @cindex @code{emacs}, a charset |
| 365 | non-@acronym{ASCII} characters 128 through 159 are in character set | 333 | @cindex @code{unicode}, a charset |
| 366 | @code{eight-bit-control}, and codes 160 through 255 are in character set | 334 | @cindex @code{eight-bit}, a charset |
| 367 | @code{eight-bit-graphic}. | 335 | Emacs defines several special character sets. The character set |
| 336 | @code{unicode} includes all the characters whose Emacs code points are | ||
| 337 | in the range @code{0..10FFFF}. The character set @code{emacs} | ||
| 338 | includes all @acronym{ASCII} and non-@acronym{ASCII} characters. | ||
| 339 | Finally, the @code{eight-bit} charset includes the 8-bit raw bytes; | ||
| 340 | Emacs uses it to represent raw bytes encountered in text. | ||
| 368 | 341 | ||
| 369 | @defun charsetp object | 342 | @defun charsetp object |
| 370 | Returns @code{t} if @var{object} is a symbol that names a character set, | 343 | Returns @code{t} if @var{object} is a symbol that names a character set, |
| @@ -375,110 +348,60 @@ Returns @code{t} if @var{object} is a symbol that names a character set, | |||
| 375 | The value is a list of all defined character set names. | 348 | The value is a list of all defined character set names. |
| 376 | @end defvar | 349 | @end defvar |
| 377 | 350 | ||
| 378 | @defun charset-list | 351 | @defun charset-priority-list &optional highestp |
| 379 | This function returns the value of @code{charset-list}. It is only | 352 | This functions returns a list of all defined character sets ordered by |
| 380 | provided for backward compatibility. | 353 | their priority. If @var{highestp} is non-@code{nil}, the function |
| 354 | returns a single character set of the highest priority. | ||
| 355 | @end defun | ||
| 356 | |||
| 357 | @defun set-charset-priority &rest charsets | ||
| 358 | This function makes @var{charsets} the highest priority character sets. | ||
| 381 | @end defun | 359 | @end defun |
| 382 | 360 | ||
| 383 | @defun char-charset character | 361 | @defun char-charset character |
| 384 | This function returns the name of the character set that @var{character} | 362 | This function returns the name of the character set of highest |
| 385 | belongs to, or the symbol @code{unknown} if @var{character} is not a | 363 | priority that @var{character} belongs to. @acronym{ASCII} characters |
| 386 | valid character. | 364 | are an exception: for them, this function always returns @code{ascii}. |
| 387 | @end defun | 365 | @end defun |
| 388 | 366 | ||
| 389 | @defun charset-plist charset | 367 | @defun charset-plist charset |
| 390 | This function returns the charset property list of the character set | 368 | This function returns the property list of the character set |
| 391 | @var{charset}. Although @var{charset} is a symbol, this is not the same | 369 | @var{charset}. Although @var{charset} is a symbol, this is not the |
| 392 | as the property list of that symbol. Charset properties are used for | 370 | same as the property list of that symbol. Charset properties include |
| 393 | special purposes within Emacs. | 371 | important information about the charset, such as its documentation |
| 372 | string, short name, etc. | ||
| 394 | @end defun | 373 | @end defun |
| 395 | 374 | ||
| 396 | @deffn Command list-charset-chars charset | 375 | @defun put-charset-property charset propname value |
| 397 | This command displays a list of characters in the character set | 376 | This function sets the @var{propname} property of @var{charset} to the |
| 398 | @var{charset}. | 377 | given @var{value}. |
| 399 | @end deffn | ||
| 400 | |||
| 401 | @node Chars and Bytes | ||
| 402 | @section Characters and Bytes | ||
| 403 | @cindex bytes and characters | ||
| 404 | |||
| 405 | @cindex introduction sequence (of character) | ||
| 406 | @cindex dimension (of character set) | ||
| 407 | In multibyte representation, each character occupies one or more | ||
| 408 | bytes. Each character set has an @dfn{introduction sequence}, which is | ||
| 409 | normally one or two bytes long. (Exception: the @code{ascii} character | ||
| 410 | set and the @code{eight-bit-graphic} character set have a zero-length | ||
| 411 | introduction sequence.) The introduction sequence is the beginning of | ||
| 412 | the byte sequence for any character in the character set. The rest of | ||
| 413 | the character's bytes distinguish it from the other characters in the | ||
| 414 | same character set. Depending on the character set, there are either | ||
| 415 | one or two distinguishing bytes; the number of such bytes is called the | ||
| 416 | @dfn{dimension} of the character set. | ||
| 417 | |||
| 418 | @defun charset-dimension charset | ||
| 419 | This function returns the dimension of @var{charset}; at present, the | ||
| 420 | dimension is always 1 or 2. | ||
| 421 | @end defun | 378 | @end defun |
| 422 | 379 | ||
| 423 | @defun charset-bytes charset | 380 | @defun get-charset-property charset propname |
| 424 | This function returns the number of bytes used to represent a character | 381 | This function returns the value of @var{charset}s property |
| 425 | in character set @var{charset}. | 382 | @var{propname}. |
| 426 | @end defun | 383 | @end defun |
| 427 | 384 | ||
| 428 | This is the simplest way to determine the byte length of a character | 385 | @deffn Command list-charset-chars charset |
| 429 | set's introduction sequence: | 386 | This command displays a list of characters in the character set |
| 430 | 387 | @var{charset}. | |
| 431 | @example | 388 | @end deffn |
| 432 | (- (charset-bytes @var{charset}) | ||
| 433 | (charset-dimension @var{charset})) | ||
| 434 | @end example | ||
| 435 | |||
| 436 | @node Splitting Characters | ||
| 437 | @section Splitting Characters | ||
| 438 | @cindex character as bytes | ||
| 439 | |||
| 440 | The functions in this section convert between characters and the byte | ||
| 441 | values used to represent them. For most purposes, there is no need to | ||
| 442 | be concerned with the sequence of bytes used to represent a character, | ||
| 443 | because Emacs translates automatically when necessary. | ||
| 444 | |||
| 445 | @defun split-char character | ||
| 446 | Return a list containing the name of the character set of | ||
| 447 | @var{character}, followed by one or two byte values (integers) which | ||
| 448 | identify @var{character} within that character set. The number of byte | ||
| 449 | values is the character set's dimension. | ||
| 450 | |||
| 451 | If @var{character} is invalid as a character code, @code{split-char} | ||
| 452 | returns a list consisting of the symbol @code{unknown} and @var{character}. | ||
| 453 | 389 | ||
| 454 | @example | 390 | @defun decode-char charset code-point |
| 455 | (split-char 2248) | 391 | This function decodes a character that is assigned a @var{code-point} |
| 456 | @result{} (latin-iso8859-1 72) | 392 | in @var{charset}, to the corresponding Emacs character, and returns |
| 457 | (split-char 65) | 393 | that character. If @var{charset} doesn't contain a character of that |
| 458 | @result{} (ascii 65) | 394 | code point, the value is @code{nil}. If @var{code-point} doesnt't fit |
| 459 | (split-char 128) | 395 | in a Lisp integer (@pxref{Integer Basics, most-positive-fixnum}), it |
| 460 | @result{} (eight-bit-control 128) | 396 | can be specified as a cons cell @code{(@var{high} . @var{low})}, where |
| 461 | @end example | 397 | @var{low} are the lower 16 bits of the value and @var{high} are the |
| 398 | high 16 bits. | ||
| 462 | @end defun | 399 | @end defun |
| 463 | 400 | ||
| 464 | @c FIXME: update split-char and make-char | 401 | @defun encode-char char charset |
| 465 | @cindex generate characters in charsets | 402 | This function returns the code point assigned to the character |
| 466 | @defun make-char charset &optional code1 code2 | 403 | @var{char} in @var{charset}. If @var{charset} doesn't contain |
| 467 | This function returns the character in character set @var{charset} whose | 404 | @var{char}, the value is @code{nil}. |
| 468 | position codes are @var{code1} and @var{code2}. This is roughly the | ||
| 469 | inverse of @code{split-char}. Normally, you should specify either one | ||
| 470 | or both of @var{code1} and @var{code2} according to the dimension of | ||
| 471 | @var{charset}. For example, | ||
| 472 | |||
| 473 | @example | ||
| 474 | (make-char 'latin-iso8859-1 72) | ||
| 475 | @result{} 2248 | ||
| 476 | @end example | ||
| 477 | |||
| 478 | Actually, the eighth bit of both @var{code1} and @var{code2} is zeroed | ||
| 479 | before they are used to index @var{charset}. Thus you may use, for | ||
| 480 | instance, an ISO 8859 character code rather than subtracting 128, as | ||
| 481 | is necessary to index the corresponding Emacs charset. | ||
| 482 | @end defun | 405 | @end defun |
| 483 | 406 | ||
| 484 | @node Scanning Charsets | 407 | @node Scanning Charsets |
| @@ -490,15 +413,16 @@ coding systems (@pxref{Coding Systems}) are capable of representing all | |||
| 490 | of the text in question. | 413 | of the text in question. |
| 491 | 414 | ||
| 492 | @defun charset-after &optional pos | 415 | @defun charset-after &optional pos |
| 493 | This function return the charset of a character in the current buffer | 416 | This function returns the charset of highest priority containing the |
| 494 | at position @var{pos}. If @var{pos} is omitted or @code{nil}, it | 417 | character in the current buffer at position @var{pos}. If @var{pos} |
| 495 | defaults to the current value of point. If @var{pos} is out of range, | 418 | is omitted or @code{nil}, it defaults to the current value of point. |
| 496 | the value is @code{nil}. | 419 | If @var{pos} is out of range, the value is @code{nil}. |
| 497 | @end defun | 420 | @end defun |
| 498 | 421 | ||
| 499 | @defun find-charset-region beg end &optional translation | 422 | @defun find-charset-region beg end &optional translation |
| 500 | This function returns a list of the character sets that appear in the | 423 | This function returns a list of the character sets of highest priority |
| 501 | current buffer between positions @var{beg} and @var{end}. | 424 | that contain charcters in the current buffer between positions |
| 425 | @var{beg} and @var{end}. | ||
| 502 | 426 | ||
| 503 | The optional argument @var{translation} specifies a translation table to | 427 | The optional argument @var{translation} specifies a translation table to |
| 504 | be used in scanning the text (@pxref{Translation of Characters}). If it | 428 | be used in scanning the text (@pxref{Translation of Characters}). If it |
| @@ -508,10 +432,10 @@ characters instead of the characters actually in the buffer. | |||
| 508 | @end defun | 432 | @end defun |
| 509 | 433 | ||
| 510 | @defun find-charset-string string &optional translation | 434 | @defun find-charset-string string &optional translation |
| 511 | This function returns a list of the character sets that appear in the | 435 | This function returns a list of the character sets of highest priority |
| 512 | string @var{string}. It is just like @code{find-charset-region}, except | 436 | that contain characters in @var{string}. It is just like |
| 513 | that it applies to the contents of @var{string} instead of part of the | 437 | @code{find-charset-region}, except that it applies to the contents of |
| 514 | current buffer. | 438 | @var{string} instead of part of the current buffer. |
| 515 | @end defun | 439 | @end defun |
| 516 | 440 | ||
| 517 | @node Translation of Characters | 441 | @node Translation of Characters |
| @@ -519,19 +443,17 @@ current buffer. | |||
| 519 | @cindex character translation tables | 443 | @cindex character translation tables |
| 520 | @cindex translation tables | 444 | @cindex translation tables |
| 521 | 445 | ||
| 522 | A @dfn{translation table} is a char-table that specifies a mapping | 446 | A @dfn{translation table} is a char-table (@pxref{Char-Tables}) that |
| 523 | of characters into characters. These tables are used in encoding and | 447 | specifies a mapping of characters into characters. These tables are |
| 524 | decoding, and for other purposes. Some coding systems specify their | 448 | used in encoding and decoding, and for other purposes. Some coding |
| 525 | own particular translation tables; there are also default translation | 449 | systems specify their own particular translation tables; there are |
| 526 | tables which apply to all other coding systems. | 450 | also default translation tables which apply to all other coding |
| 451 | systems. | ||
| 527 | 452 | ||
| 528 | For instance, the coding-system @code{utf-8} has a translation table | 453 | A translation table has two extra slots. The first is either |
| 529 | that maps characters of various charsets (e.g., | 454 | @code{nil} or a translation table that performs the reverse |
| 530 | @code{latin-iso8859-@var{x}}) into Unicode character sets. This way, | 455 | translation; the second is the maximum number of characters to look up |
| 531 | it can encode Latin-2 characters into UTF-8. Meanwhile, | 456 | for translation. |
| 532 | @code{unify-8859-on-decoding-mode} operates by specifying | ||
| 533 | @code{standard-translation-table-for-decode} to translate | ||
| 534 | Latin-@var{x} characters into corresponding Unicode characters. | ||
| 535 | 457 | ||
| 536 | @defun make-translation-table &rest translations | 458 | @defun make-translation-table &rest translations |
| 537 | This function returns a translation table based on the argument | 459 | This function returns a translation table based on the argument |
| @@ -545,34 +467,66 @@ character, say @var{to-alt}, @var{from} is also translated to | |||
| 545 | @var{to-alt}. | 467 | @var{to-alt}. |
| 546 | @end defun | 468 | @end defun |
| 547 | 469 | ||
| 548 | In decoding, the translation table's translations are applied to the | 470 | During decoding, the translation table's translations are applied to |
| 549 | characters that result from ordinary decoding. If a coding system has | 471 | the characters that result from ordinary decoding. If a coding system |
| 550 | property @code{translation-table-for-decode}, that specifies the | 472 | has property @code{:decode-translation-table}, that specifies the |
| 551 | translation table to use. (This is a property of the coding system, | 473 | translation table to use, or a list of translation tables to apply in |
| 552 | as returned by @code{coding-system-get}, not a property of the symbol | 474 | sequence. (This is a property of the coding system, as returned by |
| 553 | that is the coding system's name. @xref{Coding System Basics,, Basic | 475 | @code{coding-system-get}, not a property of the symbol that is the |
| 554 | Concepts of Coding Systems}.) Otherwise, if | 476 | coding system's name. @xref{Coding System Basics,, Basic Concepts of |
| 555 | @code{standard-translation-table-for-decode} is non-@code{nil}, | 477 | Coding Systems}.) Finally, if |
| 556 | decoding uses that table. | 478 | @code{standard-translation-table-for-decode} is non-@code{nil}, the |
| 557 | 479 | resulting characters are translated by that table. | |
| 558 | In encoding, the translation table's translations are applied to the | 480 | |
| 559 | characters in the buffer, and the result of translation is actually | 481 | During encoding, the translation table's translations are applied to |
| 560 | encoded. If a coding system has property | 482 | the characters in the buffer, and the result of translation is |
| 561 | @code{translation-table-for-encode}, that specifies the translation | 483 | actually encoded. If a coding system has property |
| 562 | table to use. Otherwise the variable | 484 | @code{:encode-translation-table}, that specifies the translation table |
| 563 | @code{standard-translation-table-for-encode} specifies the translation | 485 | to use, or a list of translation tables to apply in sequence. In |
| 564 | table. | 486 | addition, if the variable @code{standard-translation-table-for-encode} |
| 487 | is non-@code{nil}, it specifies the translation table to use for | ||
| 488 | translating the result. | ||
| 565 | 489 | ||
| 566 | @defvar standard-translation-table-for-decode | 490 | @defvar standard-translation-table-for-decode |
| 567 | This is the default translation table for decoding, for | 491 | This is the default translation table for decoding. If a coding |
| 568 | coding systems that don't specify any other translation table. | 492 | systems specifies its own translation tables, the table that is the |
| 493 | value of this variable, if non-@code{nil}, is applied after them. | ||
| 569 | @end defvar | 494 | @end defvar |
| 570 | 495 | ||
| 571 | @defvar standard-translation-table-for-encode | 496 | @defvar standard-translation-table-for-encode |
| 572 | This is the default translation table for encoding, for | 497 | This is the default translation table for encoding. If a coding |
| 573 | coding systems that don't specify any other translation table. | 498 | systems specifies its own translation tables, the table that is the |
| 499 | value of this variable, if non-@code{nil}, is applied after them. | ||
| 574 | @end defvar | 500 | @end defvar |
| 575 | 501 | ||
| 502 | @defun make-translation-table-from-vector vec | ||
| 503 | This function returns a translation table made from @var{vec} that is | ||
| 504 | an array of 256 elements to map byte values 0 through 255 to | ||
| 505 | characters. Elements may be @code{nil} for untranslated bytes. The | ||
| 506 | returned table has a translation table for reverse mapping in the | ||
| 507 | first extra slot. | ||
| 508 | |||
| 509 | This function provides an easy way to make a private coding system | ||
| 510 | that maps each byte to a specific character. You can specify the | ||
| 511 | returned table and the reverse translation table using the properties | ||
| 512 | @code{:decode-translation-table} and @code{:encode-translation-table} | ||
| 513 | respectively in the @var{props} argument to | ||
| 514 | @code{define-coding-system}. | ||
| 515 | @end defun | ||
| 516 | |||
| 517 | @defun make-translation-table-from-alist alist | ||
| 518 | This function is similar to @code{make-translation-table} but returns | ||
| 519 | a complex translation table rather than a simple one-to-one mapping. | ||
| 520 | Each element of @var{alist} is of the form @code{(@var{from} | ||
| 521 | . @var{to})}, where @var{from} and @var{to} are either a character or | ||
| 522 | a vector specifying a sequence of characters. If @var{from} is a | ||
| 523 | character, that character is translated to @var{to} (i.e.@: to a | ||
| 524 | character or a character sequence). If @var{from} is a vector of | ||
| 525 | characters, that sequence is translated to @var{to}. The returned | ||
| 526 | table has a translation table for reverse mapping in the first extra | ||
| 527 | slot. | ||
| 528 | @end defun | ||
| 529 | |||
| 576 | @node Coding Systems | 530 | @node Coding Systems |
| 577 | @section Coding Systems | 531 | @section Coding Systems |
| 578 | 532 | ||