aboutsummaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
-rw-r--r--doc/lispref/nonascii.texi247
1 files changed, 247 insertions, 0 deletions
diff --git a/doc/lispref/nonascii.texi b/doc/lispref/nonascii.texi
index 256d2c8f38a..c967c28f631 100644
--- a/doc/lispref/nonascii.texi
+++ b/doc/lispref/nonascii.texi
@@ -19,6 +19,8 @@ how they are stored in strings and buffers.
19* Selecting a Representation:: Treating a byte sequence as unibyte or multi. 19* Selecting a Representation:: Treating a byte sequence as unibyte or multi.
20* Character Codes:: How unibyte and multibyte relate to 20* Character Codes:: How unibyte and multibyte relate to
21 codes of individual characters. 21 codes of individual characters.
22* Character Properties:: Character attributes that define their
23 behavior and handling.
22* Character Sets:: The space of possible character codes 24* Character Sets:: The space of possible character codes
23 is divided into various character sets. 25 is divided into various character sets.
24* Scanning Charsets:: Which character sets are used in a buffer? 26* Scanning Charsets:: Which character sets are used in a buffer?
@@ -344,6 +346,184 @@ The optional argument @var{string} means to get a byte value from that
344string instead of the current buffer. 346string instead of the current buffer.
345@end defun 347@end defun
346 348
349@node Character Properties
350@section Character Properties
351@cindex character properties
352A @dfn{character property} is a named attribute of a character that
353specifies how the character behaves and how it should be handled
354during text processing and display. Thus, character properties are an
355important part of specifying the character's semantics.
356
357 Emacs generally follows the Unicode Standard in its implementation
358of character properties. In particular, Emacs supports the
359@uref{http://www.unicode.org/reports/tr23/, Unicode Character Property
360Model}, and the Emacs character property database is derived from the
361Unicode Character Database (@acronym{UCD}). See the
362@uref{http://www.unicode.org/versions/Unicode5.0.0/ch04.pdf, Character
363Properties chapter of the Unicode Standard}, for more details about
364Unicode character properties and their meaning.
365
366 The facilities documented in this section are useful for setting and
367retrieving properties of characters.
368
369 In Emacs, each property has a name, which is a symbol, and a set of
370possible values, whose types depend on the property. Here's the full
371list of character properties that Emacs knows about:
372
373@table @code
374@item name
375The character's canonical unique name. The value of the property is a
376string consisting of upper-case Latin letters A to Z, digits, spaces,
377and hyphen @samp{-} characters.
378
379@item general-category
380This property assigns the character to one of the major classes, such
381as letters, punctuation, and symbols, and its important subclasses.
382The value is a symbol whose name is a 2-letter abbreviation. The
383first letter specifies the character's major class and the second
384letter designates a subclass of that major class.
385
386@item canonical-combining-class
387This property classifies combining characters into several classes,
388depending on the details of their behavior in sequences of combining
389characters. The property's value is an integer number.
390
391@item bidi-class
392This property specifies character attributes required for correct
393display of @dfn{bidirectional text} used by right-to-left scripts,
394such as Arabic and Hebrew. The value is a symbol whose name is the
395Unicode @dfn{directional type} of the character.
396
397@item decomposition
398This property defines a mapping from a character to a sequence of one
399or more characters that is a canonical or compatibility equivalent to
400it. The value is a list, whose first element may be a symbol
401representing a compatibility formatting tag, such as @code{<small>};
402the other elements are characters that give the compatibility
403decomposition sequence.
404
405@item decimal-digit-value
406This property specifies a numeric value of characters that represent
407decimal digits. The value is an integer number.
408
409@item digit
410This property specifies a numeric value of characters that represent
411digits, but not necessarily decimal. Examples include compatibility
412subscript and superscript digits. The value is an integer number.
413
414@item numeric-value
415This property specifies whether the character represents a number.
416Examples of characters that do include fractions, subscripts,
417superscripts, Roman numerals, currency numerators, and encircled
418numbers. The value is a symbol whose name gives the numeric value;
419for example, the value of this property for the character
420@code{U+2155} (@sc{vulgar fraction one fifth}) is the symbol
421@samp{1/5}.
422
423@item mirrored
424This is a property of characters such as parentheses, which need to be
425mirrored horizontally in right to left scripts. The value is a
426symbol, either @samp{Y} or @samp{N}.
427
428@item old-name
429This property's value specifies the name, if any, of the character in
430the old version 1.0 of the Unicode Standard. The value is a string.
431
432@item iso-10646-comment
433This character's comment field from the ISO 10646 standard. The value
434is a string, or @code{nil} if there's no comment.
435
436@item uppercase
437If this character has an upper-case equivalent that is a single
438character, then the value of this property is that upper-case
439equivalent. Otherwise, the value is @code{nil}.
440
441@item lowercase
442If this character has an lower-case equivalent that is a single
443character, then the value of this property is that lower-case
444equivalent. Otherwise, the value is @code{nil}.
445
446@item titlecase
447@dfn{Title case} is a special form of a character used when the first
448character of a word needs to be capitalized. If a character has a
449title-case equivalent that is a single character, then the value of
450this property is that title-case equivalent. Otherwise, the value is
451@code{nil}.
452@end table
453
454@defun get-char-code-property char propname
455This function returns the value of @var{char}'s @var{propname} property.
456
457@example
458@group
459(get-char-code-property ? 'general-category)
460 @result{} Zs
461@end group
462@group
463(get-char-code-property ?1 'general-category)
464 @result{} Nd
465@end group
466@group
467(get-char-code-property ?\u2084 'digit-value) ; subscript 4
468 @result{} 4
469@end group
470@group
471(get-char-code-property ?\u2155 'numeric-value) ; one fifth
472 @result{} 1/5
473@end group
474@group
475(get-char-code-property ?\u2163 'numeric-value) ; Roman IV
476 @result{} \4
477@end group
478@end example
479@end defun
480
481@defun char-code-property-description prop value
482This function returns the description string of property @var{prop}'s
483@var{value}, or @code{nil} if @var{value} has no description.
484
485@example
486@group
487(char-code-property-description 'general-category 'Zs)
488 @result{} "Separator, Space"
489@end group
490@group
491(char-code-property-description 'general-category 'Nd)
492 @result{} "Number, Decimal Digit"
493@end group
494@group
495(char-code-property-description 'numeric-value '1/5)
496 @result{} nil
497@end group
498@end example
499@end defun
500
501@defun put-char-code-property char propname value
502This function stores @var{value} as the value of the property
503@var{propname} for the character @var{char}.
504@end defun
505
506@defvar char-script-table
507The value of this variable is a char-table (@pxref{Char-Tables}) that
508specifies, for each character, a symbol whose name is the script to
509which the character belongs, according to the Unicode Standard
510classification of the Unicode code space into script-specific blocks.
511This char-table has a single extra slot whose value is the list of all
512script symbols.
513@end defvar
514
515@defvar char-width-table
516The value of this variable is a char-table that specifies the width of
517each character in columns that it will occupy on the screen.
518@end defvar
519
520@defvar printable-chars
521The value of this variable is a char-table that specifies, for each
522character, whether it is printable or not. That is, if evaluating
523@code{(aref printable-chars char)} results in @code{t}, the character
524is printable, and if it results in @code{nil}, it is not.
525@end defvar
526
347@node Character Sets 527@node Character Sets
348@section Character Sets 528@section Character Sets
349@cindex character sets 529@cindex character sets
@@ -692,6 +872,10 @@ The value of the @code{:mime-charset} property is also defined
692as an alias for the coding system. 872as an alias for the coding system.
693@end defun 873@end defun
694 874
875@defun coding-system-aliases coding-system
876This function returns the list of aliases of @var{coding-system}.
877@end defun
878
695@node Encoding and I/O 879@node Encoding and I/O
696@subsection Encoding and I/O 880@subsection Encoding and I/O
697 881
@@ -865,6 +1049,22 @@ This function returns a list of coding systems that could be used to
865encode all the character sets in the list @var{charsets}. 1049encode all the character sets in the list @var{charsets}.
866@end defun 1050@end defun
867 1051
1052@defun check-coding-systems-region start end coding-system-list
1053This function checks whether coding systems in the list
1054@code{coding-system-list} can encode all the characters in the region
1055between @var{start} and @var{end}. If all of the coding systems in
1056the list can encode the specified text, the function returns
1057@code{nil}. If some coding systems cannot encode some of the
1058characters, the value is an alist, each element of which has the form
1059@code{(@var{coding-system1} @var{pos1} @var{pos2} @dots{})}, meaning
1060that @var{coding-system1} cannot encode characters at buffer positions
1061@var{pos1}, @var{pos2}, @enddots{}.
1062
1063@var{start} may be a string, in which case @var{end} is ignored and
1064the returned value references string indices instead of buffer
1065positions.
1066@end defun
1067
868@defun detect-coding-region start end &optional highest 1068@defun detect-coding-region start end &optional highest
869This function chooses a plausible coding system for decoding the text 1069This function chooses a plausible coding system for decoding the text
870from @var{start} to @var{end}. This text should be a byte sequence, 1070from @var{start} to @var{end}. This text should be a byte sequence,
@@ -888,6 +1088,26 @@ This function is like @code{detect-coding-region} except that it
888operates on the contents of @var{string} instead of bytes in the buffer. 1088operates on the contents of @var{string} instead of bytes in the buffer.
889@end defun 1089@end defun
890 1090
1091@defun coding-system-charset-list coding-system
1092This function returns the list of character sets (@pxref{Character
1093Sets}) supported by @var{coding-system}. Some coding systems that
1094support too many character sets to list them all yield special values:
1095@itemize @bullet
1096@item
1097If @var{coding-system} supports all the ISO-2022 charsets, the value
1098is @code{iso-2022}.
1099@item
1100If @var{coding-system} supports all Emacs characters, the value is
1101@code{(emacs)}.
1102@item
1103If @var{coding-system} supports all emacs-mule characters, the value
1104is @code{emacs-mule}.
1105@item
1106If @var{coding-system} supports all Unicode characters, the value is
1107@code{(unicode)}.
1108@end itemize
1109@end defun
1110
891 @xref{Coding systems for a subprocess,, Process Information}, in 1111 @xref{Coding systems for a subprocess,, Process Information}, in
892particular the description of the functions 1112particular the description of the functions
893@code{process-coding-system} and @code{set-process-coding-system}, for 1113@code{process-coding-system} and @code{set-process-coding-system}, for
@@ -1179,6 +1399,33 @@ Emacs I/O and subprocess primitives, and to the explicit encoding and
1179decoding functions (@pxref{Explicit Encoding}). 1399decoding functions (@pxref{Explicit Encoding}).
1180@end defvar 1400@end defvar
1181 1401
1402@cindex priority order of coding systems
1403@cindex coding systems, priority
1404 Sometimes, you need to prefer several coding systems for some
1405operation, rather than fix a single one. Emacs lets you specify a
1406priority order for using coding systems. This ordering affects the
1407sorting of lists of coding sysems returned by functions such as
1408@code{find-coding-systems-region} (@pxref{Lisp and Coding Systems}).
1409
1410@defun coding-system-priority-list &optional highestp
1411This function returns the list of coding systems in the order of their
1412current priorities. Optional argument @var{highestp}, if
1413non-@code{nil}, means return only the highest priority coding system.
1414@end defun
1415
1416@defun set-coding-system-priority &rest coding-systems
1417This function puts @var{coding-systems} at the beginning of the
1418priority list for coding systems, thus making their priority higher
1419than all the rest.
1420@end defun
1421
1422@defmac with-coding-priority coding-systems &rest body@dots{}
1423This macro execute @var{body}, like @code{progn} does
1424(@pxref{Sequencing, progn}), with @var{coding-systems} at the front of
1425the priority list for coding systems. @var{coding-systems} should be
1426a list of coding systems to prefer during execution of @var{body}.
1427@end defmac
1428
1182@node Explicit Encoding 1429@node Explicit Encoding
1183@subsection Explicit Encoding and Decoding 1430@subsection Explicit Encoding and Decoding
1184@cindex encoding in coding systems 1431@cindex encoding in coding systems