aboutsummaryrefslogtreecommitdiffstats
path: root/doc
diff options
context:
space:
mode:
authorMichal Nazarewicz2016-10-05 00:06:01 +0200
committerMichal Nazarewicz2017-04-06 20:54:58 +0200
commitb3b9b258c4026baa1cad3f2e617f1a637fc8d205 (patch)
tree1520ef9f5a3204784c597fcf2bf7a7c7fc1b8d7c /doc
parent2c87dabd0460cce83d2345b4ddff159969674fef (diff)
downloademacs-b3b9b258c4026baa1cad3f2e617f1a637fc8d205.tar.gz
emacs-b3b9b258c4026baa1cad3f2e617f1a637fc8d205.zip
Support casing characters which map into multiple code points (bug#24603)
Implement unconditional special casing rules defined in Unicode standard. Among other things, they deal with cases when a single code point is replaced by multiple ones because single character does not exist (e.g. ‘fi’ ligature turning into ‘FL’) or is not commonly used (e.g. ß turning into SS). * admin/unidata/SpecialCasing.txt: New data file pulled from Unicode standard distribution. * admin/unidata/README: Mention SpecialCasing.txt. * admin/unidata/unidata-get.el (unidata-gen-table-special-casing, unidata-gen-table-special-casing--do-load): New functions generating ‘special-uppercase’, ‘special-lowercase’ and ‘special-titlecase’ character Unicode properties built from the SpecialCasing.txt Unicode data file. * src/casefiddle.c (struct casing_str_buf): New structure for representing short strings used to handle one-to-many character mappings. (case_character_imlp): New function which can handle one-to-many character mappings. (case_character, case_single_character): Wrappers for the above functions. The former may map one character to multiple (or no) code points while the latter does what the former used to do (i.e. handles one-to-one mappings only). (do_casify_natnum, do_casify_unibyte_string, do_casify_unibyte_region): Use case_single_character. (do_casify_multibyte_string, do_casify_multibyte_region): Support new features of case_character. * (do_casify_region): Updated to reflact do_casify_multibyte_string changes. (casify_word): Handle situation when one character-length of a word can change affecting where end of the word is. (upcase, capitalize, upcase-initials): Update documentation to mention limitations when working on characters. * test/src/casefiddle-tests.el (casefiddle-tests-char-properties): Add test cases for the newly introduced character properties. (casefiddle-tests-casing): Update test cases which are now passing. * test/lisp/char-fold-tests.el (char-fold--ascii-upcase, char-fold--ascii-downcase): New functions which behave like old ‘upcase’ and ‘downcase’. (char-fold--test-match-exactly): Use the new functions. This is needed because otherwise fi and similar characters are turned into their multi- -character representation. * doc/lispref/strings.texi: Describe issue with casing characters versus strings. * doc/lispref/nonascii.texi: Describe the new character properties.
Diffstat (limited to 'doc')
-rw-r--r--doc/lispref/nonascii.texi23
-rw-r--r--doc/lispref/strings.texi27
2 files changed, 50 insertions, 0 deletions
diff --git a/doc/lispref/nonascii.texi b/doc/lispref/nonascii.texi
index 05c08c6dbe5..039201feca1 100644
--- a/doc/lispref/nonascii.texi
+++ b/doc/lispref/nonascii.texi
@@ -619,6 +619,29 @@ Corresponds to the Unicode @code{Simple_Titlecase_Mapping} property.
619character of a word needs to be capitalized. The value of this 619character of a word needs to be capitalized. The value of this
620property is a single character. For unassigned codepoints, the value 620property is a single character. For unassigned codepoints, the value
621is @code{nil}, which means the character itself. 621is @code{nil}, which means the character itself.
622
623@item special-uppercase
624Corresponds to Unicode language- and context-independent special upper-casing
625rules. The value of this property is a string (which may be empty). For
626example mapping for @code{U+00DF} (@sc{latin small letter sharp s}) is
627@code{"SS"}. For characters with no special mapping, the value is @code{nil}
628which means @code{uppercase} property needs to be consulted instead.
629
630@item special-lowercase
631Corresponds to Unicode language- and context-independent special lower-casing
632rules. The value of this property is a string (which may be empty). For
633example mapping for @code{U+0130} (@sc{latin capital letter i with dot above})
634the value is @code{"i\u0307"} (i.e. 2-character string consisting of @sc{latin
635small letter i} followed by @sc{combining dot above}). For characters with no
636special mapping, the value is @code{nil} which means @code{lowercase} property
637needs to be consulted instead.
638
639@item special-titlecase
640Corresponds to Unicode unconditional special title-casing rules. The value of
641this property is a string (which may be empty). For example mapping for
642@code{U+FB01} (@sc{latin small ligature fi}) the value is @code{"Fi"}. For
643characters with no special mapping, the value is @code{nil} which means
644@code{titlecase} property needs to be consulted instead.
622@end table 645@end table
623 646
624@defun get-char-code-property char propname 647@defun get-char-code-property char propname
diff --git a/doc/lispref/strings.texi b/doc/lispref/strings.texi
index ae2b31c5418..1d766869b1f 100644
--- a/doc/lispref/strings.texi
+++ b/doc/lispref/strings.texi
@@ -1177,6 +1177,33 @@ When the argument to @code{upcase-initials} is a character,
1177@end example 1177@end example
1178@end defun 1178@end defun
1179 1179
1180 Note that case conversion is not a one-to-one mapping of codepoints
1181and length of the result may differ from length of the argument.
1182Furthermore, because passing a character forces return type to be
1183a character, functions are unable to perform proper substitution and
1184result may differ compared to treating a one-character string. For
1185example:
1186
1187@example
1188@group
1189(upcase "fi") ; note: single character, ligature "fi"
1190 @result{} "FI"
1191@end group
1192@group
1193(upcase ?fi)
1194 @result{} 64257 ; i.e. ?fi
1195@end group
1196@end example
1197
1198 To avoid this, a character must first be converted into a string,
1199using @code{string} function, before being passed to one of the casing
1200functions. Of course, no assumptions on the length of the result may
1201be made.
1202
1203 Mapping for such special cases are taken from
1204@code{special-uppercase}, @code{special-lowercase} and
1205@code{special-titlecase} @xref{Character Properties}.
1206
1180 @xref{Text Comparison}, for functions that compare strings; some of 1207 @xref{Text Comparison}, for functions that compare strings; some of
1181them ignore case differences, or can optionally ignore case differences. 1208them ignore case differences, or can optionally ignore case differences.
1182 1209