Support casing characters which map into multiple code points (bug#24603)

Implement unconditional special casing rules defined in Unicode standard. Among other things, they deal with cases when a single code point is replaced by multiple ones because single character does not exist (e.g. ‘ﬁ’ ligature turning into ‘FL’) or is not commonly used (e.g. ß turning into SS). * admin/unidata/SpecialCasing.txt: New data file pulled from Unicode standard distribution. * admin/unidata/README: Mention SpecialCasing.txt. * admin/unidata/unidata-get.el (unidata-gen-table-special-casing, unidata-gen-table-special-casing--do-load): New functions generating ‘special-uppercase’, ‘special-lowercase’ and ‘special-titlecase’ character Unicode properties built from the SpecialCasing.txt Unicode data file. * src/casefiddle.c (struct casing_str_buf): New structure for representing short strings used to handle one-to-many character mappings. (case_character_imlp): New function which can handle one-to-many character mappings. (case_character, case_single_character): Wrappers for the above functions. The former may map one character to multiple (or no) code points while the latter does what the former used to do (i.e. handles one-to-one mappings only). (do_casify_natnum, do_casify_unibyte_string, do_casify_unibyte_region): Use case_single_character. (do_casify_multibyte_string, do_casify_multibyte_region): Support new features of case_character. * (do_casify_region): Updated to reflact do_casify_multibyte_string changes. (casify_word): Handle situation when one character-length of a word can change affecting where end of the word is. (upcase, capitalize, upcase-initials): Update documentation to mention limitations when working on characters. * test/src/casefiddle-tests.el (casefiddle-tests-char-properties): Add test cases for the newly introduced character properties. (casefiddle-tests-casing): Update test cases which are now passing. * test/lisp/char-fold-tests.el (char-fold--ascii-upcase, char-fold--ascii-downcase): New functions which behave like old ‘upcase’ and ‘downcase’. (char-fold--test-match-exactly): Use the new functions. This is needed because otherwise ﬁ and similar characters are turned into their multi- -character representation. * doc/lispref/strings.texi: Describe issue with casing characters versus strings. * doc/lispref/nonascii.texi: Describe the new character properties.
author: Michal Nazarewicz 2016-10-05 00:06:01 +0200
committer: Michal Nazarewicz 2017-04-06 20:54:58 +0200
commit: b3b9b258c4026baa1cad3f2e617f1a637fc8d205 (patch)
tree: 1520ef9f5a3204784c597fcf2bf7a7c7fc1b8d7c /doc
parent: 2c87dabd0460cce83d2345b4ddff159969674fef (diff)
download: emacs-b3b9b258c4026baa1cad3f2e617f1a637fc8d205.tar.gz
emacs-b3b9b258c4026baa1cad3f2e617f1a637fc8d205.zip
2 files changed, 50 insertions, 0 deletions
diff --git a/doc/lispref/nonascii.texi b/doc/lispref/nonascii.texi
index 05c08c6dbe5..039201feca1 100644
--- a/doc/lispref/nonascii.texi
+++ b/doc/lispref/nonascii.texi
@@ -619,6 +619,29 @@ Corresponds to the Unicode @code{Simple_Titlecase_Mapping} property.
 character of a word needs to be capitalized.  The value of this
 property is a single character.  For unassigned codepoints, the value
 is @code{nil}, which means the character itself.
+@item special-uppercase
+Corresponds to Unicode language- and context-independent special upper-casing
+rules.  The value of this property is a string (which may be empty).  For
+example mapping for @code{U+00DF} (@sc{latin small letter sharp s}) is
+@code{"SS"}.  For characters with no special mapping, the value is @code{nil}
+which means @code{uppercase} property needs to be consulted instead.
+@item special-lowercase
+Corresponds to Unicode language- and context-independent special lower-casing
+rules.  The value of this property is a string (which may be empty).  For
+example mapping for @code{U+0130} (@sc{latin capital letter i with dot above})
+the value is @code{"i\u0307"} (i.e. 2-character string consisting of @sc{latin
+small letter i} followed by @sc{combining dot above}).  For characters with no
+special mapping, the value is @code{nil} which means @code{lowercase} property
+needs to be consulted instead.
+@item special-titlecase
+Corresponds to Unicode unconditional special title-casing rules.  The value of
+this property is a string (which may be empty).  For example mapping for
+@code{U+FB01} (@sc{latin small ligature fi}) the value is @code{"Fi"}.  For
+characters with no special mapping, the value is @code{nil} which means
+@code{titlecase} property needs to be consulted instead.
 @end table
 @defun get-char-code-property char propname
diff --git a/doc/lispref/strings.texi b/doc/lispref/strings.texi
index ae2b31c5418..1d766869b1f 100644
--- a/doc/lispref/strings.texi
+++ b/doc/lispref/strings.texi
@@ -1177,6 +1177,33 @@ When the argument to @code{upcase-initials} is a character,
 @end example
 @end defun
+  Note that case conversion is not a one-to-one mapping of codepoints
+and length of the result may differ from length of the argument.
+Furthermore, because passing a character forces return type to be
+a character, functions are unable to perform proper substitution and
+result may differ compared to treating a one-character string.  For
+example:
+@example
+@group
+(upcase "ﬁ")  ; note: single character, ligature "fi"
+     @result{} "FI"
+@end group
+@group
+(upcase ?ﬁ)
+     @result{} 64257  ; i.e. ?ﬁ
+@end group
+@end example
+  To avoid this, a character must first be converted into a string,
+using @code{string} function, before being passed to one of the casing
+functions.  Of course, no assumptions on the length of the result may
+be made.
+  Mapping for such special cases are taken from
+@code{special-uppercase}, @code{special-lowercase} and
+@code{special-titlecase} @xref{Character Properties}.
  @xref{Text Comparison}, for functions that compare strings; some of
 them ignore case differences, or can optionally ignore case differences.
author	Michal Nazarewicz	2016-10-05 00:06:01 +0200
committer	Michal Nazarewicz	2017-04-06 20:54:58 +0200
commit	b3b9b258c4026baa1cad3f2e617f1a637fc8d205 (patch)
tree	1520ef9f5a3204784c597fcf2bf7a7c7fc1b8d7c /doc
parent	2c87dabd0460cce83d2345b4ddff159969674fef (diff)
download	emacs-b3b9b258c4026baa1cad3f2e617f1a637fc8d205.tar.gz emacs-b3b9b258c4026baa1cad3f2e617f1a637fc8d205.zip

diff --git a/doc/lispref/nonascii.texi b/doc/lispref/nonascii.texi index 05c08c6dbe5..039201feca1 100644 --- a/doc/lispref/nonascii.texi +++ b/doc/lispref/nonascii.texi
@@ -619,6 +619,29 @@ Corresponds to the Unicode @code{Simple_Titlecase_Mapping} property.
619	character of a word needs to be capitalized. The value of this	619	character of a word needs to be capitalized. The value of this
620	property is a single character. For unassigned codepoints, the value	620	property is a single character. For unassigned codepoints, the value
621	is @code{nil}, which means the character itself.	621	is @code{nil}, which means the character itself.
		622
		623	@item special-uppercase
		624	Corresponds to Unicode language- and context-independent special upper-casing
		625	rules. The value of this property is a string (which may be empty). For
		626	example mapping for @code{U+00DF} (@sc{latin small letter sharp s}) is
		627	@code{"SS"}. For characters with no special mapping, the value is @code{nil}
		628	which means @code{uppercase} property needs to be consulted instead.
		629
		630	@item special-lowercase
		631	Corresponds to Unicode language- and context-independent special lower-casing
		632	rules. The value of this property is a string (which may be empty). For
		633	example mapping for @code{U+0130} (@sc{latin capital letter i with dot above})
		634	the value is @code{"i\u0307"} (i.e. 2-character string consisting of @sc{latin
		635	small letter i} followed by @sc{combining dot above}). For characters with no
		636	special mapping, the value is @code{nil} which means @code{lowercase} property
		637	needs to be consulted instead.
		638
		639	@item special-titlecase
		640	Corresponds to Unicode unconditional special title-casing rules. The value of
		641	this property is a string (which may be empty). For example mapping for
		642	@code{U+FB01} (@sc{latin small ligature fi}) the value is @code{"Fi"}. For
		643	characters with no special mapping, the value is @code{nil} which means
		644	@code{titlecase} property needs to be consulted instead.
622	@end table	645	@end table
623		646
624	@defun get-char-code-property char propname	647	@defun get-char-code-property char propname


diff --git a/doc/lispref/strings.texi b/doc/lispref/strings.texi index ae2b31c5418..1d766869b1f 100644 --- a/doc/lispref/strings.texi +++ b/doc/lispref/strings.texi
@@ -1177,6 +1177,33 @@ When the argument to @code{upcase-initials} is a character,
1177	@end example	1177	@end example
1178	@end defun	1178	@end defun
1179		1179
		1180	Note that case conversion is not a one-to-one mapping of codepoints
		1181	and length of the result may differ from length of the argument.
		1182	Furthermore, because passing a character forces return type to be
		1183	a character, functions are unable to perform proper substitution and
		1184	result may differ compared to treating a one-character string. For
		1185	example:
		1186
		1187	@example
		1188	@group
		1189	(upcase "ﬁ") ; note: single character, ligature "fi"
		1190	@result{} "FI"
		1191	@end group
		1192	@group
		1193	(upcase ?ﬁ)
		1194	@result{} 64257 ; i.e. ?ﬁ
		1195	@end group
		1196	@end example
		1197
		1198	To avoid this, a character must first be converted into a string,
		1199	using @code{string} function, before being passed to one of the casing
		1200	functions. Of course, no assumptions on the length of the result may
		1201	be made.
		1202
		1203	Mapping for such special cases are taken from
		1204	@code{special-uppercase}, @code{special-lowercase} and
		1205	@code{special-titlecase} @xref{Character Properties}.
		1206
1180	@xref{Text Comparison}, for functions that compare strings; some of	1207	@xref{Text Comparison}, for functions that compare strings; some of
1181	them ignore case differences, or can optionally ignore case differences.	1208	them ignore case differences, or can optionally ignore case differences.
1182		1209