aboutsummaryrefslogtreecommitdiffstats
path: root/admin/notes/unicode
diff options
context:
space:
mode:
Diffstat (limited to 'admin/notes/unicode')
-rw-r--r--admin/notes/unicode130
1 files changed, 94 insertions, 36 deletions
diff --git a/admin/notes/unicode b/admin/notes/unicode
index bc7279150a9..014bfb9b0d5 100644
--- a/admin/notes/unicode
+++ b/admin/notes/unicode
@@ -1,6 +1,6 @@
1 -*-mode: text; coding: utf-8;-*- 1 -*-mode: text; coding: utf-8;-*-
2 2
3Copyright (C) 2002-2017 Free Software Foundation, Inc. 3Copyright (C) 2002-2022 Free Software Foundation, Inc.
4See the end of the file for license conditions. 4See the end of the file for license conditions.
5 5
6Importing a new Unicode Standard version into Emacs 6Importing a new Unicode Standard version into Emacs
@@ -11,15 +11,38 @@ Emacs uses the following files from the Unicode Character Database
11 11
12 . UnicodeData.txt 12 . UnicodeData.txt
13 . Blocks.txt 13 . Blocks.txt
14 . BidiMirroring.txt
15 . BidiBrackets.txt 14 . BidiBrackets.txt
15 . BidiMirroring.txt
16 . IVD_Sequences.txt 16 . IVD_Sequences.txt
17 . NormalizationTest.txt 17 . NormalizationTest.txt
18 . PropertyValueAliases.txt
19 . ScriptExtensions.txt
20 . Scripts.txt
18 . SpecialCasing.txt 21 . SpecialCasing.txt
22 . confusables.txt
23 . emoji-data.txt
24 . emoji-zwj-sequences.txt
25 . emoji-sequences.txt
19 . BidiCharacterTest.txt 26 . BidiCharacterTest.txt
20 27
21First, the first 7 files need to be copied into admin/unidata/, and 28Emacs also uses the file emoji-test.txt which should be imported from
22then Emacs should be rebuilt for them to take effect. Rebuilding 29the Unicode's Public/emoji/ directory, and IdnaMappingTable.txt from
30the Public/idna/ directory.
31
32First, the first 14 files, emoji-test.txt and IdnaMappingTable.txt
33need to be copied into admin/unidata/, and the file
34https://www.unicode.org/copyright.html should be copied over
35copyright.html in admin/unidata (some of them might need trailing
36whitespace removed before they can be committed to the Emacs
37repository).
38
39Next, review the assignment of default values of the Bidi Class
40property to blocks in the file extracted/DerivedBidiClass.txt from the
41UCD (search for "unassigned" in that file). Any changes should be
42reflected in the unidata-gen.el file, where it sets up the default
43values around line 210.
44
45Then Emacs should be rebuilt for them to take effect. Rebuilding
23Emacs updates several derived files elsewhere in the Emacs source 46Emacs updates several derived files elsewhere in the Emacs source
24tree, mainly in lisp/international/. 47tree, mainly in lisp/international/.
25 48
@@ -28,7 +51,10 @@ files, pay attention to any warning or error messages. In particular,
28admin/unidata/unidata-gen.el will complain if UnicodeData.txt defines 51admin/unidata/unidata-gen.el will complain if UnicodeData.txt defines
29new bidirectional attributes of characters, because unidata-gen.el, 52new bidirectional attributes of characters, because unidata-gen.el,
30bidi.c and dispextern.h need to be updated in that case; failure to do 53bidi.c and dispextern.h need to be updated in that case; failure to do
31so will cause aborts in redisplay. 54so will cause aborts in redisplay. unidata-gen.el will also complain
55if the format of the Unicode Copyright notice in copyright.html
56changed in significant ways; in that case, update the regular
57expression in unidata-gen-file used to extract the copyright string.
32 58
33Next, review the changes in UnicodeData.txt vs the previous version 59Next, review the changes in UnicodeData.txt vs the previous version
34used by Emacs. Any changes, be it introduction of new scripts or 60used by Emacs. Any changes, be it introduction of new scripts or
@@ -40,15 +66,23 @@ and see if any changes in admin/unidata/blocks.awk are required.
40 66
41The setting of char-width-table around line 1200 of characters.el 67The setting of char-width-table around line 1200 of characters.el
42should be checked against the latest version of the Unicode file 68should be checked against the latest version of the Unicode file
43EastAsianWidth.txt, and any discrepancies fixed. 69EastAsianWidth.txt, and any discrepancies fixed: double-width
70characters are those marked with W or F in that file. Zero-width
71characters are not taken from EastAsianWidth.txt, they are those whose
72Unicode General Category property is one of Mn, Me, or Cf, and also
73Hangul jungseong and jongseong characters (a.k.a. "Jamo medial vowels"
74and "Jamo final consonants").
44 75
45Any new scripts added by UnicodeData.txt will also need updates to 76Any new scripts added by UnicodeData.txt will also need updates to
46script-representative-chars defined in fontset.el, and also the list 77script-representative-chars defined in fontset.el, and also the list
47of OTF script tags in otf-script-alist, whose source is on this page: 78of OTF script tags in otf-script-alist, whose source is on this page:
48 79
49 https://www.microsoft.com/typography/otspec/scripttags.htm 80 https://docs.microsoft.com/en-us/typography/opentype/spec/scripttags
50 81
51Other databases in fontset.el might also need to be updated as needed. 82Other databases in fontset.el might also need to be updated as needed.
83One notable place to check is the function setup-default-fontset,
84where new scripts will generally need some addition, most probably to
85the list of "simple" scripts (search for "Simple").
52 86
53The function 'ucs-names', defined in lisp/international/mule-cmds.el, 87The function 'ucs-names', defined in lisp/international/mule-cmds.el,
54might need to be updated because it knows about used and unused ranges 88might need to be updated because it knows about used and unused ranges
@@ -65,7 +99,51 @@ regarding failing lines.
65 99
66The file BidiCharacterTest.txt should be copied to the test suite, and 100The file BidiCharacterTest.txt should be copied to the test suite, and
67if its format has changed, the file biditest.el there should be 101if its format has changed, the file biditest.el there should be
68modified to follow suit. 102modified to follow suit. If there's trailing whitespace in
103BidiCharacterTest.txt, it should be removed before committing the new
104version.
105
106src/macuvs.h is a generated file, but if it has changed as a result
107of the updates, please commit it as well (see
108admin/unidata/Makefile.in for an explanation).
109
110Visit "emoji-data.txt" with the rebuilt Emacs, and check that an
111appropriate font is being used for the emoji (by default Emacs uses
112"Noto Color Emoji"). Running the following command in that buffer
113will give you an idea of which codepoints are not supported by
114whichever font Emacs is using.
115
116(defun check-emoji-coverage (font-name-regexp)
117"Display a buffer containing emoji codepoints for which FONT-NAME is not used.
118This must be run from a buffer in the format of emoji-data.txt.
119FONT-NAME-REGEXP is checked using `string-match'."
120(interactive "MFont Name: ")
121(save-excursion
122(goto-char (point-min))
123(let (res char name ifont)
124 (while (re-search-forward "; Emoji_Presentation [^(]+(\\(.\\)[).]" nil t)
125 (setq char (aref (match-string 1) 0))
126 (setq ifont (car (internal-char-font nil char)))
127 (when ifont
128 (setq name (font-xlfd-name ifont)))
129 (if (or (not ifont) (not (string-match font-name-regexp name)))
130 (setq res (concat (string char) res))))
131 (when res
132 (with-output-to-temp-buffer "*Check-Emoji-Coverage*"
133 (princ (format "Font not matching '%s' was used for the following characters:\n%s"
134 font-name-regexp (reverse res))))))))
135
136Visit "emoji-zwj-sequences.txt" and "emoji-sequences.txt" with the
137rebuilt Emacs, and check that the sample sequences are composed
138properly. Also check the Unicode style chart file available at
139https://unicode.org/emoji/charts/emoji-style.txt for any issues
140involving VS-15 and VS-16, if so you may need to update the value
141generated for auto-composition-emoji-eligible-codepoints by
142admin/unidata/emoji-zwj.awk. Note that your emoji font might not have
143glyphs for the newest codepoints yet.
144
145Finally, etc/NEWS should be updated to announce the support for the
146new Unicode version.
69 147
70Problems, fixmes and other unicode-related issues 148Problems, fixmes and other unicode-related issues
71------------------------------------------------------------- 149-------------------------------------------------------------
@@ -85,7 +163,7 @@ regard to completeness.
85 code (keymap.c and print.c). 163 code (keymap.c and print.c).
86 164
87 * Rationalize character syntax and its relationship to the Unicode 165 * Rationalize character syntax and its relationship to the Unicode
88 database. (Applies mainly to symbol an punctuation syntax.) 166 database. (Applies mainly to symbol and punctuation syntax.)
89 167
90 * Fontset handling and customization needs work. We want to relate 168 * Fontset handling and customization needs work. We want to relate
91 fonts to scripts, probably based on the Unicode blocks. The 169 fonts to scripts, probably based on the Unicode blocks. The
@@ -230,36 +308,15 @@ nontrivial changes to the build process.
230 308
231 admin/charsets/mapfiles/cns2ucsdkw.txt 309 admin/charsets/mapfiles/cns2ucsdkw.txt
232 310
233 * iso-2022-7bit 311 * iso-2022-jp
234
235 This file switches between CJK charsets, which is not encoded in UTF-8.
236 312
237 etc/HELLO 313 This contains just one CJK charset, but Emacs currently has no
238 314 easy way to specify set-charset-priority on a per-file basis, so
239 Each of these files contains just one CJK charset, but Emacs 315 converting this file to UTF-8 might change the file's appearance
240 currently has no easy way to specify set-charset-priority on a 316 when viewed by an Emacs that is operating in some other language
241 per-file basis, so converting any of these files to UTF-8 might 317 environment.
242 change the file's appearance when viewed by an Emacs that is
243 operating in some other language environment.
244 318
245 etc/tutorials/TUTORIAL.ja 319 etc/tutorials/TUTORIAL.ja
246 lisp/international/ja-dic-cnv.el
247 lisp/international/ja-dic-utl.el
248 lisp/international/kinsoku.el
249 lisp/international/kkc.el
250 lisp/international/titdic-cnv.el
251 lisp/language/japan-util.el
252 lisp/language/japanese.el
253 lisp/leim/quail/cyril-jis.el
254 lisp/leim/quail/hanja-jis.el
255 lisp/leim/quail/japanese.el
256 lisp/leim/quail/py-punct.el
257 lisp/leim/quail/pypunct-b5.el
258
259 This file contains just Chinese characters, and has same problem.
260 Also, it contains characters that cannot be encoded in UTF-8.
261
262 lisp/international/titdic-cnv.el
263 320
264 * utf-8-emacs 321 * utf-8-emacs
265 322
@@ -272,6 +329,7 @@ nontrivial changes to the build process.
272 lisp/language/tibetan.el 329 lisp/language/tibetan.el
273 lisp/leim/quail/ethiopic.el 330 lisp/leim/quail/ethiopic.el
274 lisp/leim/quail/tibetan.el 331 lisp/leim/quail/tibetan.el
332 lisp/international/titdic-cnv.el
275 333
276 * binary files 334 * binary files
277 335