diff options
Diffstat (limited to 'admin/notes/unicode')
| -rw-r--r-- | admin/notes/unicode | 130 |
1 files changed, 94 insertions, 36 deletions
diff --git a/admin/notes/unicode b/admin/notes/unicode index bc7279150a9..014bfb9b0d5 100644 --- a/admin/notes/unicode +++ b/admin/notes/unicode | |||
| @@ -1,6 +1,6 @@ | |||
| 1 | -*-mode: text; coding: utf-8;-*- | 1 | -*-mode: text; coding: utf-8;-*- |
| 2 | 2 | ||
| 3 | Copyright (C) 2002-2017 Free Software Foundation, Inc. | 3 | Copyright (C) 2002-2022 Free Software Foundation, Inc. |
| 4 | See the end of the file for license conditions. | 4 | See the end of the file for license conditions. |
| 5 | 5 | ||
| 6 | Importing a new Unicode Standard version into Emacs | 6 | Importing a new Unicode Standard version into Emacs |
| @@ -11,15 +11,38 @@ Emacs uses the following files from the Unicode Character Database | |||
| 11 | 11 | ||
| 12 | . UnicodeData.txt | 12 | . UnicodeData.txt |
| 13 | . Blocks.txt | 13 | . Blocks.txt |
| 14 | . BidiMirroring.txt | ||
| 15 | . BidiBrackets.txt | 14 | . BidiBrackets.txt |
| 15 | . BidiMirroring.txt | ||
| 16 | . IVD_Sequences.txt | 16 | . IVD_Sequences.txt |
| 17 | . NormalizationTest.txt | 17 | . NormalizationTest.txt |
| 18 | . PropertyValueAliases.txt | ||
| 19 | . ScriptExtensions.txt | ||
| 20 | . Scripts.txt | ||
| 18 | . SpecialCasing.txt | 21 | . SpecialCasing.txt |
| 22 | . confusables.txt | ||
| 23 | . emoji-data.txt | ||
| 24 | . emoji-zwj-sequences.txt | ||
| 25 | . emoji-sequences.txt | ||
| 19 | . BidiCharacterTest.txt | 26 | . BidiCharacterTest.txt |
| 20 | 27 | ||
| 21 | First, the first 7 files need to be copied into admin/unidata/, and | 28 | Emacs also uses the file emoji-test.txt which should be imported from |
| 22 | then Emacs should be rebuilt for them to take effect. Rebuilding | 29 | the Unicode's Public/emoji/ directory, and IdnaMappingTable.txt from |
| 30 | the Public/idna/ directory. | ||
| 31 | |||
| 32 | First, the first 14 files, emoji-test.txt and IdnaMappingTable.txt | ||
| 33 | need to be copied into admin/unidata/, and the file | ||
| 34 | https://www.unicode.org/copyright.html should be copied over | ||
| 35 | copyright.html in admin/unidata (some of them might need trailing | ||
| 36 | whitespace removed before they can be committed to the Emacs | ||
| 37 | repository). | ||
| 38 | |||
| 39 | Next, review the assignment of default values of the Bidi Class | ||
| 40 | property to blocks in the file extracted/DerivedBidiClass.txt from the | ||
| 41 | UCD (search for "unassigned" in that file). Any changes should be | ||
| 42 | reflected in the unidata-gen.el file, where it sets up the default | ||
| 43 | values around line 210. | ||
| 44 | |||
| 45 | Then Emacs should be rebuilt for them to take effect. Rebuilding | ||
| 23 | Emacs updates several derived files elsewhere in the Emacs source | 46 | Emacs updates several derived files elsewhere in the Emacs source |
| 24 | tree, mainly in lisp/international/. | 47 | tree, mainly in lisp/international/. |
| 25 | 48 | ||
| @@ -28,7 +51,10 @@ files, pay attention to any warning or error messages. In particular, | |||
| 28 | admin/unidata/unidata-gen.el will complain if UnicodeData.txt defines | 51 | admin/unidata/unidata-gen.el will complain if UnicodeData.txt defines |
| 29 | new bidirectional attributes of characters, because unidata-gen.el, | 52 | new bidirectional attributes of characters, because unidata-gen.el, |
| 30 | bidi.c and dispextern.h need to be updated in that case; failure to do | 53 | bidi.c and dispextern.h need to be updated in that case; failure to do |
| 31 | so will cause aborts in redisplay. | 54 | so will cause aborts in redisplay. unidata-gen.el will also complain |
| 55 | if the format of the Unicode Copyright notice in copyright.html | ||
| 56 | changed in significant ways; in that case, update the regular | ||
| 57 | expression in unidata-gen-file used to extract the copyright string. | ||
| 32 | 58 | ||
| 33 | Next, review the changes in UnicodeData.txt vs the previous version | 59 | Next, review the changes in UnicodeData.txt vs the previous version |
| 34 | used by Emacs. Any changes, be it introduction of new scripts or | 60 | used by Emacs. Any changes, be it introduction of new scripts or |
| @@ -40,15 +66,23 @@ and see if any changes in admin/unidata/blocks.awk are required. | |||
| 40 | 66 | ||
| 41 | The setting of char-width-table around line 1200 of characters.el | 67 | The setting of char-width-table around line 1200 of characters.el |
| 42 | should be checked against the latest version of the Unicode file | 68 | should be checked against the latest version of the Unicode file |
| 43 | EastAsianWidth.txt, and any discrepancies fixed. | 69 | EastAsianWidth.txt, and any discrepancies fixed: double-width |
| 70 | characters are those marked with W or F in that file. Zero-width | ||
| 71 | characters are not taken from EastAsianWidth.txt, they are those whose | ||
| 72 | Unicode General Category property is one of Mn, Me, or Cf, and also | ||
| 73 | Hangul jungseong and jongseong characters (a.k.a. "Jamo medial vowels" | ||
| 74 | and "Jamo final consonants"). | ||
| 44 | 75 | ||
| 45 | Any new scripts added by UnicodeData.txt will also need updates to | 76 | Any new scripts added by UnicodeData.txt will also need updates to |
| 46 | script-representative-chars defined in fontset.el, and also the list | 77 | script-representative-chars defined in fontset.el, and also the list |
| 47 | of OTF script tags in otf-script-alist, whose source is on this page: | 78 | of OTF script tags in otf-script-alist, whose source is on this page: |
| 48 | 79 | ||
| 49 | https://www.microsoft.com/typography/otspec/scripttags.htm | 80 | https://docs.microsoft.com/en-us/typography/opentype/spec/scripttags |
| 50 | 81 | ||
| 51 | Other databases in fontset.el might also need to be updated as needed. | 82 | Other databases in fontset.el might also need to be updated as needed. |
| 83 | One notable place to check is the function setup-default-fontset, | ||
| 84 | where new scripts will generally need some addition, most probably to | ||
| 85 | the list of "simple" scripts (search for "Simple"). | ||
| 52 | 86 | ||
| 53 | The function 'ucs-names', defined in lisp/international/mule-cmds.el, | 87 | The function 'ucs-names', defined in lisp/international/mule-cmds.el, |
| 54 | might need to be updated because it knows about used and unused ranges | 88 | might need to be updated because it knows about used and unused ranges |
| @@ -65,7 +99,51 @@ regarding failing lines. | |||
| 65 | 99 | ||
| 66 | The file BidiCharacterTest.txt should be copied to the test suite, and | 100 | The file BidiCharacterTest.txt should be copied to the test suite, and |
| 67 | if its format has changed, the file biditest.el there should be | 101 | if its format has changed, the file biditest.el there should be |
| 68 | modified to follow suit. | 102 | modified to follow suit. If there's trailing whitespace in |
| 103 | BidiCharacterTest.txt, it should be removed before committing the new | ||
| 104 | version. | ||
| 105 | |||
| 106 | src/macuvs.h is a generated file, but if it has changed as a result | ||
| 107 | of the updates, please commit it as well (see | ||
| 108 | admin/unidata/Makefile.in for an explanation). | ||
| 109 | |||
| 110 | Visit "emoji-data.txt" with the rebuilt Emacs, and check that an | ||
| 111 | appropriate font is being used for the emoji (by default Emacs uses | ||
| 112 | "Noto Color Emoji"). Running the following command in that buffer | ||
| 113 | will give you an idea of which codepoints are not supported by | ||
| 114 | whichever font Emacs is using. | ||
| 115 | |||
| 116 | (defun check-emoji-coverage (font-name-regexp) | ||
| 117 | "Display a buffer containing emoji codepoints for which FONT-NAME is not used. | ||
| 118 | This must be run from a buffer in the format of emoji-data.txt. | ||
| 119 | FONT-NAME-REGEXP is checked using `string-match'." | ||
| 120 | (interactive "MFont Name: ") | ||
| 121 | (save-excursion | ||
| 122 | (goto-char (point-min)) | ||
| 123 | (let (res char name ifont) | ||
| 124 | (while (re-search-forward "; Emoji_Presentation [^(]+(\\(.\\)[).]" nil t) | ||
| 125 | (setq char (aref (match-string 1) 0)) | ||
| 126 | (setq ifont (car (internal-char-font nil char))) | ||
| 127 | (when ifont | ||
| 128 | (setq name (font-xlfd-name ifont))) | ||
| 129 | (if (or (not ifont) (not (string-match font-name-regexp name))) | ||
| 130 | (setq res (concat (string char) res)))) | ||
| 131 | (when res | ||
| 132 | (with-output-to-temp-buffer "*Check-Emoji-Coverage*" | ||
| 133 | (princ (format "Font not matching '%s' was used for the following characters:\n%s" | ||
| 134 | font-name-regexp (reverse res)))))))) | ||
| 135 | |||
| 136 | Visit "emoji-zwj-sequences.txt" and "emoji-sequences.txt" with the | ||
| 137 | rebuilt Emacs, and check that the sample sequences are composed | ||
| 138 | properly. Also check the Unicode style chart file available at | ||
| 139 | https://unicode.org/emoji/charts/emoji-style.txt for any issues | ||
| 140 | involving VS-15 and VS-16, if so you may need to update the value | ||
| 141 | generated for auto-composition-emoji-eligible-codepoints by | ||
| 142 | admin/unidata/emoji-zwj.awk. Note that your emoji font might not have | ||
| 143 | glyphs for the newest codepoints yet. | ||
| 144 | |||
| 145 | Finally, etc/NEWS should be updated to announce the support for the | ||
| 146 | new Unicode version. | ||
| 69 | 147 | ||
| 70 | Problems, fixmes and other unicode-related issues | 148 | Problems, fixmes and other unicode-related issues |
| 71 | ------------------------------------------------------------- | 149 | ------------------------------------------------------------- |
| @@ -85,7 +163,7 @@ regard to completeness. | |||
| 85 | code (keymap.c and print.c). | 163 | code (keymap.c and print.c). |
| 86 | 164 | ||
| 87 | * Rationalize character syntax and its relationship to the Unicode | 165 | * Rationalize character syntax and its relationship to the Unicode |
| 88 | database. (Applies mainly to symbol an punctuation syntax.) | 166 | database. (Applies mainly to symbol and punctuation syntax.) |
| 89 | 167 | ||
| 90 | * Fontset handling and customization needs work. We want to relate | 168 | * Fontset handling and customization needs work. We want to relate |
| 91 | fonts to scripts, probably based on the Unicode blocks. The | 169 | fonts to scripts, probably based on the Unicode blocks. The |
| @@ -230,36 +308,15 @@ nontrivial changes to the build process. | |||
| 230 | 308 | ||
| 231 | admin/charsets/mapfiles/cns2ucsdkw.txt | 309 | admin/charsets/mapfiles/cns2ucsdkw.txt |
| 232 | 310 | ||
| 233 | * iso-2022-7bit | 311 | * iso-2022-jp |
| 234 | |||
| 235 | This file switches between CJK charsets, which is not encoded in UTF-8. | ||
| 236 | 312 | ||
| 237 | etc/HELLO | 313 | This contains just one CJK charset, but Emacs currently has no |
| 238 | 314 | easy way to specify set-charset-priority on a per-file basis, so | |
| 239 | Each of these files contains just one CJK charset, but Emacs | 315 | converting this file to UTF-8 might change the file's appearance |
| 240 | currently has no easy way to specify set-charset-priority on a | 316 | when viewed by an Emacs that is operating in some other language |
| 241 | per-file basis, so converting any of these files to UTF-8 might | 317 | environment. |
| 242 | change the file's appearance when viewed by an Emacs that is | ||
| 243 | operating in some other language environment. | ||
| 244 | 318 | ||
| 245 | etc/tutorials/TUTORIAL.ja | 319 | etc/tutorials/TUTORIAL.ja |
| 246 | lisp/international/ja-dic-cnv.el | ||
| 247 | lisp/international/ja-dic-utl.el | ||
| 248 | lisp/international/kinsoku.el | ||
| 249 | lisp/international/kkc.el | ||
| 250 | lisp/international/titdic-cnv.el | ||
| 251 | lisp/language/japan-util.el | ||
| 252 | lisp/language/japanese.el | ||
| 253 | lisp/leim/quail/cyril-jis.el | ||
| 254 | lisp/leim/quail/hanja-jis.el | ||
| 255 | lisp/leim/quail/japanese.el | ||
| 256 | lisp/leim/quail/py-punct.el | ||
| 257 | lisp/leim/quail/pypunct-b5.el | ||
| 258 | |||
| 259 | This file contains just Chinese characters, and has same problem. | ||
| 260 | Also, it contains characters that cannot be encoded in UTF-8. | ||
| 261 | |||
| 262 | lisp/international/titdic-cnv.el | ||
| 263 | 320 | ||
| 264 | * utf-8-emacs | 321 | * utf-8-emacs |
| 265 | 322 | ||
| @@ -272,6 +329,7 @@ nontrivial changes to the build process. | |||
| 272 | lisp/language/tibetan.el | 329 | lisp/language/tibetan.el |
| 273 | lisp/leim/quail/ethiopic.el | 330 | lisp/leim/quail/ethiopic.el |
| 274 | lisp/leim/quail/tibetan.el | 331 | lisp/leim/quail/tibetan.el |
| 332 | lisp/international/titdic-cnv.el | ||
| 275 | 333 | ||
| 276 | * binary files | 334 | * binary files |
| 277 | 335 | ||