1 files changed, 94 insertions, 36 deletions
diff --git a/admin/notes/unicode b/admin/notes/unicode
index bc7279150a9..014bfb9b0d5 100644
--- a/admin/notes/unicode
+++ b/admin/notes/unicode
@@ -1,6 +1,6 @@
                                            -*-mode: text; coding: utf-8;-*-
-Copyright (C) 2002-2017 Free Software Foundation, Inc.
+Copyright (C) 2002-2022 Free Software Foundation, Inc.
 See the end of the file for license conditions.
 Importing a new Unicode Standard version into Emacs
@@ -11,15 +11,38 @@ Emacs uses the following files from the Unicode Character Database
  . UnicodeData.txt
  . Blocks.txt
-  . BidiMirroring.txt
  . BidiBrackets.txt
+  . BidiMirroring.txt
  . IVD_Sequences.txt
  . NormalizationTest.txt
+  . PropertyValueAliases.txt
+  . ScriptExtensions.txt
+  . Scripts.txt
  . SpecialCasing.txt
+  . confusables.txt
+  . emoji-data.txt
+  . emoji-zwj-sequences.txt
+  . emoji-sequences.txt
  . BidiCharacterTest.txt
-First, the first 7 files need to be copied into admin/unidata/, and
+Emacs also uses the file emoji-test.txt which should be imported from
-then Emacs should be rebuilt for them to take effect.  Rebuilding
+the Unicode's Public/emoji/ directory, and IdnaMappingTable.txt from
+the Public/idna/ directory.
+First, the first 14 files, emoji-test.txt and IdnaMappingTable.txt
+need to be copied into admin/unidata/, and the file
+https://www.unicode.org/copyright.html should be copied over
+copyright.html in admin/unidata (some of them might need trailing
+whitespace removed before they can be committed to the Emacs
+repository).
+Next, review the assignment of default values of the Bidi Class
+property to blocks in the file extracted/DerivedBidiClass.txt from the
+UCD (search for "unassigned" in that file).  Any changes should be
+reflected in the unidata-gen.el file, where it sets up the default
+values around line 210.
+Then Emacs should be rebuilt for them to take effect.  Rebuilding
 Emacs updates several derived files elsewhere in the Emacs source
 tree, mainly in lisp/international/.
@@ -28,7 +51,10 @@ files, pay attention to any warning or error messages.  In particular,
 admin/unidata/unidata-gen.el will complain if UnicodeData.txt defines
 new bidirectional attributes of characters, because unidata-gen.el,
 bidi.c and dispextern.h need to be updated in that case; failure to do
-so will cause aborts in redisplay.
+so will cause aborts in redisplay.  unidata-gen.el will also complain
+if the format of the Unicode Copyright notice in copyright.html
+changed in significant ways; in that case, update the regular
+expression in unidata-gen-file used to extract the copyright string.
 Next, review the changes in UnicodeData.txt vs the previous version
 used by Emacs.  Any changes, be it introduction of new scripts or
@@ -40,15 +66,23 @@ and see if any changes in admin/unidata/blocks.awk are required.
 The setting of char-width-table around line 1200 of characters.el
 should be checked against the latest version of the Unicode file
-EastAsianWidth.txt, and any discrepancies fixed.
+EastAsianWidth.txt, and any discrepancies fixed: double-width
+characters are those marked with W or F in that file.  Zero-width
+characters are not taken from EastAsianWidth.txt, they are those whose
+Unicode General Category property is one of Mn, Me, or Cf, and also
+Hangul jungseong and jongseong characters (a.k.a. "Jamo medial vowels"
+and "Jamo final consonants").
 Any new scripts added by UnicodeData.txt will also need updates to
 script-representative-chars defined in fontset.el, and also the list
 of OTF script tags in otf-script-alist, whose source is on this page:
-  https://www.microsoft.com/typography/otspec/scripttags.htm
+  https://docs.microsoft.com/en-us/typography/opentype/spec/scripttags
 Other databases in fontset.el might also need to be updated as needed.
+One notable place to check is the function setup-default-fontset,
+where new scripts will generally need some addition, most probably to
+the list of "simple" scripts (search for "Simple").
 The function 'ucs-names', defined in lisp/international/mule-cmds.el,
 might need to be updated because it knows about used and unused ranges
@@ -65,7 +99,51 @@ regarding failing lines.
 The file BidiCharacterTest.txt should be copied to the test suite, and
 if its format has changed, the file biditest.el there should be
-modified to follow suit.
+modified to follow suit.  If there's trailing whitespace in
+BidiCharacterTest.txt, it should be removed before committing the new
+version.
+src/macuvs.h is a generated file, but if it has changed as a result
+of the updates, please commit it as well (see
+admin/unidata/Makefile.in for an explanation).
+Visit "emoji-data.txt" with the rebuilt Emacs, and check that an
+appropriate font is being used for the emoji (by default Emacs uses
+"Noto Color Emoji").  Running the following command in that buffer
+will give you an idea of which codepoints are not supported by
+whichever font Emacs is using.
+(defun check-emoji-coverage (font-name-regexp)
+"Display a buffer containing emoji codepoints for which FONT-NAME is not used.
+This must be run from a buffer in the format of emoji-data.txt.
+FONT-NAME-REGEXP is checked using `string-match'."
+(interactive "MFont Name: ")
+(save-excursion
+(goto-char (point-min))
+(let (res char name ifont)
+  (while (re-search-forward "; Emoji_Presentation [^(]+(\\(.\\)[).]" nil t)
+    (setq char (aref (match-string 1) 0))
+    (setq ifont (car (internal-char-font nil char)))
+    (when ifont
+      (setq name (font-xlfd-name ifont)))
+    (if (or (not ifont) (not (string-match font-name-regexp name)))
+        (setq res (concat (string char) res))))
+  (when res
+    (with-output-to-temp-buffer "*Check-Emoji-Coverage*"
+      (princ (format "Font not matching '%s' was used for the following characters:\n%s"
+                     font-name-regexp (reverse res))))))))
+Visit "emoji-zwj-sequences.txt" and "emoji-sequences.txt" with the
+rebuilt Emacs, and check that the sample sequences are composed
+properly.  Also check the Unicode style chart file available at
+https://unicode.org/emoji/charts/emoji-style.txt for any issues
+involving VS-15 and VS-16, if so you may need to update the value
+generated for auto-composition-emoji-eligible-codepoints by
+admin/unidata/emoji-zwj.awk.  Note that your emoji font might not have
+glyphs for the newest codepoints yet.
+Finally, etc/NEWS should be updated to announce the support for the
+new Unicode version.
 Problems, fixmes and other unicode-related issues
 -------------------------------------------------------------
@@ -85,7 +163,7 @@ regard to completeness.
        code (keymap.c and print.c).
 * Rationalize character syntax and its relationship to the Unicode
-   database.  (Applies mainly to symbol an punctuation syntax.)
+   database.  (Applies mainly to symbol and punctuation syntax.)
 * Fontset handling and customization needs work.  We want to relate
   fonts to scripts, probably based on the Unicode blocks.  The
@@ -230,36 +308,15 @@ nontrivial changes to the build process.
        admin/charsets/mapfiles/cns2ucsdkw.txt
- * iso-2022-7bit
+ * iso-2022-jp
-     This file switches between CJK charsets, which is not encoded in UTF-8.
-        etc/HELLO
+     This contains just one CJK charset, but Emacs currently has no
+     easy way to specify set-charset-priority on a per-file basis, so
-     Each of these files contains just one CJK charset, but Emacs
+     converting this file to UTF-8 might change the file's appearance
-     currently has no easy way to specify set-charset-priority on a
+     when viewed by an Emacs that is operating in some other language
-     per-file basis, so converting any of these files to UTF-8 might
+     environment.
-     change the file's appearance when viewed by an Emacs that is
-     operating in some other language environment.
        etc/tutorials/TUTORIAL.ja
-        lisp/international/ja-dic-cnv.el
-        lisp/international/ja-dic-utl.el
-        lisp/international/kinsoku.el
-        lisp/international/kkc.el
-        lisp/international/titdic-cnv.el
-        lisp/language/japan-util.el
-        lisp/language/japanese.el
-        lisp/leim/quail/cyril-jis.el
-        lisp/leim/quail/hanja-jis.el
-        lisp/leim/quail/japanese.el
-        lisp/leim/quail/py-punct.el
-        lisp/leim/quail/pypunct-b5.el
-     This file contains just Chinese characters, and has same problem.
-     Also, it contains characters that cannot be encoded in UTF-8.
-        lisp/international/titdic-cnv.el
 * utf-8-emacs
@@ -272,6 +329,7 @@ nontrivial changes to the build process.
        lisp/language/tibetan.el
        lisp/leim/quail/ethiopic.el
        lisp/leim/quail/tibetan.el
+        lisp/international/titdic-cnv.el
 * binary files

diff --git a/admin/notes/unicode b/admin/notes/unicode index bc7279150a9..014bfb9b0d5 100644 --- a/admin/notes/unicode +++ b/admin/notes/unicode
@@ -1,6 +1,6 @@
1	--mode: text; coding: utf-8;--	1	--mode: text; coding: utf-8;--
2		2
3	Copyright (C) 2002-2017 Free Software Foundation, Inc.	3	Copyright (C) 2002-2022 Free Software Foundation, Inc.
4	See the end of the file for license conditions.	4	See the end of the file for license conditions.
5		5
6	Importing a new Unicode Standard version into Emacs	6	Importing a new Unicode Standard version into Emacs
@@ -11,15 +11,38 @@ Emacs uses the following files from the Unicode Character Database
11		11
12	. UnicodeData.txt	12	. UnicodeData.txt
13	. Blocks.txt	13	. Blocks.txt
14	. BidiMirroring.txt
15	. BidiBrackets.txt	14	. BidiBrackets.txt
		15	. BidiMirroring.txt
16	. IVD_Sequences.txt	16	. IVD_Sequences.txt
17	. NormalizationTest.txt	17	. NormalizationTest.txt
		18	. PropertyValueAliases.txt
		19	. ScriptExtensions.txt
		20	. Scripts.txt
18	. SpecialCasing.txt	21	. SpecialCasing.txt
		22	. confusables.txt
		23	. emoji-data.txt
		24	. emoji-zwj-sequences.txt
		25	. emoji-sequences.txt
19	. BidiCharacterTest.txt	26	. BidiCharacterTest.txt
20		27
21	First, the first 7 files need to be copied into admin/unidata/, and	28	Emacs also uses the file emoji-test.txt which should be imported from
22	then Emacs should be rebuilt for them to take effect. Rebuilding	29	the Unicode's Public/emoji/ directory, and IdnaMappingTable.txt from
		30	the Public/idna/ directory.
		31
		32	First, the first 14 files, emoji-test.txt and IdnaMappingTable.txt
		33	need to be copied into admin/unidata/, and the file
		34	https://www.unicode.org/copyright.html should be copied over
		35	copyright.html in admin/unidata (some of them might need trailing
		36	whitespace removed before they can be committed to the Emacs
		37	repository).
		38
		39	Next, review the assignment of default values of the Bidi Class
		40	property to blocks in the file extracted/DerivedBidiClass.txt from the
		41	UCD (search for "unassigned" in that file). Any changes should be
		42	reflected in the unidata-gen.el file, where it sets up the default
		43	values around line 210.
		44
		45	Then Emacs should be rebuilt for them to take effect. Rebuilding
23	Emacs updates several derived files elsewhere in the Emacs source	46	Emacs updates several derived files elsewhere in the Emacs source
24	tree, mainly in lisp/international/.	47	tree, mainly in lisp/international/.
25		48
@@ -28,7 +51,10 @@ files, pay attention to any warning or error messages. In particular,
28	admin/unidata/unidata-gen.el will complain if UnicodeData.txt defines	51	admin/unidata/unidata-gen.el will complain if UnicodeData.txt defines
29	new bidirectional attributes of characters, because unidata-gen.el,	52	new bidirectional attributes of characters, because unidata-gen.el,
30	bidi.c and dispextern.h need to be updated in that case; failure to do	53	bidi.c and dispextern.h need to be updated in that case; failure to do
31	so will cause aborts in redisplay.	54	so will cause aborts in redisplay. unidata-gen.el will also complain
		55	if the format of the Unicode Copyright notice in copyright.html
		56	changed in significant ways; in that case, update the regular
		57	expression in unidata-gen-file used to extract the copyright string.
32		58
33	Next, review the changes in UnicodeData.txt vs the previous version	59	Next, review the changes in UnicodeData.txt vs the previous version
34	used by Emacs. Any changes, be it introduction of new scripts or	60	used by Emacs. Any changes, be it introduction of new scripts or
@@ -40,15 +66,23 @@ and see if any changes in admin/unidata/blocks.awk are required.
40		66
41	The setting of char-width-table around line 1200 of characters.el	67	The setting of char-width-table around line 1200 of characters.el
42	should be checked against the latest version of the Unicode file	68	should be checked against the latest version of the Unicode file
43	EastAsianWidth.txt, and any discrepancies fixed.	69	EastAsianWidth.txt, and any discrepancies fixed: double-width
		70	characters are those marked with W or F in that file. Zero-width
		71	characters are not taken from EastAsianWidth.txt, they are those whose
		72	Unicode General Category property is one of Mn, Me, or Cf, and also
		73	Hangul jungseong and jongseong characters (a.k.a. "Jamo medial vowels"
		74	and "Jamo final consonants").
44		75
45	Any new scripts added by UnicodeData.txt will also need updates to	76	Any new scripts added by UnicodeData.txt will also need updates to
46	script-representative-chars defined in fontset.el, and also the list	77	script-representative-chars defined in fontset.el, and also the list
47	of OTF script tags in otf-script-alist, whose source is on this page:	78	of OTF script tags in otf-script-alist, whose source is on this page:
48		79
49	https://www.microsoft.com/typography/otspec/scripttags.htm	80	https://docs.microsoft.com/en-us/typography/opentype/spec/scripttags
50		81
51	Other databases in fontset.el might also need to be updated as needed.	82	Other databases in fontset.el might also need to be updated as needed.
		83	One notable place to check is the function setup-default-fontset,
		84	where new scripts will generally need some addition, most probably to
		85	the list of "simple" scripts (search for "Simple").
52		86
53	The function 'ucs-names', defined in lisp/international/mule-cmds.el,	87	The function 'ucs-names', defined in lisp/international/mule-cmds.el,
54	might need to be updated because it knows about used and unused ranges	88	might need to be updated because it knows about used and unused ranges
@@ -65,7 +99,51 @@ regarding failing lines.
65		99
66	The file BidiCharacterTest.txt should be copied to the test suite, and	100	The file BidiCharacterTest.txt should be copied to the test suite, and
67	if its format has changed, the file biditest.el there should be	101	if its format has changed, the file biditest.el there should be
68	modified to follow suit.	102	modified to follow suit. If there's trailing whitespace in
		103	BidiCharacterTest.txt, it should be removed before committing the new
		104	version.
		105
		106	src/macuvs.h is a generated file, but if it has changed as a result
		107	of the updates, please commit it as well (see
		108	admin/unidata/Makefile.in for an explanation).
		109
		110	Visit "emoji-data.txt" with the rebuilt Emacs, and check that an
		111	appropriate font is being used for the emoji (by default Emacs uses
		112	"Noto Color Emoji"). Running the following command in that buffer
		113	will give you an idea of which codepoints are not supported by
		114	whichever font Emacs is using.
		115
		116	(defun check-emoji-coverage (font-name-regexp)
		117	"Display a buffer containing emoji codepoints for which FONT-NAME is not used.
		118	This must be run from a buffer in the format of emoji-data.txt.
		119	FONT-NAME-REGEXP is checked using `string-match'."
		120	(interactive "MFont Name: ")
		121	(save-excursion
		122	(goto-char (point-min))
		123	(let (res char name ifont)
		124	(while (re-search-forward "; Emoji_Presentation [^(]+(\\(.\\)[).]" nil t)
		125	(setq char (aref (match-string 1) 0))
		126	(setq ifont (car (internal-char-font nil char)))
		127	(when ifont
		128	(setq name (font-xlfd-name ifont)))
		129	(if (or (not ifont) (not (string-match font-name-regexp name)))
		130	(setq res (concat (string char) res))))
		131	(when res
		132	(with-output-to-temp-buffer "Check-Emoji-Coverage"
		133	(princ (format "Font not matching '%s' was used for the following characters:\n%s"
		134	font-name-regexp (reverse res))))))))
		135
		136	Visit "emoji-zwj-sequences.txt" and "emoji-sequences.txt" with the
		137	rebuilt Emacs, and check that the sample sequences are composed
		138	properly. Also check the Unicode style chart file available at
		139	https://unicode.org/emoji/charts/emoji-style.txt for any issues
		140	involving VS-15 and VS-16, if so you may need to update the value
		141	generated for auto-composition-emoji-eligible-codepoints by
		142	admin/unidata/emoji-zwj.awk. Note that your emoji font might not have
		143	glyphs for the newest codepoints yet.
		144
		145	Finally, etc/NEWS should be updated to announce the support for the
		146	new Unicode version.
69		147
70	Problems, fixmes and other unicode-related issues	148	Problems, fixmes and other unicode-related issues
71	-------------------------------------------------------------	149	-------------------------------------------------------------
@@ -85,7 +163,7 @@ regard to completeness.
85	code (keymap.c and print.c).	163	code (keymap.c and print.c).
86		164
87	* Rationalize character syntax and its relationship to the Unicode	165	* Rationalize character syntax and its relationship to the Unicode
88	database. (Applies mainly to symbol an punctuation syntax.)	166	database. (Applies mainly to symbol and punctuation syntax.)
89		167
90	* Fontset handling and customization needs work. We want to relate	168	* Fontset handling and customization needs work. We want to relate
91	fonts to scripts, probably based on the Unicode blocks. The	169	fonts to scripts, probably based on the Unicode blocks. The
@@ -230,36 +308,15 @@ nontrivial changes to the build process.
230		308
231	admin/charsets/mapfiles/cns2ucsdkw.txt	309	admin/charsets/mapfiles/cns2ucsdkw.txt
232		310
233	* iso-2022-7bit	311	* iso-2022-jp
234
235	This file switches between CJK charsets, which is not encoded in UTF-8.
236		312
237	etc/HELLO	313	This contains just one CJK charset, but Emacs currently has no
238		314	easy way to specify set-charset-priority on a per-file basis, so
239	Each of these files contains just one CJK charset, but Emacs	315	converting this file to UTF-8 might change the file's appearance
240	currently has no easy way to specify set-charset-priority on a	316	when viewed by an Emacs that is operating in some other language
241	per-file basis, so converting any of these files to UTF-8 might	317	environment.
242	change the file's appearance when viewed by an Emacs that is
243	operating in some other language environment.
244		318
245	etc/tutorials/TUTORIAL.ja	319	etc/tutorials/TUTORIAL.ja
246	lisp/international/ja-dic-cnv.el
247	lisp/international/ja-dic-utl.el
248	lisp/international/kinsoku.el
249	lisp/international/kkc.el
250	lisp/international/titdic-cnv.el
251	lisp/language/japan-util.el
252	lisp/language/japanese.el
253	lisp/leim/quail/cyril-jis.el
254	lisp/leim/quail/hanja-jis.el
255	lisp/leim/quail/japanese.el
256	lisp/leim/quail/py-punct.el
257	lisp/leim/quail/pypunct-b5.el
258
259	This file contains just Chinese characters, and has same problem.
260	Also, it contains characters that cannot be encoded in UTF-8.
261
262	lisp/international/titdic-cnv.el
263		320
264	* utf-8-emacs	321	* utf-8-emacs
265		322
@@ -272,6 +329,7 @@ nontrivial changes to the build process.
272	lisp/language/tibetan.el	329	lisp/language/tibetan.el
273	lisp/leim/quail/ethiopic.el	330	lisp/leim/quail/ethiopic.el
274	lisp/leim/quail/tibetan.el	331	lisp/leim/quail/tibetan.el
		332	lisp/international/titdic-cnv.el
275		333
276	* binary files	334	* binary files
277		335