Merge from trunk.

author: Stephen Berman 2013-06-14 22:07:55 +0200
committer: Stephen Berman 2013-06-14 22:07:55 +0200
commit: bd358779861f265a7acff31ead40172735af693e (patch)
tree: 345217a9889dbd29b09bdc80a94265c17719d41f /admin/notes/unicode
parent: 2a97b47f0878cbda86cb6ba0e7e744924810b70e (diff)
parent: f7394b12358ae453a0c8b85fc307afc1b740010d (diff)
download: emacs-bd358779861f265a7acff31ead40172735af693e.tar.gz
emacs-bd358779861f265a7acff31ead40172735af693e.zip
1 files changed, 134 insertions, 6 deletions
diff --git a/admin/notes/unicode b/admin/notes/unicode
index dda6ec4cc93..6db5bb7d05c 100644
--- a/admin/notes/unicode
+++ b/admin/notes/unicode
@@ -1,6 +1,6 @@
-                                            -*-mode: text; coding: latin-1;-*-
+                                            -*-mode: text; coding: utf-8;-*-
-Copyright (C) 2002-2012 Free Software Foundation, Inc.
+Copyright (C) 2002-2013 Free Software Foundation, Inc.
 See the end of the file for license conditions.
 Problems, fixmes and other unicode-related issues
@@ -12,9 +12,9 @@ regard to completeness.
 * SINGLE_BYTE_CHAR_P returns true for Latin-1 characters, which has
   undesirable effects.  E.g.:
-   (multibyte-string-p (let ((s "x")) (aset s 0 ?�) s)) => nil
+   (multibyte-string-p (let ((s "x")) (aset s 0 ?��) s)) => nil
-   (multibyte-string-p (concat [?�])) => nil
+   (multibyte-string-p (concat [?��])) => nil
-   (text-char-description ?�) => "M-#"
+   (text-char-description ?��) => "M-#"
        These examples are all fixed by the change of 2002-10-14, but
        there still exist questionable SINGLE_BYTE_CHAR_P in the
@@ -77,7 +77,7 @@ regard to completeness.
   spelling and calendar, but that's not a Unicode issue.)
 * Handle Unicode combining characters usefully, e.g. diacritics, and
-   handle more scripts specifically (� la Devanagari).  There are
+   handle more scripts specifically (à la Devanagari).  There are
   issues with canonicalization.
 * We need tabular input methods, e.g. for maths symbols.  (Not
@@ -98,6 +98,134 @@ regard to completeness.
 * Old auto-save files, and similar files, such as Gnus drafts,
   containing non-ASCII characters probably won't be re-read correctly.
+Source file encoding
+--------------------
+Most Emacs source files are encoded in UTF-8 (or in ASCII, which is a
+subset), but there are a few exceptions, listed below.  Perhaps
+someday many of these files will be converted to UTF-8, for
+convenience when using tools like 'grep -r', but this might need
+nontrivial changes to the build process.
+ * chinese-big5
+     These are verbatim copies of files taken from external sources.
+     They haven't been converted to UTF-8.
+        leim/CXTERM-DIC/4Corner.tit
+        leim/CXTERM-DIC/ARRAY30.tit
+        leim/CXTERM-DIC/ECDICT.tit
+        leim/CXTERM-DIC/ETZY.tit
+        leim/CXTERM-DIC/PY-b5.tit
+        leim/CXTERM-DIC/Punct-b5.tit
+        leim/CXTERM-DIC/QJ-b5.tit
+        leim/CXTERM-DIC/ZOZY.tit
+        leim/MISC-DIC/CTLau-b5.html
+        leim/MISC-DIC/cangjie-table.b5
+ * chinese-iso-8bit
+     These are verbatim copies of files taken from external sources.
+     They haven't been converted to UTF-8.
+        leim/CXTERM-DIC/CCDOSPY.tit
+        leim/CXTERM-DIC/Punct.tit
+        leim/CXTERM-DIC/QJ.tit
+        leim/CXTERM-DIC/SW.tit
+        leim/CXTERM-DIC/TONEPY.tit
+        leim/MISC-DIC/pinyin.map
+        leim/MISC-DIC/CTLau.html
+        leim/MISC-DIC/ziranma.cin
+ * cp850
+     This file contains non-ASCII characters in unibyte strings.  When
+     editing a keyboard layout it's more convenient to see 'é' than
+     '\202', and the MS-DOS compiler requires the single byte if a
+     backslash escape is not being used.
+        src/msdos.c
+ * iso-2022-cn-ext
+     This file is externally generated from leim/MISC-DIC/cangjie-table.b5
+     by Big5->CNS converter.  It hasn't been converted to UTF-8.
+        leim/MISC-DIC/cangjie-table.cns
+ * iso-latin-2
+     These files are processed by csplain, a program that requires
+     Latin-2 input.  In 2012 the csplain maintainers started
+     recommending UTF-8, but these files haven't been converted yet.
+        etc/refcards/cs-dired-ref.tex
+        etc/refcards/cs-refcard.tex
+        etc/refcards/cs-survival.tex
+        etc/refcards/sk-dired-ref.tex
+        etc/refcards/sk-refcard.tex
+        etc/refcards/sk-survival.tex
+ * japanese-iso-8bit
+     SKK-JISYO.L is a verbatim copy of a file taken from an external source.
+     It hasn't been converted to UTF-8.
+        leim/SKK-DIC/SKK-JISYO.L
+ * japanese-shift-jis
+     This is a verbatim copy of a file taken from an external source.
+     It hasn't been converted to UTF-8.
+        admin/charsets/mapfiles/cns2ucsdkw.txt
+ * no-conversion
+     This file purposely contains arbitrary bytes interspersed within text,
+     to test whether the Emacs distribution is corrupted.
+        lib-src/testfile
+ * iso-2022-7bit
+     This file switches between CJK charsets, which is not encoded in UTF-8.
+        etc/HELLO
+     Each of these files contains just one CJK charset, but Emacs
+     currently has no easy way to specify set-charset-priority on a
+     per-file basis, so converting any of these files to UTF-8 might
+     change the file's appearance when viewed by an Emacs that is
+     operating in some other language environment.
+        etc/tutorials/TUTORIAL.ja
+        leim/quail/cyril-jis.el
+        leim/quail/hanja-jis.el
+        leim/quail/japanese.el
+        leim/quail/py-punct.el
+        leim/quail/pypunct-b5.el
+        lisp/international/ja-dic-cnv.el
+        lisp/international/ja-dic-utl.el
+        lisp/international/kinsoku.el
+        lisp/international/kkc.el
+        lisp/international/titdic-cnv.el
+        lisp/language/japan-util.el
+        lisp/language/japanese.el
+        lisp/term/x-win.el
+ * utf-8-emacs
+     These files contain characters that cannot be encoded in UTF-8.
+        leim/quail/tibetan.el
+        leim/quail/ethiopic.el
+        lisp/international/titdic-cnv.el
+        lisp/language/tibetan.el
+        lisp/language/tibet-util.el
+        lisp/language/ind-util.el
 This file is part of GNU Emacs.
author	Stephen Berman	2013-06-14 22:07:55 +0200
committer	Stephen Berman	2013-06-14 22:07:55 +0200
commit	bd358779861f265a7acff31ead40172735af693e (patch)
tree	345217a9889dbd29b09bdc80a94265c17719d41f /admin/notes/unicode
parent	2a97b47f0878cbda86cb6ba0e7e744924810b70e (diff)
parent	f7394b12358ae453a0c8b85fc307afc1b740010d (diff)
download	emacs-bd358779861f265a7acff31ead40172735af693e.tar.gz emacs-bd358779861f265a7acff31ead40172735af693e.zip

diff --git a/admin/notes/unicode b/admin/notes/unicode index dda6ec4cc93..6db5bb7d05c 100644 --- a/admin/notes/unicode +++ b/admin/notes/unicode
@@ -1,6 +1,6 @@
1	--mode: text; coding: latin-1;--	1	--mode: text; coding: utf-8;--
2		2
3	Copyright (C) 2002-2012 Free Software Foundation, Inc.	3	Copyright (C) 2002-2013 Free Software Foundation, Inc.
4	See the end of the file for license conditions.	4	See the end of the file for license conditions.
5		5
6	Problems, fixmes and other unicode-related issues	6	Problems, fixmes and other unicode-related issues
@@ -12,9 +12,9 @@ regard to completeness.
12		12
13	* SINGLE_BYTE_CHAR_P returns true for Latin-1 characters, which has	13	* SINGLE_BYTE_CHAR_P returns true for Latin-1 characters, which has
14	undesirable effects. E.g.:	14	undesirable effects. E.g.:
15	(multibyte-string-p (let ((s "x")) (aset s 0 ?�) s)) => nil	15	(multibyte-string-p (let ((s "x")) (aset s 0 ?��) s)) => nil
16	(multibyte-string-p (concat [?�])) => nil	16	(multibyte-string-p (concat [?��])) => nil
17	(text-char-description ?�) => "M-#"	17	(text-char-description ?��) => "M-#"
18		18
19	These examples are all fixed by the change of 2002-10-14, but	19	These examples are all fixed by the change of 2002-10-14, but
20	there still exist questionable SINGLE_BYTE_CHAR_P in the	20	there still exist questionable SINGLE_BYTE_CHAR_P in the
@@ -77,7 +77,7 @@ regard to completeness.
77	spelling and calendar, but that's not a Unicode issue.)	77	spelling and calendar, but that's not a Unicode issue.)
78		78
79	* Handle Unicode combining characters usefully, e.g. diacritics, and	79	* Handle Unicode combining characters usefully, e.g. diacritics, and
80	handle more scripts specifically (� la Devanagari). There are	80	handle more scripts specifically (à la Devanagari). There are
81	issues with canonicalization.	81	issues with canonicalization.
82		82
83	* We need tabular input methods, e.g. for maths symbols. (Not	83	* We need tabular input methods, e.g. for maths symbols. (Not
@@ -98,6 +98,134 @@ regard to completeness.
98	* Old auto-save files, and similar files, such as Gnus drafts,	98	* Old auto-save files, and similar files, such as Gnus drafts,
99	containing non-ASCII characters probably won't be re-read correctly.	99	containing non-ASCII characters probably won't be re-read correctly.
100		100
		101
		102	Source file encoding
		103	--------------------
		104
		105	Most Emacs source files are encoded in UTF-8 (or in ASCII, which is a
		106	subset), but there are a few exceptions, listed below. Perhaps
		107	someday many of these files will be converted to UTF-8, for
		108	convenience when using tools like 'grep -r', but this might need
		109	nontrivial changes to the build process.
		110
		111	* chinese-big5
		112
		113	These are verbatim copies of files taken from external sources.
		114	They haven't been converted to UTF-8.
		115
		116	leim/CXTERM-DIC/4Corner.tit
		117	leim/CXTERM-DIC/ARRAY30.tit
		118	leim/CXTERM-DIC/ECDICT.tit
		119	leim/CXTERM-DIC/ETZY.tit
		120	leim/CXTERM-DIC/PY-b5.tit
		121	leim/CXTERM-DIC/Punct-b5.tit
		122	leim/CXTERM-DIC/QJ-b5.tit
		123	leim/CXTERM-DIC/ZOZY.tit
		124	leim/MISC-DIC/CTLau-b5.html
		125	leim/MISC-DIC/cangjie-table.b5
		126
		127	* chinese-iso-8bit
		128
		129	These are verbatim copies of files taken from external sources.
		130	They haven't been converted to UTF-8.
		131
		132	leim/CXTERM-DIC/CCDOSPY.tit
		133	leim/CXTERM-DIC/Punct.tit
		134	leim/CXTERM-DIC/QJ.tit
		135	leim/CXTERM-DIC/SW.tit
		136	leim/CXTERM-DIC/TONEPY.tit
		137	leim/MISC-DIC/pinyin.map
		138	leim/MISC-DIC/CTLau.html
		139	leim/MISC-DIC/ziranma.cin
		140
		141	* cp850
		142
		143	This file contains non-ASCII characters in unibyte strings. When
		144	editing a keyboard layout it's more convenient to see 'é' than
		145	'\202', and the MS-DOS compiler requires the single byte if a
		146	backslash escape is not being used.
		147
		148	src/msdos.c
		149
		150	* iso-2022-cn-ext
		151
		152	This file is externally generated from leim/MISC-DIC/cangjie-table.b5
		153	by Big5->CNS converter. It hasn't been converted to UTF-8.
		154
		155	leim/MISC-DIC/cangjie-table.cns
		156
		157	* iso-latin-2
		158
		159	These files are processed by csplain, a program that requires
		160	Latin-2 input. In 2012 the csplain maintainers started
		161	recommending UTF-8, but these files haven't been converted yet.
		162
		163	etc/refcards/cs-dired-ref.tex
		164	etc/refcards/cs-refcard.tex
		165	etc/refcards/cs-survival.tex
		166	etc/refcards/sk-dired-ref.tex
		167	etc/refcards/sk-refcard.tex
		168	etc/refcards/sk-survival.tex
		169
		170	* japanese-iso-8bit
		171
		172	SKK-JISYO.L is a verbatim copy of a file taken from an external source.
		173	It hasn't been converted to UTF-8.
		174
		175	leim/SKK-DIC/SKK-JISYO.L
		176
		177	* japanese-shift-jis
		178
		179	This is a verbatim copy of a file taken from an external source.
		180	It hasn't been converted to UTF-8.
		181
		182	admin/charsets/mapfiles/cns2ucsdkw.txt
		183
		184	* no-conversion
		185
		186	This file purposely contains arbitrary bytes interspersed within text,
		187	to test whether the Emacs distribution is corrupted.
		188
		189	lib-src/testfile
		190
		191	* iso-2022-7bit
		192
		193	This file switches between CJK charsets, which is not encoded in UTF-8.
		194
		195	etc/HELLO
		196
		197	Each of these files contains just one CJK charset, but Emacs
		198	currently has no easy way to specify set-charset-priority on a
		199	per-file basis, so converting any of these files to UTF-8 might
		200	change the file's appearance when viewed by an Emacs that is
		201	operating in some other language environment.
		202
		203	etc/tutorials/TUTORIAL.ja
		204	leim/quail/cyril-jis.el
		205	leim/quail/hanja-jis.el
		206	leim/quail/japanese.el
		207	leim/quail/py-punct.el
		208	leim/quail/pypunct-b5.el
		209	lisp/international/ja-dic-cnv.el
		210	lisp/international/ja-dic-utl.el
		211	lisp/international/kinsoku.el
		212	lisp/international/kkc.el
		213	lisp/international/titdic-cnv.el
		214	lisp/language/japan-util.el
		215	lisp/language/japanese.el
		216	lisp/term/x-win.el
		217
		218	* utf-8-emacs
		219
		220	These files contain characters that cannot be encoded in UTF-8.
		221
		222	leim/quail/tibetan.el
		223	leim/quail/ethiopic.el
		224	lisp/international/titdic-cnv.el
		225	lisp/language/tibetan.el
		226	lisp/language/tibet-util.el
		227	lisp/language/ind-util.el
		228
101		229
102	This file is part of GNU Emacs.	230	This file is part of GNU Emacs.
103		231