Added commentary about the overall design and its limitations.

author: Eli Zaretskii 2013-12-05 22:59:23 +0200
committer: Eli Zaretskii 2013-12-05 22:59:23 +0200
commit: 0cd7a14e577cae9c0713d1cfa549cfca3f0ca06c (patch)
tree: 31828931b3b0e4b1bf40e68e96f2601967c1266f /src
parent: a22205d67caa1cbf666a703d2bc26afd5a2704b6 (diff)
download: emacs-0cd7a14e577cae9c0713d1cfa549cfca3f0ca06c.tar.gz
emacs-0cd7a14e577cae9c0713d1cfa549cfca3f0ca06c.zip
1 files changed, 92 insertions, 0 deletions
diff --git a/src/w32.c b/src/w32.c
index 7d1ebebc68b..47c4f04b152 100644
--- a/src/w32.c
+++ b/src/w32.c
@@ -1290,6 +1290,98 @@ w32_valid_pointer_p (void *p, int size)
+/* Here's an overview of how the Windows build supports file names
+   that cannot be encoded by the current system codepage.
+   From the POV of Lisp and layers of C code above the functions here,
+   Emacs on Windows pretends that its file names are encoded in UTF-8;
+   see encode_file and decode_file on coding.c.  Any file name that is
+   passed as a unibyte string to C functions defined here is assumed
+   to be in UTF-8 encoding.  Any file name returned by functions
+   defined here must be in UTF-8 encoding, with only a few exceptions
+   reserved for a couple of special cases.  (Be sure to use
+   MAX_UTF8_PATH for char arrays that store UTF-8 encoded file names,
+   as they can be much longer than MAX_PATH!)
+   The UTF-8 encoded file names cannot be passed to system APIs, as
+   Windows does not support that.  Therefore, they are converted
+   either to UTF-16 or to the ANSI codepage, depending on the value of
+   w32-unicode-filenames, before calling any system APIs or CRT library
+   functions.  The default value of that variable is determined by the
+   OS on which Emacs runs: nil on Windows 9X and t otherwise, but the
+   user can change that default (although I don't see why would she
+   want to).
+   The 4 functions defined below, filename_to_utf16, filename_to_ansi,
+   filename_from_utf16, and filename_from_ansi, are the workhorses of
+   these conversions.  They rely on Windows native APIs
+   MultiByteToWideChar and WideCharToMultiByte; we cannot use
+   functions from coding.c here, because they allocate memory, which
+   is a bad idea on the level of libc, which is what the functions
+   here emulate.  (If you worry about performance due to constant
+   conversion back and forth from UTF-8 to UTF-16, then don't: first,
+   it was measured to take only a few microseconds on a not-so-fast
+   machine, and second, that's exactly what the ANSI APIs we used
+   before do anyway, because they are just thin wrappers around the
+   Unicode APIs.)
+   The variables file-name-coding-system and default-file-name-coding-system
+   still exist, but are actually used only when a file name needs to
+   be converted to the ANSI codepage.  This happens all the time when
+   w32-unicode-filenames is nil, but can also happen from time to time
+   when it is t.  Otherwise, these variables have no effect on file-name
+   encoding when w32-unicode-filenames is t; this is similar to
+   selection-coding-system.
+   This arrangement works very well, but it has a few gotchas:
+   . Lisp code that encodes or decodes file names manually should
+     normally use 'utf-8' as the coding-system on Windows,
+     disregarding file-name-coding-system.  This is a somewhat
+     unpleasant consequence, but it cannot be avoided.  Fortunately,
+     very few Lisp packages need to do that.
+     More generally, passing to library functions (e.g., fopen or
+     opendir) file names already encoded in the ANSI codepage is
+     explictly *verboten*, as all those functions, as shadowed and
+     emulated here, assume they will receive UTF-8 encoded file names.
+     For the same reasons, no CRT function or Win32 API can be called
+     directly in Emacs sources, without either converting the file
+     name sfrom UTF-8 to either UTF-16 or ANSI codepage, or going
+     through some shadowing function defined here.
+   . File names passed to external libraries, like the image libraries
+     and GnuTLS, need special handling.  These libraries generally
+     don't support UTF-16 or UTF-8 file names, so they must get file
+     names encoded in the ANSI codepage.  To facilitate using these
+     libraries with file names that are not encodable in the ANSI
+     codepage, use the function ansi_encode_filename, which will try
+     to use the short 8+3 alias of a file name if that file name is
+     not encodable in the ANSI codepage.  See image.c and gnutls.c for
+     examples of how this should be done.
+   . Running subprocesses in non-ASCII directories and with non-ASCII
+     file arguments is limited to the current codepage (even though
+     Emacs is perfectly capable of finding an executable program file
+     even in a directory whose name cannot be encoded in the curreent
+     codepage).  This is because the command-line arguments are
+     encoded _before_ they get to the w32-specific level, and the
+     encoding is not known in advance (it doesn't have to be the
+     current ANSI codepage), so w32proc.c functions cannot re-encode
+     them in UTF-16.  This should be fixed, but will also require
+     changes in cmdproxy.  The current limitation is not terribly bad
+     anyway, since very few, if any, Windows console programs that are
+     likely to be invoked by Emacs support UTF-16 encoded command
+     lines.
+   . For similar reasons, server.el and emacsclient are also limited
+     to the current ANSI codepage for now.
+*/
 /* Converting file names from UTF-8 to either UTF-16 or the ANSI
   codepage defined by file-name-coding-system.  */
author	Eli Zaretskii	2013-12-05 22:59:23 +0200
committer	Eli Zaretskii	2013-12-05 22:59:23 +0200
commit	0cd7a14e577cae9c0713d1cfa549cfca3f0ca06c (patch)
tree	31828931b3b0e4b1bf40e68e96f2601967c1266f /src
parent	a22205d67caa1cbf666a703d2bc26afd5a2704b6 (diff)
download	emacs-0cd7a14e577cae9c0713d1cfa549cfca3f0ca06c.tar.gz emacs-0cd7a14e577cae9c0713d1cfa549cfca3f0ca06c.zip

diff --git a/src/w32.c b/src/w32.c index 7d1ebebc68b..47c4f04b152 100644 --- a/src/w32.c +++ b/src/w32.c
@@ -1290,6 +1290,98 @@ w32_valid_pointer_p (void *p, int size)
1290		1290
1291		1291
1292		1292
		1293	/* Here's an overview of how the Windows build supports file names
		1294	that cannot be encoded by the current system codepage.
		1295
		1296	From the POV of Lisp and layers of C code above the functions here,
		1297	Emacs on Windows pretends that its file names are encoded in UTF-8;
		1298	see encode_file and decode_file on coding.c. Any file name that is
		1299	passed as a unibyte string to C functions defined here is assumed
		1300	to be in UTF-8 encoding. Any file name returned by functions
		1301	defined here must be in UTF-8 encoding, with only a few exceptions
		1302	reserved for a couple of special cases. (Be sure to use
		1303	MAX_UTF8_PATH for char arrays that store UTF-8 encoded file names,
		1304	as they can be much longer than MAX_PATH!)
		1305
		1306	The UTF-8 encoded file names cannot be passed to system APIs, as
		1307	Windows does not support that. Therefore, they are converted
		1308	either to UTF-16 or to the ANSI codepage, depending on the value of
		1309	w32-unicode-filenames, before calling any system APIs or CRT library
		1310	functions. The default value of that variable is determined by the
		1311	OS on which Emacs runs: nil on Windows 9X and t otherwise, but the
		1312	user can change that default (although I don't see why would she
		1313	want to).
		1314
		1315	The 4 functions defined below, filename_to_utf16, filename_to_ansi,
		1316	filename_from_utf16, and filename_from_ansi, are the workhorses of
		1317	these conversions. They rely on Windows native APIs
		1318	MultiByteToWideChar and WideCharToMultiByte; we cannot use
		1319	functions from coding.c here, because they allocate memory, which
		1320	is a bad idea on the level of libc, which is what the functions
		1321	here emulate. (If you worry about performance due to constant
		1322	conversion back and forth from UTF-8 to UTF-16, then don't: first,
		1323	it was measured to take only a few microseconds on a not-so-fast
		1324	machine, and second, that's exactly what the ANSI APIs we used
		1325	before do anyway, because they are just thin wrappers around the
		1326	Unicode APIs.)
		1327
		1328	The variables file-name-coding-system and default-file-name-coding-system
		1329	still exist, but are actually used only when a file name needs to
		1330	be converted to the ANSI codepage. This happens all the time when
		1331	w32-unicode-filenames is nil, but can also happen from time to time
		1332	when it is t. Otherwise, these variables have no effect on file-name
		1333	encoding when w32-unicode-filenames is t; this is similar to
		1334	selection-coding-system.
		1335
		1336	This arrangement works very well, but it has a few gotchas:
		1337
		1338	. Lisp code that encodes or decodes file names manually should
		1339	normally use 'utf-8' as the coding-system on Windows,
		1340	disregarding file-name-coding-system. This is a somewhat
		1341	unpleasant consequence, but it cannot be avoided. Fortunately,
		1342	very few Lisp packages need to do that.
		1343
		1344	More generally, passing to library functions (e.g., fopen or
		1345	opendir) file names already encoded in the ANSI codepage is
		1346	explictly verboten, as all those functions, as shadowed and
		1347	emulated here, assume they will receive UTF-8 encoded file names.
		1348
		1349	For the same reasons, no CRT function or Win32 API can be called
		1350	directly in Emacs sources, without either converting the file
		1351	name sfrom UTF-8 to either UTF-16 or ANSI codepage, or going
		1352	through some shadowing function defined here.
		1353
		1354	. File names passed to external libraries, like the image libraries
		1355	and GnuTLS, need special handling. These libraries generally
		1356	don't support UTF-16 or UTF-8 file names, so they must get file
		1357	names encoded in the ANSI codepage. To facilitate using these
		1358	libraries with file names that are not encodable in the ANSI
		1359	codepage, use the function ansi_encode_filename, which will try
		1360	to use the short 8+3 alias of a file name if that file name is
		1361	not encodable in the ANSI codepage. See image.c and gnutls.c for
		1362	examples of how this should be done.
		1363
		1364	. Running subprocesses in non-ASCII directories and with non-ASCII
		1365	file arguments is limited to the current codepage (even though
		1366	Emacs is perfectly capable of finding an executable program file
		1367	even in a directory whose name cannot be encoded in the curreent
		1368	codepage). This is because the command-line arguments are
		1369	encoded _before_ they get to the w32-specific level, and the
		1370	encoding is not known in advance (it doesn't have to be the
		1371	current ANSI codepage), so w32proc.c functions cannot re-encode
		1372	them in UTF-16. This should be fixed, but will also require
		1373	changes in cmdproxy. The current limitation is not terribly bad
		1374	anyway, since very few, if any, Windows console programs that are
		1375	likely to be invoked by Emacs support UTF-16 encoded command
		1376	lines.
		1377
		1378	. For similar reasons, server.el and emacsclient are also limited
		1379	to the current ANSI codepage for now.
		1380
		1381	*/
		1382
		1383
		1384
1293	/* Converting file names from UTF-8 to either UTF-16 or the ANSI	1385	/* Converting file names from UTF-8 to either UTF-16 or the ANSI
1294	codepage defined by file-name-coding-system. */	1386	codepage defined by file-name-coding-system. */
1295		1387