diff options
| author | Eli Zaretskii | 2013-12-05 22:59:23 +0200 |
|---|---|---|
| committer | Eli Zaretskii | 2013-12-05 22:59:23 +0200 |
| commit | 0cd7a14e577cae9c0713d1cfa549cfca3f0ca06c (patch) | |
| tree | 31828931b3b0e4b1bf40e68e96f2601967c1266f /src | |
| parent | a22205d67caa1cbf666a703d2bc26afd5a2704b6 (diff) | |
| download | emacs-0cd7a14e577cae9c0713d1cfa549cfca3f0ca06c.tar.gz emacs-0cd7a14e577cae9c0713d1cfa549cfca3f0ca06c.zip | |
Added commentary about the overall design and its limitations.
Diffstat (limited to 'src')
| -rw-r--r-- | src/w32.c | 92 |
1 files changed, 92 insertions, 0 deletions
| @@ -1290,6 +1290,98 @@ w32_valid_pointer_p (void *p, int size) | |||
| 1290 | 1290 | ||
| 1291 | 1291 | ||
| 1292 | 1292 | ||
| 1293 | /* Here's an overview of how the Windows build supports file names | ||
| 1294 | that cannot be encoded by the current system codepage. | ||
| 1295 | |||
| 1296 | From the POV of Lisp and layers of C code above the functions here, | ||
| 1297 | Emacs on Windows pretends that its file names are encoded in UTF-8; | ||
| 1298 | see encode_file and decode_file on coding.c. Any file name that is | ||
| 1299 | passed as a unibyte string to C functions defined here is assumed | ||
| 1300 | to be in UTF-8 encoding. Any file name returned by functions | ||
| 1301 | defined here must be in UTF-8 encoding, with only a few exceptions | ||
| 1302 | reserved for a couple of special cases. (Be sure to use | ||
| 1303 | MAX_UTF8_PATH for char arrays that store UTF-8 encoded file names, | ||
| 1304 | as they can be much longer than MAX_PATH!) | ||
| 1305 | |||
| 1306 | The UTF-8 encoded file names cannot be passed to system APIs, as | ||
| 1307 | Windows does not support that. Therefore, they are converted | ||
| 1308 | either to UTF-16 or to the ANSI codepage, depending on the value of | ||
| 1309 | w32-unicode-filenames, before calling any system APIs or CRT library | ||
| 1310 | functions. The default value of that variable is determined by the | ||
| 1311 | OS on which Emacs runs: nil on Windows 9X and t otherwise, but the | ||
| 1312 | user can change that default (although I don't see why would she | ||
| 1313 | want to). | ||
| 1314 | |||
| 1315 | The 4 functions defined below, filename_to_utf16, filename_to_ansi, | ||
| 1316 | filename_from_utf16, and filename_from_ansi, are the workhorses of | ||
| 1317 | these conversions. They rely on Windows native APIs | ||
| 1318 | MultiByteToWideChar and WideCharToMultiByte; we cannot use | ||
| 1319 | functions from coding.c here, because they allocate memory, which | ||
| 1320 | is a bad idea on the level of libc, which is what the functions | ||
| 1321 | here emulate. (If you worry about performance due to constant | ||
| 1322 | conversion back and forth from UTF-8 to UTF-16, then don't: first, | ||
| 1323 | it was measured to take only a few microseconds on a not-so-fast | ||
| 1324 | machine, and second, that's exactly what the ANSI APIs we used | ||
| 1325 | before do anyway, because they are just thin wrappers around the | ||
| 1326 | Unicode APIs.) | ||
| 1327 | |||
| 1328 | The variables file-name-coding-system and default-file-name-coding-system | ||
| 1329 | still exist, but are actually used only when a file name needs to | ||
| 1330 | be converted to the ANSI codepage. This happens all the time when | ||
| 1331 | w32-unicode-filenames is nil, but can also happen from time to time | ||
| 1332 | when it is t. Otherwise, these variables have no effect on file-name | ||
| 1333 | encoding when w32-unicode-filenames is t; this is similar to | ||
| 1334 | selection-coding-system. | ||
| 1335 | |||
| 1336 | This arrangement works very well, but it has a few gotchas: | ||
| 1337 | |||
| 1338 | . Lisp code that encodes or decodes file names manually should | ||
| 1339 | normally use 'utf-8' as the coding-system on Windows, | ||
| 1340 | disregarding file-name-coding-system. This is a somewhat | ||
| 1341 | unpleasant consequence, but it cannot be avoided. Fortunately, | ||
| 1342 | very few Lisp packages need to do that. | ||
| 1343 | |||
| 1344 | More generally, passing to library functions (e.g., fopen or | ||
| 1345 | opendir) file names already encoded in the ANSI codepage is | ||
| 1346 | explictly *verboten*, as all those functions, as shadowed and | ||
| 1347 | emulated here, assume they will receive UTF-8 encoded file names. | ||
| 1348 | |||
| 1349 | For the same reasons, no CRT function or Win32 API can be called | ||
| 1350 | directly in Emacs sources, without either converting the file | ||
| 1351 | name sfrom UTF-8 to either UTF-16 or ANSI codepage, or going | ||
| 1352 | through some shadowing function defined here. | ||
| 1353 | |||
| 1354 | . File names passed to external libraries, like the image libraries | ||
| 1355 | and GnuTLS, need special handling. These libraries generally | ||
| 1356 | don't support UTF-16 or UTF-8 file names, so they must get file | ||
| 1357 | names encoded in the ANSI codepage. To facilitate using these | ||
| 1358 | libraries with file names that are not encodable in the ANSI | ||
| 1359 | codepage, use the function ansi_encode_filename, which will try | ||
| 1360 | to use the short 8+3 alias of a file name if that file name is | ||
| 1361 | not encodable in the ANSI codepage. See image.c and gnutls.c for | ||
| 1362 | examples of how this should be done. | ||
| 1363 | |||
| 1364 | . Running subprocesses in non-ASCII directories and with non-ASCII | ||
| 1365 | file arguments is limited to the current codepage (even though | ||
| 1366 | Emacs is perfectly capable of finding an executable program file | ||
| 1367 | even in a directory whose name cannot be encoded in the curreent | ||
| 1368 | codepage). This is because the command-line arguments are | ||
| 1369 | encoded _before_ they get to the w32-specific level, and the | ||
| 1370 | encoding is not known in advance (it doesn't have to be the | ||
| 1371 | current ANSI codepage), so w32proc.c functions cannot re-encode | ||
| 1372 | them in UTF-16. This should be fixed, but will also require | ||
| 1373 | changes in cmdproxy. The current limitation is not terribly bad | ||
| 1374 | anyway, since very few, if any, Windows console programs that are | ||
| 1375 | likely to be invoked by Emacs support UTF-16 encoded command | ||
| 1376 | lines. | ||
| 1377 | |||
| 1378 | . For similar reasons, server.el and emacsclient are also limited | ||
| 1379 | to the current ANSI codepage for now. | ||
| 1380 | |||
| 1381 | */ | ||
| 1382 | |||
| 1383 | |||
| 1384 | |||
| 1293 | /* Converting file names from UTF-8 to either UTF-16 or the ANSI | 1385 | /* Converting file names from UTF-8 to either UTF-16 or the ANSI |
| 1294 | codepage defined by file-name-coding-system. */ | 1386 | codepage defined by file-name-coding-system. */ |
| 1295 | 1387 | ||