From ffca4d278ee9c95c03357761fe36cf3b81f11017 Mon Sep 17 00:00:00 2001 From: Corinna Vinschen Date: Wed, 30 Sep 2009 09:45:01 +0000 Subject: [PATCH] * new-features.sgml (ov-new1.7-file): Ctrl-X, not Ctrl-N. * pathnames.sgml (pathnames-unusual): Ditto. * setup2.sgml (setup-locale-ov): Change description according to latest changes. (setup-locale-how): Rewrite. (setup-locale-console): Enable section again. Change to reflect recent changes. (setup-locale-problems): Change to reflect recent changes. --- winsup/doc/ChangeLog | 11 +++ winsup/doc/new-features.sgml | 2 +- winsup/doc/pathnames.sgml | 8 +- winsup/doc/setup2.sgml | 139 ++++++++++++++++++----------------- 4 files changed, 86 insertions(+), 74 deletions(-) diff --git a/winsup/doc/ChangeLog b/winsup/doc/ChangeLog index 49754267f..19e7ec866 100644 --- a/winsup/doc/ChangeLog +++ b/winsup/doc/ChangeLog @@ -1,3 +1,14 @@ +2009-09-30 Corinna Vinschen + + * new-features.sgml (ov-new1.7-file): Ctrl-X, not Ctrl-N. + * pathnames.sgml (pathnames-unusual): Ditto. + * setup2.sgml (setup-locale-ov): Change description according to + latest changes. + (setup-locale-how): Rewrite. + (setup-locale-console): Enable section again. Change to reflect + recent changes. + (setup-locale-problems): Change to reflect recent changes. + 2009-09-26 Eric Blake * new-features.sgml (ov-new1.7-file): Mention fexecve, execvpe. diff --git a/winsup/doc/new-features.sgml b/winsup/doc/new-features.sgml index 5c3a4e4ba..dda067ac3 100644 --- a/winsup/doc/new-features.sgml +++ b/winsup/doc/new-features.sgml @@ -22,7 +22,7 @@ /etc/fstab. - If a filename cannot be represented in the current character set, - the character will be converted to a sequence Ctrl-N + UTF-8 representation + the character will be converted to a sequence Ctrl-X + UTF-8 representation of the character. This allows to access all files, even those not having a valid representation of their filename in the current character set (codepage). To always have a valid string, use the UTF-8 charset diff --git a/winsup/doc/pathnames.sgml b/winsup/doc/pathnames.sgml index c6fd792d8..527096fcb 100644 --- a/winsup/doc/pathnames.sgml +++ b/winsup/doc/pathnames.sgml @@ -424,14 +424,14 @@ reason, you will nevertheless be able to access the file. How does that work? When Cygwin converts the filename from UTF-16 to your character set, it recognizes characters which can't be converted. If that occurs, Cygwin replaces the non-convertible character with a special character -sequence. The sequence starts with an ASCII SO character (hex code -0x0e, equivalent Control-N), followed by the UTF-8 representation of the +sequence. The sequence starts with an ASCII CAN character (hex code +0x18, equivalent Control-X), followed by the UTF-8 representation of the character. The result is a filename containing some ugly looking characters. While it doesn't look nice, it is nice, because Cygwin knows how to convert this filename back to UTF-16. The filename will be converted using your -usual character set. However, when Cygwin recognizes an ASCII SO -character, it skips over the ASCII SO and handles the following bytes as +usual character set. However, when Cygwin recognizes an ASCII CAN +character, it skips over the ASCII CAN and handles the following bytes as a UTF-8 character. Thus, the filename is symmetrically converted back to UTF-16 and you can access the file. diff --git a/winsup/doc/setup2.sgml b/winsup/doc/setup2.sgml index 78ebc2e9c..15e581768 100644 --- a/winsup/doc/setup2.sgml +++ b/winsup/doc/setup2.sgml @@ -170,11 +170,37 @@ manual pages on the homepage of the -And let's not forget the default locale called "C" or "POSIX" -which basically only supports plain ASCII code. If the aforementioned -environment variables are not set, or set to "C" or "POSIX", you get the -default ASCII-only behaviour. - +At application startup, the application's locale is set to the default +"C" or "POSIX" locale. Under Cygwin, this locale defaults to the UTF-8 +character set. If you want to stick to the "C" locale and only change to +another charset, you can define this by setting one of the locale environment +variables to "C.charset". For instance + + + "C.ISO-9959-1" + + +Windows uses the UTF-16 charset exclusively to store the names +of any object used by the Operating System. This is especially important +with filenames. Cygwin uses the setting of the locale environment variables +LC_ALL, LC_CTYPE, and LANG, to +determine how to convert Windows filenames from their UTF-16 representation +to the singlebyte or multibyte character set used by Cygwin. Setting +the environment variables to another value changes the way filenames are +converted in subsequently stated programs. + + +However, even if one of the locale environment variables is set to +some other value than "C", this does only affect +how Cygwin itself converts filenames. As the POSIX standard requires, +it's the applications responsibility to activate that locale for its +own purpose, typically by using the call + + + setlocale (LC_ALL, ""); + + +early in the application code. Right now the language and territory, as well as the modifier, are not @@ -187,7 +213,7 @@ these characters have a width of 2. Kind of explains why they are called "ambiguous"... -The problem has been fixed for now like this. wcwidth/wcswidth usually +The problem has been fixed like this. wcwidth/wcswidth usually return 1 as the width of these characters. However, if the language is specifed as "ja" (Japanese), "ko" (Korean), or "zh" (Chinese), wcwidth returns 2 for these characters. Unfortunately this isn't correct in @@ -197,6 +223,7 @@ ambiguous width characters to return 1 even in those languages. Other than that, the only important part so far is the character set. + How does that work? @@ -206,31 +233,18 @@ How does that work? -The default locale is the "C" or "POSIX" locale. In this locale, basically -only ASCII characters are supported. Even if one of the aforementioned -environment variables are set to something else, it's the application's -responsibility to call the function setlocale, -typically like this - - - setlocale (LC_ALL, ""); - - -to switch to another locale according to the settings of the -internationalization environment variables. - +The default locale is the "C" or "POSIX" locale. Under Cygwin this locale +defaults to the UTF-8 character set. + Assume that you've set one of the aforementioned environment variables to some -valid POSIX locale value, other than "C" and "POSIX", and assume that you -call an application which calls setlocale as above. - -Assume further that you're living in Japan. You might want to use -the language code "ja" and the territory "JP", thus setting, say, -LANG to "ja_JP". You didn't set a character set, so -what will Cygwin use now? Easy! It will use the default Windows ANSI -codepage of your system, if it's supported by Cygwin. Hopefully Cygwin -supports all relevant default ANSI codepages... +valid POSIX locale value, other than "C" and "POSIX". Assume further that +you're living in Japan. You might want to use the language code "ja" and the +territory "JP", thus setting, say, LANG to "ja_JP". You didn't +set a character set, so what will Cygwin use now? Easy! It will use the +default Windows ANSI codepage of your system, if it's supported by Cygwin. +Hopefully Cygwin supports all relevant default ANSI codepages... For a list of supported character sets, see @@ -240,10 +254,10 @@ supports all relevant default ANSI codepages... You don't want to use the default Windows codepage as character set? In that case you have to specify the charset explicitly. For instance, -assume you're from Italy and don't want to use the default Windows codepage -1252, but the more portable ISO-8859-15 character set. What you can do is -to set the LANG variable in the -C:\cygwin\Cygwin.bat file which is the batch file +assume you're from Italy and don't want to use the Italian default Windows +ANSI codepage 1252, but the more portable ISO-8859-15 character set. +What you can do, for instance, is to set the LANG variable +in the C:\cygwin\Cygwin.bat file which is the batch file to start a Cygwin session from the "Cygwin" desktop shortcut. @@ -257,14 +271,16 @@ to start a Cygwin session from the "Cygwin" desktop shortcut. -Most singlebyte or doublebyte charsets have a disadvantage. Windows -filesystems use the Unicode character set in the UTF-16 encoding to store filename information. Not all characters +Last, but not least, most singlebyte or doublebyte charsets have a big +disadvantage. Windows filesystems use the Unicode character set in the +UTF-16 encoding to store filename information. Not all characters from the Unicode character set are available in a singlebyte or doublebyte charset. While Cygwin has a workaround to access files with unusual characters (see ), a better -workaround is to use always the UTF-8 character set. UTF-8 is the only -multibyte character set which can represent every -Unicode character. +workaround is to use always the UTF-8 character set.i + +UTF-8 is the only multibyte character set which can represent +every Unicode character. set LANG=es_MX.UTF-8 @@ -278,7 +294,6 @@ Unicode character. - Potential Problems when using Locales @@ -330,22 +336,17 @@ set, and yet another. In bash for instance: However, here's a problem. At the start of the first Cygwin process -in a session, the Windows environment has to be converted from UTF-16 to -some singlebyte or multibyte charset. If the internationalization environment -variable hasn't been set before starting this process, -Cygwin has to make an educated guess which charset to use to convert -the environment itself. The only reproducible way to do that in the absence -of LC_ALL, LC_CTYPE, or LANG, -is to use the "C" locale. The default conversion in the "C" locale -used by Cygwin internally is UTF-8. So, in the absence of any -internationalization environment variable, the environment will be converted -to UTF-8. +in a session, the Windows environment is converted from UTF-16 to UTF-8. +The environment is another of the system objects stored in UTF-16 in +Windows. As long as the environment only contains ASCII characters, this is no problem at all. But if it contains native characters, and you're planning to use, say, GBK, the environment will result in invalid characters in the GBK charset. This would be especially a problem in variables like -PATH. +PATH. To circumvent the worst problems, Cygwin converts +the PATH environment variable to the charset set in the +environment, if it's different from the UTF-8 charset. Per POSIX, the name of an environment variable should only consist of valid ASCII characters, and only of uppercase letters, digits, and