* new-features.sgml (ov-new1.7-file): Ctrl-X, not Ctrl-N.

* pathnames.sgml (pathnames-unusual): Ditto.
	* setup2.sgml (setup-locale-ov): Change description according to
	latest changes.
	(setup-locale-how): Rewrite.
	(setup-locale-console): Enable section again.  Change to reflect
	recent changes.
	(setup-locale-problems): Change to reflect recent changes.
This commit is contained in:
Corinna Vinschen 2009-09-30 09:45:01 +00:00
parent 4180b64df4
commit ffca4d278e
4 changed files with 86 additions and 74 deletions

View File

@ -1,3 +1,14 @@
2009-09-30 Corinna Vinschen <corinna@vinschen.de>
* new-features.sgml (ov-new1.7-file): Ctrl-X, not Ctrl-N.
* pathnames.sgml (pathnames-unusual): Ditto.
* setup2.sgml (setup-locale-ov): Change description according to
latest changes.
(setup-locale-how): Rewrite.
(setup-locale-console): Enable section again. Change to reflect
recent changes.
(setup-locale-problems): Change to reflect recent changes.
2009-09-26 Eric Blake <ebb9@byu.net> 2009-09-26 Eric Blake <ebb9@byu.net>
* new-features.sgml (ov-new1.7-file): Mention fexecve, execvpe. * new-features.sgml (ov-new1.7-file): Mention fexecve, execvpe.

View File

@ -22,7 +22,7 @@
/etc/fstab. /etc/fstab.
- If a filename cannot be represented in the current character set, - If a filename cannot be represented in the current character set,
the character will be converted to a sequence Ctrl-N + UTF-8 representation the character will be converted to a sequence Ctrl-X + UTF-8 representation
of the character. This allows to access all files, even those not of the character. This allows to access all files, even those not
having a valid representation of their filename in the current character having a valid representation of their filename in the current character
set (codepage). To always have a valid string, use the UTF-8 charset set (codepage). To always have a valid string, use the UTF-8 charset

View File

@ -424,14 +424,14 @@ reason, you will nevertheless be able to access the file. How does that
work? When Cygwin converts the filename from UTF-16 to your character work? When Cygwin converts the filename from UTF-16 to your character
set, it recognizes characters which can't be converted. If that occurs, set, it recognizes characters which can't be converted. If that occurs,
Cygwin replaces the non-convertible character with a special character Cygwin replaces the non-convertible character with a special character
sequence. The sequence starts with an ASCII SO character (hex code sequence. The sequence starts with an ASCII CAN character (hex code
0x0e, equivalent Control-N), followed by the UTF-8 representation of the 0x18, equivalent Control-X), followed by the UTF-8 representation of the
character. The result is a filename containing some ugly looking character. The result is a filename containing some ugly looking
characters. While it doesn't <emphasis>look</emphasis> nice, it characters. While it doesn't <emphasis>look</emphasis> nice, it
<emphasis>is</emphasis> nice, because Cygwin knows how to convert this <emphasis>is</emphasis> nice, because Cygwin knows how to convert this
filename back to UTF-16. The filename will be converted using your filename back to UTF-16. The filename will be converted using your
usual character set. However, when Cygwin recognizes an ASCII SO usual character set. However, when Cygwin recognizes an ASCII CAN
character, it skips over the ASCII SO and handles the following bytes as character, it skips over the ASCII CAN and handles the following bytes as
a UTF-8 character. Thus, the filename is symmetrically converted back to a UTF-8 character. Thus, the filename is symmetrically converted back to
UTF-16 and you can access the file.</para> UTF-16 and you can access the file.</para>

View File

@ -170,11 +170,37 @@ manual pages on the homepage of the
</screen> </screen>
<para> <para>
And let's not forget the default locale called "C" or "POSIX" At application startup, the application's locale is set to the default
which basically only supports plain ASCII code. If the aforementioned "C" or "POSIX" locale. Under Cygwin, this locale defaults to the UTF-8
environment variables are not set, or set to "C" or "POSIX", you get the character set. If you want to stick to the "C" locale and only change to
default ASCII-only behaviour. another charset, you can define this by setting one of the locale environment
</para> variables to "C.charset". For instance</para>
<screen>
"C.ISO-9959-1"
</screen>
<para>Windows uses the UTF-16 charset exclusively to store the names
of any object used by the Operating System. This is especially important
with filenames. Cygwin uses the setting of the locale environment variables
<envar>LC_ALL</envar>, <envar>LC_CTYPE</envar>, and <envar>LANG</envar>, to
determine how to convert Windows filenames from their UTF-16 representation
to the singlebyte or multibyte character set used by Cygwin. Setting
the environment variables to another value changes the way filenames are
converted in subsequently stated programs.</para>
<para>
However, even if one of the locale environment variables is set to
some other value than "C", this does <emphasis>only</emphasis> affect
how Cygwin itself converts filenames. As the POSIX standard requires,
it's the applications responsibility to activate that locale for its
own purpose, typically by using the call</para>
<screen>
setlocale (LC_ALL, "");
</screen>
<para>early in the application code.</para>
<para> <para>
Right now the language and territory, as well as the modifier, are not Right now the language and territory, as well as the modifier, are not
@ -187,7 +213,7 @@ these characters have a width of 2. Kind of explains why they are
called "ambiguous"...</para> called "ambiguous"...</para>
<para> <para>
The problem has been fixed for now like this. wcwidth/wcswidth usually The problem has been fixed like this. wcwidth/wcswidth usually
return 1 as the width of these characters. However, if the language is return 1 as the width of these characters. However, if the language is
specifed as "ja" (Japanese), "ko" (Korean), or "zh" (Chinese), wcwidth specifed as "ja" (Japanese), "ko" (Korean), or "zh" (Chinese), wcwidth
returns 2 for these characters. Unfortunately this isn't correct in returns 2 for these characters. Unfortunately this isn't correct in
@ -197,6 +223,7 @@ ambiguous width characters to return 1 even in those languages.</para>
<para> <para>
Other than that, the only important part so far is the character set. Other than that, the only important part so far is the character set.
How does that work?</para> How does that work?</para>
</sect2> </sect2>
@ -206,31 +233,18 @@ How does that work?</para>
<itemizedlist mark="bullet"> <itemizedlist mark="bullet">
<listitem><para> <listitem><para>
The default locale is the "C" or "POSIX" locale. In this locale, basically The default locale is the "C" or "POSIX" locale. Under Cygwin this locale
only ASCII characters are supported. Even if one of the aforementioned defaults to the UTF-8 character set.</para>
environment variables are set to something else, it's the application's </listitem>
responsibility to call the function <function>setlocale</function>,
typically like this</para>
<screen>
setlocale (LC_ALL, "");
</screen>
<para>to switch to another locale according to the settings of the
internationalization environment variables.
</para></listitem>
<listitem><para> <listitem><para>
Assume that you've set one of the aforementioned environment variables to some Assume that you've set one of the aforementioned environment variables to some
valid POSIX locale value, other than "C" and "POSIX", and assume that you valid POSIX locale value, other than "C" and "POSIX". Assume further that
call an application which calls <function>setlocale</function> as above.</para> you're living in Japan. You might want to use the language code "ja" and the
territory "JP", thus setting, say, <envar>LANG</envar> to "ja_JP". You didn't
<para>Assume further that you're living in Japan. You might want to use set a character set, so what will Cygwin use now? Easy! It will use the
the language code "ja" and the territory "JP", thus setting, say, default Windows ANSI codepage of your system, if it's supported by Cygwin.
<envar>LANG</envar> to "ja_JP". You didn't set a character set, so Hopefully Cygwin supports all relevant default ANSI codepages...</para>
what will Cygwin use now? Easy! It will use the default Windows ANSI
codepage of your system, if it's supported by Cygwin. Hopefully Cygwin
supports all relevant default ANSI codepages...</para>
<note><para>For a list of supported character sets, see <note><para>For a list of supported character sets, see
<xref linkend="setup-locale-charsetlist"></xref> <xref linkend="setup-locale-charsetlist"></xref>
@ -240,10 +254,10 @@ supports all relevant default ANSI codepages...</para>
<listitem><para> <listitem><para>
You don't want to use the default Windows codepage as character set? You don't want to use the default Windows codepage as character set?
In that case you have to specify the charset explicitly. For instance, In that case you have to specify the charset explicitly. For instance,
assume you're from Italy and don't want to use the default Windows codepage assume you're from Italy and don't want to use the Italian default Windows
1252, but the more portable ISO-8859-15 character set. What you can do is ANSI codepage 1252, but the more portable ISO-8859-15 character set.
to set the <envar>LANG</envar> variable in the What you can do, for instance, is to set the <envar>LANG</envar> variable
<filename>C:\cygwin\Cygwin.bat</filename> file which is the batch file in the <filename>C:\cygwin\Cygwin.bat</filename> file which is the batch file
to start a Cygwin session from the "Cygwin" desktop shortcut.</para> to start a Cygwin session from the "Cygwin" desktop shortcut.</para>
<screen> <screen>
@ -257,14 +271,16 @@ to start a Cygwin session from the "Cygwin" desktop shortcut.</para>
</listitem> </listitem>
<listitem><para> <listitem><para>
Most singlebyte or doublebyte charsets have a disadvantage. Windows Last, but not least, most singlebyte or doublebyte charsets have a big
filesystems use the Unicode character set in the UTF-16 encoding to store filename information. Not all characters disadvantage. Windows filesystems use the Unicode character set in the
UTF-16 encoding to store filename information. Not all characters
from the Unicode character set are available in a singlebyte or doublebyte from the Unicode character set are available in a singlebyte or doublebyte
charset. While Cygwin has a workaround to access files with unusual charset. While Cygwin has a workaround to access files with unusual
characters (see <xref linkend="pathnames-unusual"></xref>), a better characters (see <xref linkend="pathnames-unusual"></xref>), a better
workaround is to use always the UTF-8 character set. UTF-8 is the only workaround is to use always the UTF-8 character set.i</para>
multibyte character set which can represent <emphasis>every</emphasis>
Unicode character.</para> <para><emphasis>UTF-8 is the only multibyte character set which can represent
every Unicode character.</emphasis></para>
<screen> <screen>
set LANG=es_MX.UTF-8 set LANG=es_MX.UTF-8
@ -278,7 +294,6 @@ Unicode character.</para>
</sect2> </sect2>
<!-- TODO: This is not correct anymore.
<sect2 id="setup-locale-console"><title>The Windows Console character set</title> <sect2 id="setup-locale-console"><title>The Windows Console character set</title>
<para>Most of the time the Windows console is used to run Cygwin applications. <para>Most of the time the Windows console is used to run Cygwin applications.
@ -287,7 +302,7 @@ While terminal emulations like <command>xterm</command> or
used for in- and output, the Windows console hasn't such a way, since it's used for in- and output, the Windows console hasn't such a way, since it's
not an application in its own right.</para> not an application in its own right.</para>
<para>This problem is solved in Cygwin as follows. When the first Cygwin <para>This problem is solved in Cygwin as follows. When a Cygwin
process is started in a Windows console (either explicitly from cmd.exe, process is started in a Windows console (either explicitly from cmd.exe,
or implicitly by, for instance, clicking on the Cygwin desktop icon, or or implicitly by, for instance, clicking on the Cygwin desktop icon, or
running the Cygwin.bat file), the Console character set is determined by the running the Cygwin.bat file), the Console character set is determined by the
@ -295,27 +310,18 @@ setting of the aforementioned internationalization environment variables,
the same way as described in <xref linkend="setup-locale-how"></xref>. the same way as described in <xref linkend="setup-locale-how"></xref>.
</para> </para>
<para>However, in contrast to the application's character set, which is <para>What is that good for? Why not switch the console character set with
determined by the <function>setlocale</function> call, the console the applications requirements? After all, the application knows if it uses
character set stays fixed for all subsequent Cygwin processes started localization or not. However, what if a non-localized application calls
from this first Cygwin process in the console. So, for instance, if a remote application which itself is localized? This can happen with
<envar>LANG</envar> was set to "en_US.UTF-8" when the first Cygwin process <command>ssh</command> or <command>rlogin</command>. Both commands don't
started, the console is a UTF-8 terminal for the entire Cygwin process have and don't need localization and they never call
tree started from this first Cygwin process.</para> <function>setlocale</function>. Setting one of the internationalization
environment variable to the same charset as the remote machine before
<para>You're asking "What is that good for? Why not switch the console starting <command>ssh</command> or <command>rlogin</command> fixes that
character set with the applications requirements? After all, the problem.</para>
application knows if it uses localization or not." That's true, but
what if the non-localized application calls a remote application which
itself is localized? This can happen with <command>ssh</command> or
<command>rlogin</command>. Both commands don't have and don't need
localization and they never call <function>setlocale</function>. This
would have the unfortunate effect, that the console would run with the
ASCII character set alone. Native characters printed from the remote
application would not show up correctly on your local console.</para>
</sect2> </sect2>
-->
<sect2 id="setup-locale-problems"><title>Potential Problems when using Locales</title> <sect2 id="setup-locale-problems"><title>Potential Problems when using Locales</title>
@ -330,22 +336,17 @@ set, and yet another. In bash for instance:</para>
</screen> </screen>
<para>However, here's a problem. At the start of the first Cygwin process <para>However, here's a problem. At the start of the first Cygwin process
in a session, the Windows environment has to be converted from UTF-16 to in a session, the Windows environment is converted from UTF-16 to UTF-8.
some singlebyte or multibyte charset. If the internationalization environment The environment is another of the system objects stored in UTF-16 in
variable hasn't been set <emphasis>before</emphasis> starting this process, Windows.</para>
Cygwin has to make an educated guess which charset to use to convert
the environment itself. The only reproducible way to do that in the absence
of <envar>LC_ALL</envar>, <envar>LC_CTYPE</envar>, or <envar>LANG</envar>,
is to use the "C" locale. The default conversion in the "C" locale
used by Cygwin internally is UTF-8. So, in the absence of any
internationalization environment variable, the environment will be converted
to UTF-8.</para>
<para>As long as the environment only contains ASCII characters, this is <para>As long as the environment only contains ASCII characters, this is
no problem at all. But if it contains native characters, and you're planning no problem at all. But if it contains native characters, and you're planning
to use, say, GBK, the environment will result in invalid characters in to use, say, GBK, the environment will result in invalid characters in
the GBK charset. This would be especially a problem in variables like the GBK charset. This would be especially a problem in variables like
<envar>PATH</envar>.</para> <envar>PATH</envar>. To circumvent the worst problems, Cygwin converts
the <envar>PATH</envar> environment variable to the charset set in the
environment, if it's different from the UTF-8 charset.</para>
<note><para>Per POSIX, the name of an environment variable should only <note><para>Per POSIX, the name of an environment variable should only
consist of valid ASCII characters, and only of uppercase letters, digits, and consist of valid ASCII characters, and only of uppercase letters, digits, and