* new-features.sgml (ov-new1.7-file): Ctrl-X, not Ctrl-N.
* pathnames.sgml (pathnames-unusual): Ditto. * setup2.sgml (setup-locale-ov): Change description according to latest changes. (setup-locale-how): Rewrite. (setup-locale-console): Enable section again. Change to reflect recent changes. (setup-locale-problems): Change to reflect recent changes.
This commit is contained in:
parent
4180b64df4
commit
ffca4d278e
@ -1,3 +1,14 @@
|
|||||||
|
2009-09-30 Corinna Vinschen <corinna@vinschen.de>
|
||||||
|
|
||||||
|
* new-features.sgml (ov-new1.7-file): Ctrl-X, not Ctrl-N.
|
||||||
|
* pathnames.sgml (pathnames-unusual): Ditto.
|
||||||
|
* setup2.sgml (setup-locale-ov): Change description according to
|
||||||
|
latest changes.
|
||||||
|
(setup-locale-how): Rewrite.
|
||||||
|
(setup-locale-console): Enable section again. Change to reflect
|
||||||
|
recent changes.
|
||||||
|
(setup-locale-problems): Change to reflect recent changes.
|
||||||
|
|
||||||
2009-09-26 Eric Blake <ebb9@byu.net>
|
2009-09-26 Eric Blake <ebb9@byu.net>
|
||||||
|
|
||||||
* new-features.sgml (ov-new1.7-file): Mention fexecve, execvpe.
|
* new-features.sgml (ov-new1.7-file): Mention fexecve, execvpe.
|
||||||
|
@ -22,7 +22,7 @@
|
|||||||
/etc/fstab.
|
/etc/fstab.
|
||||||
|
|
||||||
- If a filename cannot be represented in the current character set,
|
- If a filename cannot be represented in the current character set,
|
||||||
the character will be converted to a sequence Ctrl-N + UTF-8 representation
|
the character will be converted to a sequence Ctrl-X + UTF-8 representation
|
||||||
of the character. This allows to access all files, even those not
|
of the character. This allows to access all files, even those not
|
||||||
having a valid representation of their filename in the current character
|
having a valid representation of their filename in the current character
|
||||||
set (codepage). To always have a valid string, use the UTF-8 charset
|
set (codepage). To always have a valid string, use the UTF-8 charset
|
||||||
|
@ -424,14 +424,14 @@ reason, you will nevertheless be able to access the file. How does that
|
|||||||
work? When Cygwin converts the filename from UTF-16 to your character
|
work? When Cygwin converts the filename from UTF-16 to your character
|
||||||
set, it recognizes characters which can't be converted. If that occurs,
|
set, it recognizes characters which can't be converted. If that occurs,
|
||||||
Cygwin replaces the non-convertible character with a special character
|
Cygwin replaces the non-convertible character with a special character
|
||||||
sequence. The sequence starts with an ASCII SO character (hex code
|
sequence. The sequence starts with an ASCII CAN character (hex code
|
||||||
0x0e, equivalent Control-N), followed by the UTF-8 representation of the
|
0x18, equivalent Control-X), followed by the UTF-8 representation of the
|
||||||
character. The result is a filename containing some ugly looking
|
character. The result is a filename containing some ugly looking
|
||||||
characters. While it doesn't <emphasis>look</emphasis> nice, it
|
characters. While it doesn't <emphasis>look</emphasis> nice, it
|
||||||
<emphasis>is</emphasis> nice, because Cygwin knows how to convert this
|
<emphasis>is</emphasis> nice, because Cygwin knows how to convert this
|
||||||
filename back to UTF-16. The filename will be converted using your
|
filename back to UTF-16. The filename will be converted using your
|
||||||
usual character set. However, when Cygwin recognizes an ASCII SO
|
usual character set. However, when Cygwin recognizes an ASCII CAN
|
||||||
character, it skips over the ASCII SO and handles the following bytes as
|
character, it skips over the ASCII CAN and handles the following bytes as
|
||||||
a UTF-8 character. Thus, the filename is symmetrically converted back to
|
a UTF-8 character. Thus, the filename is symmetrically converted back to
|
||||||
UTF-16 and you can access the file.</para>
|
UTF-16 and you can access the file.</para>
|
||||||
|
|
||||||
|
@ -170,11 +170,37 @@ manual pages on the homepage of the
|
|||||||
</screen>
|
</screen>
|
||||||
|
|
||||||
<para>
|
<para>
|
||||||
And let's not forget the default locale called "C" or "POSIX"
|
At application startup, the application's locale is set to the default
|
||||||
which basically only supports plain ASCII code. If the aforementioned
|
"C" or "POSIX" locale. Under Cygwin, this locale defaults to the UTF-8
|
||||||
environment variables are not set, or set to "C" or "POSIX", you get the
|
character set. If you want to stick to the "C" locale and only change to
|
||||||
default ASCII-only behaviour.
|
another charset, you can define this by setting one of the locale environment
|
||||||
</para>
|
variables to "C.charset". For instance</para>
|
||||||
|
|
||||||
|
<screen>
|
||||||
|
"C.ISO-9959-1"
|
||||||
|
</screen>
|
||||||
|
|
||||||
|
<para>Windows uses the UTF-16 charset exclusively to store the names
|
||||||
|
of any object used by the Operating System. This is especially important
|
||||||
|
with filenames. Cygwin uses the setting of the locale environment variables
|
||||||
|
<envar>LC_ALL</envar>, <envar>LC_CTYPE</envar>, and <envar>LANG</envar>, to
|
||||||
|
determine how to convert Windows filenames from their UTF-16 representation
|
||||||
|
to the singlebyte or multibyte character set used by Cygwin. Setting
|
||||||
|
the environment variables to another value changes the way filenames are
|
||||||
|
converted in subsequently stated programs.</para>
|
||||||
|
|
||||||
|
<para>
|
||||||
|
However, even if one of the locale environment variables is set to
|
||||||
|
some other value than "C", this does <emphasis>only</emphasis> affect
|
||||||
|
how Cygwin itself converts filenames. As the POSIX standard requires,
|
||||||
|
it's the applications responsibility to activate that locale for its
|
||||||
|
own purpose, typically by using the call</para>
|
||||||
|
|
||||||
|
<screen>
|
||||||
|
setlocale (LC_ALL, "");
|
||||||
|
</screen>
|
||||||
|
|
||||||
|
<para>early in the application code.</para>
|
||||||
|
|
||||||
<para>
|
<para>
|
||||||
Right now the language and territory, as well as the modifier, are not
|
Right now the language and territory, as well as the modifier, are not
|
||||||
@ -187,7 +213,7 @@ these characters have a width of 2. Kind of explains why they are
|
|||||||
called "ambiguous"...</para>
|
called "ambiguous"...</para>
|
||||||
|
|
||||||
<para>
|
<para>
|
||||||
The problem has been fixed for now like this. wcwidth/wcswidth usually
|
The problem has been fixed like this. wcwidth/wcswidth usually
|
||||||
return 1 as the width of these characters. However, if the language is
|
return 1 as the width of these characters. However, if the language is
|
||||||
specifed as "ja" (Japanese), "ko" (Korean), or "zh" (Chinese), wcwidth
|
specifed as "ja" (Japanese), "ko" (Korean), or "zh" (Chinese), wcwidth
|
||||||
returns 2 for these characters. Unfortunately this isn't correct in
|
returns 2 for these characters. Unfortunately this isn't correct in
|
||||||
@ -197,6 +223,7 @@ ambiguous width characters to return 1 even in those languages.</para>
|
|||||||
|
|
||||||
<para>
|
<para>
|
||||||
Other than that, the only important part so far is the character set.
|
Other than that, the only important part so far is the character set.
|
||||||
|
|
||||||
How does that work?</para>
|
How does that work?</para>
|
||||||
|
|
||||||
</sect2>
|
</sect2>
|
||||||
@ -206,31 +233,18 @@ How does that work?</para>
|
|||||||
<itemizedlist mark="bullet">
|
<itemizedlist mark="bullet">
|
||||||
|
|
||||||
<listitem><para>
|
<listitem><para>
|
||||||
The default locale is the "C" or "POSIX" locale. In this locale, basically
|
The default locale is the "C" or "POSIX" locale. Under Cygwin this locale
|
||||||
only ASCII characters are supported. Even if one of the aforementioned
|
defaults to the UTF-8 character set.</para>
|
||||||
environment variables are set to something else, it's the application's
|
</listitem>
|
||||||
responsibility to call the function <function>setlocale</function>,
|
|
||||||
typically like this</para>
|
|
||||||
|
|
||||||
<screen>
|
|
||||||
setlocale (LC_ALL, "");
|
|
||||||
</screen>
|
|
||||||
|
|
||||||
<para>to switch to another locale according to the settings of the
|
|
||||||
internationalization environment variables.
|
|
||||||
</para></listitem>
|
|
||||||
|
|
||||||
<listitem><para>
|
<listitem><para>
|
||||||
Assume that you've set one of the aforementioned environment variables to some
|
Assume that you've set one of the aforementioned environment variables to some
|
||||||
valid POSIX locale value, other than "C" and "POSIX", and assume that you
|
valid POSIX locale value, other than "C" and "POSIX". Assume further that
|
||||||
call an application which calls <function>setlocale</function> as above.</para>
|
you're living in Japan. You might want to use the language code "ja" and the
|
||||||
|
territory "JP", thus setting, say, <envar>LANG</envar> to "ja_JP". You didn't
|
||||||
<para>Assume further that you're living in Japan. You might want to use
|
set a character set, so what will Cygwin use now? Easy! It will use the
|
||||||
the language code "ja" and the territory "JP", thus setting, say,
|
default Windows ANSI codepage of your system, if it's supported by Cygwin.
|
||||||
<envar>LANG</envar> to "ja_JP". You didn't set a character set, so
|
Hopefully Cygwin supports all relevant default ANSI codepages...</para>
|
||||||
what will Cygwin use now? Easy! It will use the default Windows ANSI
|
|
||||||
codepage of your system, if it's supported by Cygwin. Hopefully Cygwin
|
|
||||||
supports all relevant default ANSI codepages...</para>
|
|
||||||
|
|
||||||
<note><para>For a list of supported character sets, see
|
<note><para>For a list of supported character sets, see
|
||||||
<xref linkend="setup-locale-charsetlist"></xref>
|
<xref linkend="setup-locale-charsetlist"></xref>
|
||||||
@ -240,10 +254,10 @@ supports all relevant default ANSI codepages...</para>
|
|||||||
<listitem><para>
|
<listitem><para>
|
||||||
You don't want to use the default Windows codepage as character set?
|
You don't want to use the default Windows codepage as character set?
|
||||||
In that case you have to specify the charset explicitly. For instance,
|
In that case you have to specify the charset explicitly. For instance,
|
||||||
assume you're from Italy and don't want to use the default Windows codepage
|
assume you're from Italy and don't want to use the Italian default Windows
|
||||||
1252, but the more portable ISO-8859-15 character set. What you can do is
|
ANSI codepage 1252, but the more portable ISO-8859-15 character set.
|
||||||
to set the <envar>LANG</envar> variable in the
|
What you can do, for instance, is to set the <envar>LANG</envar> variable
|
||||||
<filename>C:\cygwin\Cygwin.bat</filename> file which is the batch file
|
in the <filename>C:\cygwin\Cygwin.bat</filename> file which is the batch file
|
||||||
to start a Cygwin session from the "Cygwin" desktop shortcut.</para>
|
to start a Cygwin session from the "Cygwin" desktop shortcut.</para>
|
||||||
|
|
||||||
<screen>
|
<screen>
|
||||||
@ -257,14 +271,16 @@ to start a Cygwin session from the "Cygwin" desktop shortcut.</para>
|
|||||||
</listitem>
|
</listitem>
|
||||||
|
|
||||||
<listitem><para>
|
<listitem><para>
|
||||||
Most singlebyte or doublebyte charsets have a disadvantage. Windows
|
Last, but not least, most singlebyte or doublebyte charsets have a big
|
||||||
filesystems use the Unicode character set in the UTF-16 encoding to store filename information. Not all characters
|
disadvantage. Windows filesystems use the Unicode character set in the
|
||||||
|
UTF-16 encoding to store filename information. Not all characters
|
||||||
from the Unicode character set are available in a singlebyte or doublebyte
|
from the Unicode character set are available in a singlebyte or doublebyte
|
||||||
charset. While Cygwin has a workaround to access files with unusual
|
charset. While Cygwin has a workaround to access files with unusual
|
||||||
characters (see <xref linkend="pathnames-unusual"></xref>), a better
|
characters (see <xref linkend="pathnames-unusual"></xref>), a better
|
||||||
workaround is to use always the UTF-8 character set. UTF-8 is the only
|
workaround is to use always the UTF-8 character set.i</para>
|
||||||
multibyte character set which can represent <emphasis>every</emphasis>
|
|
||||||
Unicode character.</para>
|
<para><emphasis>UTF-8 is the only multibyte character set which can represent
|
||||||
|
every Unicode character.</emphasis></para>
|
||||||
|
|
||||||
<screen>
|
<screen>
|
||||||
set LANG=es_MX.UTF-8
|
set LANG=es_MX.UTF-8
|
||||||
@ -278,7 +294,6 @@ Unicode character.</para>
|
|||||||
|
|
||||||
</sect2>
|
</sect2>
|
||||||
|
|
||||||
<!-- TODO: This is not correct anymore.
|
|
||||||
<sect2 id="setup-locale-console"><title>The Windows Console character set</title>
|
<sect2 id="setup-locale-console"><title>The Windows Console character set</title>
|
||||||
|
|
||||||
<para>Most of the time the Windows console is used to run Cygwin applications.
|
<para>Most of the time the Windows console is used to run Cygwin applications.
|
||||||
@ -287,7 +302,7 @@ While terminal emulations like <command>xterm</command> or
|
|||||||
used for in- and output, the Windows console hasn't such a way, since it's
|
used for in- and output, the Windows console hasn't such a way, since it's
|
||||||
not an application in its own right.</para>
|
not an application in its own right.</para>
|
||||||
|
|
||||||
<para>This problem is solved in Cygwin as follows. When the first Cygwin
|
<para>This problem is solved in Cygwin as follows. When a Cygwin
|
||||||
process is started in a Windows console (either explicitly from cmd.exe,
|
process is started in a Windows console (either explicitly from cmd.exe,
|
||||||
or implicitly by, for instance, clicking on the Cygwin desktop icon, or
|
or implicitly by, for instance, clicking on the Cygwin desktop icon, or
|
||||||
running the Cygwin.bat file), the Console character set is determined by the
|
running the Cygwin.bat file), the Console character set is determined by the
|
||||||
@ -295,27 +310,18 @@ setting of the aforementioned internationalization environment variables,
|
|||||||
the same way as described in <xref linkend="setup-locale-how"></xref>.
|
the same way as described in <xref linkend="setup-locale-how"></xref>.
|
||||||
</para>
|
</para>
|
||||||
|
|
||||||
<para>However, in contrast to the application's character set, which is
|
<para>What is that good for? Why not switch the console character set with
|
||||||
determined by the <function>setlocale</function> call, the console
|
the applications requirements? After all, the application knows if it uses
|
||||||
character set stays fixed for all subsequent Cygwin processes started
|
localization or not. However, what if a non-localized application calls
|
||||||
from this first Cygwin process in the console. So, for instance, if
|
a remote application which itself is localized? This can happen with
|
||||||
<envar>LANG</envar> was set to "en_US.UTF-8" when the first Cygwin process
|
<command>ssh</command> or <command>rlogin</command>. Both commands don't
|
||||||
started, the console is a UTF-8 terminal for the entire Cygwin process
|
have and don't need localization and they never call
|
||||||
tree started from this first Cygwin process.</para>
|
<function>setlocale</function>. Setting one of the internationalization
|
||||||
|
environment variable to the same charset as the remote machine before
|
||||||
<para>You're asking "What is that good for? Why not switch the console
|
starting <command>ssh</command> or <command>rlogin</command> fixes that
|
||||||
character set with the applications requirements? After all, the
|
problem.</para>
|
||||||
application knows if it uses localization or not." That's true, but
|
|
||||||
what if the non-localized application calls a remote application which
|
|
||||||
itself is localized? This can happen with <command>ssh</command> or
|
|
||||||
<command>rlogin</command>. Both commands don't have and don't need
|
|
||||||
localization and they never call <function>setlocale</function>. This
|
|
||||||
would have the unfortunate effect, that the console would run with the
|
|
||||||
ASCII character set alone. Native characters printed from the remote
|
|
||||||
application would not show up correctly on your local console.</para>
|
|
||||||
|
|
||||||
</sect2>
|
</sect2>
|
||||||
-->
|
|
||||||
|
|
||||||
<sect2 id="setup-locale-problems"><title>Potential Problems when using Locales</title>
|
<sect2 id="setup-locale-problems"><title>Potential Problems when using Locales</title>
|
||||||
|
|
||||||
@ -330,22 +336,17 @@ set, and yet another. In bash for instance:</para>
|
|||||||
</screen>
|
</screen>
|
||||||
|
|
||||||
<para>However, here's a problem. At the start of the first Cygwin process
|
<para>However, here's a problem. At the start of the first Cygwin process
|
||||||
in a session, the Windows environment has to be converted from UTF-16 to
|
in a session, the Windows environment is converted from UTF-16 to UTF-8.
|
||||||
some singlebyte or multibyte charset. If the internationalization environment
|
The environment is another of the system objects stored in UTF-16 in
|
||||||
variable hasn't been set <emphasis>before</emphasis> starting this process,
|
Windows.</para>
|
||||||
Cygwin has to make an educated guess which charset to use to convert
|
|
||||||
the environment itself. The only reproducible way to do that in the absence
|
|
||||||
of <envar>LC_ALL</envar>, <envar>LC_CTYPE</envar>, or <envar>LANG</envar>,
|
|
||||||
is to use the "C" locale. The default conversion in the "C" locale
|
|
||||||
used by Cygwin internally is UTF-8. So, in the absence of any
|
|
||||||
internationalization environment variable, the environment will be converted
|
|
||||||
to UTF-8.</para>
|
|
||||||
|
|
||||||
<para>As long as the environment only contains ASCII characters, this is
|
<para>As long as the environment only contains ASCII characters, this is
|
||||||
no problem at all. But if it contains native characters, and you're planning
|
no problem at all. But if it contains native characters, and you're planning
|
||||||
to use, say, GBK, the environment will result in invalid characters in
|
to use, say, GBK, the environment will result in invalid characters in
|
||||||
the GBK charset. This would be especially a problem in variables like
|
the GBK charset. This would be especially a problem in variables like
|
||||||
<envar>PATH</envar>.</para>
|
<envar>PATH</envar>. To circumvent the worst problems, Cygwin converts
|
||||||
|
the <envar>PATH</envar> environment variable to the charset set in the
|
||||||
|
environment, if it's different from the UTF-8 charset.</para>
|
||||||
|
|
||||||
<note><para>Per POSIX, the name of an environment variable should only
|
<note><para>Per POSIX, the name of an environment variable should only
|
||||||
consist of valid ASCII characters, and only of uppercase letters, digits, and
|
consist of valid ASCII characters, and only of uppercase letters, digits, and
|
||||||
|
Loading…
x
Reference in New Issue
Block a user