newlib/winsup/doc/setup2.sgml
Corinna Vinschen 0b8e38dd8b * setup2.sgml (setup-locale): Mention three character codes per
ISO 639-3.

	* setup2.sgml (setup-locale): Adapt description to the C using ASCII
	change in 1.7.2.
2010-01-17 14:55:57 +00:00

625 lines
26 KiB
Plaintext

<sect1 id="setup-env"><title>Environment Variables</title>
<para>
You may wish to specify settings of several important environment
variables that affect Cygwin's operation. Some of these settings need
to be in effect prior to launching the initial Cygwin session (before
starting your bash shell, for instance). They should therefore be set
in the Windows environment; all Windows environment variables are
imported when Cygwin starts. Such settings can be
placed in a .bat file. An initial file is named Cygwin.bat and is created
in the Cygwin root directory that you specified during setup. Note that
the "Cygwin" option of the Start Menu points to Cygwin.bat. Edit
Cygwin.bat to your liking or create your own .bat files to start
Cygwin processes.</para>
<para>
The <envar>CYGWIN</envar> variable is used to configure many global
settings for the Cygwin runtime system. Initially you can leave
<envar>CYGWIN</envar> unset or set it to <literal>tty</literal> (e.g.
to support job control with ^Z etc...) using a syntax like this in the
DOS shell, before launching bash.</para>
<screen>
<prompt>C:\&gt;</prompt> <userinput>set CYGWIN=tty notitle glob</userinput>
</screen>
<para>
Locale support is controlled by the <envar>LANG</envar> and
<envar>LC_xxx</envar> environment variables. You can set all of them
but Cygwin itself only honors the variables <envar>LC_ALL</envar>,
<envar>LC_CTYPE</envar>, and <envar>LANG</envar>, in this order, according
to the POSIX standard. The first one found rules. For a more detailed
description see <xref linkend="setup-locale"></xref>.
</para>
<para>
The <envar>PATH</envar> environment variable is used by Cygwin
applications as a list of directories to search for executable files
to run. This environment variable is converted from Windows format
(e.g. <filename>C:\Windows\system32;C:\Windows</filename>) to UNIX format
(e.g., <filename>/cygdrive/c/Windows/system32:/cygdrive/c/Windows</filename>)
when a Cygwin process first starts.
Set it so that it contains at least the <filename>x:\cygwin\bin</filename>
directory where "<filename>x:\cygwin</filename> is the "root" of your
cygwin installation if you wish to use cygwin tools outside of bash.
This is usually done by the batch file you're starting your shell with.
</para>
<para>
The <envar>HOME</envar> environment variable is used by many programs to
determine the location of your home directory and we recommend that it be
defined. This environment variable is also converted from Windows format
when a Cygwin process first starts. It's usually set in the shell
profile scripts in the /etc directory.
</para>
<para>
The <envar>TERM</envar> environment variable specifies your terminal
type. It is automatically set to <literal>cygwin</literal> if you have
not set it to something else.
</para>
<para>The <envar>LD_LIBRARY_PATH</envar> environment variable is used by
the Cygwin function <function>dlopen ()</function> as a list of
directories to search for .dll files to load. This environment variable
is converted from Windows format to UNIX format when a Cygwin process
first starts. Most Cygwin applications do not make use of the
<function>dlopen ()</function> call and do not need this variable.
</para>
<para>
In addition to <envar>PATH</envar>, <envar>HOME</envar>,
and <envar>LD_LIBRARY_PATH</envar>, there are three other environment
variables which, if they exist in the Windows environment, are
converted to UNIX format: <envar>TMPDIR</envar>, <envar>TMP</envar>,
and <envar>TEMP</envar>. The first is not set by default in the
Windows environment but the other two are, and they point to the
default Windows temporary directory. If set, these variables will be
used by some Cygwin applications, possibly with unexpected results.
You may therefore want to unset them by adding the following two lines
to your <filename>~/.bashrc</filename> file:
<screen>
unset TMP
unset TEMP
</screen>
This is done in the default <filename>~/.bashrc</filename> file.
Alternatively, you could set <envar>TMP</envar>
and <envar>TEMP</envar> to point to <filename>/tmp</filename> or to
any other temporary directory of your choice. For example:
<screen>
export TMP=/tmp
export TEMP=/tmp
</screen>
</para>
</sect1>
<sect1 id="setup-maxmem"><title>Changing Cygwin's Maximum Memory</title>
<para>
Cygwin's heap is extensible. However, it does start out at a fixed size
and attempts to extend it may run into memory which has been previously
allocated by Windows. In some cases, this problem can be solved by
adding an entry in the either the <literal>HKEY_LOCAL_MACHINE</literal>
(to change the limit for all users) or
<literal>HKEY_CURRENT_USER</literal> (for just the current user) section
of the registry. </para>
<para>
Add the <literal>DWORD</literal> value <literal>heap_chunk_in_mb</literal>
and set it to the desired memory limit in decimal MB. It is preferred to do
this in Cygwin using the <command>regtool</command> program included in the
Cygwin package.
(For more information about <command>regtool</command> or the other Cygwin
utilities, see <xref linkend="using-utils"></xref> or use the
<literal>--help</literal> option of each util.) You should always be careful
when using <command>regtool</command> since damaging your system registry can
result in an unusable system. This example sets memory limit to 1024 MB:
<screen>
regtool -i set /HKLM/Software/Cygwin/heap_chunk_in_mb 1024
regtool -v list /HKLM/Software/Cygwin
</screen>
</para>
<para>
Exit all running Cygwin processes and restart them. Memory can be allocated up
to the size of the system swap space minus any the size of any running
processes. The system swap should be at least as large as the physically
installed RAM and can be modified under the System category of the
Control Panel.
</para>
<para>
Here is a small program written by DJ Delorie that tests the
memory allocation limit on your system:
<screen>
main()
{
unsigned int bit=0x40000000, sum=0;
char *x;
while (bit > 4096)
{
x = malloc(bit);
if (x)
sum += bit;
bit >>= 1;
}
printf("%08x bytes (%.1fMb)\n", sum, sum/1024.0/1024.0);
return 0;
}
</screen>
You can compile this program using:
<screen>
gcc max_memory.c -o max_memory.exe
</screen>
Run the program and it will output the maximum amount of allocatable memory.
</para>
</sect1>
<sect1 id="setup-locale"><title>Internationalization</title>
<sect2 id="setup-locale-ov"><title>Overview</title>
<para>
Internationalization support is controlled by the <envar>LANG</envar> and
<envar>LC_xxx</envar> environment variables. You can set all of them
but Cygwin itself only honors the variables <envar>LC_ALL</envar>,
<envar>LC_CTYPE</envar>, and <envar>LANG</envar>, in this order, according
to the POSIX standard. The content of these variables should follow the
POSIX standard for a locale specifier. The correct form of a locale
specifier is</para>
<screen>
language[[_TERRITORY][.charset][@modifier]]
</screen>
<para>"language" is a lowercase two character string per ISO 639-1, or,
if there is no ISO 639-1 code for the language (for instance, "Lower Sorbian"),
a three character string per ISO 639-3.</para>
<para>"TERRITORY" is an uppercase two character string per ISO 3166, charset is
one of a list of supported character sets, and the modifier doesn't matter
here (though it might for some applications). If you're interested in the
exact description, you can find it in the online publication of the POSIX
manual pages on the homepage of the
<ulink url="http://www.opengroup.org/">Open Group</ulink>.</para>
<para>Typical locale specifiers are</para>
<screen>
"de_CH" language = German, territory = Switzerland, default charset
"fr_FR.UTF-8" language = french, territory = France, charset = UTF-8
"ko_KR.eucKR" language = korean, territory = South Korea, charset = eucKR
"syr_SY" language = Syriac, territory = Syria, default charset
</screen>
<para>
At application startup, the application's locale is set to the default
"C" or "POSIX" locale. Under Cygwin 1.7.2 and later, this locale defaults
to the ASCII character set on the application level. If you want to stick
to the "C" locale and only change to another charset, you can define this
by setting one of the locale environment variables to "C.charset". For
instance</para>
<screen>
"C.ISO-8859-1"
</screen>
<note><para>The default locale in the absence of the aforementioned locale
environment variables is "C.UTF-8".</para></note>
<para>Windows uses the UTF-16 charset exclusively to store the names
of any object used by the Operating System. This is especially important
with filenames. Cygwin uses the setting of the locale environment variables
<envar>LC_ALL</envar>, <envar>LC_CTYPE</envar>, and <envar>LANG</envar>, to
determine how to convert Windows filenames from their UTF-16 representation
to the singlebyte or multibyte character set used by Cygwin.</para>
<para>
The setting of the locale environment variables at process startup
is effective for Cygwin's internal conversions to and from the Windows UTF-16
object names for the entire lifetime of the current process. Changing
the environment variables to another value changes the way filenames are
converted in subsequently started child processes, but not within the same
process.</para>
<para>
However, even if one of the locale environment variables is set to
some other value than "C", this does <emphasis>only</emphasis> affect
how Cygwin itself converts filenames. As the POSIX standard requires,
it's the application's responsibility to activate that locale for its
own purposes, typically by using the call</para>
<screen>
setlocale (LC_ALL, "");
</screen>
<para>early in the application code. Again, so that this doesn't get
lost: If the application calls setlocale as above, and there is none
of the important locale variables set in the environment, the locale
is set to the default locale, which is "C.UTF-8".</para>
<para>But what about applications which are not locale-aware? Per POSIX,
they are running in the "C" or "POSIX" locale, which implies the ASCII
charset. The Cygwin DLL itself, however, will nevertheless use the locale
set in the environment (or the "C.UTF-8" default locale) for converting
filenames etc.</para>
<para>When the locale set in the environment specifies an ASCII charset,
for example "C" or "en_US.ASCII", Cygwin will still use UTF-8
under the hood to translate filenames. This allows for easier
interoperability with applications running in the default "C.UTF-8" locale.
</para>
<para>
Right now the language and territory, as well as the modifier, are not
important to Cygwin, except to fix a single problem. There's a class of
characters in the Unicode character set, called the "CJK Ambiguous Width
Character set". For these characters the width returned by the
wcwidth/wcswidth function is usually 1. This is often a problem in
East-Asian languages, which historically use character sets in which
these characters have a width of 2. Kind of explains why they are
called "ambiguous"...</para>
<para>
The problem has been fixed like this. wcwidth/wcswidth usually
return 1 as the width of these characters. However, if the language is
specifed as "ja" (Japanese), "ko" (Korean), or "zh" (Chinese), wcwidth
returns 2 for these characters. Unfortunately this isn't correct in
all circumstances, so the user can specify the modifier "@cjknarrow",
which modifies the behaviour of wcwidth/wcswidth to return 1 for the
ambiguous width characters to return 1 even in those languages.</para>
<para>
Other than that, the only important part so far is the character set.
How does that work?</para>
</sect2>
<sect2 id="setup-locale-how"><title>How to set the locale</title>
<itemizedlist mark="bullet">
<listitem><para>
Assume that you've set one of the aforementioned environment variables to some
valid POSIX locale value, other than "C" and "POSIX". Assume further that
you're living in Japan. You might want to use the language code "ja" and the
territory "JP", thus setting, say, <envar>LANG</envar> to "ja_JP". You didn't
set a character set, so what will Cygwin use now? Easy! It will use the
default Windows ANSI codepage of your system, if it's supported by Cygwin.
Hopefully Cygwin supports all relevant default ANSI codepages...</para>
<note><para>For a list of supported character sets, see
<xref linkend="setup-locale-charsetlist"></xref>
</para></note>
</listitem>
<listitem><para>
You don't want to use the default Windows codepage as character set?
In that case you have to specify the charset explicitly. For instance,
assume you're from Italy and don't want to use the Italian default Windows
ANSI codepage 1252, but the more portable ISO-8859-15 character set.
What you can do, for instance, is to set the <envar>LANG</envar> variable
in the <filename>C:\cygwin\Cygwin.bat</filename> file which is the batch file
to start a Cygwin session from the "Cygwin" desktop shortcut.</para>
<screen>
@echo off
C:
chdir C:\cygwin\bin
set LANG=it_IT.ISO-8859-15
bash --login -i
</screen>
</listitem>
<listitem><para>
Last, but not least, most singlebyte or doublebyte charsets have a big
disadvantage. Windows filesystems use the Unicode character set in the
UTF-16 encoding to store filename information. Not all characters
from the Unicode character set are available in a singlebyte or doublebyte
charset. While Cygwin has a workaround to access files with unusual
characters (see <xref linkend="pathnames-unusual"></xref>), a better
workaround is to use always the UTF-8 character set.i</para>
<para><emphasis>UTF-8 is the only multibyte character set which can represent
every Unicode character.</emphasis></para>
<screen>
set LANG=es_MX.UTF-8
</screen>
<para>For a description of the Unicode standard, see the homepage of the
<ulink url="http://www.unicode.org/">Unicode Consortium</ulink>.
</para></listitem>
</itemizedlist>
</sect2>
<sect2 id="setup-locale-console"><title>The Windows Console character set</title>
<para>Most of the time the Windows console is used to run Cygwin applications.
While terminal emulations like <command>xterm</command> or
<command>mintty</command> have a distinct way to set the character set
used for in- and output, the Windows console hasn't such a way, since it's
not an application in its own right.</para>
<para>This problem is solved in Cygwin as follows. When a Cygwin
process is started in a Windows console (either explicitly from cmd.exe,
or implicitly by, for instance, clicking on the Cygwin desktop icon, or
running the Cygwin.bat file), the Console character set is determined by the
setting of the aforementioned internationalization environment variables,
the same way as described in <xref linkend="setup-locale-how"></xref>.
</para>
<para>What is that good for? Why not switch the console character set with
the applications requirements? After all, the application knows if it uses
localization or not. However, what if a non-localized application calls
a remote application which itself is localized? This can happen with
<command>ssh</command> or <command>rlogin</command>. Both commands don't
have and don't need localization and they never call
<function>setlocale</function>. Setting one of the internationalization
environment variable to the same charset as the remote machine before
starting <command>ssh</command> or <command>rlogin</command> fixes that
problem.</para>
</sect2>
<sect2 id="setup-locale-problems"><title>Potential Problems when using Locales</title>
<para>
You can set the above internationalization variables not only in
<filename>Cygwin.bat</filename> or in the Windows environment, but also
in your Cygwin shell on the fly, even switch to yet another character
set, and yet another. In bash for instance:</para>
<screen>
<prompt>bash$</prompt> export LC_CTYPE="nl_BE.UTF-8"
</screen>
<para>However, here's a problem. At the start of the first Cygwin process
in a session, the Windows environment is converted from UTF-16 to UTF-8.
The environment is another of the system objects stored in UTF-16 in
Windows.</para>
<para>As long as the environment only contains ASCII characters, this is
no problem at all. But if it contains native characters, and you're planning
to use, say, GBK, the environment will result in invalid characters in
the GBK charset. This would be especially a problem in variables like
<envar>PATH</envar>. To circumvent the worst problems, Cygwin converts
the <envar>PATH</envar> environment variable to the charset set in the
environment, if it's different from the UTF-8 charset.</para>
<note><para>Per POSIX, the name of an environment variable should only
consist of valid ASCII characters, and only of uppercase letters, digits, and
the underscore for maximum portablilty.</para></note>
<para>Symbolic links, too, may pose a problem when switching charsets on
the fly. A symbolic link contains the filename of the target file the
symlink points to. When a symlink had been created with older versions
of Cygwin, the current ANSI or OEM character set had been used to store
the target filename, dependent on the old <envar>CYGWIN</envar>
environment variable setting <envar>codepage</envar> (see <xref
linkend="cygwinenv-removed-options"></xref>. If the target filename
contains non-ASCII characters and you use another character set than
your default ANSI/OEM charset, the target filename of the symlink is now
potentially an invalid character sequence in the new character set.
This behaviour is not different from the behaviour in other Operating
Systems. So, if you suddenly can't access a symlink anymore which
worked all these years before, maybe it's because you switched to
another character set. This doesn't occur with symlinks created with
Cygwin 1.7 or later. </para>
<para>Another problem you might encounter is that older versions of
Windows did not install all charsets by default. If you are running
Windows XP or older, you can open the "Regional and Language Options"
portion of the Control Panel, select the "Advanced" tab, and select
entries from the "Code page conversion tables" list. The following
entries are useful to cygwin: 932/SJIS, 936/GBK, 949/EUC-KR, 950/Big5,
20932/EUC-JP.</para>
</sect2>
<sect2 id="setup-locale-missing"><title>What does not work?</title>
<para>
Except for <envar>LC_ALL</envar>, <envar>LC_CTYPE</envar>,
and <envar>LANG</envar>, all other LC_xxx environment variables,
<envar>LC_COLLATE</envar>, <envar>LC_MESSAGES</envar>,
<envar>LC_MONETARY</envar>, <envar>LC_NUMERIC</envar>,
and <envar>LC_TIME</envar>, are ignored right now. This means, while Cygwin
supports different character sets, it does <emphasis>not</emphasis> support
real localization so far. There's no support for locale-specific monetary
symbols, for a decimalpoint other than '.', no support for native time
formats, and no support for native language sorting orders.
</para>
<para>Cygwin's internationalization support is work in progress and we would
be glad for coding help in this area.</para>
</sect2>
<sect2 id="setup-locale-charsetlist"><title>List of supported character sets</title>
<para>Last but not least, here's the list of currently supported character
sets. The left-hand expression is the name of the charset, as you would use
it in the internationalization environment variables as outlined above.
Note that charset specifiers are case-insensitive. <literal>EUCJP</literal>
is equivalent to <literal>eucJP</literal> or <literal>eUcJp</literal>.
Writing the charset in the exact case as given in the list below is a
good convention, though.
</para>
<para>The right-hand side is the number of the equivalent Windows
codepage as well as the Windows name of the codepage. They are only
noted here for reference. Don't try to use the bare codepage number or
the Windows name of the codepage as charset in locale specifiers, unless
they happen to be identical with the left-hand side. Especially in case
of the "CPxxx" style charsets, always use them with the trailing "CP".</para>
<para>This works:</para>
<screen>
set LC_ALL=en_US.CP437
</screen>
<para>This does <emphasis>not</emphasis> work:</para>
<screen>
set LC_ALL=en_US.437
</screen>
<para>You can find a full list of Windows codepages on the Microsoft MSDN page
<ulink url="http://msdn.microsoft.com/en-us/library/dd317756(VS.85).aspx">Code Page Identifiers</ulink>.</para>
<screen>
Charset Codepage
CP437 437 (OEM United States)
CP720 720 (DOS Arabic)
CP737 737 (OEM Greek)
CP775 775 (OEM Baltic)
CP850 850 (OEM Latin 1, Western European)
CP852 852 (OEM Latin 2, Central European)
CP855 855 (OEM Cyrillic)
CP857 857 (OEM Turkish)
CP858 858 (OEM Latin 1 + Euro Symbol)
CP862 862 (OEM Hebrew)
CP866 866 (OEM Russian)
CP874 874 (ANSI/OEM Thai)
CP1125 1125 (OEM Ukraine)
CP1250 1250 (ANSI Central European)
CP1251 1251 (ANSI Cyrillic)
CP1252 1252 (ANSI Latin 1, Western European)
CP1253 1253 (ANSI Greek)
CP1254 1254 (ANSI Turkish)
CP1255 1255 (ANSI Hebrew)
CP1256 1256 (ANSI Arabic)
CP1257 1257 (ANSI Baltic)
CP1258 1258 (ANSI/OEM Vietnamese)
ISO-8859-1 28591 (ISO-8859-1)
ISO-8859-2 28592 (ISO-8859-2)
ISO-8859-3 28593 (ISO-8859-3)
ISO-8859-4 28594 (ISO-8859-4)
ISO-8859-5 28595 (ISO-8859-5)
ISO-8859-6 28596 (ISO-8859-6)
ISO-8859-7 28597 (ISO-8859-7)
ISO-8859-8 28598 (ISO-8859-8)
ISO-8859-9 28599 (ISO-8859-9)
ISO-8859-10 - (not available)
ISO-8859-11 - (not available)
ISO-8859-13 28603 (ISO-8859-13)
ISO-8859-14 - (not available)
ISO-8859-15 28605 (ISO-8859-15)
ISO-8859-16 - (not available)
KOI8-R 20866 (KOI8-R Russian Cyrillic)
KOI8-U 21866 (KOI8-U Ukrainian Cyrillic)
SJIS 932 (ANSI/OEM Japanese)
GBK 936 (ANSI/OEM Simplified Chinese)
Big5 950 (ANSI/OEM Traditional Chinese)
eucJP 20932 (EUC Japanese)
eucKR 949 (EUC Korean)
UTF-8 or UTF8 65001 (UTF-8)
</screen>
</sect2>
</sect1>
<sect1 id="setup-files"><title>Customizing bash</title>
<para>
To set up bash so that cut and paste work properly, click on the
"Properties" button of the window, then on the "Misc" tab. Make sure
that "QuickEdit mode" and "Insert mode" are checked. These settings
will be remembered next time you run bash from that shortcut. Similarly
you can set the working directory inside the "Program" tab. The entry
"%HOME%" is valid, but requires that you set <envar>HOME</envar> in
the Windows environment.
</para>
<para>
Your home directory should contain three initialization files
that control the behavior of bash. They are
<filename>.profile</filename>, <filename>.bashrc</filename> and
<filename>.inputrc</filename>. The Cygwin base installation creates
stub files when you start bash for the first time.</para>
<para>
<filename>.profile</filename> (other names are also valid, see the bash man
page) contains bash commands. It is executed when bash is started as login
shell, e.g. from the command <command>bash --login</command>.
This is a useful place to define and
export environment variables and bash functions that will be used by bash
and the programs invoked by bash. It is a good place to redefine
<envar>PATH</envar> if needed. We recommend adding a ":." to the end of
<envar>PATH</envar> to also search the current working directory (contrary
to DOS, the local directory is not searched by default). Also to avoid
delays you should either <command>unset</command> <envar>MAILCHECK</envar>
or define <envar>MAILPATH</envar> to point to your existing mail inbox.
</para>
<para>
<filename>.bashrc</filename> is similar to
<filename>.profile</filename> but is executed each time an interactive
bash shell is launched. It serves to define elements that are not
inherited through the environment, such as aliases. If you do not use
login shells, you may want to put the contents of
<filename>.profile</filename> as discussed above in this file
instead.
</para>
<para>
<screen>
shopt -s nocaseglob
</screen>
will allow bash to glob filenames in a case-insensitive manner.
Note that <filename>.bashrc</filename> is not called automatically for login
shells. You can source it from <filename>.profile</filename>.
</para>
<para>
<filename>.inputrc</filename> controls how programs using the readline
library (including <command>bash</command>) behave. It is loaded
automatically. For full details see the <literal>Function and Variable
Index</literal> section of the GNU <systemitem>readline</systemitem> manual.
Consider the following settings:
<screen>
# Ignore case while completing
set completion-ignore-case on
# Make Bash 8bit clean
set meta-flag on
set convert-meta off
set output-meta on
</screen>
The first command makes filename completion case insensitive, which can
be convenient in a Windows environment. The next three commands allow
<command>bash</command> to display 8-bit characters, useful for
languages with accented characters. Note that tools that do not use
<systemitem>readline</systemitem> for display, such as
<command>less</command> and <command>ls</command>, require additional
settings, which could be put in your <filename>.bashrc</filename>:
<screen>
alias less='/bin/less -r'
alias ls='/bin/ls -F --color=tty --show-control-chars'
</screen>
</para>
</sect1>