newlib/winsup/doc/textbinary.sgml
Joshua Daniel Franklin aff8b4f9aa 2004-01-24 Joshua Daniel Franklin <joshuadfranklin@yahoo.com>
* cygwinenv.sgml: Cleanup minor markup problems.
	* dll.sgml: Cleanup minor markup problems.
	* effectively.sgml: Cleanup minor markup problems.
	* gcc.sgml: Cleanup minor markup problems.
	* ntsec.sgml: Cleanup minor markup problems.
	* pathnames.sgml: Cleanup minor markup problems.
	* setup-net.sgml: Cleanup minor markup problems.
	* textbinary.sgml: Cleanup minor markup problems.
	* windres.sgml: Cleanup minor markup problems.
2004-01-24 08:09:45 +00:00

188 lines
7.5 KiB
Plaintext

<sect1 id="using-textbinary"><title>Text and Binary modes</title>
<sect2> <title>The Issue</title>
<para>On a UNIX system, when an application reads from a file it gets
exactly what's in the file on disk and the converse is true for writing.
The situation is different in the DOS/Windows world where a file can
be opened in one of two modes, binary or text. In the binary mode the
system behaves exactly as in UNIX. However in text mode there are
major differences:</para>
<OrderedList Numeration="Loweralpha" Spacing="Compact">
<listitem>
<para>
On writing in text mode, a NL (\n, ^J) is transformed into the
sequence CR (\r, ^M) NL.</para>
</listitem>
<listitem>
<para>
On reading in text mode, a CR followed by an NL is deleted and a ^Z
character signals the end of file.</para>
</listitem>
</OrderedList>
<para>This can wreak havoc with the seek/fseek calls since the number
of bytes actually in the file may differ from that seen by the
application.</para>
<para>The mode can be specified explicitly as explained in the Programming
section below. In an ideal DOS/Windows world, all programs using lines as
records (such as <command>bash</command>, <command>make</command>,
<command>sed</command> ...) would open files (and change the mode of their
standard input and output) as text. All other programs (such as
<command>cat</command>, <command>cmp</command>, <command>tr</command> ...)
would use binary mode. In practice with Cygwin, programs that deal
explicitly with object files specify binary mode (this is the case of
<command>od</command>, which is helpful to diagnose CR problems). Most
other programs (such as <command>cat</command>, <command>cmp</command>,
<command>tr</command>) use the default mode.</para>
</sect2>
<sect2><title>The default Cygwin behavior</title>
<para>The Cygwin system gives us some flexibility in deciding how files
are to be opened when the mode is not specified explicitly.
The rules are evolving, this section gives the design goals.</para>
<OrderedList Numeration="Loweralpha">
<listitem>
<para>If the file appears to reside on a file system that is mounted
(i.e. if its pathname starts with a directory displayed by
<command>mount</command>), then the default is specified by the mount
flag. If the file is a symbolic link, the mode of the target file system
applies.</para>
</listitem>
<listitem>
<para>If the file appears to reside on a file system that is not mounted
(as can happen when the path contains a drive letter), the default is text.
</para>
</listitem>
<listitem>
<para>Pipes and non-file devices are opened in binary mode,
except if the <EnVar>CYGWIN</EnVar> environment variable contains
<literal>nobinmode</literal>.</para>
<warning><Title>Warning!</Title><para>In b20.1 of 12/98, a file will be opened
in binary mode if any of the following conditions hold:</para>
<OrderedList Numeration="arabic" Spacing="Compact">
<listitem><para>binary mode is specified in the open call</para>
</listitem>
<listitem><para><envar>CYGWIN</envar> contains <literal>binmode</literal></para>
</listitem>
<listitem><para>the file resides in a binary mounted partition</para>
</listitem>
<listitem><para>the file is not a disk file</para>
</listitem>
</OrderedList>
</warning>
</listitem>
<listitem>
<para>When a Cygwin program is launched by a shell, its standard input,
output and error are in binary mode if the <envar>CYGWIN</envar> variable
contains <literal>tty</literal>, else in text mode, except if they are piped
or redirected.</para>
<para> When redirecting, the Cygwin shells uses rules (a-c). For
these shells the relevant value of <envar>CYGWIN</envar> is that at the time
the shell was launched and not that at the time the program is executed.
Non-Cygwin shells always pipe and redirect with binary mode. With
non-Cygwin shells the commands <command> cat filename | program </command>
and <command> program &lt; filename </command> are not equivalent when
<filename>filename</filename> is on a text-mounted partition. </para>
</listitem>
</OrderedList>
</sect2>
<sect2><title>Example</title>
<para>To illustrate the various rules, we provide scripts to delete CRs
from files by using the <command>tr</command> program, which can only write
to standard output.
The script</para>
<screen>
<![CDATA[
#!/bin/sh
# Remove \r from the file given as argument
tr -d '\r' < "$1" > "$1".nocr
]]>
</screen>
<para> will not work on a text mounted systems because the \r will be
reintroduced on writing. However scripts such as </para>
<screen>
<![CDATA[
#!/bin/sh
# Remove \r from the file given as argument
tr -d '\r' | gzip | gunzip > "$1".nocr
]]>
</screen>
<para>and the .bat file</para>
<screen>
<![CDATA[
REM Remove \r from the file given as argument
@echo off
tr -d \r < %1 > %1.nocr
]]>
</screen>
<para> work fine. In the first case (assuming the pipes are binary)
we rely on <command>gunzip</command> to set its output to binary mode,
possibly overriding the mode used by the shell.
In the second case we rely on the DOS shell to redirect in binary mode.
</para>
</sect2>
<sect2><title>Binary or text?</title>
<para>UNIX programs that have been written for maximum portability
will know the difference between text and binary files and act
appropriately under Cygwin. For those programs, the text mode default
is a good choice. Programs included in official Cygwin distributions
should work well in the default mode. </para>
<para>Text mode makes it much easier to mix files between Cygwin and
Windows programs, since Windows programs will usually use the CRLF
format. Unfortunately you may still have some problems with text
mode. First, some of the utilities included with Cygwin do not yet
specify binary mode when they should, e.g. <command>cat</command> will
not work with binary files (input will stop at ^Z, CRs will be
introduced in the output). Second, you will introduce CRs in text
files you write, which can cause problems when moving them back to a
UNIX system. </para>
<para>If you are mounting a remote file system from a UNIX machine,
or moving files back and forth to a UNIX machine, you may want to
access the files in binary mode. The text files found there will normally
be in UNIX NL format, and you would want any files put there by Cygwin
programs to be stored in a format understood by UNIX.
Be sure to remove CRs from all Makefiles and
shell scripts and make sure that you only edit the files with
DOS/Windows editors that can cope with and preserve NL terminated lines.
</para>
<para>Note that you can decide this on a disk by disk basis (for
example, mounting local disks in text mode and network disks in binary
mode). You can also partition a disk, for example by mounting
<filename>c:</filename> in text mode, and <filename>c:\home</filename>
in binary mode.</para>
</sect2>
<sect2><title>Programming</title>
<para>In the <function>open()</function> function call, binary mode can be
specified with the flag <literal>O_BINARY</literal> and text mode with
<literal>O_TEXT</literal>. These symbols are defined in
<filename>fcntl.h</filename>.</para>
<para>In the <function>fopen()</function> function call, binary mode can be
specified by adding a <literal>b</literal> to the mode string. There is no
direct way to specify text mode.</para>
<para>The mode of a file can be changed by the call
<function>setmode(fd,mode)</function> where <literal>fd</literal> is a file
descriptor (an integer) and <literal>mode</literal> is
<literal>O_BINARY</literal> or <literal>O_TEXT</literal>. The function
returns <literal>O_BINARY</literal> or <literal>O_TEXT</literal> depending
on the mode before the call, and <literal>EOF</literal> on error.</para>
</sect2>
</sect1>