1811 lines
		
	
	
		
			85 KiB
		
	
	
	
		
			HTML
		
	
	
	
	
	
			
		
		
	
	
			1811 lines
		
	
	
		
			85 KiB
		
	
	
	
		
			HTML
		
	
	
	
	
	
| <html lang="en">
 | ||
| <head>
 | ||
| <title>Lzip Manual</title>
 | ||
| <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-15">
 | ||
| <meta name="description" content="Lzip Manual">
 | ||
| <meta name="generator" content="makeinfo 4.13+">
 | ||
| <link title="Top" rel="top" href="#Top">
 | ||
| <link href="http://www.gnu.org/software/texinfo/" rel="generator-home" title="Texinfo Homepage">
 | ||
| <meta http-equiv="Content-Style-Type" content="text/css">
 | ||
| <style type="text/css"><!--
 | ||
|   pre.display { font-family:inherit }
 | ||
|   pre.format  { font-family:inherit }
 | ||
|   pre.smalldisplay { font-family:inherit; font-size:smaller }
 | ||
|   pre.smallformat  { font-family:inherit; font-size:smaller }
 | ||
|   pre.smallexample { font-size:smaller }
 | ||
|   pre.smalllisp    { font-size:smaller }
 | ||
|   span.sc    { font-variant:small-caps }
 | ||
|   span.roman { font-family:serif; font-weight:normal; } 
 | ||
|   span.sansserif { font-family:sans-serif; font-weight:normal; } 
 | ||
| --></style>
 | ||
| </head>
 | ||
| <body>
 | ||
| <div class="node">
 | ||
| <a name="Top"></a>
 | ||
| <p><hr>
 | ||
| Next: <a rel="next" accesskey="n" href="#Introduction">Introduction</a>,
 | ||
| Up: <a rel="up" accesskey="u" href="#dir">(dir)</a>
 | ||
| 
 | ||
| </div>
 | ||
| 
 | ||
| <h2 class="unnumbered">Lzip Manual</h2>
 | ||
| 
 | ||
| <p>This manual is for Lzip (version 1.24, 7 April 2024).
 | ||
| 
 | ||
| <ul class="menu">
 | ||
| <li><a accesskey="1" href="#Introduction">Introduction</a>:            Purpose and features of lzip
 | ||
| <li><a accesskey="2" href="#Output">Output</a>:                  Meaning of lzip's output
 | ||
| <li><a accesskey="3" href="#Invoking-lzip">Invoking lzip</a>:           Command-line interface
 | ||
| <li><a accesskey="4" href="#Quality-assurance">Quality assurance</a>:       Design, development, and testing of lzip
 | ||
| <li><a accesskey="5" href="#Algorithm">Algorithm</a>:               How lzip compresses the data
 | ||
| <li><a accesskey="6" href="#File-format">File format</a>:             Detailed format of the compressed file
 | ||
| <li><a accesskey="7" href="#Stream-format">Stream format</a>:           Format of the LZMA stream in lzip files
 | ||
| <li><a accesskey="8" href="#Trailing-data">Trailing data</a>:           Extra data appended to the file
 | ||
| <li><a accesskey="9" href="#Examples">Examples</a>:                A small tutorial with examples
 | ||
| <li><a href="#Problems">Problems</a>:                Reporting bugs
 | ||
| <li><a href="#Reference-source-code">Reference source code</a>:   Source code illustrating stream format
 | ||
| <li><a href="#Concept-index">Concept index</a>:           Index of concepts
 | ||
| </ul>
 | ||
| 
 | ||
|    <pre class="sp">
 | ||
| 
 | ||
| </pre>
 | ||
| Copyright © 2008-2024 Antonio Diaz Diaz.
 | ||
| 
 | ||
|    <p>This manual is free documentation: you have unlimited permission to copy,
 | ||
| distribute, and modify it.
 | ||
| 
 | ||
| <div class="node">
 | ||
| <a name="Introduction"></a>
 | ||
| <p><hr>
 | ||
| Next: <a rel="next" accesskey="n" href="#Output">Output</a>,
 | ||
| Previous: <a rel="previous" accesskey="p" href="#Top">Top</a>,
 | ||
| Up: <a rel="up" accesskey="u" href="#Top">Top</a>
 | ||
| 
 | ||
| </div>
 | ||
| 
 | ||
| <h2 class="chapter">1 Introduction</h2>
 | ||
| 
 | ||
| <p><a name="index-introduction-1"></a>
 | ||
| <a href="http://www.nongnu.org/lzip/lzip.html">Lzip</a>
 | ||
| is a lossless data compressor with a user interface similar to the one
 | ||
| of gzip or bzip2. Lzip uses a simplified form of the 'Lempel-Ziv-Markov
 | ||
| chain-Algorithm' (LZMA) stream format to maximize interoperability. The
 | ||
| maximum dictionary size is 512 MiB so that any lzip file can be decompressed
 | ||
| on 32-bit machines. Lzip provides accurate and robust 3-factor integrity
 | ||
| checking. Lzip can compress about as fast as gzip (lzip -0)<!-- /@w --> or compress most
 | ||
| files more than bzip2 (lzip -9)<!-- /@w -->. Decompression speed is intermediate between
 | ||
| gzip and bzip2. Lzip is better than gzip and bzip2 from a data recovery
 | ||
| perspective. Lzip has been designed, written, and tested with great care to
 | ||
| replace gzip and bzip2 as the standard general-purpose compressed format for
 | ||
| Unix-like systems.
 | ||
| 
 | ||
|    <p>For compressing/decompressing large files on multiprocessor machines
 | ||
| <a href="http://www.nongnu.org/lzip/manual/plzip_manual.html">plzip</a> can be
 | ||
| much faster than lzip at the cost of a slightly reduced compression ratio.
 | ||
| 
 | ||
|    <p>For creation and manipulation of compressed tar archives
 | ||
| <a href="http://www.nongnu.org/lzip/manual/tarlz_manual.html">tarlz</a> can be more
 | ||
| efficient than using tar and plzip because tarlz is able to keep the
 | ||
| alignment between tar members and lzip members.
 | ||
| 
 | ||
|    <p>The lzip file format is designed for data sharing and long-term archiving,
 | ||
| taking into account both data integrity and decoder availability:
 | ||
| 
 | ||
|      <ul>
 | ||
| <li>The lzip format provides very safe integrity checking and some data
 | ||
| recovery means. The program
 | ||
| <a href="http://www.nongnu.org/lzip/manual/lziprecover_manual.html#Data-safety">lziprecover</a>
 | ||
| can repair bit flip errors (one of the most common forms of data corruption)
 | ||
| in lzip files, and provides data recovery capabilities, including
 | ||
| error-checked merging of damaged copies of a file.
 | ||
| 
 | ||
|      <li>The lzip format is as simple as possible (but not simpler). The lzip
 | ||
| manual provides the source code of a simple decompressor along with a
 | ||
| detailed explanation of how it works, so that with the only help of the
 | ||
| lzip manual it would be possible for a digital archaeologist to extract
 | ||
| the data from a lzip file long after quantum computers eventually
 | ||
| render LZMA obsolete.
 | ||
| 
 | ||
|      <li>Additionally the lzip reference implementation is copylefted, which
 | ||
| guarantees that it will remain free forever. 
 | ||
| </ul>
 | ||
| 
 | ||
|    <p>A nice feature of the lzip format is that a corrupt byte is easier to repair
 | ||
| the nearer it is from the beginning of the file. Therefore, with the help of
 | ||
| lziprecover, losing an entire archive just because of a corrupt byte near
 | ||
| the beginning is a thing of the past.
 | ||
| 
 | ||
|    <p>The member trailer stores the 32-bit CRC of the original data, the size of
 | ||
| the original data, and the size of the member. These values, together with
 | ||
| the "End Of Stream" marker, provide a 3-factor integrity checking which
 | ||
| guarantees that the decompressed version of the data is identical to the
 | ||
| original. This guards against corruption of the compressed data, and against
 | ||
| undetected bugs in lzip (hopefully very unlikely). The chances of data
 | ||
| corruption going undetected are microscopic. Be aware, though, that the
 | ||
| check occurs upon decompression, so it can only tell you that something is
 | ||
| wrong. It can't help you recover the original uncompressed data.
 | ||
| 
 | ||
|    <p>Lzip uses the same well-defined exit status values used by bzip2, which
 | ||
| makes it safer than compressors returning ambiguous warning values (like
 | ||
| gzip) when it is used as a back end for other programs like tar or zutils.
 | ||
| 
 | ||
|    <p>Lzip automatically uses for each file the largest dictionary size that does
 | ||
| not exceed neither the file size nor the limit given. Keep in mind that the
 | ||
| decompression memory requirement is affected at compression time by the
 | ||
| choice of dictionary size limit.
 | ||
| 
 | ||
|    <p>The amount of memory required for compression is about 1 or 2 times the
 | ||
| dictionary size limit (1 if input file size is less than dictionary size
 | ||
| limit, else 2) plus 9 times the dictionary size really used. The option
 | ||
| <samp><span class="option">-0</span></samp> is special and only requires about 1.5 MiB<!-- /@w --> at most. The
 | ||
| amount of memory required for decompression is about 46 kB<!-- /@w --> larger
 | ||
| than the dictionary size really used.
 | ||
| 
 | ||
|    <p>When compressing, lzip replaces every file given in the command line
 | ||
| with a compressed version of itself, with the name "original_name.lz". 
 | ||
| When decompressing, lzip attempts to guess the name for the decompressed
 | ||
| file from that of the compressed file as follows:
 | ||
| 
 | ||
|    <p><table summary=""><tr align="left"><td valign="top">filename.lz  </td><td valign="top">becomes </td><td valign="top">filename
 | ||
| <br></td></tr><tr align="left"><td valign="top">filename.tlz </td><td valign="top">becomes </td><td valign="top">filename.tar
 | ||
| <br></td></tr><tr align="left"><td valign="top">anyothername </td><td valign="top">becomes </td><td valign="top">anyothername.out
 | ||
|    <br></td></tr></table>
 | ||
| 
 | ||
|    <p>(De)compressing a file is much like copying or moving it. Therefore lzip
 | ||
| preserves the access and modification dates, permissions, and, if you have
 | ||
| appropriate privileges, ownership of the file just as '<samp><span class="samp">cp -p</span></samp>'<!-- /@w --> does. 
 | ||
| (If the user ID or the group ID can't be duplicated, the file permission
 | ||
| bits S_ISUID and S_ISGID are cleared).
 | ||
| 
 | ||
|    <p>Lzip is able to read from some types of non-regular files if either the
 | ||
| option <samp><span class="option">-c</span></samp> or the option <samp><span class="option">-o</span></samp> is specified.
 | ||
| 
 | ||
|    <p>Lzip refuses to read compressed data from a terminal or write compressed
 | ||
| data to a terminal, as this would be entirely incomprehensible and might
 | ||
| leave the terminal in an abnormal state.
 | ||
| 
 | ||
|    <p>Lzip correctly decompresses a file which is the concatenation of two or
 | ||
| more compressed files. The result is the concatenation of the corresponding
 | ||
| decompressed files. Integrity testing of concatenated compressed files is
 | ||
| also supported.
 | ||
| 
 | ||
|    <p>Lzip can produce multimember files, and lziprecover can safely recover the
 | ||
| undamaged members in case of file damage. Lzip can also split the compressed
 | ||
| output in volumes of a given size, even when reading from standard input. 
 | ||
| This allows the direct creation of multivolume compressed tar archives.
 | ||
| 
 | ||
|    <p>Lzip is able to compress and decompress streams of unlimited size by
 | ||
| automatically creating multimember output. The members so created are large,
 | ||
| about 2 PiB<!-- /@w --> each.
 | ||
| 
 | ||
| <div class="node">
 | ||
| <a name="Output"></a>
 | ||
| <p><hr>
 | ||
| Next: <a rel="next" accesskey="n" href="#Invoking-lzip">Invoking lzip</a>,
 | ||
| Previous: <a rel="previous" accesskey="p" href="#Introduction">Introduction</a>,
 | ||
| Up: <a rel="up" accesskey="u" href="#Top">Top</a>
 | ||
| 
 | ||
| </div>
 | ||
| 
 | ||
| <h2 class="chapter">2 Meaning of lzip's output</h2>
 | ||
| 
 | ||
| <p><a name="index-output-2"></a>
 | ||
| The output of lzip looks like this:
 | ||
| 
 | ||
| <pre class="example">     lzip -v foo
 | ||
|        foo:  6.676:1, 14.98% ratio, 85.02% saved, 450560 in, 67493 out.
 | ||
|      
 | ||
|      lzip -tvvv foo.lz
 | ||
|        foo.lz:  6.676:1, 14.98% ratio, 85.02% saved.  450560 out,  67493 in. ok
 | ||
| </pre>
 | ||
|    <p>The meaning of each field is as follows:
 | ||
| 
 | ||
|      <dl>
 | ||
| <dt><code>N:1</code><dd>The compression ratio (uncompressed_size / compressed_size)<!-- /@w -->, shown as
 | ||
| N to 1<!-- /@w -->.
 | ||
| 
 | ||
|      <br><dt><code>ratio</code><dd>The inverse compression ratio (compressed_size / uncompressed_size)<!-- /@w -->,
 | ||
| shown as a percentage. A decimal ratio is easily obtained by moving the
 | ||
| decimal point two places to the left; 14.98% = 0.1498<!-- /@w -->.
 | ||
| 
 | ||
|      <br><dt><code>saved</code><dd>The space saved by compression (1 - ratio)<!-- /@w -->, shown as a percentage.
 | ||
| 
 | ||
|      <br><dt><code>in</code><dd>Size of the input data. This is the uncompressed size when compressing, or
 | ||
| the compressed size when decompressing or testing. Note that lzip always
 | ||
| prints the uncompressed size before the compressed size when compressing,
 | ||
| decompressing, testing, or listing.
 | ||
| 
 | ||
|      <br><dt><code>out</code><dd>Size of the output data. This is the compressed size when compressing, or
 | ||
| the decompressed size when decompressing or testing.
 | ||
| 
 | ||
|    </dl>
 | ||
| 
 | ||
|    <p>When decompressing or testing at verbosity level 4 (-vvvv), the dictionary
 | ||
| size used to compress the file and the CRC32 of the uncompressed data are
 | ||
| also shown.
 | ||
| 
 | ||
|    <p>LANGUAGE NOTE: Uncompressed = not compressed = plain data; it may never have
 | ||
| been compressed. Decompressed is used to refer to data which have undergone
 | ||
| the process of decompression.
 | ||
| 
 | ||
| <div class="node">
 | ||
| <a name="Invoking-lzip"></a>
 | ||
| <p><hr>
 | ||
| Next: <a rel="next" accesskey="n" href="#Quality-assurance">Quality assurance</a>,
 | ||
| Previous: <a rel="previous" accesskey="p" href="#Output">Output</a>,
 | ||
| Up: <a rel="up" accesskey="u" href="#Top">Top</a>
 | ||
| 
 | ||
| </div>
 | ||
| 
 | ||
| <h2 class="chapter">3 Invoking lzip</h2>
 | ||
| 
 | ||
| <p><a name="index-invoking-3"></a><a name="index-options-4"></a><a name="index-usage-5"></a><a name="index-version-6"></a>
 | ||
| The format for running lzip is:
 | ||
| 
 | ||
| <pre class="example">     lzip [<var>options</var>] [<var>files</var>]
 | ||
| </pre>
 | ||
|    <p class="noindent">If no file names are specified, lzip compresses (or decompresses) from
 | ||
| standard input to standard output. A hyphen '<samp><span class="samp">-</span></samp>' used as a <var>file</var>
 | ||
| argument means standard input. It can be mixed with other <var>files</var> and is
 | ||
| read just once, the first time it appears in the command line. Remember to
 | ||
| prepend <samp><span class="file">./</span></samp> to any file name beginning with a hyphen, or use '<samp><span class="samp">--</span></samp>'.
 | ||
| 
 | ||
|    <p>lzip supports the following
 | ||
| <a href="http://www.nongnu.org/arg-parser/manual/arg_parser_manual.html#Argument-syntax">options</a>:
 | ||
| 
 | ||
|      <dl>
 | ||
| <dt><code>-h</code><dt><code>--help</code><dd>Print an informative help message describing the options and exit.
 | ||
| 
 | ||
|      <br><dt><code>-V</code><dt><code>--version</code><dd>Print the version number of lzip on the standard output and exit. 
 | ||
| This version number should be included in all bug reports.
 | ||
| 
 | ||
|      <p><a name="g_t_002d_002dtrailing_002derror"></a><br><dt><code>-a</code><dt><code>--trailing-error</code><dd>Exit with error status 2 if any remaining input is detected after
 | ||
| decompressing the last member. Such remaining input is usually trailing
 | ||
| garbage that can be safely ignored. See <a href="#concat_002dexample">concat-example</a>.
 | ||
| 
 | ||
|      <br><dt><code>-b </code><var>bytes</var><dt><code>--member-size=</code><var>bytes</var><dd>When compressing, set the member size limit to <var>bytes</var>. If <var>bytes</var>
 | ||
| is smaller than the compressed size, a multimember file is produced. It is
 | ||
| advisable to keep members smaller than RAM size so that they can be repaired
 | ||
| with lziprecover in case of corruption. A small member size may degrade
 | ||
| compression ratio, so use it only when needed. Valid values range from
 | ||
| 100 kB<!-- /@w --> to 2 PiB<!-- /@w -->. Defaults to 2 PiB<!-- /@w -->.
 | ||
| 
 | ||
|      <br><dt><code>-c</code><dt><code>--stdout</code><dd>Compress or decompress to standard output; keep input files unchanged. If
 | ||
| compressing several files, each file is compressed independently. (The
 | ||
| output consists of a sequence of independently compressed members). This
 | ||
| option (or <samp><span class="option">-o</span></samp>) is needed when reading from a named pipe (fifo) or
 | ||
| from a device. Use it also to recover as much of the decompressed data as
 | ||
| possible when decompressing a corrupt file. <samp><span class="option">-c</span></samp> overrides <samp><span class="option">-o</span></samp>
 | ||
| and <samp><span class="option">-S</span></samp>. <samp><span class="option">-c</span></samp> has no effect when testing or listing.
 | ||
| 
 | ||
|      <br><dt><code>-d</code><dt><code>--decompress</code><dd>Decompress the files specified. The integrity of the files specified is
 | ||
| checked. If a file does not exist, can't be opened, or the destination file
 | ||
| already exists and <samp><span class="option">--force</span></samp> has not been specified, lzip continues
 | ||
| decompressing the rest of the files and exits with error status 1. If a file
 | ||
| fails to decompress, or is a terminal, lzip exits immediately with error
 | ||
| status 2 without decompressing the rest of the files. A terminal is
 | ||
| considered an uncompressed file, and therefore invalid.
 | ||
| 
 | ||
|      <br><dt><code>-f</code><dt><code>--force</code><dd>Force overwrite of output files.
 | ||
| 
 | ||
|      <br><dt><code>-F</code><dt><code>--recompress</code><dd>When compressing, force re-compression of files whose name already has
 | ||
| the '<samp><span class="samp">.lz</span></samp>' or '<samp><span class="samp">.tlz</span></samp>' suffix.
 | ||
| 
 | ||
|      <br><dt><code>-k</code><dt><code>--keep</code><dd>Keep (don't delete) input files during compression or decompression.
 | ||
| 
 | ||
|      <br><dt><code>-l</code><dt><code>--list</code><dd>Print the uncompressed size, compressed size, and percentage saved of the
 | ||
| files specified. Trailing data are ignored. The values produced are correct
 | ||
| even for multimember files. If more than one file is given, a final line
 | ||
| containing the cumulative sizes is printed. With <samp><span class="option">-v</span></samp>, the dictionary
 | ||
| size, the number of members in the file, and the amount of trailing data (if
 | ||
| any) are also printed. With <samp><span class="option">-vv</span></samp>, the positions and sizes of each
 | ||
| member in multimember files are also printed.
 | ||
| 
 | ||
|      <p>If any file is damaged, does not exist, can't be opened, or is not regular,
 | ||
| the final exit status is > 0<!-- /@w -->. <samp><span class="option">-lq</span></samp> can be used to check quickly
 | ||
| (without decompressing) the structural integrity of the files specified. 
 | ||
| (Use <samp><span class="option">--test</span></samp> to check the data integrity). <samp><span class="option">-alq</span></samp>
 | ||
| additionally checks that none of the files specified contain trailing data.
 | ||
| 
 | ||
|      <br><dt><code>-m </code><var>bytes</var><dt><code>--match-length=</code><var>bytes</var><dd>When compressing, set the match length limit in bytes. After a match this
 | ||
| long is found, the search is finished. Valid values range from 5 to 273. 
 | ||
| Larger values usually give better compression ratios but longer compression
 | ||
| times.
 | ||
| 
 | ||
|      <br><dt><code>-o </code><var>file</var><dt><code>--output=</code><var>file</var><dd>If <samp><span class="option">-c</span></samp> has not been also specified, write the (de)compressed output
 | ||
| to <var>file</var>, automatically creating any missing parent directories; keep
 | ||
| input files unchanged. If compressing several files, each file is compressed
 | ||
| independently. (The output consists of a sequence of independently
 | ||
| compressed members). This option (or <samp><span class="option">-c</span></samp>) is needed when reading
 | ||
| from a named pipe (fifo) or from a device. <samp><span class="option">-o -</span></samp><!-- /@w --> is equivalent
 | ||
| to <samp><span class="option">-c</span></samp>. <samp><span class="option">-o</span></samp> has no effect when testing or listing.
 | ||
| 
 | ||
|      <p>In order to keep backward compatibility with lzip versions prior to 1.22,
 | ||
| when compressing from standard input and no other file names are given, the
 | ||
| extension '<samp><span class="samp">.lz</span></samp>' is appended to <var>file</var> unless it already ends in
 | ||
| '<samp><span class="samp">.lz</span></samp>' or '<samp><span class="samp">.tlz</span></samp>'. This feature will be removed in a future version
 | ||
| of lzip. Meanwhile, redirection may be used instead of <samp><span class="option">-o</span></samp> to write
 | ||
| the compressed output to a file without the extension '<samp><span class="samp">.lz</span></samp>' in its
 | ||
| name: '<samp><span class="samp">lzip < file > foo</span></samp>'<!-- /@w -->.
 | ||
| 
 | ||
|      <p>When compressing and splitting the output in volumes, <var>file</var> is used as
 | ||
| a prefix, and several files named '<samp><var>file</var><span class="samp">00001.lz</span></samp>',
 | ||
| '<samp><var>file</var><span class="samp">00002.lz</span></samp>', etc, are created. In this case, only one input
 | ||
| file is allowed.
 | ||
| 
 | ||
|      <br><dt><code>-q</code><dt><code>--quiet</code><dd>Quiet operation. Suppress all messages.
 | ||
| 
 | ||
|      <br><dt><code>-s </code><var>bytes</var><dt><code>--dictionary-size=</code><var>bytes</var><dd>When compressing, set the dictionary size limit in bytes. Lzip uses for
 | ||
| each file the largest dictionary size that does not exceed neither the file
 | ||
| size nor this limit. Valid values range from 4 KiB<!-- /@w --> to 512 MiB<!-- /@w -->. 
 | ||
| Values 12 to 29 are interpreted as powers of two, meaning 2^12 to 2^29
 | ||
| bytes. Dictionary sizes are quantized so that they can be coded in just one
 | ||
| byte (see <a href="#coded_002ddict_002dsize">coded-dict-size</a>). If the size specified does not match one of
 | ||
| the valid sizes, it is rounded upwards by adding up to (<var>bytes</var> / 8)<!-- /@w -->
 | ||
| to it.
 | ||
| 
 | ||
|      <p>For maximum compression you should use a dictionary size limit as large
 | ||
| as possible, but keep in mind that the decompression memory requirement
 | ||
| is affected at compression time by the choice of dictionary size limit.
 | ||
| 
 | ||
|      <br><dt><code>-S </code><var>bytes</var><dt><code>--volume-size=</code><var>bytes</var><dd>When compressing, and <samp><span class="option">-c</span></samp> has not been also specified, split the
 | ||
| compressed output into several volume files with names
 | ||
| '<samp><span class="samp">original_name00001.lz</span></samp>', '<samp><span class="samp">original_name00002.lz</span></samp>', etc, and set the
 | ||
| volume size limit to <var>bytes</var>. Input files are kept unchanged. Each
 | ||
| volume is a complete, maybe multimember, lzip file. A small volume size may
 | ||
| degrade compression ratio, so use it only when needed. Valid values range
 | ||
| from 100 kB<!-- /@w --> to 4 EiB<!-- /@w -->.
 | ||
| 
 | ||
|      <br><dt><code>-t</code><dt><code>--test</code><dd>Check integrity of the files specified, but don't decompress them. This
 | ||
| really performs a trial decompression and throws away the result. Use it
 | ||
| together with <samp><span class="option">-v</span></samp> to see information about the files. If a file
 | ||
| fails the test, does not exist, can't be opened, or is a terminal, lzip
 | ||
| continues testing the rest of the files. A final diagnostic is shown at
 | ||
| verbosity level 1 or higher if any file fails the test when testing multiple
 | ||
| files.
 | ||
| 
 | ||
|      <br><dt><code>-v</code><dt><code>--verbose</code><dd>Verbose mode.<br>
 | ||
| When compressing, show the compression ratio and size for each file
 | ||
| processed.<br>
 | ||
| When decompressing or testing, further -v's (up to 4) increase the verbosity
 | ||
| level, showing status, compression ratio, dictionary size, trailer contents
 | ||
| (CRC, data size, member size), and up to 6 bytes of trailing data (if any)
 | ||
| both in hexadecimal and as a string of printable ASCII characters.<br>
 | ||
| Two or more <samp><span class="option">-v</span></samp> options show the progress of (de)compression.
 | ||
| 
 | ||
|      <br><dt><code>-0 .. -9</code><dd>Compression level. Set the compression parameters (dictionary size and
 | ||
| match length limit) as shown in the table below. The default compression
 | ||
| level is <samp><span class="option">-6</span></samp>, equivalent to <samp><span class="option">-s8MiB -m36</span></samp><!-- /@w -->. Note that
 | ||
| <samp><span class="option">-9</span></samp> can be much slower than <samp><span class="option">-0</span></samp>. These options have no
 | ||
| effect when decompressing, testing, or listing.
 | ||
| 
 | ||
|      <p>The bidimensional parameter space of LZMA can't be mapped to a linear scale
 | ||
| optimal for all files. If your files are large, very repetitive, etc, you
 | ||
| may need to use the options <samp><span class="option">--dictionary-size</span></samp> and
 | ||
| <samp><span class="option">--match-length</span></samp> directly to achieve optimal performance.
 | ||
| 
 | ||
|      <p>If several compression levels or <samp><span class="option">-s</span></samp> or <samp><span class="option">-m</span></samp> options are
 | ||
| given, the last setting is used. For example <samp><span class="option">-9 -s64MiB</span></samp><!-- /@w --> is
 | ||
| equivalent to <samp><span class="option">-s64MiB -m273</span></samp><!-- /@w -->
 | ||
| 
 | ||
|      <p><table summary=""><tr align="left"><td valign="top">Level </td><td valign="top">Dictionary size (-s) </td><td valign="top">Match length limit (-m)
 | ||
| <br></td></tr><tr align="left"><td valign="top">-0 </td><td valign="top">64 KiB </td><td valign="top">16 bytes
 | ||
| <br></td></tr><tr align="left"><td valign="top">-1 </td><td valign="top">1 MiB </td><td valign="top">5 bytes
 | ||
| <br></td></tr><tr align="left"><td valign="top">-2 </td><td valign="top">1.5 MiB </td><td valign="top">6 bytes
 | ||
| <br></td></tr><tr align="left"><td valign="top">-3 </td><td valign="top">2 MiB </td><td valign="top">8 bytes
 | ||
| <br></td></tr><tr align="left"><td valign="top">-4 </td><td valign="top">3 MiB </td><td valign="top">12 bytes
 | ||
| <br></td></tr><tr align="left"><td valign="top">-5 </td><td valign="top">4 MiB </td><td valign="top">20 bytes
 | ||
| <br></td></tr><tr align="left"><td valign="top">-6 </td><td valign="top">8 MiB </td><td valign="top">36 bytes
 | ||
| <br></td></tr><tr align="left"><td valign="top">-7 </td><td valign="top">16 MiB </td><td valign="top">68 bytes
 | ||
| <br></td></tr><tr align="left"><td valign="top">-8 </td><td valign="top">24 MiB </td><td valign="top">132 bytes
 | ||
| <br></td></tr><tr align="left"><td valign="top">-9 </td><td valign="top">32 MiB </td><td valign="top">273 bytes
 | ||
|      <br></td></tr></table>
 | ||
| 
 | ||
|      <br><dt><code>--fast</code><dt><code>--best</code><dd>Aliases for GNU gzip compatibility.
 | ||
| 
 | ||
|      <br><dt><code>--empty-error</code><dd>Exit with error status 2 if any empty member is found in the input files.
 | ||
| 
 | ||
|      <br><dt><code>--marking-error</code><dd>Exit with error status 2 if the first LZMA byte is non-zero in any member of
 | ||
| the input files. This may be caused by data corruption or by deliberate
 | ||
| insertion of tracking information in the file. Use
 | ||
| '<samp><span class="samp">lziprecover --clear-marking</span></samp>'<!-- /@w --> to clear any such non-zero bytes.
 | ||
| 
 | ||
|      <br><dt><code>--loose-trailing</code><dd>When decompressing, testing, or listing, allow trailing data whose first
 | ||
| bytes are so similar to the magic bytes of a lzip header that they can
 | ||
| be confused with a corrupt header. Use this option if a file triggers a
 | ||
| "corrupt header" error and the cause is not indeed a corrupt header.
 | ||
| 
 | ||
|    </dl>
 | ||
| 
 | ||
|    <p>Numbers given as arguments to options may be expressed in decimal,
 | ||
| hexadecimal, or octal (using the same syntax as integer constants in C++),
 | ||
| and may be followed by a multiplier and an optional '<samp><span class="samp">B</span></samp>' for "byte".
 | ||
| 
 | ||
|    <p>Table of SI and binary prefixes (unit multipliers):
 | ||
| 
 | ||
|    <p><table summary=""><tr align="left"><td valign="top">Prefix </td><td valign="top">Value               </td><td valign="top">| </td><td valign="top">Prefix </td><td valign="top">Value
 | ||
| <br></td></tr><tr align="left"><td valign="top">k </td><td valign="top">kilobyte   (10^3 = 1000) </td><td valign="top">| </td><td valign="top">Ki </td><td valign="top">kibibyte  (2^10 = 1024)
 | ||
| <br></td></tr><tr align="left"><td valign="top">M </td><td valign="top">megabyte   (10^6)        </td><td valign="top">| </td><td valign="top">Mi </td><td valign="top">mebibyte  (2^20)
 | ||
| <br></td></tr><tr align="left"><td valign="top">G </td><td valign="top">gigabyte   (10^9)        </td><td valign="top">| </td><td valign="top">Gi </td><td valign="top">gibibyte  (2^30)
 | ||
| <br></td></tr><tr align="left"><td valign="top">T </td><td valign="top">terabyte   (10^12)       </td><td valign="top">| </td><td valign="top">Ti </td><td valign="top">tebibyte  (2^40)
 | ||
| <br></td></tr><tr align="left"><td valign="top">P </td><td valign="top">petabyte   (10^15)       </td><td valign="top">| </td><td valign="top">Pi </td><td valign="top">pebibyte  (2^50)
 | ||
| <br></td></tr><tr align="left"><td valign="top">E </td><td valign="top">exabyte    (10^18)       </td><td valign="top">| </td><td valign="top">Ei </td><td valign="top">exbibyte  (2^60)
 | ||
| <br></td></tr><tr align="left"><td valign="top">Z </td><td valign="top">zettabyte  (10^21)       </td><td valign="top">| </td><td valign="top">Zi </td><td valign="top">zebibyte  (2^70)
 | ||
| <br></td></tr><tr align="left"><td valign="top">Y </td><td valign="top">yottabyte  (10^24)       </td><td valign="top">| </td><td valign="top">Yi </td><td valign="top">yobibyte  (2^80)
 | ||
| <br></td></tr><tr align="left"><td valign="top">R </td><td valign="top">ronnabyte  (10^27)       </td><td valign="top">| </td><td valign="top">Ri </td><td valign="top">robibyte  (2^90)
 | ||
| <br></td></tr><tr align="left"><td valign="top">Q </td><td valign="top">quettabyte (10^30)       </td><td valign="top">| </td><td valign="top">Qi </td><td valign="top">quebibyte (2^100)
 | ||
|    <br></td></tr></table>
 | ||
| 
 | ||
|    <pre class="sp">
 | ||
| 
 | ||
| </pre>
 | ||
| Exit status: 0 for a normal exit, 1 for environmental problems
 | ||
| (file not found, invalid command-line options, I/O errors, etc), 2 to
 | ||
| indicate a corrupt or invalid input file, 3 for an internal consistency
 | ||
| error (e.g., bug) which caused lzip to panic.
 | ||
| 
 | ||
| <div class="node">
 | ||
| <a name="Quality-assurance"></a>
 | ||
| <p><hr>
 | ||
| Next: <a rel="next" accesskey="n" href="#Algorithm">Algorithm</a>,
 | ||
| Previous: <a rel="previous" accesskey="p" href="#Invoking-lzip">Invoking lzip</a>,
 | ||
| Up: <a rel="up" accesskey="u" href="#Top">Top</a>
 | ||
| 
 | ||
| </div>
 | ||
| 
 | ||
| <h2 class="chapter">4 Design, development, and testing of lzip</h2>
 | ||
| 
 | ||
| <p><a name="index-quality-assurance-7"></a>
 | ||
| There are two ways of constructing a software design: One way is to make it
 | ||
| so simple that there are obviously no deficiencies and the other way is to
 | ||
| make it so complicated that there are no obvious deficiencies. The first
 | ||
| method is far more difficult.<br>
 | ||
| -- C.A.R. Hoare
 | ||
| 
 | ||
|    <p>Lzip has been designed, written, and tested with great care to replace gzip
 | ||
| and bzip2 as the standard general-purpose compressed format for Unix-like
 | ||
| systems. This chapter describes the lessons learned from these previous
 | ||
| formats, and their application to the design of lzip. The lzip format
 | ||
| specification has been reviewed carefully and is believed to be free from
 | ||
| design errors.
 | ||
| 
 | ||
|    <pre class="sp">
 | ||
| 
 | ||
| </pre>
 | ||
| 
 | ||
| <h3 class="section">4.1 Format design</h3>
 | ||
| 
 | ||
| <p>When gzip was designed in 1992, computers and operating systems were much
 | ||
| less capable than they are today. The designers of gzip tried to work around
 | ||
| some of those limitations, like 8.3 file names, with additional fields in
 | ||
| the file format.
 | ||
| 
 | ||
|    <p>Today those limitations have mostly disappeared, and the format of gzip has
 | ||
| proved to be unnecessarily complicated. It includes fields that were never
 | ||
| used, others that have lost their usefulness, and finally others that have
 | ||
| become too limited.
 | ||
| 
 | ||
|    <p>Bzip2 was designed 5 years later, and its format is simpler than the one of
 | ||
| gzip.
 | ||
| 
 | ||
|    <p>Probably the worst defect of the gzip format from the point of view of data
 | ||
| safety is the variable size of its header. If the byte at offset 3 (flags)
 | ||
| of a gzip member gets corrupted, it may become difficult to recover the
 | ||
| data, even if the compressed blocks are intact, because it can't be known
 | ||
| with certainty where the compressed blocks begin.
 | ||
| 
 | ||
|    <p>By contrast, the header of a lzip member has a fixed length of 6. The LZMA
 | ||
| stream in a lzip member always starts at offset 6, making it trivial to
 | ||
| recover the data even if the whole header becomes corrupt.
 | ||
| 
 | ||
|    <p>Bzip2 also provides a header of fixed length and marks the begin and end of
 | ||
| each compressed block with six magic bytes, making it possible to find the
 | ||
| compressed blocks even in case of file damage. But bzip2 does not store the
 | ||
| size of each compressed block, as lzip does.
 | ||
| 
 | ||
|    <p>Lziprecover is able to provide unique data recovery capabilities because the
 | ||
| lzip format is extraordinarily safe. The simple and safe design of the file
 | ||
| format complements the embedded error detection provided by the LZMA data
 | ||
| stream. Any distance larger than the dictionary size acts as a forbidden
 | ||
| symbol, allowing the decompressor to detect the approximate position of
 | ||
| errors, and leaving very little work for the check sequence (CRC and data
 | ||
| sizes) in the detection of errors. Lzip is usually able to detect all
 | ||
| possible bit flips in the compressed data without resorting to the check
 | ||
| sequence. It would be difficult to write an automatic recovery tool like
 | ||
| lziprecover for the gzip format. And, as far as I know, it has never been
 | ||
| written.
 | ||
| 
 | ||
|    <p>Lzip, like gzip and bzip2, uses a CRC32 to check the integrity of the
 | ||
| decompressed data because it provides optimal accuracy in the detection of
 | ||
| errors up to a compressed size of about 16 GiB<!-- /@w -->, a size larger than that
 | ||
| of most files. In the case of lzip, the additional detection capability of
 | ||
| the decompressor reduces the probability of undetected errors several
 | ||
| million times more, resulting in a combined integrity checking optimally
 | ||
| accurate for any member size produced by lzip. Preliminary results suggest
 | ||
| that the lzip format is safe enough to be used in critical safety avionics
 | ||
| systems.
 | ||
| 
 | ||
|    <p>The lzip format is designed for long-term archiving. Therefore it excludes
 | ||
| any unneeded features that may interfere with the future extraction of the
 | ||
| decompressed data.
 | ||
| 
 | ||
|    <pre class="sp">
 | ||
| 
 | ||
| </pre>
 | ||
| 
 | ||
| <h4 class="subsection">4.1.1 Gzip format (mis)features not present in lzip</h4>
 | ||
| 
 | ||
|      <dl>
 | ||
| <dt>'<samp><span class="samp">Multiple algorithms</span></samp>'<dd>
 | ||
| Gzip provides a CM (Compression Method) field that has never been used
 | ||
| because it is a bad idea to begin with. New compression methods may require
 | ||
| additional fields, making it impossible to implement new methods and, at the
 | ||
| same time, keep the same format. This field does not solve the problem of
 | ||
| format proliferation; it just makes the problem less obvious.
 | ||
| 
 | ||
|      <br><dt>'<samp><span class="samp">Optional fields in header</span></samp>'<dd>
 | ||
| Unless special precautions are taken, optional fields are generally a bad
 | ||
| idea because they produce a header of variable size. The gzip header has 2
 | ||
| fields that, in addition to being optional, are zero-terminated. This means
 | ||
| that if any byte inside the field gets zeroed, or if the terminating zero
 | ||
| gets altered, gzip won't be able to find neither the header CRC nor the
 | ||
| compressed blocks.
 | ||
| 
 | ||
|      <br><dt>'<samp><span class="samp">Optional CRC for the header</span></samp>'<dd>
 | ||
| Using an optional CRC for the header is not only a bad idea, it is an error;
 | ||
| it circumvents the Hamming distance (HD) of the CRC and may prevent the
 | ||
| extraction of perfectly good data. For example, if the CRC is used and the
 | ||
| bit enabling it is reset by a bit flip, then the header seems to be intact
 | ||
| (in spite of being corrupt) while the compressed blocks seem to be totally
 | ||
| unrecoverable (in spite of being intact). Very misleading indeed.
 | ||
| 
 | ||
|      <br><dt>'<samp><span class="samp">Metadata</span></samp>'<dd>
 | ||
| The gzip format stores some metadata, like the modification time of the
 | ||
| original file or the operating system on which compression took place. This
 | ||
| complicates reproducible compression (obtaining identical compressed output
 | ||
| from identical input).
 | ||
| 
 | ||
| </dl>
 | ||
| 
 | ||
| <h4 class="subsection">4.1.2 Lzip format improvements over gzip and bzip2</h4>
 | ||
| 
 | ||
|      <dl>
 | ||
| <dt>'<samp><span class="samp">64-bit size field</span></samp>'<dd>
 | ||
| Probably the most frequently reported shortcoming of the gzip format is that
 | ||
| it only stores the least significant 32 bits of the uncompressed size. The
 | ||
| size of any file larger or equal than 4 GiB<!-- /@w --> gets truncated.
 | ||
| 
 | ||
|      <p>Bzip2 does not store the uncompressed size of the file.
 | ||
| 
 | ||
|      <p>The lzip format provides a 64-bit field for the uncompressed size. 
 | ||
| Additionally, lzip produces multimember output automatically when the size
 | ||
| is too large for a single member, allowing for an unlimited uncompressed
 | ||
| size.
 | ||
| 
 | ||
|      <br><dt>'<samp><span class="samp">Distributed index</span></samp>'<dd>
 | ||
| The lzip format provides a distributed index that, among other things, helps
 | ||
| plzip to decompress several times faster than pigz and helps lziprecover do
 | ||
| its job. Neither the gzip format nor the bzip2 format do provide an index.
 | ||
| 
 | ||
|      <p>A distributed index is safer and more scalable than a monolithic index. The
 | ||
| monolithic index introduces a single point of failure in the compressed file
 | ||
| and may limit the number of members or the total uncompressed size.
 | ||
| 
 | ||
| </dl>
 | ||
| 
 | ||
| <h3 class="section">4.2 Quality of implementation</h3>
 | ||
| 
 | ||
| <p>Our civilization depends critically on software; it had better be quality
 | ||
| software.<br>
 | ||
| -- Bjarne Stroustrup
 | ||
| 
 | ||
|      <dl>
 | ||
| <dt>'<samp><span class="samp">Accurate and robust error detection</span></samp>'<dd>
 | ||
| The lzip format provides 3-factor integrity checking, and the decompressors
 | ||
| report mismatches in each factor separately. This method detects most false
 | ||
| positives for corruption. If just one byte in one factor fails but the other
 | ||
| two factors match the data, it probably means that the data are intact and
 | ||
| the corruption just affects the mismatching factor (CRC, data size, or
 | ||
| member size) in the member trailer.
 | ||
| 
 | ||
|      <br><dt>'<samp><span class="samp">Multiple implementations</span></samp>'<dd>
 | ||
| Just like the lzip format provides 3-factor protection against undetected
 | ||
| data corruption, the development methodology of the lzip family of
 | ||
| compressors provides 3-factor protection against undetected programming
 | ||
| errors.
 | ||
| 
 | ||
|      <p>Three related but independent compressor implementations, lzip, clzip, and
 | ||
| minilzip/lzlib, are developed concurrently. Every stable release of any of
 | ||
| them is tested to check that it produces identical output to the other two. 
 | ||
| This guarantees that all three implement the same algorithm, and makes it
 | ||
| unlikely that any of them may contain serious undiscovered errors. In fact,
 | ||
| no errors have been discovered in lzip since 2009.
 | ||
| 
 | ||
|      <p>Additionally, the three implementations have been extensively tested with
 | ||
| <a href="http://www.nongnu.org/lzip/manual/lziprecover_manual.html#Unzcrash">unzcrash</a>,
 | ||
| valgrind, and '<samp><span class="samp">american fuzzy lop</span></samp>' without finding a single
 | ||
| vulnerability or false negative.
 | ||
| 
 | ||
|      <br><dt>'<samp><span class="samp">Dictionary size</span></samp>'<dd>
 | ||
| Lzip automatically adapts the dictionary size to the size of each file. 
 | ||
| In addition to reducing the amount of memory required for decompression,
 | ||
| this feature also minimizes the probability of being affected by RAM errors
 | ||
| during compression. <!-- key4_mask -->
 | ||
| 
 | ||
|      <br><dt>'<samp><span class="samp">Exit status</span></samp>'<dd>
 | ||
| Returning a warning status of 2 is a design flaw of compress that leaked
 | ||
| into the design of gzip. Both bzip2 and lzip are free from this flaw.
 | ||
| 
 | ||
|    </dl>
 | ||
| 
 | ||
| <div class="node">
 | ||
| <a name="Algorithm"></a>
 | ||
| <p><hr>
 | ||
| Next: <a rel="next" accesskey="n" href="#File-format">File format</a>,
 | ||
| Previous: <a rel="previous" accesskey="p" href="#Quality-assurance">Quality assurance</a>,
 | ||
| Up: <a rel="up" accesskey="u" href="#Top">Top</a>
 | ||
| 
 | ||
| </div>
 | ||
| 
 | ||
| <h2 class="chapter">5 Algorithm</h2>
 | ||
| 
 | ||
| <p><a name="index-algorithm-8"></a>
 | ||
| In spite of its name (Lempel-Ziv-Markov chain-Algorithm), LZMA is not a
 | ||
| concrete algorithm; it is more like "any algorithm using the LZMA coding
 | ||
| scheme". LZMA compression consists in describing the uncompressed data as a
 | ||
| succession of coding sequences from the set shown in Section '<samp><span class="samp">What is
 | ||
| coded</span></samp>' (see <a href="#what_002dis_002dcoded">what-is-coded</a>), and then encoding them using a range
 | ||
| encoder. For example, the option <samp><span class="option">-0</span></samp> of lzip uses the scheme in almost
 | ||
| the simplest way possible; issuing the longest match it can find, or a
 | ||
| literal byte if it can't find a match. Inversely, a much more elaborated way
 | ||
| of finding coding sequences of minimum size than the one currently used by
 | ||
| lzip could be developed, and the resulting sequence could also be coded
 | ||
| using the LZMA coding scheme.
 | ||
| 
 | ||
|    <p>Lzip currently implements two variants of the LZMA algorithm: fast
 | ||
| (used by option <samp><span class="option">-0</span></samp>) and normal (used by all other compression levels).
 | ||
| 
 | ||
|    <p>The high compression of LZMA comes from combining two basic, well-proven
 | ||
| compression ideas: sliding dictionaries (LZ77) and Markov models (the thing
 | ||
| used by every compression algorithm that uses a range encoder or similar
 | ||
| order-0 entropy coder as its last stage) with segregation of contexts
 | ||
| according to what the bits are used for.
 | ||
| 
 | ||
|    <p>Lzip is a two stage compressor. The first stage is a Lempel-Ziv coder,
 | ||
| which reduces redundancy by translating chunks of data to their
 | ||
| corresponding distance-length pairs. The second stage is a range encoder
 | ||
| that uses a different probability model for each type of data:
 | ||
| distances, lengths, literal bytes, etc.
 | ||
| 
 | ||
|    <p>Here is how it works, step by step:
 | ||
| 
 | ||
|    <p>1) The member header is written to the output stream.
 | ||
| 
 | ||
|    <p>2) The first byte is coded literally, because there are no previous
 | ||
| bytes to which the match finder can refer to.
 | ||
| 
 | ||
|    <p>3) The main encoder advances to the next byte in the input data and
 | ||
| calls the match finder.
 | ||
| 
 | ||
|    <p>4) The match finder fills an array with the minimum distances before the
 | ||
| current byte where a match of a given length can be found.
 | ||
| 
 | ||
|    <p>5) Go back to step 3 until a sequence (formed of pairs, repeated
 | ||
| distances, and literal bytes) of minimum price has been formed. Where the
 | ||
| price represents the number of output bits produced.
 | ||
| 
 | ||
|    <p>6) The range encoder encodes the sequence produced by the main encoder
 | ||
| and sends the bytes produced to the output stream.
 | ||
| 
 | ||
|    <p>7) Go back to step 3 until the input data are finished or until the
 | ||
| member or volume size limits are reached.
 | ||
| 
 | ||
|    <p>8) The range encoder is flushed.
 | ||
| 
 | ||
|    <p>9) The member trailer is written to the output stream.
 | ||
| 
 | ||
|    <p>10) If there are more data to compress, go back to step 1.
 | ||
| 
 | ||
|    <pre class="sp">
 | ||
| 
 | ||
| </pre>
 | ||
| During compression, lzip reads data in large blocks (one dictionary size at
 | ||
| a time). Therefore it may block for up to tens of seconds any process
 | ||
| feeding data to it through a pipe. This is normal. The blocking intervals
 | ||
| get longer with higher compression levels because dictionary size increases
 | ||
| (and compression speed decreases) with compression level.
 | ||
| 
 | ||
| <p class="noindent">The ideas embodied in lzip are due to (at least) the following people:
 | ||
| Abraham Lempel and Jacob Ziv (for the LZ algorithm), Andrei Markov (for the
 | ||
| definition of Markov chains), G.N.N. Martin (for the definition of range
 | ||
| encoding), Igor Pavlov (for putting all the above together in LZMA), and
 | ||
| Julian Seward (for bzip2's CLI).
 | ||
| 
 | ||
| <div class="node">
 | ||
| <a name="File-format"></a>
 | ||
| <p><hr>
 | ||
| Next: <a rel="next" accesskey="n" href="#Stream-format">Stream format</a>,
 | ||
| Previous: <a rel="previous" accesskey="p" href="#Algorithm">Algorithm</a>,
 | ||
| Up: <a rel="up" accesskey="u" href="#Top">Top</a>
 | ||
| 
 | ||
| </div>
 | ||
| 
 | ||
| <h2 class="chapter">6 File format</h2>
 | ||
| 
 | ||
| <p><a name="index-file-format-9"></a>
 | ||
| Perfection is reached, not when there is no longer anything to add, but
 | ||
| when there is no longer anything to take away.<br>
 | ||
| -- Antoine de Saint-Exupery
 | ||
| 
 | ||
|    <pre class="sp">
 | ||
| 
 | ||
| </pre>
 | ||
| In the diagram below, a box like this:
 | ||
| 
 | ||
| <pre class="verbatim">+---+
 | ||
| |   | <-- the vertical bars might be missing
 | ||
| +---+
 | ||
| </pre>
 | ||
| 
 | ||
|    <p>represents one byte; a box like this:
 | ||
| 
 | ||
| <pre class="verbatim">+==============+
 | ||
| |              |
 | ||
| +==============+
 | ||
| </pre>
 | ||
| 
 | ||
|    <p>represents a variable number of bytes.
 | ||
| 
 | ||
|    <pre class="sp">
 | ||
| 
 | ||
| </pre>
 | ||
| A lzip file consists of one or more independent "members" (compressed data
 | ||
| sets). The members simply appear one after another in the file, with no
 | ||
| additional information before, between, or after them. Each member can
 | ||
| encode in compressed form up to 16 EiB - 1 byte<!-- /@w --> of uncompressed data. 
 | ||
| The size of a multimember file is unlimited.
 | ||
| 
 | ||
|    <p>Each member has the following structure:
 | ||
| 
 | ||
| <pre class="verbatim">+--+--+--+--+----+----+=============+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
 | ||
| | ID string | VN | DS | LZMA stream | CRC32 |   Data size   |  Member size  |
 | ||
| +--+--+--+--+----+----+=============+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
 | ||
| </pre>
 | ||
| 
 | ||
|    <p>All multibyte values are stored in little endian order.
 | ||
| 
 | ||
|      <dl>
 | ||
| <dt>'<samp><span class="samp">ID string (the "magic" bytes)</span></samp>'<dd>A four byte string, identifying the lzip format, with the value "LZIP"
 | ||
| (0x4C, 0x5A, 0x49, 0x50).
 | ||
| 
 | ||
|      <br><dt>'<samp><span class="samp">VN (version number, 1 byte)</span></samp>'<dd>Just in case something needs to be modified in the future. 1 for now.
 | ||
| 
 | ||
|      <p><a name="coded_002ddict_002dsize"></a><br><dt>'<samp><span class="samp">DS (coded dictionary size, 1 byte)</span></samp>'<dd>The dictionary size is calculated by taking a power of 2 (the base size)
 | ||
| and subtracting from it a fraction between 0/16 and 7/16 of the base size.<br>
 | ||
| Bits 4-0 contain the base 2 logarithm of the base size (12 to 29).<br>
 | ||
| Bits 7-5 contain the numerator of the fraction (0 to 7) to subtract
 | ||
| from the base size to obtain the dictionary size.<br>
 | ||
| Example: 0xD3 = 2^19 - 6 * 2^15 = 512 KiB - 6 * 32 KiB = 320 KiB<br>
 | ||
| Valid values for dictionary size range from 4 KiB to 512 MiB.
 | ||
| 
 | ||
|      <br><dt>'<samp><span class="samp">LZMA stream</span></samp>'<dd>The LZMA stream, finished by an "End Of Stream" marker. Uses default values
 | ||
| for encoder properties. See <a href="#Stream-format">Stream format</a>, for a complete description.
 | ||
| 
 | ||
|      <br><dt>'<samp><span class="samp">CRC32 (4 bytes)</span></samp>'<dd>Cyclic Redundancy Check (CRC) of the original uncompressed data.
 | ||
| 
 | ||
|      <br><dt>'<samp><span class="samp">Data size (8 bytes)</span></samp>'<dd>Size of the original uncompressed data.
 | ||
| 
 | ||
|      <br><dt>'<samp><span class="samp">Member size (8 bytes)</span></samp>'<dd>Total size of the member, including header and trailer. This field acts
 | ||
| as a distributed index, improves the checking of stream integrity, and
 | ||
| facilitates the safe recovery of undamaged members from multimember files. 
 | ||
| Lzip limits the member size to 2 PiB<!-- /@w --> to prevent the data size field from
 | ||
| overflowing.
 | ||
| 
 | ||
|    </dl>
 | ||
| 
 | ||
| <div class="node">
 | ||
| <a name="Stream-format"></a>
 | ||
| <p><hr>
 | ||
| Next: <a rel="next" accesskey="n" href="#Trailing-data">Trailing data</a>,
 | ||
| Previous: <a rel="previous" accesskey="p" href="#File-format">File format</a>,
 | ||
| Up: <a rel="up" accesskey="u" href="#Top">Top</a>
 | ||
| 
 | ||
| </div>
 | ||
| 
 | ||
| <h2 class="chapter">7 Format of the LZMA stream in lzip files</h2>
 | ||
| 
 | ||
| <p><a name="index-format-of-the-LZMA-stream-10"></a>
 | ||
| The LZMA algorithm has three parameters, called "special LZMA
 | ||
| properties", to adjust it for some kinds of binary data. These
 | ||
| parameters are: '<samp><span class="samp">literal_context_bits</span></samp>' (with a default value of 3),
 | ||
| '<samp><span class="samp">literal_pos_state_bits</span></samp>' (with a default value of 0), and
 | ||
| '<samp><span class="samp">pos_state_bits</span></samp>' (with a default value of 2). As a general purpose
 | ||
| compressor, lzip only uses the default values for these parameters. In
 | ||
| particular '<samp><span class="samp">literal_pos_state_bits</span></samp>' has been optimized away and
 | ||
| does not even appear in the code.
 | ||
| 
 | ||
|    <p>Lzip finishes the LZMA stream with an "End Of Stream" (EOS) marker (the
 | ||
| distance-length pair 0xFFFFFFFFU, 2<!-- /@w -->), which in conjunction with the
 | ||
| '<samp><span class="samp">member size</span></samp>' field in the member trailer allows the checking of stream
 | ||
| integrity. The EOS marker is the only LZMA marker allowed in lzip files. The
 | ||
| LZMA stream in lzip files always has these two features (default properties
 | ||
| and EOS marker) and is referred to in this document as LZMA-302eos. This
 | ||
| simplified and marker-terminated form of the LZMA stream format has been
 | ||
| chosen to maximize interoperability and safety.
 | ||
| 
 | ||
|    <p>The second stage of LZMA is a range encoder that uses a different
 | ||
| probability model for each type of symbol: distances, lengths, literal
 | ||
| bytes, etc. Range encoding conceptually encodes all the symbols of the
 | ||
| message into one number. Unlike Huffman coding, which assigns to each
 | ||
| symbol a bit-pattern and concatenates all the bit-patterns together,
 | ||
| range encoding can compress one symbol to less than one bit. Therefore
 | ||
| the compressed data produced by a range encoder can't be split in pieces
 | ||
| that could be described individually.
 | ||
| 
 | ||
|    <p>It seems that the only way of describing the LZMA-302eos stream is to
 | ||
| describe the algorithm that decodes it. And given the many details
 | ||
| about the range decoder that need to be described accurately, the source
 | ||
| code of a real decompressor seems the only appropriate reference to use.
 | ||
| 
 | ||
|    <p>What follows is a description of the decoding algorithm for LZMA-302eos
 | ||
| streams using as reference the source code of "lzd", an educational
 | ||
| decompressor for lzip files, included in appendix A. See <a href="#Reference-source-code">Reference source code</a>. Lzd is written in C++11 and can be downloaded from the lzip download
 | ||
| directory.
 | ||
| 
 | ||
|    <pre class="sp">
 | ||
| 
 | ||
| </pre>
 | ||
| 
 | ||
| <h3 class="section">7.1 What is coded</h3>
 | ||
| 
 | ||
| <p><a name="what_002dis_002dcoded"></a>The LZMA stream includes literals, matches, and repeated matches (matches
 | ||
| reusing a recently used distance). There are 7 different coding sequences:
 | ||
| 
 | ||
|    <p><table summary=""><tr align="left"><th valign="top" width="35%">Bit sequence </th><th valign="top" width="14%">Name </th><th valign="top" width="51%">Description
 | ||
| <br></th></tr><tr align="left"><td valign="top" width="35%">0 + byte </td><td valign="top" width="14%">literal </td><td valign="top" width="51%">literal byte
 | ||
| <br></td></tr><tr align="left"><td valign="top" width="35%">1 + 0 + len + dis </td><td valign="top" width="14%">match </td><td valign="top" width="51%">distance-length pair
 | ||
| <br></td></tr><tr align="left"><td valign="top" width="35%">1 + 1 + 0 + 0 </td><td valign="top" width="14%">shortrep </td><td valign="top" width="51%">1 byte match at latest used distance
 | ||
| <br></td></tr><tr align="left"><td valign="top" width="35%">1 + 1 + 0 + 1 + len </td><td valign="top" width="14%">rep0 </td><td valign="top" width="51%">len bytes match at latest used distance
 | ||
| <br></td></tr><tr align="left"><td valign="top" width="35%">1 + 1 + 1 + 0 + len </td><td valign="top" width="14%">rep1 </td><td valign="top" width="51%">len bytes match at second
 | ||
| latest used distance
 | ||
| <br></td></tr><tr align="left"><td valign="top" width="35%">1 + 1 + 1 + 1 + 0 + len </td><td valign="top" width="14%">rep2 </td><td valign="top" width="51%">len bytes match at third
 | ||
| latest used distance
 | ||
| <br></td></tr><tr align="left"><td valign="top" width="35%">1 + 1 + 1 + 1 + 1 + len </td><td valign="top" width="14%">rep3 </td><td valign="top" width="51%">len bytes match at fourth
 | ||
| latest used distance
 | ||
|    <br></td></tr></table>
 | ||
| 
 | ||
|    <pre class="sp">
 | ||
| 
 | ||
| </pre>
 | ||
| In the following tables, multibit sequences are coded in normal order,
 | ||
| from most significant bit (MSB) to least significant bit (LSB), except
 | ||
| where noted otherwise.
 | ||
| 
 | ||
|    <p>Lengths (the '<samp><span class="samp">len</span></samp>' in the table above) are coded as follows:
 | ||
| 
 | ||
|    <p><table summary=""><tr align="left"><th valign="top" width="50%">Bit sequence </th><th valign="top" width="50%">Description
 | ||
| <br></th></tr><tr align="left"><td valign="top" width="50%">0 + 3 bits </td><td valign="top" width="50%">lengths from 2 to 9
 | ||
| <br></td></tr><tr align="left"><td valign="top" width="50%">1 + 0 + 3 bits </td><td valign="top" width="50%">lengths from 10 to 17
 | ||
| <br></td></tr><tr align="left"><td valign="top" width="50%">1 + 1 + 8 bits </td><td valign="top" width="50%">lengths from 18 to 273
 | ||
|    <br></td></tr></table>
 | ||
| 
 | ||
|    <pre class="sp">
 | ||
| 
 | ||
| </pre>
 | ||
| The coding of distances is a little more complicated, so I'll begin by
 | ||
| explaining a simpler version of the encoding.
 | ||
| 
 | ||
|    <p>Imagine you need to encode a number from 0 to 2^32 - 1<!-- /@w -->, and you want to
 | ||
| do it in a way that produces shorter codes for the smaller numbers. You may
 | ||
| first encode the position of the most significant bit that is set to 1,
 | ||
| which you may find by making a bit scan from the left (from the MSB). A
 | ||
| position of 0 means that the number is 0 (no bit is set), 1 means the LSB is
 | ||
| the first bit set (the number is 1), and 32 means the MSB is set (i.e., the
 | ||
| number is >= 0x80000000<!-- /@w -->). Then, if the position is >= 2<!-- /@w -->, you encode
 | ||
| the remaining position - 1<!-- /@w --> bits. Let's call these bits "direct bits"
 | ||
| because they are coded directly by value instead of indirectly by position.
 | ||
| 
 | ||
|    <p>The inconvenient of this simple method is that it needs 6 bits to encode the
 | ||
| position, but it just uses 33 of the 64 possible values, wasting almost half
 | ||
| of the codes.
 | ||
| 
 | ||
|    <p>The intelligent trick of LZMA is that it encodes in what it calls a "slot"
 | ||
| the position of the most significant bit set, along with the value of the
 | ||
| next bit, using the same 6 bits that would take to encode the position
 | ||
| alone. This seems to need 66 slots (twice the number of positions), but for
 | ||
| positions 0 and 1 there is no next bit, so the number of slots needed is 64
 | ||
| (0 to 63).
 | ||
| 
 | ||
|    <p>The 6 bits representing this "slot number" are then context-coded. If
 | ||
| the distance is >= 4<!-- /@w -->, the remaining bits are encoded as follows. 
 | ||
| '<samp><span class="samp">direct_bits</span></samp>' is the amount of remaining bits (from 1 to 30) needed
 | ||
| to form a complete distance, and is calculated as (slot >> 1) - 1<!-- /@w -->. 
 | ||
| If a distance needs 6 or more direct_bits, the last 4 bits are encoded
 | ||
| separately. The last piece (all the direct_bits for distances 4 to 127
 | ||
| (slots 4 to 13), or the last 4 bits for distances >= 128<!-- /@w -->
 | ||
| (slot >= 14)<!-- /@w -->) is context-coded in reverse order (from LSB to MSB). For
 | ||
| distances >= 128<!-- /@w -->, the '<samp><span class="samp">direct_bits - 4</span></samp>'<!-- /@w --> part is encoded with
 | ||
| fixed 0.5 probability.
 | ||
| 
 | ||
|    <p><table summary=""><tr align="left"><th valign="top" width="50%">Bit sequence </th><th valign="top" width="50%">Description
 | ||
| <br></th></tr><tr align="left"><td valign="top" width="50%">slot </td><td valign="top" width="50%">distances from 0 to 3
 | ||
| <br></td></tr><tr align="left"><td valign="top" width="50%">slot + direct_bits </td><td valign="top" width="50%">distances from 4 to 127
 | ||
| <br></td></tr><tr align="left"><td valign="top" width="50%">slot + (direct_bits - 4) + 4 bits </td><td valign="top" width="50%">distances from 128 to 2^32 - 1
 | ||
|    <br></td></tr></table>
 | ||
| 
 | ||
|    <pre class="sp">
 | ||
| 
 | ||
| </pre>
 | ||
| 
 | ||
| <h3 class="section">7.2 The coding contexts</h3>
 | ||
| 
 | ||
| <p>These contexts ('<samp><span class="samp">Bit_model</span></samp>' in the source), are integers or arrays
 | ||
| of integers representing the probability of the corresponding bit being 0.
 | ||
| 
 | ||
|    <p>The indices used in these arrays are:
 | ||
| 
 | ||
|      <dl>
 | ||
| <dt>'<samp><span class="samp">state</span></samp>'<dd>A state machine ('<samp><span class="samp">State</span></samp>' in the source) with 12 states (0 to 11),
 | ||
| coding the latest 2 to 4 types of sequences processed. The initial state
 | ||
| is 0.
 | ||
| 
 | ||
|      <br><dt>'<samp><span class="samp">pos_state</span></samp>'<dd>Value of the 2 least significant bits of the current position in the
 | ||
| decoded data.
 | ||
| 
 | ||
|      <br><dt>'<samp><span class="samp">literal_state</span></samp>'<dd>Value of the 3 most significant bits of the latest byte decoded.
 | ||
| 
 | ||
|      <br><dt>'<samp><span class="samp">len_state</span></samp>'<dd>Coded value of the current match length (length - 2)<!-- /@w -->, with a maximum
 | ||
| of 3. The resulting value is in the range 0 to 3.
 | ||
| 
 | ||
|    </dl>
 | ||
| 
 | ||
|    <p>The types of previous sequences corresponding to each state are shown in the
 | ||
| following table. '<samp><span class="samp">!literal</span></samp>' is any sequence except a literal byte. 
 | ||
| '<samp><span class="samp">rep</span></samp>' is any one of '<samp><span class="samp">rep0</span></samp>', '<samp><span class="samp">rep1</span></samp>', '<samp><span class="samp">rep2</span></samp>', or
 | ||
| '<samp><span class="samp">rep3</span></samp>'. The last type in each line is the most recent.
 | ||
| 
 | ||
|    <p><table summary=""><tr align="left"><th valign="top">State </th><th valign="top">Types of previous sequences
 | ||
| <br></th></tr><tr align="left"><td valign="top">0 </td><td valign="top">literal, literal, literal
 | ||
| <br></td></tr><tr align="left"><td valign="top">1 </td><td valign="top">match, literal, literal
 | ||
| <br></td></tr><tr align="left"><td valign="top">2 </td><td valign="top">rep or (!literal, shortrep), literal, literal
 | ||
| <br></td></tr><tr align="left"><td valign="top">3 </td><td valign="top">literal, shortrep, literal, literal
 | ||
| <br></td></tr><tr align="left"><td valign="top">4 </td><td valign="top">match, literal
 | ||
| <br></td></tr><tr align="left"><td valign="top">5 </td><td valign="top">rep or (!literal, shortrep), literal
 | ||
| <br></td></tr><tr align="left"><td valign="top">6 </td><td valign="top">literal, shortrep, literal
 | ||
| <br></td></tr><tr align="left"><td valign="top">7 </td><td valign="top">literal, match
 | ||
| <br></td></tr><tr align="left"><td valign="top">8 </td><td valign="top">literal, rep
 | ||
| <br></td></tr><tr align="left"><td valign="top">9 </td><td valign="top">literal, shortrep
 | ||
| <br></td></tr><tr align="left"><td valign="top">10 </td><td valign="top">!literal, match
 | ||
| <br></td></tr><tr align="left"><td valign="top">11 </td><td valign="top">!literal, (rep or shortrep)
 | ||
|    <br></td></tr></table>
 | ||
| 
 | ||
|    <pre class="sp">
 | ||
| 
 | ||
| </pre>
 | ||
| The contexts for decoding the type of coding sequence are:
 | ||
| 
 | ||
|    <p><table summary=""><tr align="left"><th valign="top" width="20%">Name </th><th valign="top" width="35%">Indices </th><th valign="top" width="45%">Used when
 | ||
| <br></th></tr><tr align="left"><td valign="top" width="20%">bm_match </td><td valign="top" width="35%">state, pos_state </td><td valign="top" width="45%">sequence start
 | ||
| <br></td></tr><tr align="left"><td valign="top" width="20%">bm_rep </td><td valign="top" width="35%">state </td><td valign="top" width="45%">after sequence 1
 | ||
| <br></td></tr><tr align="left"><td valign="top" width="20%">bm_rep0 </td><td valign="top" width="35%">state </td><td valign="top" width="45%">after sequence 11
 | ||
| <br></td></tr><tr align="left"><td valign="top" width="20%">bm_rep1 </td><td valign="top" width="35%">state </td><td valign="top" width="45%">after sequence 111
 | ||
| <br></td></tr><tr align="left"><td valign="top" width="20%">bm_rep2 </td><td valign="top" width="35%">state </td><td valign="top" width="45%">after sequence 1111
 | ||
| <br></td></tr><tr align="left"><td valign="top" width="20%">bm_len </td><td valign="top" width="35%">state, pos_state </td><td valign="top" width="45%">after sequence 110
 | ||
|    <br></td></tr></table>
 | ||
| 
 | ||
|    <pre class="sp">
 | ||
| 
 | ||
| </pre>
 | ||
| The contexts for decoding distances are:
 | ||
| 
 | ||
|    <p><table summary=""><tr align="left"><th valign="top" width="20%">Name </th><th valign="top" width="30%">Indices </th><th valign="top" width="50%">Used when
 | ||
| <br></th></tr><tr align="left"><td valign="top" width="20%">bm_dis_slot </td><td valign="top" width="30%">len_state, bit tree </td><td valign="top" width="50%">distance start
 | ||
| <br></td></tr><tr align="left"><td valign="top" width="20%">bm_dis </td><td valign="top" width="30%">reverse bit tree </td><td valign="top" width="50%">after slots 4 to 13
 | ||
| <br></td></tr><tr align="left"><td valign="top" width="20%">bm_align </td><td valign="top" width="30%">reverse bit tree </td><td valign="top" width="50%">for distances >= 128, after
 | ||
| fixed probability bits
 | ||
|    <br></td></tr></table>
 | ||
| 
 | ||
|    <pre class="sp">
 | ||
| 
 | ||
| </pre>
 | ||
| There are two separate sets of contexts for lengths ('<samp><span class="samp">Len_model</span></samp>' in
 | ||
| the source). One for normal matches, the other for repeated matches. The
 | ||
| contexts in each Len_model are (see '<samp><span class="samp">decode_len</span></samp>' in the source):
 | ||
| 
 | ||
|    <p><table summary=""><tr align="left"><th valign="top" width="20%">Name </th><th valign="top" width="40%">Indices </th><th valign="top" width="40%">Used when
 | ||
| <br></th></tr><tr align="left"><td valign="top" width="20%">choice1 </td><td valign="top" width="40%">none </td><td valign="top" width="40%">length start
 | ||
| <br></td></tr><tr align="left"><td valign="top" width="20%">choice2 </td><td valign="top" width="40%">none </td><td valign="top" width="40%">after sequence 1
 | ||
| <br></td></tr><tr align="left"><td valign="top" width="20%">bm_low </td><td valign="top" width="40%">pos_state, bit tree </td><td valign="top" width="40%">after sequence 0
 | ||
| <br></td></tr><tr align="left"><td valign="top" width="20%">bm_mid </td><td valign="top" width="40%">pos_state, bit tree </td><td valign="top" width="40%">after sequence 10
 | ||
| <br></td></tr><tr align="left"><td valign="top" width="20%">bm_high </td><td valign="top" width="40%">bit tree </td><td valign="top" width="40%">after sequence 11
 | ||
|    <br></td></tr></table>
 | ||
| 
 | ||
|    <pre class="sp">
 | ||
| 
 | ||
| </pre>
 | ||
| The context array '<samp><span class="samp">bm_literal</span></samp>' is special. In principle it acts as
 | ||
| a normal bit tree context, the one selected by '<samp><span class="samp">literal_state</span></samp>'. But
 | ||
| if the previous decoded byte was not a literal, two other bit tree
 | ||
| contexts are used depending on the value of each bit in
 | ||
| '<samp><span class="samp">match_byte</span></samp>' (the byte at the latest used distance), until a bit is
 | ||
| decoded that is different from its corresponding bit in
 | ||
| '<samp><span class="samp">match_byte</span></samp>'. After the first difference is found, the rest of the
 | ||
| byte is decoded using the normal bit tree context. (See
 | ||
| '<samp><span class="samp">decode_matched</span></samp>' in the source).
 | ||
| 
 | ||
|    <pre class="sp">
 | ||
| 
 | ||
| </pre>
 | ||
| 
 | ||
| <h3 class="section">7.3 The range decoder</h3>
 | ||
| 
 | ||
| <p>The LZMA stream is consumed one byte at a time by the range decoder. 
 | ||
| (See '<samp><span class="samp">normalize</span></samp>' in the source). Every byte consumed produces a
 | ||
| variable number of decoded bits, depending on how well these bits agree
 | ||
| with their context. (See '<samp><span class="samp">decode_bit</span></samp>' in the source).
 | ||
| 
 | ||
|    <p>The range decoder state consists of two unsigned 32-bit variables:
 | ||
| '<samp><span class="samp">range</span></samp>' (representing the most significant part of the range size
 | ||
| not yet decoded) and '<samp><span class="samp">code</span></samp>' (representing the current point within
 | ||
| '<samp><span class="samp">range</span></samp>'). '<samp><span class="samp">range</span></samp>' is initialized to 2^32 - 1<!-- /@w -->, and
 | ||
| '<samp><span class="samp">code</span></samp>' is initialized to 0.
 | ||
| 
 | ||
|    <p>The range encoder produces a first 0 byte that must be ignored by the
 | ||
| range decoder. (See the '<samp><span class="samp">Range_decoder</span></samp>' constructor in the source).
 | ||
| 
 | ||
|    <pre class="sp">
 | ||
| 
 | ||
| </pre>
 | ||
| 
 | ||
| <h3 class="section">7.4 Decoding and checking the LZMA stream</h3>
 | ||
| 
 | ||
| <p>After decoding the member header and obtaining the dictionary size, the
 | ||
| range decoder is initialized and then the LZMA decoder enters a loop
 | ||
| (see '<samp><span class="samp">decode_member</span></samp>' in the source) where it invokes the range
 | ||
| decoder with the appropriate contexts to decode the different coding
 | ||
| sequences (matches, repeated matches, and literal bytes), until the "End
 | ||
| Of Stream" marker is decoded.
 | ||
| 
 | ||
|    <p>Once the "End Of Stream" marker has been decoded, the decompressor reads and
 | ||
| decodes the member trailer, and checks that the three integrity factors
 | ||
| stored there (CRC, data size, and member size) match those computed from the
 | ||
| data.
 | ||
| 
 | ||
| <div class="node">
 | ||
| <a name="Trailing-data"></a>
 | ||
| <p><hr>
 | ||
| Next: <a rel="next" accesskey="n" href="#Examples">Examples</a>,
 | ||
| Previous: <a rel="previous" accesskey="p" href="#Stream-format">Stream format</a>,
 | ||
| Up: <a rel="up" accesskey="u" href="#Top">Top</a>
 | ||
| 
 | ||
| </div>
 | ||
| 
 | ||
| <h2 class="chapter">8 Extra data appended to the file</h2>
 | ||
| 
 | ||
| <p><a name="index-trailing-data-11"></a>
 | ||
| Sometimes extra data are found appended to a lzip file after the last
 | ||
| member. Such trailing data may be:
 | ||
| 
 | ||
|      <ul>
 | ||
| <li>Padding added to make the file size a multiple of some block size, for
 | ||
| example when writing to a tape. It is safe to append any amount of
 | ||
| padding zero bytes to a lzip file.
 | ||
| 
 | ||
|      <li>Useful data added by the user; an "End Of File" string (to check that the
 | ||
| file has not been truncated), a cryptographically secure hash, a description
 | ||
| of file contents, etc. It is safe to append any amount of text to a lzip
 | ||
| file as long as none of the first four bytes of the text matches the
 | ||
| corresponding byte in the string "LZIP", and the text does not contain any
 | ||
| zero bytes (null characters). Nonzero bytes and zero bytes can't be safely
 | ||
| mixed in trailing data.
 | ||
| 
 | ||
|      <li>Garbage added by some not totally successful copy operation.
 | ||
| 
 | ||
|      <li>Malicious data added to the file in order to make its total size and
 | ||
| hash value (for a chosen hash) coincide with those of another file.
 | ||
| 
 | ||
|      <li>In rare cases, trailing data could be the corrupt header of another
 | ||
| member. In multimember or concatenated files the probability of
 | ||
| corruption happening in the magic bytes is 5 times smaller than the
 | ||
| probability of getting a false positive caused by the corruption of the
 | ||
| integrity information itself. Therefore it can be considered to be below
 | ||
| the noise level. Additionally, the test used by lzip to discriminate
 | ||
| trailing data from a corrupt header has a Hamming distance (HD) of 3,
 | ||
| and the 3 bit flips must happen in different magic bytes for the test to
 | ||
| fail. In any case, the option <samp><span class="option">--trailing-error</span></samp> guarantees that
 | ||
| any corrupt header is detected. 
 | ||
| </ul>
 | ||
| 
 | ||
|    <p>Trailing data are in no way part of the lzip file format, but tools
 | ||
| reading lzip files are expected to behave as correctly and usefully as
 | ||
| possible in the presence of trailing data.
 | ||
| 
 | ||
|    <p>Trailing data can be safely ignored in most cases. In some cases, like
 | ||
| that of user-added data, they are expected to be ignored. In those cases
 | ||
| where a file containing trailing data must be rejected, the option
 | ||
| <samp><span class="option">--trailing-error</span></samp> can be used. See <a href="#g_t_002d_002dtrailing_002derror">--trailing-error</a>.
 | ||
| 
 | ||
| <div class="node">
 | ||
| <a name="Examples"></a>
 | ||
| <p><hr>
 | ||
| Next: <a rel="next" accesskey="n" href="#Problems">Problems</a>,
 | ||
| Previous: <a rel="previous" accesskey="p" href="#Trailing-data">Trailing data</a>,
 | ||
| Up: <a rel="up" accesskey="u" href="#Top">Top</a>
 | ||
| 
 | ||
| </div>
 | ||
| 
 | ||
| <h2 class="chapter">9 A small tutorial with examples</h2>
 | ||
| 
 | ||
| <p><a name="index-examples-12"></a>
 | ||
| WARNING! Even if lzip is bug-free, other causes may result in a corrupt
 | ||
| compressed file (bugs in the system libraries, memory errors, etc). 
 | ||
| Therefore, if the data you are going to compress are important, give the
 | ||
| option <samp><span class="option">--keep</span></samp> to lzip and don't remove the original file until you
 | ||
| check the compressed file with a command like
 | ||
| '<samp><span class="samp">lzip -cd file.lz | cmp file -</span></samp>'<!-- /@w -->. Most RAM errors happening during
 | ||
| compression can only be detected by comparing the compressed file with the
 | ||
| original because the corruption happens before lzip compresses the RAM
 | ||
| contents, resulting in a valid compressed file containing wrong data.
 | ||
| 
 | ||
|    <pre class="sp">
 | ||
| 
 | ||
| </pre>
 | ||
| Example 1: Extract all the files from archive '<samp><span class="samp">foo.tar.lz</span></samp>'.
 | ||
| 
 | ||
| <pre class="example">       tar -xf foo.tar.lz
 | ||
|      or
 | ||
|        lzip -cd foo.tar.lz | tar -xf -
 | ||
| </pre>
 | ||
|    <pre class="sp">
 | ||
| 
 | ||
| </pre>
 | ||
| Example 2: Replace a regular file with its compressed version '<samp><span class="samp">file.lz</span></samp>'
 | ||
| and show the compression ratio.
 | ||
| 
 | ||
| <pre class="example">     lzip -v file
 | ||
| </pre>
 | ||
|    <pre class="sp">
 | ||
| 
 | ||
| </pre>
 | ||
| Example 3: Like example 2 but the created '<samp><span class="samp">file.lz</span></samp>' is multimember with
 | ||
| a member size of 1 MiB<!-- /@w -->. The compression ratio is not shown.
 | ||
| 
 | ||
| <pre class="example">     lzip -b 1MiB file
 | ||
| </pre>
 | ||
|    <pre class="sp">
 | ||
| 
 | ||
| </pre>
 | ||
| Example 4: Restore a regular file from its compressed version
 | ||
| '<samp><span class="samp">file.lz</span></samp>'. If the operation is successful, '<samp><span class="samp">file.lz</span></samp>' is removed.
 | ||
| 
 | ||
| <pre class="example">     lzip -d file.lz
 | ||
| </pre>
 | ||
|    <pre class="sp">
 | ||
| 
 | ||
| </pre>
 | ||
| Example 5: Check the integrity of the compressed file '<samp><span class="samp">file.lz</span></samp>' and
 | ||
| show status.
 | ||
| 
 | ||
| <pre class="example">     lzip -tv file.lz
 | ||
| </pre>
 | ||
|    <pre class="sp">
 | ||
| 
 | ||
| </pre>
 | ||
| <a name="concat_002dexample"></a>Example 6: The right way of concatenating the decompressed output of two or
 | ||
| more compressed files. See <a href="#Trailing-data">Trailing data</a>.
 | ||
| 
 | ||
| <pre class="example">     Don't do this
 | ||
|        cat file1.lz file2.lz file3.lz | lzip -d -
 | ||
|      Do this instead
 | ||
|        lzip -cd file1.lz file2.lz file3.lz
 | ||
| </pre>
 | ||
|    <pre class="sp">
 | ||
| 
 | ||
| </pre>
 | ||
| Example 7: Decompress '<samp><span class="samp">file.lz</span></samp>' partially until 10 KiB<!-- /@w --> of
 | ||
| decompressed data are produced.
 | ||
| 
 | ||
| <pre class="example">     lzip -cd file.lz | dd bs=1024 count=10
 | ||
| </pre>
 | ||
|    <pre class="sp">
 | ||
| 
 | ||
| </pre>
 | ||
| Example 8: Decompress '<samp><span class="samp">file.lz</span></samp>' partially from decompressed byte at
 | ||
| offset 10000 to decompressed byte at offset 14999 (5000 bytes are produced).
 | ||
| 
 | ||
| <pre class="example">     lzip -cd file.lz | dd bs=1000 skip=10 count=5
 | ||
| </pre>
 | ||
|    <pre class="sp">
 | ||
| 
 | ||
| </pre>
 | ||
| Example 9: Compress a whole device in /dev/sdc and send the output to
 | ||
| '<samp><span class="samp">file.lz</span></samp>'.
 | ||
| 
 | ||
| <pre class="example">       lzip -c /dev/sdc > file.lz
 | ||
|      or
 | ||
|        lzip /dev/sdc -o file.lz
 | ||
| </pre>
 | ||
|    <pre class="sp">
 | ||
| 
 | ||
| </pre>
 | ||
| Example 10: Create a multivolume compressed tar archive with a volume size
 | ||
| of 1440 KiB<!-- /@w -->.
 | ||
| 
 | ||
| <pre class="example">     tar -c some_directory | lzip -S 1440KiB -o volume_name -
 | ||
| </pre>
 | ||
|    <pre class="sp">
 | ||
| 
 | ||
| </pre>
 | ||
| Example 11: Extract a multivolume compressed tar archive.
 | ||
| 
 | ||
| <pre class="example">     lzip -cd volume_name*.lz | tar -xf -
 | ||
| </pre>
 | ||
|    <pre class="sp">
 | ||
| 
 | ||
| </pre>
 | ||
| Example 12: Create a multivolume compressed backup of a large database file
 | ||
| with a volume size of 650 MB<!-- /@w -->, where each volume is a multimember file
 | ||
| with a member size of 32 MiB<!-- /@w -->.
 | ||
| 
 | ||
| <pre class="example">     lzip -b 32MiB -S 650MB big_db
 | ||
| </pre>
 | ||
|    <div class="node">
 | ||
| <a name="Problems"></a>
 | ||
| <p><hr>
 | ||
| Next: <a rel="next" accesskey="n" href="#Reference-source-code">Reference source code</a>,
 | ||
| Previous: <a rel="previous" accesskey="p" href="#Examples">Examples</a>,
 | ||
| Up: <a rel="up" accesskey="u" href="#Top">Top</a>
 | ||
| 
 | ||
| </div>
 | ||
| 
 | ||
| <h2 class="chapter">10 Reporting bugs</h2>
 | ||
| 
 | ||
| <p><a name="index-bugs-13"></a><a name="index-getting-help-14"></a>
 | ||
| There are probably bugs in lzip. There are certainly errors and
 | ||
| omissions in this manual. If you report them, they will get fixed. If
 | ||
| you don't, no one will ever know about them and they will remain unfixed
 | ||
| for all eternity, if not longer.
 | ||
| 
 | ||
|    <p>If you find a bug in lzip, please send electronic mail to
 | ||
| <a href="mailto:lzip-bug@nongnu.org">lzip-bug@nongnu.org</a>. Include the version number, which you can
 | ||
| find by running '<samp><span class="samp">lzip --version</span></samp>'<!-- /@w -->.
 | ||
| 
 | ||
| <div class="node">
 | ||
| <a name="Reference-source-code"></a>
 | ||
| <p><hr>
 | ||
| Next: <a rel="next" accesskey="n" href="#Concept-index">Concept index</a>,
 | ||
| Previous: <a rel="previous" accesskey="p" href="#Problems">Problems</a>,
 | ||
| Up: <a rel="up" accesskey="u" href="#Top">Top</a>
 | ||
| 
 | ||
| </div>
 | ||
| 
 | ||
| <h2 class="appendix">Appendix A Reference source code</h2>
 | ||
| 
 | ||
| <p><a name="index-reference-source-code-15"></a>
 | ||
| <pre class="verbatim">/* Lzd - Educational decompressor for the lzip format
 | ||
|    Copyright (C) 2013-2024 Antonio Diaz Diaz.
 | ||
| 
 | ||
|    This program is free software. Redistribution and use in source and
 | ||
|    binary forms, with or without modification, are permitted provided
 | ||
|    that the following conditions are met:
 | ||
| 
 | ||
|    1. Redistributions of source code must retain the above copyright
 | ||
|    notice, this list of conditions, and the following disclaimer.
 | ||
| 
 | ||
|    2. Redistributions in binary form must reproduce the above copyright
 | ||
|    notice, this list of conditions, and the following disclaimer in the
 | ||
|    documentation and/or other materials provided with the distribution.
 | ||
| 
 | ||
|    This program is distributed in the hope that it will be useful,
 | ||
|    but WITHOUT ANY WARRANTY; without even the implied warranty of
 | ||
|    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
 | ||
| */
 | ||
| /*
 | ||
|    Exit status: 0 for a normal exit, 1 for environmental problems
 | ||
|    (file not found, invalid command-line options, I/O errors, etc), 2 to
 | ||
|    indicate a corrupt or invalid input file.
 | ||
| */
 | ||
| 
 | ||
| #include <algorithm>
 | ||
| #include <cerrno>
 | ||
| #include <cstdio>
 | ||
| #include <cstdlib>
 | ||
| #include <cstring>
 | ||
| #include <stdint.h>
 | ||
| #include <unistd.h>
 | ||
| #if defined __MSVCRT__ || defined __OS2__ || defined __DJGPP__
 | ||
| #include <fcntl.h>
 | ||
| #include <io.h>
 | ||
| #endif
 | ||
| 
 | ||
| 
 | ||
| class State
 | ||
|   {
 | ||
|   int st;
 | ||
| 
 | ||
| public:
 | ||
|   enum { states = 12 };
 | ||
|   State() : st( 0 ) {}
 | ||
|   int operator()() const { return st; }
 | ||
|   bool is_char() const { return st < 7; }
 | ||
| 
 | ||
|   void set_char()
 | ||
|     {
 | ||
|     const int next[states] = { 0, 0, 0, 0, 1, 2, 3, 4, 5, 6, 4, 5 };
 | ||
|     st = next[st];
 | ||
|     }
 | ||
|   void set_match()     { st = ( st < 7 ) ? 7 : 10; }
 | ||
|   void set_rep()       { st = ( st < 7 ) ? 8 : 11; }
 | ||
|   void set_short_rep() { st = ( st < 7 ) ? 9 : 11; }
 | ||
|   };
 | ||
| 
 | ||
| 
 | ||
| enum {
 | ||
|   min_dictionary_size = 1 << 12,
 | ||
|   max_dictionary_size = 1 << 29,
 | ||
|   literal_context_bits = 3,
 | ||
|   literal_pos_state_bits = 0,				// not used
 | ||
|   pos_state_bits = 2,
 | ||
|   pos_states = 1 << pos_state_bits,
 | ||
|   pos_state_mask = pos_states - 1,
 | ||
| 
 | ||
|   len_states = 4,
 | ||
|   dis_slot_bits = 6,
 | ||
|   start_dis_model = 4,
 | ||
|   end_dis_model = 14,
 | ||
|   modeled_distances = 1 << ( end_dis_model / 2 ),	// 128
 | ||
|   dis_align_bits = 4,
 | ||
|   dis_align_size = 1 << dis_align_bits,
 | ||
| 
 | ||
|   len_low_bits = 3,
 | ||
|   len_mid_bits = 3,
 | ||
|   len_high_bits = 8,
 | ||
|   len_low_symbols = 1 << len_low_bits,
 | ||
|   len_mid_symbols = 1 << len_mid_bits,
 | ||
|   len_high_symbols = 1 << len_high_bits,
 | ||
|   max_len_symbols = len_low_symbols + len_mid_symbols + len_high_symbols,
 | ||
| 
 | ||
|   min_match_len = 2,					// must be 2
 | ||
| 
 | ||
|   bit_model_move_bits = 5,
 | ||
|   bit_model_total_bits = 11,
 | ||
|   bit_model_total = 1 << bit_model_total_bits };
 | ||
| 
 | ||
| struct Bit_model
 | ||
|   {
 | ||
|   int probability;
 | ||
|   Bit_model() : probability( bit_model_total / 2 ) {}
 | ||
|   };
 | ||
| 
 | ||
| struct Len_model
 | ||
|   {
 | ||
|   Bit_model choice1;
 | ||
|   Bit_model choice2;
 | ||
|   Bit_model bm_low[pos_states][len_low_symbols];
 | ||
|   Bit_model bm_mid[pos_states][len_mid_symbols];
 | ||
|   Bit_model bm_high[len_high_symbols];
 | ||
|   };
 | ||
| 
 | ||
| 
 | ||
| class CRC32
 | ||
|   {
 | ||
|   uint32_t data[256];		// Table of CRCs of all 8-bit messages.
 | ||
| 
 | ||
| public:
 | ||
|   CRC32()
 | ||
|     {
 | ||
|     for( unsigned n = 0; n < 256; ++n )
 | ||
|       {
 | ||
|       unsigned c = n;
 | ||
|       for( int k = 0; k < 8; ++k )
 | ||
|         { if( c & 1 ) c = 0xEDB88320U ^ ( c >> 1 ); else c >>= 1; }
 | ||
|       data[n] = c;
 | ||
|       }
 | ||
|     }
 | ||
| 
 | ||
|   void update_buf( uint32_t & crc, const uint8_t * const buffer,
 | ||
|                    const int size ) const
 | ||
|     {
 | ||
|     for( int i = 0; i < size; ++i )
 | ||
|       crc = data[(crc^buffer[i])&0xFF] ^ ( crc >> 8 );
 | ||
|     }
 | ||
|   };
 | ||
| 
 | ||
| const CRC32 crc32;
 | ||
| 
 | ||
| 
 | ||
| enum { header_size = 6, trailer_size = 20 };
 | ||
| typedef uint8_t Lzip_header[header_size]; // 0-3 magic bytes
 | ||
| 					  //   4 version
 | ||
| 					  //   5 coded dictionary size
 | ||
| typedef uint8_t Lzip_trailer[trailer_size];
 | ||
| 			//  0-3  CRC32 of the uncompressed data
 | ||
| 			//  4-11 size of the uncompressed data
 | ||
| 			// 12-19 member size including header and trailer
 | ||
| 
 | ||
| class Range_decoder
 | ||
|   {
 | ||
|   unsigned long long member_pos;
 | ||
|   uint32_t code;
 | ||
|   uint32_t range;
 | ||
| 
 | ||
| public:
 | ||
|   Range_decoder()
 | ||
|     : member_pos( header_size ), code( 0 ), range( 0xFFFFFFFFU )
 | ||
|     {
 | ||
|     get_byte();			// discard first byte of the LZMA stream
 | ||
|     for( int i = 0; i < 4; ++i ) code = ( code << 8 ) | get_byte();
 | ||
|     }
 | ||
| 
 | ||
|   uint8_t get_byte() { ++member_pos; return std::getc( stdin ); }
 | ||
|   unsigned long long member_position() const { return member_pos; }
 | ||
| 
 | ||
|   unsigned decode( const int num_bits )
 | ||
|     {
 | ||
|     unsigned symbol = 0;
 | ||
|     for( int i = num_bits; i > 0; --i )
 | ||
|       {
 | ||
|       range >>= 1;
 | ||
|       symbol <<= 1;
 | ||
|       if( code >= range ) { code -= range; symbol |= 1; }
 | ||
|       if( range <= 0x00FFFFFFU )			// normalize
 | ||
|         { range <<= 8; code = ( code << 8 ) | get_byte(); }
 | ||
|       }
 | ||
|     return symbol;
 | ||
|     }
 | ||
| 
 | ||
|   bool decode_bit( Bit_model & bm )
 | ||
|     {
 | ||
|     bool symbol;
 | ||
|     const uint32_t bound = ( range >> bit_model_total_bits ) * bm.probability;
 | ||
|     if( code < bound )
 | ||
|       {
 | ||
|       range = bound;
 | ||
|       bm.probability +=
 | ||
|         ( bit_model_total - bm.probability ) >> bit_model_move_bits;
 | ||
|       symbol = 0;
 | ||
|       }
 | ||
|     else
 | ||
|       {
 | ||
|       code -= bound;
 | ||
|       range -= bound;
 | ||
|       bm.probability -= bm.probability >> bit_model_move_bits;
 | ||
|       symbol = 1;
 | ||
|       }
 | ||
|     if( range <= 0x00FFFFFFU )				// normalize
 | ||
|       { range <<= 8; code = ( code << 8 ) | get_byte(); }
 | ||
|     return symbol;
 | ||
|     }
 | ||
| 
 | ||
|   unsigned decode_tree( Bit_model bm[], const int num_bits )
 | ||
|     {
 | ||
|     unsigned symbol = 1;
 | ||
|     for( int i = 0; i < num_bits; ++i )
 | ||
|       symbol = ( symbol << 1 ) | decode_bit( bm[symbol] );
 | ||
|     return symbol - ( 1 << num_bits );
 | ||
|     }
 | ||
| 
 | ||
|   unsigned decode_tree_reversed( Bit_model bm[], const int num_bits )
 | ||
|     {
 | ||
|     unsigned symbol = decode_tree( bm, num_bits );
 | ||
|     unsigned reversed_symbol = 0;
 | ||
|     for( int i = 0; i < num_bits; ++i )
 | ||
|       {
 | ||
|       reversed_symbol = ( reversed_symbol << 1 ) | ( symbol & 1 );
 | ||
|       symbol >>= 1;
 | ||
|       }
 | ||
|     return reversed_symbol;
 | ||
|     }
 | ||
| 
 | ||
|   unsigned decode_matched( Bit_model bm[], const unsigned match_byte )
 | ||
|     {
 | ||
|     unsigned symbol = 1;
 | ||
|     for( int i = 7; i >= 0; --i )
 | ||
|       {
 | ||
|       const bool match_bit = ( match_byte >> i ) & 1;
 | ||
|       const bool bit = decode_bit( bm[symbol+(match_bit<<8)+0x100] );
 | ||
|       symbol = ( symbol << 1 ) | bit;
 | ||
|       if( match_bit != bit )
 | ||
|         {
 | ||
|         while( symbol < 0x100 )
 | ||
|           symbol = ( symbol << 1 ) | decode_bit( bm[symbol] );
 | ||
|         break;
 | ||
|         }
 | ||
|       }
 | ||
|     return symbol & 0xFF;
 | ||
|     }
 | ||
| 
 | ||
|   unsigned decode_len( Len_model & lm, const int pos_state )
 | ||
|     {
 | ||
|     if( decode_bit( lm.choice1 ) == 0 )
 | ||
|       return min_match_len +
 | ||
|              decode_tree( lm.bm_low[pos_state], len_low_bits );
 | ||
|     if( decode_bit( lm.choice2 ) == 0 )
 | ||
|       return min_match_len + len_low_symbols +
 | ||
|              decode_tree( lm.bm_mid[pos_state], len_mid_bits );
 | ||
|     return min_match_len + len_low_symbols + len_mid_symbols +
 | ||
|            decode_tree( lm.bm_high, len_high_bits );
 | ||
|     }
 | ||
|   };
 | ||
| 
 | ||
| 
 | ||
| class LZ_decoder
 | ||
|   {
 | ||
|   unsigned long long partial_data_pos;
 | ||
|   Range_decoder rdec;
 | ||
|   const unsigned dictionary_size;
 | ||
|   uint8_t * const buffer;	// output buffer
 | ||
|   unsigned pos;			// current pos in buffer
 | ||
|   unsigned stream_pos;		// first byte not yet written to stdout
 | ||
|   uint32_t crc_;
 | ||
|   bool pos_wrapped;
 | ||
| 
 | ||
|   void flush_data();
 | ||
| 
 | ||
|   uint8_t peek( const unsigned distance ) const
 | ||
|     {
 | ||
|     if( pos > distance ) return buffer[pos - distance - 1];
 | ||
|     if( pos_wrapped ) return buffer[dictionary_size + pos - distance - 1];
 | ||
|     return 0;			// prev_byte of first byte
 | ||
|     }
 | ||
| 
 | ||
|   void put_byte( const uint8_t b )
 | ||
|     {
 | ||
|     buffer[pos] = b;
 | ||
|     if( ++pos >= dictionary_size ) flush_data();
 | ||
|     }
 | ||
| 
 | ||
| public:
 | ||
|   explicit LZ_decoder( const unsigned dict_size )
 | ||
|     :
 | ||
|     partial_data_pos( 0 ),
 | ||
|     dictionary_size( dict_size ),
 | ||
|     buffer( new uint8_t[dictionary_size] ),
 | ||
|     pos( 0 ),
 | ||
|     stream_pos( 0 ),
 | ||
|     crc_( 0xFFFFFFFFU ),
 | ||
|     pos_wrapped( false )
 | ||
|     {}
 | ||
| 
 | ||
|   ~LZ_decoder() { delete[] buffer; }
 | ||
| 
 | ||
|   unsigned crc() const { return crc_ ^ 0xFFFFFFFFU; }
 | ||
|   unsigned long long data_position() const
 | ||
|     { return partial_data_pos + pos; }
 | ||
|   uint8_t get_byte() { return rdec.get_byte(); }
 | ||
|   unsigned long long member_position() const
 | ||
|     { return rdec.member_position(); }
 | ||
| 
 | ||
|   bool decode_member();
 | ||
|   };
 | ||
| 
 | ||
| 
 | ||
| void LZ_decoder::flush_data()
 | ||
|   {
 | ||
|   if( pos > stream_pos )
 | ||
|     {
 | ||
|     const unsigned size = pos - stream_pos;
 | ||
|     crc32.update_buf( crc_, buffer + stream_pos, size );
 | ||
|     if( std::fwrite( buffer + stream_pos, 1, size, stdout ) != size )
 | ||
|       { std::fprintf( stderr, "Write error: %s\n", std::strerror( errno ) );
 | ||
|         std::exit( 1 ); }
 | ||
|     if( pos >= dictionary_size )
 | ||
|       { partial_data_pos += pos; pos = 0; pos_wrapped = true; }
 | ||
|     stream_pos = pos;
 | ||
|     }
 | ||
|   }
 | ||
| 
 | ||
| 
 | ||
| bool LZ_decoder::decode_member()	// Return false if error
 | ||
|   {
 | ||
|   Bit_model bm_literal[1<<literal_context_bits][0x300];
 | ||
|   Bit_model bm_match[State::states][pos_states];
 | ||
|   Bit_model bm_rep[State::states];
 | ||
|   Bit_model bm_rep0[State::states];
 | ||
|   Bit_model bm_rep1[State::states];
 | ||
|   Bit_model bm_rep2[State::states];
 | ||
|   Bit_model bm_len[State::states][pos_states];
 | ||
|   Bit_model bm_dis_slot[len_states][1<<dis_slot_bits];
 | ||
|   Bit_model bm_dis[modeled_distances-end_dis_model+1];
 | ||
|   Bit_model bm_align[dis_align_size];
 | ||
|   Len_model match_len_model;
 | ||
|   Len_model rep_len_model;
 | ||
|   unsigned rep0 = 0;		// rep[0-3] latest four distances
 | ||
|   unsigned rep1 = 0;		// used for efficient coding of
 | ||
|   unsigned rep2 = 0;		// repeated distances
 | ||
|   unsigned rep3 = 0;
 | ||
|   State state;
 | ||
| 
 | ||
|   while( !std::feof( stdin ) && !std::ferror( stdin ) )
 | ||
|     {
 | ||
|     const int pos_state = data_position() & pos_state_mask;
 | ||
|     if( rdec.decode_bit( bm_match[state()][pos_state] ) == 0 )	// 1st bit
 | ||
|       {
 | ||
|       // literal byte
 | ||
|       const uint8_t prev_byte = peek( 0 );
 | ||
|       const int literal_state = prev_byte >> ( 8 - literal_context_bits );
 | ||
|       Bit_model * const bm = bm_literal[literal_state];
 | ||
|       if( state.is_char() )
 | ||
|         put_byte( rdec.decode_tree( bm, 8 ) );
 | ||
|       else
 | ||
|         put_byte( rdec.decode_matched( bm, peek( rep0 ) ) );
 | ||
|       state.set_char();
 | ||
|       continue;
 | ||
|       }
 | ||
|     // match or repeated match
 | ||
|     int len;
 | ||
|     if( rdec.decode_bit( bm_rep[state()] ) != 0 )		// 2nd bit
 | ||
|       {
 | ||
|       if( rdec.decode_bit( bm_rep0[state()] ) == 0 )		// 3rd bit
 | ||
|         {
 | ||
|         if( rdec.decode_bit( bm_len[state()][pos_state] ) == 0 ) // 4th bit
 | ||
|           { state.set_short_rep(); put_byte( peek( rep0 ) ); continue; }
 | ||
|         }
 | ||
|       else
 | ||
|         {
 | ||
|         unsigned distance;
 | ||
|         if( rdec.decode_bit( bm_rep1[state()] ) == 0 )		// 4th bit
 | ||
|           distance = rep1;
 | ||
|         else
 | ||
|           {
 | ||
|           if( rdec.decode_bit( bm_rep2[state()] ) == 0 )	// 5th bit
 | ||
|             distance = rep2;
 | ||
|           else
 | ||
|             { distance = rep3; rep3 = rep2; }
 | ||
|           rep2 = rep1;
 | ||
|           }
 | ||
|         rep1 = rep0;
 | ||
|         rep0 = distance;
 | ||
|         }
 | ||
|       state.set_rep();
 | ||
|       len = rdec.decode_len( rep_len_model, pos_state );
 | ||
|       }
 | ||
|     else					// match
 | ||
|       {
 | ||
|       rep3 = rep2; rep2 = rep1; rep1 = rep0;
 | ||
|       len = rdec.decode_len( match_len_model, pos_state );
 | ||
|       const int len_state = std::min( len - min_match_len, len_states - 1 );
 | ||
|       rep0 = rdec.decode_tree( bm_dis_slot[len_state], dis_slot_bits );
 | ||
|       if( rep0 >= start_dis_model )
 | ||
|         {
 | ||
|         const unsigned dis_slot = rep0;
 | ||
|         const int direct_bits = ( dis_slot >> 1 ) - 1;
 | ||
|         rep0 = ( 2 | ( dis_slot & 1 ) ) << direct_bits;
 | ||
|         if( dis_slot < end_dis_model )
 | ||
|           rep0 += rdec.decode_tree_reversed( bm_dis + ( rep0 - dis_slot ),
 | ||
|                                              direct_bits );
 | ||
|         else
 | ||
|           {
 | ||
|           rep0 +=
 | ||
|             rdec.decode( direct_bits - dis_align_bits ) << dis_align_bits;
 | ||
|           rep0 += rdec.decode_tree_reversed( bm_align, dis_align_bits );
 | ||
|           if( rep0 == 0xFFFFFFFFU )		// marker found
 | ||
|             {
 | ||
|             flush_data();
 | ||
|             return len == min_match_len;	// End Of Stream marker
 | ||
|             }
 | ||
|           }
 | ||
|         }
 | ||
|       state.set_match();
 | ||
|       if( rep0 >= dictionary_size || ( rep0 >= pos && !pos_wrapped ) )
 | ||
|         { flush_data(); return false; }
 | ||
|       }
 | ||
|     for( int i = 0; i < len; ++i ) put_byte( peek( rep0 ) );
 | ||
|     }
 | ||
|   flush_data();
 | ||
|   return false;
 | ||
|   }
 | ||
| 
 | ||
| 
 | ||
| int main( const int argc, const char * const argv[] )
 | ||
|   {
 | ||
|   if( argc > 2 || ( argc == 2 && std::strcmp( argv[1], "-d" ) != 0 ) )
 | ||
|     {
 | ||
|     std::printf(
 | ||
|       "Lzd %s - Educational decompressor for the lzip format.\n"
 | ||
|       "Study the source code to learn how a lzip decompressor works.\n"
 | ||
|       "See the lzip manual for an explanation of the code.\n"
 | ||
|       "\nUsage: %s [-d] < file.lz > file\n"
 | ||
|       "Lzd decompresses from standard input to standard output.\n"
 | ||
|       "\nCopyright (C) 2024 Antonio Diaz Diaz.\n"
 | ||
|       "License 2-clause BSD.\n"
 | ||
|       "This is free software: you are free to change and redistribute it.\n"
 | ||
|       "There is NO WARRANTY, to the extent permitted by law.\n"
 | ||
|       "Report bugs to lzip-bug@nongnu.org\n"
 | ||
|       "Lzd home page: http://www.nongnu.org/lzip/lzd.html\n",
 | ||
|       PROGVERSION, argv[0] );
 | ||
|     return 0;
 | ||
|     }
 | ||
| 
 | ||
| #if defined __MSVCRT__ || defined __OS2__ || defined __DJGPP__
 | ||
|   setmode( STDIN_FILENO, O_BINARY );
 | ||
|   setmode( STDOUT_FILENO, O_BINARY );
 | ||
| #endif
 | ||
| 
 | ||
|   for( bool first_member = true; ; first_member = false )
 | ||
|     {
 | ||
|     Lzip_header header;				// check header
 | ||
|     for( int i = 0; i < header_size; ++i ) header[i] = std::getc( stdin );
 | ||
|     if( std::feof( stdin ) || std::memcmp( header, "LZIP\x01", 5 ) != 0 )
 | ||
|       {
 | ||
|       if( first_member )
 | ||
|         { std::fputs( "Bad magic number (file not in lzip format).\n",
 | ||
|                       stderr ); return 2; }
 | ||
|       break;					// ignore trailing data
 | ||
|       }
 | ||
|     unsigned dict_size = 1 << ( header[5] & 0x1F );
 | ||
|     dict_size -= ( dict_size / 16 ) * ( ( header[5] >> 5 ) & 7 );
 | ||
|     if( dict_size < min_dictionary_size || dict_size > max_dictionary_size )
 | ||
|       { std::fputs( "Invalid dictionary size in member header.\n", stderr );
 | ||
|         return 2; }
 | ||
| 
 | ||
|     LZ_decoder decoder( dict_size );		// decode LZMA stream
 | ||
|     if( !decoder.decode_member() )
 | ||
|       { std::fputs( "Data error\n", stderr ); return 2; }
 | ||
| 
 | ||
|     Lzip_trailer trailer;			// check trailer
 | ||
|     for( int i = 0; i < trailer_size; ++i ) trailer[i] = decoder.get_byte();
 | ||
|     int retval = 0;
 | ||
|     unsigned crc = 0;
 | ||
|     for( int i = 3; i >= 0; --i ) crc = ( crc << 8 ) + trailer[i];
 | ||
|     if( crc != decoder.crc() )
 | ||
|       { std::fputs( "CRC mismatch\n", stderr ); retval = 2; }
 | ||
| 
 | ||
|     unsigned long long data_size = 0;
 | ||
|     for( int i = 11; i >= 4; --i )
 | ||
|       data_size = ( data_size << 8 ) + trailer[i];
 | ||
|     if( data_size != decoder.data_position() )
 | ||
|       { std::fputs( "Data size mismatch\n", stderr ); retval = 2; }
 | ||
| 
 | ||
|     unsigned long long member_size = 0;
 | ||
|     for( int i = 19; i >= 12; --i )
 | ||
|       member_size = ( member_size << 8 ) + trailer[i];
 | ||
|     if( member_size != decoder.member_position() )
 | ||
|       { std::fputs( "Member size mismatch\n", stderr ); retval = 2; }
 | ||
|     if( retval ) return retval;
 | ||
|     }
 | ||
| 
 | ||
|   if( std::fclose( stdout ) != 0 )
 | ||
|     { std::fprintf( stderr, "Error closing stdout: %s\n",
 | ||
|                     std::strerror( errno ) ); return 1; }
 | ||
|   return 0;
 | ||
|   }
 | ||
| </pre>
 | ||
| 
 | ||
| <div class="node">
 | ||
| <a name="Concept-index"></a>
 | ||
| <p><hr>
 | ||
| Previous: <a rel="previous" accesskey="p" href="#Reference-source-code">Reference source code</a>,
 | ||
| Up: <a rel="up" accesskey="u" href="#Top">Top</a>
 | ||
| 
 | ||
| </div>
 | ||
| 
 | ||
| <h2 class="unnumbered">Concept index</h2>
 | ||
| 
 | ||
| <ul class="index-cp" compact>
 | ||
| <li><a href="#index-algorithm-8">algorithm</a>: <a href="#Algorithm">Algorithm</a></li>
 | ||
| <li><a href="#index-bugs-13">bugs</a>: <a href="#Problems">Problems</a></li>
 | ||
| <li><a href="#index-examples-12">examples</a>: <a href="#Examples">Examples</a></li>
 | ||
| <li><a href="#index-file-format-9">file format</a>: <a href="#File-format">File format</a></li>
 | ||
| <li><a href="#index-format-of-the-LZMA-stream-10">format of the LZMA stream</a>: <a href="#Stream-format">Stream format</a></li>
 | ||
| <li><a href="#index-getting-help-14">getting help</a>: <a href="#Problems">Problems</a></li>
 | ||
| <li><a href="#index-introduction-1">introduction</a>: <a href="#Introduction">Introduction</a></li>
 | ||
| <li><a href="#index-invoking-3">invoking</a>: <a href="#Invoking-lzip">Invoking lzip</a></li>
 | ||
| <li><a href="#index-options-4">options</a>: <a href="#Invoking-lzip">Invoking lzip</a></li>
 | ||
| <li><a href="#index-output-2">output</a>: <a href="#Output">Output</a></li>
 | ||
| <li><a href="#index-quality-assurance-7">quality assurance</a>: <a href="#Quality-assurance">Quality assurance</a></li>
 | ||
| <li><a href="#index-reference-source-code-15">reference source code</a>: <a href="#Reference-source-code">Reference source code</a></li>
 | ||
| <li><a href="#index-trailing-data-11">trailing data</a>: <a href="#Trailing-data">Trailing data</a></li>
 | ||
| <li><a href="#index-usage-5">usage</a>: <a href="#Invoking-lzip">Invoking lzip</a></li>
 | ||
| <li><a href="#index-version-6">version</a>: <a href="#Invoking-lzip">Invoking lzip</a></li>
 | ||
| </ul></body></html>
 | ||
| 
 | ||
| <!--
 | ||
| 
 | ||
| Local Variables:
 | ||
| coding: iso-8859-15
 | ||
| End:
 | ||
| 
 | ||
| -->
 |