diff --git a/COPYING b/COPYING new file mode 100644 index 0000000..4ad17ae --- /dev/null +++ b/COPYING @@ -0,0 +1,338 @@ + GNU GENERAL PUBLIC LICENSE + Version 2, June 1991 + + Copyright (C) 1989, 1991 Free Software Foundation, Inc., + 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA + Everyone is permitted to copy and distribute verbatim copies + of this license document, but changing it is not allowed. + + Preamble + + The licenses for most software are designed to take away your +freedom to share and change it. By contrast, the GNU General Public +License is intended to guarantee your freedom to share and change free +software--to make sure the software is free for all its users. This +General Public License applies to most of the Free Software +Foundation's software and to any other program whose authors commit to +using it. (Some other Free Software Foundation software is covered by +the GNU Lesser General Public License instead.) You can apply it to +your programs, too. + + When we speak of free software, we are referring to freedom, not +price. Our General Public Licenses are designed to make sure that you +have the freedom to distribute copies of free software (and charge for +this service if you wish), that you receive source code or can get it +if you want it, that you can change the software or use pieces of it +in new free programs; and that you know you can do these things. + + To protect your rights, we need to make restrictions that forbid +anyone to deny you these rights or to ask you to surrender the rights. +These restrictions translate to certain responsibilities for you if you +distribute copies of the software, or if you modify it. + + For example, if you distribute copies of such a program, whether +gratis or for a fee, you must give the recipients all the rights that +you have. You must make sure that they, too, receive or can get the +source code. And you must show them these terms so they know their +rights. + + We protect your rights with two steps: (1) copyright the software, and +(2) offer you this license which gives you legal permission to copy, +distribute and/or modify the software. + + Also, for each author's protection and ours, we want to make certain +that everyone understands that there is no warranty for this free +software. If the software is modified by someone else and passed on, we +want its recipients to know that what they have is not the original, so +that any problems introduced by others will not reflect on the original +authors' reputations. + + Finally, any free program is threatened constantly by software +patents. We wish to avoid the danger that redistributors of a free +program will individually obtain patent licenses, in effect making the +program proprietary. To prevent this, we have made it clear that any +patent must be licensed for everyone's free use or not licensed at all. + + The precise terms and conditions for copying, distribution and +modification follow. + + GNU GENERAL PUBLIC LICENSE + TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION + + 0. This License applies to any program or other work which contains +a notice placed by the copyright holder saying it may be distributed +under the terms of this General Public License. The "Program", below, +refers to any such program or work, and a "work based on the Program" +means either the Program or any derivative work under copyright law: +that is to say, a work containing the Program or a portion of it, +either verbatim or with modifications and/or translated into another +language. (Hereinafter, translation is included without limitation in +the term "modification".) Each licensee is addressed as "you". + +Activities other than copying, distribution and modification are not +covered by this License; they are outside its scope. The act of +running the Program is not restricted, and the output from the Program +is covered only if its contents constitute a work based on the +Program (independent of having been made by running the Program). +Whether that is true depends on what the Program does. + + 1. You may copy and distribute verbatim copies of the Program's +source code as you receive it, in any medium, provided that you +conspicuously and appropriately publish on each copy an appropriate +copyright notice and disclaimer of warranty; keep intact all the +notices that refer to this License and to the absence of any warranty; +and give any other recipients of the Program a copy of this License +along with the Program. + +You may charge a fee for the physical act of transferring a copy, and +you may at your option offer warranty protection in exchange for a fee. + + 2. You may modify your copy or copies of the Program or any portion +of it, thus forming a work based on the Program, and copy and +distribute such modifications or work under the terms of Section 1 +above, provided that you also meet all of these conditions: + + a) You must cause the modified files to carry prominent notices + stating that you changed the files and the date of any change. + + b) You must cause any work that you distribute or publish, that in + whole or in part contains or is derived from the Program or any + part thereof, to be licensed as a whole at no charge to all third + parties under the terms of this License. + + c) If the modified program normally reads commands interactively + when run, you must cause it, when started running for such + interactive use in the most ordinary way, to print or display an + announcement including an appropriate copyright notice and a + notice that there is no warranty (or else, saying that you provide + a warranty) and that users may redistribute the program under + these conditions, and telling the user how to view a copy of this + License. (Exception: if the Program itself is interactive but + does not normally print such an announcement, your work based on + the Program is not required to print an announcement.) + +These requirements apply to the modified work as a whole. If +identifiable sections of that work are not derived from the Program, +and can be reasonably considered independent and separate works in +themselves, then this License, and its terms, do not apply to those +sections when you distribute them as separate works. But when you +distribute the same sections as part of a whole which is a work based +on the Program, the distribution of the whole must be on the terms of +this License, whose permissions for other licensees extend to the +entire whole, and thus to each and every part regardless of who wrote it. + +Thus, it is not the intent of this section to claim rights or contest +your rights to work written entirely by you; rather, the intent is to +exercise the right to control the distribution of derivative or +collective works based on the Program. + +In addition, mere aggregation of another work not based on the Program +with the Program (or with a work based on the Program) on a volume of +a storage or distribution medium does not bring the other work under +the scope of this License. + + 3. You may copy and distribute the Program (or a work based on it, +under Section 2) in object code or executable form under the terms of +Sections 1 and 2 above provided that you also do one of the following: + + a) Accompany it with the complete corresponding machine-readable + source code, which must be distributed under the terms of Sections + 1 and 2 above on a medium customarily used for software interchange; or, + + b) Accompany it with a written offer, valid for at least three + years, to give any third party, for a charge no more than your + cost of physically performing source distribution, a complete + machine-readable copy of the corresponding source code, to be + distributed under the terms of Sections 1 and 2 above on a medium + customarily used for software interchange; or, + + c) Accompany it with the information you received as to the offer + to distribute corresponding source code. (This alternative is + allowed only for noncommercial distribution and only if you + received the program in object code or executable form with such + an offer, in accord with Subsection b above.) + +The source code for a work means the preferred form of the work for +making modifications to it. For an executable work, complete source +code means all the source code for all modules it contains, plus any +associated interface definition files, plus the scripts used to +control compilation and installation of the executable. However, as a +special exception, the source code distributed need not include +anything that is normally distributed (in either source or binary +form) with the major components (compiler, kernel, and so on) of the +operating system on which the executable runs, unless that component +itself accompanies the executable. + +If distribution of executable or object code is made by offering +access to copy from a designated place, then offering equivalent +access to copy the source code from the same place counts as +distribution of the source code, even though third parties are not +compelled to copy the source along with the object code. + + 4. You may not copy, modify, sublicense, or distribute the Program +except as expressly provided under this License. Any attempt +otherwise to copy, modify, sublicense or distribute the Program is +void, and will automatically terminate your rights under this License. +However, parties who have received copies, or rights, from you under +this License will not have their licenses terminated so long as such +parties remain in full compliance. + + 5. You are not required to accept this License, since you have not +signed it. However, nothing else grants you permission to modify or +distribute the Program or its derivative works. These actions are +prohibited by law if you do not accept this License. Therefore, by +modifying or distributing the Program (or any work based on the +Program), you indicate your acceptance of this License to do so, and +all its terms and conditions for copying, distributing or modifying +the Program or works based on it. + + 6. Each time you redistribute the Program (or any work based on the +Program), the recipient automatically receives a license from the +original licensor to copy, distribute or modify the Program subject to +these terms and conditions. You may not impose any further +restrictions on the recipients' exercise of the rights granted herein. +You are not responsible for enforcing compliance by third parties to +this License. + + 7. If, as a consequence of a court judgment or allegation of patent +infringement or for any other reason (not limited to patent issues), +conditions are imposed on you (whether by court order, agreement or +otherwise) that contradict the conditions of this License, they do not +excuse you from the conditions of this License. If you cannot +distribute so as to satisfy simultaneously your obligations under this +License and any other pertinent obligations, then as a consequence you +may not distribute the Program at all. For example, if a patent +license would not permit royalty-free redistribution of the Program by +all those who receive copies directly or indirectly through you, then +the only way you could satisfy both it and this License would be to +refrain entirely from distribution of the Program. + +If any portion of this section is held invalid or unenforceable under +any particular circumstance, the balance of the section is intended to +apply and the section as a whole is intended to apply in other +circumstances. + +It is not the purpose of this section to induce you to infringe any +patents or other property right claims or to contest validity of any +such claims; this section has the sole purpose of protecting the +integrity of the free software distribution system, which is +implemented by public license practices. Many people have made +generous contributions to the wide range of software distributed +through that system in reliance on consistent application of that +system; it is up to the author/donor to decide if he or she is willing +to distribute software through any other system and a licensee cannot +impose that choice. + +This section is intended to make thoroughly clear what is believed to +be a consequence of the rest of this License. + + 8. If the distribution and/or use of the Program is restricted in +certain countries either by patents or by copyrighted interfaces, the +original copyright holder who places the Program under this License +may add an explicit geographical distribution limitation excluding +those countries, so that distribution is permitted only in or among +countries not thus excluded. In such case, this License incorporates +the limitation as if written in the body of this License. + + 9. The Free Software Foundation may publish revised and/or new versions +of the General Public License from time to time. Such new versions will +be similar in spirit to the present version, but may differ in detail to +address new problems or concerns. + +Each version is given a distinguishing version number. If the Program +specifies a version number of this License which applies to it and "any +later version", you have the option of following the terms and conditions +either of that version or of any later version published by the Free +Software Foundation. If the Program does not specify a version number of +this License, you may choose any version ever published by the Free Software +Foundation. + + 10. If you wish to incorporate parts of the Program into other free +programs whose distribution conditions are different, write to the author +to ask for permission. For software which is copyrighted by the Free +Software Foundation, write to the Free Software Foundation; we sometimes +make exceptions for this. Our decision will be guided by the two goals +of preserving the free status of all derivatives of our free software and +of promoting the sharing and reuse of software generally. + + NO WARRANTY + + 11. BECAUSE THE PROGRAM IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY +FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT WHEN +OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES +PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED +OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF +MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS +TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU. SHOULD THE +PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING, +REPAIR OR CORRECTION. + + 12. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING +WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR +REDISTRIBUTE THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, +INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING +OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED +TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY +YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER +PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE +POSSIBILITY OF SUCH DAMAGES. + + END OF TERMS AND CONDITIONS + + How to Apply These Terms to Your New Programs + + If you develop a new program, and you want it to be of the greatest +possible use to the public, the best way to achieve this is to make it +free software which everyone can redistribute and change under these terms. + + To do so, attach the following notices to the program. It is safest +to attach them to the start of each source file to most effectively +convey the exclusion of warranty; and each file should have at least +the "copyright" line and a pointer to where the full notice is found. + + + Copyright (C) + + This program is free software: you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation, either version 2 of the License, or + (at your option) any later version. + + This program is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License + along with this program. If not, see . + +Also add information on how to contact you by electronic and paper mail. + +If the program is interactive, make it output a short notice like this +when it starts in an interactive mode: + + Gnomovision version 69, Copyright (C) + Gnomovision comes with ABSOLUTELY NO WARRANTY; for details type `show w'. + This is free software, and you are welcome to redistribute it + under certain conditions; type `show c' for details. + +The hypothetical commands `show w' and `show c' should show the appropriate +parts of the General Public License. Of course, the commands you use may +be called something other than `show w' and `show c'; they could even be +mouse-clicks or menu items--whatever suits your program. + +You should also get your employer (if you work as a programmer) or your +school, if any, to sign a "copyright disclaimer" for the program, if +necessary. Here is a sample; alter the names: + + Yoyodyne, Inc., hereby disclaims all copyright interest in the program + `Gnomovision' (which makes passes at compilers) written by James Hacker. + + , 1 April 1989 + Ty Coon, President of Vice + +This General Public License does not permit incorporating your program into +proprietary programs. If your program is a subroutine library, you may +consider it more useful to permit linking proprietary applications with the +library. If this is what you want to do, use the GNU Lesser General +Public License instead of this License. diff --git a/decoder.cc b/decoder.cc new file mode 100644 index 0000000..16aa016 --- /dev/null +++ b/decoder.cc @@ -0,0 +1,276 @@ +/* Lzip - LZMA lossless data compressor + Copyright (C) 2008-2024 Antonio Diaz Diaz. + + This program is free software: you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation, either version 2 of the License, or + (at your option) any later version. + + This program is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License + along with this program. If not, see . +*/ + +#define _FILE_OFFSET_BITS 64 + +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include "lzip.h" +#include "decoder.h" + + +/* Return the number of bytes really read. + If (value returned < size) and (errno == 0), means EOF was reached. +*/ +int readblock( const int fd, uint8_t * const buf, const int size ) + { + int sz = 0; + errno = 0; + while( sz < size ) + { + const int n = read( fd, buf + sz, size - sz ); + if( n > 0 ) sz += n; + else if( n == 0 ) break; // EOF + else if( errno != EINTR ) break; + errno = 0; + } + return sz; + } + + +/* Return the number of bytes really written. + If (value returned < size), it is always an error. +*/ +int writeblock( const int fd, const uint8_t * const buf, const int size ) + { + int sz = 0; + errno = 0; + while( sz < size ) + { + const int n = write( fd, buf + sz, size - sz ); + if( n > 0 ) sz += n; + else if( n < 0 && errno != EINTR ) break; + errno = 0; + } + return sz; + } + + +bool Range_decoder::read_block() + { + if( !at_stream_end ) + { + stream_pos = readblock( infd, buffer, buffer_size ); + if( stream_pos != buffer_size && errno ) throw Error( "Read error" ); + at_stream_end = ( stream_pos < buffer_size ); + partial_member_pos += pos; + pos = 0; + show_dprogress(); + } + return pos < stream_pos; + } + + +void LZ_decoder::flush_data() + { + if( pos > stream_pos ) + { + const int size = pos - stream_pos; + crc32.update_buf( crc_, buffer + stream_pos, size ); + if( outfd >= 0 && writeblock( outfd, buffer + stream_pos, size ) != size ) + throw Error( "Write error" ); + if( pos >= dictionary_size ) + { partial_data_pos += pos; pos = 0; pos_wrapped = true; } + stream_pos = pos; + } + } + + +int LZ_decoder::check_trailer( const Pretty_print & pp, + const bool ignore_empty ) const + { + Lzip_trailer trailer; + int size = rdec.read_data( trailer.data, trailer.size ); + bool error = false; + + if( size < trailer.size ) + { + error = true; + if( verbosity >= 0 ) + { pp(); + std::fprintf( stderr, "Trailer truncated at trailer position %d;" + " some checks may fail.\n", size ); } + while( size < trailer.size ) trailer.data[size++] = 0; + } + + const unsigned td_crc = trailer.data_crc(); + if( td_crc != crc() ) + { + error = true; + if( verbosity >= 0 ) + { pp(); + std::fprintf( stderr, "CRC mismatch; stored %08X, computed %08X\n", + td_crc, crc() ); } + } + const unsigned long long data_size = data_position(); + const unsigned long long td_size = trailer.data_size(); + if( td_size != data_size ) + { + error = true; + if( verbosity >= 0 ) + { pp(); + std::fprintf( stderr, "Data size mismatch; stored %llu (0x%llX), computed %llu (0x%llX)\n", + td_size, td_size, data_size, data_size ); } + } + const unsigned long long member_size = rdec.member_position(); + const unsigned long long tm_size = trailer.member_size(); + if( tm_size != member_size ) + { + error = true; + if( verbosity >= 0 ) + { pp(); + std::fprintf( stderr, "Member size mismatch; stored %llu (0x%llX), computed %llu (0x%llX)\n", + tm_size, tm_size, member_size, member_size ); } + } + if( error ) return 3; + if( !ignore_empty && data_size == 0 ) return 5; + if( verbosity >= 2 ) + { + if( verbosity >= 4 ) show_header( dictionary_size ); + if( data_size == 0 || member_size == 0 ) + std::fputs( "no data compressed. ", stderr ); + else + std::fprintf( stderr, "%6.3f:1, %5.2f%% ratio, %5.2f%% saved. ", + (double)data_size / member_size, + ( 100.0 * member_size ) / data_size, + 100.0 - ( ( 100.0 * member_size ) / data_size ) ); + if( verbosity >= 4 ) std::fprintf( stderr, "CRC %08X, ", td_crc ); + if( verbosity >= 3 ) + std::fprintf( stderr, "%9llu out, %8llu in. ", data_size, member_size ); + } + return 0; + } + + +/* Return value: 0 = OK, 1 = decoder error, 2 = unexpected EOF, + 3 = trailer error, 4 = unknown marker found, + 5 = empty member found, 6 = marked member found. */ +int LZ_decoder::decode_member( const Cl_options & cl_opts, + const Pretty_print & pp ) + { + Bit_model bm_literal[1<= start_dis_model ) + { + const unsigned dis_slot = distance; + const int direct_bits = ( dis_slot >> 1 ) - 1; + distance = ( 2 | ( dis_slot & 1 ) ) << direct_bits; + if( dis_slot < end_dis_model ) + distance += rdec.decode_tree_reversed( + bm_dis + ( distance - dis_slot ), direct_bits ); + else + { + distance += + rdec.decode( direct_bits - dis_align_bits ) << dis_align_bits; + distance += rdec.decode_tree_reversed4( bm_align ); + if( distance == 0xFFFFFFFFU ) // marker found + { + rdec.normalize(); + flush_data(); + if( len == min_match_len ) // End Of Stream marker + return check_trailer( pp, cl_opts.ignore_empty ); + if( len == min_match_len + 1 ) // Sync Flush marker + { rdec.load(); continue; } + if( verbosity >= 0 ) + { + pp(); + std::fprintf( stderr, "Unsupported marker code '%d'\n", len ); + } + return 4; + } + } + } + rep3 = rep2; rep2 = rep1; rep1 = rep0; rep0 = distance; + state.set_match(); + if( rep0 >= dictionary_size || ( rep0 >= pos && !pos_wrapped ) ) + { flush_data(); return 1; } + } + copy_block( rep0, len ); + } + flush_data(); + return 2; + } diff --git a/decoder.h b/decoder.h new file mode 100644 index 0000000..cf10f3d --- /dev/null +++ b/decoder.h @@ -0,0 +1,346 @@ +/* Lzip - LZMA lossless data compressor + Copyright (C) 2008-2024 Antonio Diaz Diaz. + + This program is free software: you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation, either version 2 of the License, or + (at your option) any later version. + + This program is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License + along with this program. If not, see . +*/ + +class Range_decoder + { + enum { buffer_size = 16384 }; + unsigned long long partial_member_pos; + uint8_t * const buffer; // input buffer + int pos; // current pos in buffer + int stream_pos; // when reached, a new block must be read + uint32_t code; + uint32_t range; + const int infd; // input file descriptor + bool at_stream_end; + + bool read_block(); + + Range_decoder( const Range_decoder & ); // declared as private + void operator=( const Range_decoder & ); // declared as private + +public: + explicit Range_decoder( const int ifd ) + : + partial_member_pos( 0 ), + buffer( new uint8_t[buffer_size] ), + pos( 0 ), + stream_pos( 0 ), + code( 0 ), + range( 0xFFFFFFFFU ), + infd( ifd ), + at_stream_end( false ) + {} + + ~Range_decoder() { delete[] buffer; } + + bool finished() { return pos >= stream_pos && !read_block(); } + + unsigned long long member_position() const + { return partial_member_pos + pos; } + + void reset_member_position() + { partial_member_pos = 0; partial_member_pos -= pos; } + + uint8_t get_byte() + { + // 0xFF avoids decoder error if member is truncated at EOS marker + if( finished() ) return 0xFF; + return buffer[pos++]; + } + + int read_data( uint8_t * const outbuf, const int size ) + { + int sz = 0; + while( sz < size && !finished() ) + { + const int rd = std::min( size - sz, stream_pos - pos ); + std::memcpy( outbuf + sz, buffer + pos, rd ); + pos += rd; + sz += rd; + } + return sz; + } + + bool load( const bool ignore_marking = true ) + { + code = 0; + range = 0xFFFFFFFFU; + // check and discard first byte of the LZMA stream + if( get_byte() != 0 && !ignore_marking ) return false; + for( int i = 0; i < 4; ++i ) code = ( code << 8 ) | get_byte(); + return true; + } + + void normalize() + { + if( range <= 0x00FFFFFFU ) + { range <<= 8; code = ( code << 8 ) | get_byte(); } + } + + unsigned decode( const int num_bits ) + { + unsigned symbol = 0; + for( int i = num_bits; i > 0; --i ) + { + normalize(); + range >>= 1; +// symbol <<= 1; +// if( code >= range ) { code -= range; symbol |= 1; } + const bool bit = ( code >= range ); + symbol <<= 1; symbol += bit; + code -= range & ( 0U - bit ); + } + return symbol; + } + + bool decode_bit( Bit_model & bm ) + { + normalize(); + const uint32_t bound = ( range >> bit_model_total_bits ) * bm.probability; + if( code < bound ) + { + range = bound; + bm.probability += + ( bit_model_total - bm.probability ) >> bit_model_move_bits; + return 0; + } + else + { + code -= bound; + range -= bound; + bm.probability -= bm.probability >> bit_model_move_bits; + return 1; + } + } + + void decode_symbol_bit( Bit_model & bm, unsigned & symbol ) + { + normalize(); + symbol <<= 1; + const uint32_t bound = ( range >> bit_model_total_bits ) * bm.probability; + if( code < bound ) + { + range = bound; + bm.probability += + ( bit_model_total - bm.probability ) >> bit_model_move_bits; + } + else + { + code -= bound; + range -= bound; + bm.probability -= bm.probability >> bit_model_move_bits; + symbol |= 1; + } + } + + void decode_symbol_bit_reversed( Bit_model & bm, unsigned & model, + unsigned & symbol, const int i ) + { + normalize(); + model <<= 1; + const uint32_t bound = ( range >> bit_model_total_bits ) * bm.probability; + if( code < bound ) + { + range = bound; + bm.probability += + ( bit_model_total - bm.probability ) >> bit_model_move_bits; + } + else + { + code -= bound; + range -= bound; + bm.probability -= bm.probability >> bit_model_move_bits; + model |= 1; + symbol |= 1 << i; + } + } + + unsigned decode_tree6( Bit_model bm[] ) + { + unsigned symbol = 1; + decode_symbol_bit( bm[symbol], symbol ); + decode_symbol_bit( bm[symbol], symbol ); + decode_symbol_bit( bm[symbol], symbol ); + decode_symbol_bit( bm[symbol], symbol ); + decode_symbol_bit( bm[symbol], symbol ); + decode_symbol_bit( bm[symbol], symbol ); + return symbol & 0x3F; + } + + unsigned decode_tree8( Bit_model bm[] ) + { + unsigned symbol = 1; + decode_symbol_bit( bm[symbol], symbol ); + decode_symbol_bit( bm[symbol], symbol ); + decode_symbol_bit( bm[symbol], symbol ); + decode_symbol_bit( bm[symbol], symbol ); + decode_symbol_bit( bm[symbol], symbol ); + decode_symbol_bit( bm[symbol], symbol ); + decode_symbol_bit( bm[symbol], symbol ); + decode_symbol_bit( bm[symbol], symbol ); + return symbol & 0xFF; + } + + unsigned decode_tree_reversed( Bit_model bm[], const int num_bits ) + { + unsigned model = 1; + unsigned symbol = 0; + for( int i = 0; i < num_bits; ++i ) + decode_symbol_bit_reversed( bm[model], model, symbol, i ); + return symbol; + } + + unsigned decode_tree_reversed4( Bit_model bm[] ) + { + unsigned model = 1; + unsigned symbol = 0; + decode_symbol_bit_reversed( bm[model], model, symbol, 0 ); + decode_symbol_bit_reversed( bm[model], model, symbol, 1 ); + decode_symbol_bit_reversed( bm[model], model, symbol, 2 ); + decode_symbol_bit_reversed( bm[model], model, symbol, 3 ); + return symbol; + } + + unsigned decode_matched( Bit_model bm[], unsigned match_byte ) + { + Bit_model * const bm1 = bm + 0x100; + unsigned symbol = 1; + while( symbol < 0x100 ) + { + const unsigned match_bit = ( match_byte <<= 1 ) & 0x100; + const bool bit = decode_bit( bm1[symbol+match_bit] ); + symbol <<= 1; symbol |= bit; + if( match_bit >> 8 != bit ) + { + while( symbol < 0x100 ) decode_symbol_bit( bm[symbol], symbol ); + break; + } + } + return symbol & 0xFF; + } + + unsigned decode_len( Len_model & lm, const int pos_state ) + { + Bit_model * bm; + unsigned mask, offset, symbol = 1; + + if( decode_bit( lm.choice1 ) == 0 ) + { bm = lm.bm_low[pos_state]; mask = 7; offset = 0; goto len3; } + if( decode_bit( lm.choice2 ) == 0 ) + { bm = lm.bm_mid[pos_state]; mask = 7; offset = len_low_symbols; goto len3; } + bm = lm.bm_high; mask = 0xFF; offset = len_low_symbols + len_mid_symbols; + decode_symbol_bit( bm[symbol], symbol ); + decode_symbol_bit( bm[symbol], symbol ); + decode_symbol_bit( bm[symbol], symbol ); + decode_symbol_bit( bm[symbol], symbol ); + decode_symbol_bit( bm[symbol], symbol ); +len3: + decode_symbol_bit( bm[symbol], symbol ); + decode_symbol_bit( bm[symbol], symbol ); + decode_symbol_bit( bm[symbol], symbol ); + return ( symbol & mask ) + min_match_len + offset; + } + }; + + +class LZ_decoder + { + unsigned long long partial_data_pos; + Range_decoder & rdec; + const unsigned dictionary_size; + uint8_t * const buffer; // output buffer + unsigned pos; // current pos in buffer + unsigned stream_pos; // first byte not yet written to file + uint32_t crc_; + const int outfd; // output file descriptor + bool pos_wrapped; + + void flush_data(); + int check_trailer( const Pretty_print & pp, const bool ignore_empty ) const; + + uint8_t peek_prev() const + { return buffer[((pos > 0) ? pos : dictionary_size)-1]; } + + uint8_t peek( const unsigned distance ) const + { + const unsigned i = ( ( pos > distance ) ? 0 : dictionary_size ) + + pos - distance - 1; + return buffer[i]; + } + + void put_byte( const uint8_t b ) + { + buffer[pos] = b; + if( ++pos >= dictionary_size ) flush_data(); + } + + void copy_block( const unsigned distance, unsigned len ) + { + unsigned lpos = pos, i = lpos - distance - 1; + bool fast, fast2; + if( lpos > distance ) + { + fast = ( len < dictionary_size - lpos ); + fast2 = ( fast && len <= lpos - i ); + } + else + { + i += dictionary_size; + fast = ( len < dictionary_size - i ); // (i == pos) may happen + fast2 = ( fast && len <= i - lpos ); + } + if( fast ) // no wrap + { + pos += len; + if( fast2 ) // no wrap, no overlap + std::memcpy( buffer + lpos, buffer + i, len ); + else + for( ; len > 0; --len ) buffer[lpos++] = buffer[i++]; + } + else for( ; len > 0; --len ) + { + buffer[pos] = buffer[i]; + if( ++pos >= dictionary_size ) flush_data(); + if( ++i >= dictionary_size ) i = 0; + } + } + + LZ_decoder( const LZ_decoder & ); // declared as private + void operator=( const LZ_decoder & ); // declared as private + +public: + LZ_decoder( Range_decoder & rde, const unsigned dict_size, const int ofd ) + : + partial_data_pos( 0 ), + rdec( rde ), + dictionary_size( dict_size ), + buffer( new uint8_t[dictionary_size] ), + pos( 0 ), + stream_pos( 0 ), + crc_( 0xFFFFFFFFU ), + outfd( ofd ), + pos_wrapped( false ) + // prev_byte of first byte; also for peek( 0 ) on corrupt file + { buffer[dictionary_size-1] = 0; } + + ~LZ_decoder() { delete[] buffer; } + + unsigned crc() const { return crc_ ^ 0xFFFFFFFFU; } + unsigned long long data_position() const { return partial_data_pos + pos; } + + int decode_member( const Cl_options & cl_opts, const Pretty_print & pp ); + }; diff --git a/encoder.cc b/encoder.cc new file mode 100644 index 0000000..11da3a0 --- /dev/null +++ b/encoder.cc @@ -0,0 +1,594 @@ +/* Lzip - LZMA lossless data compressor + Copyright (C) 2008-2024 Antonio Diaz Diaz. + + This program is free software: you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation, either version 2 of the License, or + (at your option) any later version. + + This program is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License + along with this program. If not, see . +*/ + +#define _FILE_OFFSET_BITS 64 + +#include +#include +#include +#include +#include +#include +#include + +#include "lzip.h" +#include "encoder_base.h" +#include "encoder.h" + + +const CRC32 crc32; + + +int LZ_encoder::get_match_pairs( Pair * pairs ) + { + int len_limit = match_len_limit; + if( len_limit > available_bytes() ) + { + len_limit = available_bytes(); + if( len_limit < 4 ) return 0; + } + + int maxlen = 3; // only used if pairs != 0 + int num_pairs = 0; + const int min_pos = ( pos > dictionary_size ) ? pos - dictionary_size : 0; + const uint8_t * const data = ptr_to_current_pos(); + + unsigned tmp = crc32[data[0]] ^ data[1]; + const int key2 = tmp & ( num_prev_positions2 - 1 ); + tmp ^= (unsigned)data[2] << 8; + const int key3 = num_prev_positions2 + ( tmp & ( num_prev_positions3 - 1 ) ); + const int key4 = num_prev_positions23 + + ( ( tmp ^ ( crc32[data[3]] << 5 ) ) & key4_mask ); + + if( pairs ) + { + const int np2 = prev_positions[key2]; + const int np3 = prev_positions[key3]; + if( np2 > min_pos && buffer[np2-1] == data[0] ) + { + pairs[0].dis = pos - np2; + pairs[0].len = maxlen = 2 + ( np2 == np3 ); + num_pairs = 1; + } + if( np2 != np3 && np3 > min_pos && buffer[np3-1] == data[0] ) + { + maxlen = 3; + pairs[num_pairs++].dis = pos - np3; + } + if( num_pairs > 0 ) + { + const int delta = pairs[num_pairs-1].dis + 1; + while( maxlen < len_limit && data[maxlen-delta] == data[maxlen] ) + ++maxlen; + pairs[num_pairs-1].len = maxlen; + if( maxlen < 3 ) maxlen = 3; + if( maxlen >= len_limit ) pairs = 0; // done. now just skip + } + } + + const int pos1 = pos + 1; + prev_positions[key2] = pos1; + prev_positions[key3] = pos1; + int newpos1 = prev_positions[key4]; + prev_positions[key4] = pos1; + + int32_t * ptr0 = pos_array + ( cyclic_pos << 1 ); + int32_t * ptr1 = ptr0 + 1; + int len = 0, len0 = 0, len1 = 0; + + for( int count = cycles; ; ) + { + if( newpos1 <= min_pos || --count < 0 ) { *ptr0 = *ptr1 = 0; break; } + + const int delta = pos1 - newpos1; + int32_t * const newptr = pos_array + + ( ( cyclic_pos - delta + + ( ( cyclic_pos >= delta ) ? 0 : dictionary_size + 1 ) ) << 1 ); + if( data[len-delta] == data[len] ) + { + while( ++len < len_limit && data[len-delta] == data[len] ) {} + if( pairs && maxlen < len ) + { + pairs[num_pairs].dis = delta - 1; + pairs[num_pairs].len = maxlen = len; + ++num_pairs; + } + if( len >= len_limit ) + { + *ptr0 = newptr[0]; + *ptr1 = newptr[1]; + break; + } + } + if( data[len-delta] < data[len] ) + { + *ptr0 = newpos1; + ptr0 = newptr + 1; + newpos1 = *ptr0; + len0 = len; if( len1 < len ) len = len1; + } + else + { + *ptr1 = newpos1; + ptr1 = newptr; + newpos1 = *ptr1; + len1 = len; if( len0 < len ) len = len0; + } + } + return num_pairs; + } + + +void LZ_encoder::update_distance_prices() + { + for( int dis = start_dis_model; dis < modeled_distances; ++dis ) + { + const int dis_slot = dis_slots[dis]; + const int direct_bits = ( dis_slot >> 1 ) - 1; + const int base = ( 2 | ( dis_slot & 1 ) ) << direct_bits; + const int price = price_symbol_reversed( bm_dis + ( base - dis_slot ), + dis - base, direct_bits ); + for( int len_state = 0; len_state < len_states; ++len_state ) + dis_prices[len_state][dis] = price; + } + + for( int len_state = 0; len_state < len_states; ++len_state ) + { + int * const dsp = dis_slot_prices[len_state]; + const Bit_model * const bmds = bm_dis_slot[len_state]; + int slot = 0; + for( ; slot < end_dis_model; ++slot ) + dsp[slot] = price_symbol6( bmds, slot ); + for( ; slot < num_dis_slots; ++slot ) + dsp[slot] = price_symbol6( bmds, slot ) + + (((( slot >> 1 ) - 1 ) - dis_align_bits ) << price_shift_bits ); + + int * const dp = dis_prices[len_state]; + int dis = 0; + for( ; dis < start_dis_model; ++dis ) + dp[dis] = dsp[dis]; + for( ; dis < modeled_distances; ++dis ) + dp[dis] += dsp[dis_slots[dis]]; + } + } + + +/* Return the number of bytes advanced (ahead). + trials[0]..trials[ahead-1] contain the steps to encode. + ( trials[0].dis4 == -1 ) means literal. + A match/rep longer or equal than match_len_limit finishes the sequence. +*/ +int LZ_encoder::sequence_optimizer( const int reps[num_rep_distances], + const State state ) + { + int num_pairs, num_trials; + + if( pending_num_pairs > 0 ) // from previous call + { + num_pairs = pending_num_pairs; + pending_num_pairs = 0; + } + else + num_pairs = read_match_distances(); + const int main_len = ( num_pairs > 0 ) ? pairs[num_pairs-1].len : 0; + + int replens[num_rep_distances]; + int rep_index = 0; + for( int i = 0; i < num_rep_distances; ++i ) + { + replens[i] = true_match_len( 0, reps[i] + 1 ); + if( replens[i] > replens[rep_index] ) rep_index = i; + } + if( replens[rep_index] >= match_len_limit ) + { + trials[0].price = replens[rep_index]; + trials[0].dis4 = rep_index; + move_and_update( replens[rep_index] ); + return replens[rep_index]; + } + + if( main_len >= match_len_limit ) + { + trials[0].price = main_len; + trials[0].dis4 = pairs[num_pairs-1].dis + num_rep_distances; + move_and_update( main_len ); + return main_len; + } + + const int pos_state = data_position() & pos_state_mask; + const uint8_t prev_byte = peek( 1 ); + const uint8_t cur_byte = peek( 0 ); + const uint8_t match_byte = peek( reps[0] + 1 ); + + trials[1].price = price0( bm_match[state()][pos_state] ); + if( state.is_char() ) + trials[1].price += price_literal( prev_byte, cur_byte ); + else + trials[1].price += price_matched( prev_byte, cur_byte, match_byte ); + trials[1].dis4 = -1; // literal + + const int match_price = price1( bm_match[state()][pos_state] ); + const int rep_match_price = match_price + price1( bm_rep[state()] ); + + if( match_byte == cur_byte ) + trials[1].update( rep_match_price + price_shortrep( state, pos_state ), 0, 0 ); + + num_trials = std::max( main_len, replens[rep_index] ); + + if( num_trials < min_match_len ) + { + trials[0].price = 1; + trials[0].dis4 = trials[1].dis4; + move_pos(); + return 1; + } + + trials[0].state = state; + for( int i = 0; i < num_rep_distances; ++i ) + trials[0].reps[i] = reps[i]; + + for( int len = min_match_len; len <= num_trials; ++len ) + trials[len].price = infinite_price; + + for( int rep = 0; rep < num_rep_distances; ++rep ) + { + if( replens[rep] < min_match_len ) continue; + const int price = rep_match_price + price_rep( rep, state, pos_state ); + for( int len = min_match_len; len <= replens[rep]; ++len ) + trials[len].update( price + rep_len_prices.price( len, pos_state ), + rep, 0 ); + } + + if( main_len > replens[0] ) + { + const int normal_match_price = match_price + price0( bm_rep[state()] ); + int i = 0, len = std::max( replens[0] + 1, (int)min_match_len ); + while( len > pairs[i].len ) ++i; + while( true ) + { + const int dis = pairs[i].dis; + trials[len].update( normal_match_price + price_pair( dis, len, pos_state ), + dis + num_rep_distances, 0 ); + if( ++len > pairs[i].len && ++i >= num_pairs ) break; + } + } + + int cur = 0; + while( true ) // price optimization loop + { + move_pos(); + if( ++cur >= num_trials ) // no more initialized trials + { + backward( cur ); + return cur; + } + + const int num_pairs = read_match_distances(); + const int newlen = ( num_pairs > 0 ) ? pairs[num_pairs-1].len : 0; + if( newlen >= match_len_limit ) + { + pending_num_pairs = num_pairs; + backward( cur ); + return cur; + } + + // give final values to current trial + Trial & cur_trial = trials[cur]; + State cur_state; + { + const int dis4 = cur_trial.dis4; + int prev_index = cur_trial.prev_index; + const int prev_index2 = cur_trial.prev_index2; + + if( prev_index2 == single_step_trial ) + { + cur_state = trials[prev_index].state; + if( prev_index + 1 == cur ) // len == 1 + { + if( dis4 == 0 ) cur_state.set_short_rep(); + else cur_state.set_char(); // literal + } + else if( dis4 < num_rep_distances ) cur_state.set_rep(); + else cur_state.set_match(); + } + else + { + if( prev_index2 == dual_step_trial ) // dis4 == 0 (rep0) + --prev_index; + else // prev_index2 >= 0 + prev_index = prev_index2; + cur_state.set_char_rep(); + } + cur_trial.state = cur_state; + for( int i = 0; i < num_rep_distances; ++i ) + cur_trial.reps[i] = trials[prev_index].reps[i]; + mtf_reps( dis4, cur_trial.reps ); // literal is ignored + } + + const int pos_state = data_position() & pos_state_mask; + const uint8_t prev_byte = peek( 1 ); + const uint8_t cur_byte = peek( 0 ); + const uint8_t match_byte = peek( cur_trial.reps[0] + 1 ); + + int next_price = cur_trial.price + + price0( bm_match[cur_state()][pos_state] ); + if( cur_state.is_char() ) + next_price += price_literal( prev_byte, cur_byte ); + else + next_price += price_matched( prev_byte, cur_byte, match_byte ); + + // try last updates to next trial + Trial & next_trial = trials[cur+1]; + + next_trial.update( next_price, -1, cur ); // literal + + const int match_price = cur_trial.price + price1( bm_match[cur_state()][pos_state] ); + const int rep_match_price = match_price + price1( bm_rep[cur_state()] ); + + if( match_byte == cur_byte && next_trial.dis4 != 0 && + next_trial.prev_index2 == single_step_trial ) + { + const int price = rep_match_price + price_shortrep( cur_state, pos_state ); + if( price <= next_trial.price ) + { + next_trial.price = price; + next_trial.dis4 = 0; // rep0 + next_trial.prev_index = cur; + } + } + + const int triable_bytes = + std::min( available_bytes(), max_num_trials - 1 - cur ); + if( triable_bytes < min_match_len ) continue; + + const int len_limit = std::min( match_len_limit, triable_bytes ); + + // try literal + rep0 + if( match_byte != cur_byte && next_trial.prev_index != cur ) + { + const uint8_t * const data = ptr_to_current_pos(); + const int dis = cur_trial.reps[0] + 1; + const int limit = std::min( match_len_limit + 1, triable_bytes ); + int len = 1; + while( len < limit && data[len-dis] == data[len] ) ++len; + if( --len >= min_match_len ) + { + const int pos_state2 = ( pos_state + 1 ) & pos_state_mask; + State state2 = cur_state; state2.set_char(); + const int price = next_price + + price1( bm_match[state2()][pos_state2] ) + + price1( bm_rep[state2()] ) + + price_rep0_len( len, state2, pos_state2 ); + while( num_trials < cur + 1 + len ) + trials[++num_trials].price = infinite_price; + trials[cur+1+len].update2( price, cur + 1 ); + } + } + + int start_len = min_match_len; + + // try rep distances + for( int rep = 0; rep < num_rep_distances; ++rep ) + { + const uint8_t * const data = ptr_to_current_pos(); + const int dis = cur_trial.reps[rep] + 1; + int len; + + if( data[0-dis] != data[0] || data[1-dis] != data[1] ) continue; + for( len = min_match_len; len < len_limit; ++len ) + if( data[len-dis] != data[len] ) break; + while( num_trials < cur + len ) + trials[++num_trials].price = infinite_price; + int price = rep_match_price + price_rep( rep, cur_state, pos_state ); + for( int i = min_match_len; i <= len; ++i ) + trials[cur+i].update( price + rep_len_prices.price( i, pos_state ), + rep, cur ); + + if( rep == 0 ) start_len = len + 1; // discard shorter matches + + // try rep + literal + rep0 + int len2 = len + 1; + const int limit = std::min( match_len_limit + len2, triable_bytes ); + while( len2 < limit && data[len2-dis] == data[len2] ) ++len2; + len2 -= len + 1; + if( len2 < min_match_len ) continue; + + int pos_state2 = ( pos_state + len ) & pos_state_mask; + State state2 = cur_state; state2.set_rep(); + price += rep_len_prices.price( len, pos_state ) + + price0( bm_match[state2()][pos_state2] ) + + price_matched( data[len-1], data[len], data[len-dis] ); + pos_state2 = ( pos_state2 + 1 ) & pos_state_mask; + state2.set_char(); + price += price1( bm_match[state2()][pos_state2] ) + + price1( bm_rep[state2()] ) + + price_rep0_len( len2, state2, pos_state2 ); + while( num_trials < cur + len + 1 + len2 ) + trials[++num_trials].price = infinite_price; + trials[cur+len+1+len2].update3( price, rep, cur + len + 1, cur ); + } + + // try matches + if( newlen >= start_len && newlen <= len_limit ) + { + const int normal_match_price = match_price + + price0( bm_rep[cur_state()] ); + + while( num_trials < cur + newlen ) + trials[++num_trials].price = infinite_price; + + int i = 0; + while( pairs[i].len < start_len ) ++i; + int dis = pairs[i].dis; + for( int len = start_len; ; ++len ) + { + int price = normal_match_price + price_pair( dis, len, pos_state ); + trials[cur+len].update( price, dis + num_rep_distances, cur ); + + // try match + literal + rep0 + if( len == pairs[i].len ) + { + const uint8_t * const data = ptr_to_current_pos(); + const int dis2 = dis + 1; + int len2 = len + 1; + const int limit = std::min( match_len_limit + len2, triable_bytes ); + while( len2 < limit && data[len2-dis2] == data[len2] ) ++len2; + len2 -= len + 1; + if( len2 >= min_match_len ) + { + int pos_state2 = ( pos_state + len ) & pos_state_mask; + State state2 = cur_state; state2.set_match(); + price += price0( bm_match[state2()][pos_state2] ) + + price_matched( data[len-1], data[len], data[len-dis2] ); + pos_state2 = ( pos_state2 + 1 ) & pos_state_mask; + state2.set_char(); + price += price1( bm_match[state2()][pos_state2] ) + + price1( bm_rep[state2()] ) + + price_rep0_len( len2, state2, pos_state2 ); + + while( num_trials < cur + len + 1 + len2 ) + trials[++num_trials].price = infinite_price; + trials[cur+len+1+len2].update3( price, dis + num_rep_distances, + cur + len + 1, cur ); + } + if( ++i >= num_pairs ) break; + dis = pairs[i].dis; + } + } + } + } + } + + +bool LZ_encoder::encode_member( const unsigned long long member_size ) + { + const unsigned long long member_size_limit = + member_size - Lzip_trailer::size - max_marker_size; + const bool best = ( match_len_limit > 12 ); + const int dis_price_count = best ? 1 : 512; + const int align_price_count = best ? 1 : dis_align_size; + const int price_count = ( match_len_limit > 36 ) ? 1013 : 4093; + int price_counter = 0; // counters may decrement below 0 + int dis_price_counter = 0; + int align_price_counter = 0; + int reps[num_rep_distances]; + State state; + for( int i = 0; i < num_rep_distances; ++i ) reps[i] = 0; + + if( data_position() != 0 || renc.member_position() != Lzip_header::size ) + return false; // can be called only once + + if( !data_finished() ) // encode first byte + { + const uint8_t prev_byte = 0; + const uint8_t cur_byte = peek( 0 ); + renc.encode_bit( bm_match[state()][0], 0 ); + encode_literal( prev_byte, cur_byte ); + crc32.update_byte( crc_, cur_byte ); + get_match_pairs(); + move_pos(); + } + + while( !data_finished() ) + { + if( price_counter <= 0 && pending_num_pairs == 0 ) + { + price_counter = price_count; // recalculate prices every these bytes + if( dis_price_counter <= 0 ) + { dis_price_counter = dis_price_count; update_distance_prices(); } + if( align_price_counter <= 0 ) + { + align_price_counter = align_price_count; + for( int i = 0; i < dis_align_size; ++i ) + align_prices[i] = price_symbol_reversed( bm_align, i, dis_align_bits ); + } + match_len_prices.update_prices(); + rep_len_prices.update_prices(); + } + + int ahead = sequence_optimizer( reps, state ); + price_counter -= ahead; + + for( int i = 0; ahead > 0; ) + { + const int pos_state = ( data_position() - ahead ) & pos_state_mask; + const int len = trials[i].price; + int dis = trials[i].dis4; + + bool bit = ( dis < 0 ); + renc.encode_bit( bm_match[state()][pos_state], !bit ); + if( bit ) // literal byte + { + const uint8_t prev_byte = peek( ahead + 1 ); + const uint8_t cur_byte = peek( ahead ); + crc32.update_byte( crc_, cur_byte ); + if( state.is_char_set_char() ) + encode_literal( prev_byte, cur_byte ); + else + { + const uint8_t match_byte = peek( ahead + reps[0] + 1 ); + encode_matched( prev_byte, cur_byte, match_byte ); + } + } + else // match or repeated match + { + crc32.update_buf( crc_, ptr_to_current_pos() - ahead, len ); + mtf_reps( dis, reps ); + bit = ( dis < num_rep_distances ); + renc.encode_bit( bm_rep[state()], bit ); + if( bit ) // repeated match + { + bit = ( dis == 0 ); + renc.encode_bit( bm_rep0[state()], !bit ); + if( bit ) + renc.encode_bit( bm_len[state()][pos_state], len > 1 ); + else + { + renc.encode_bit( bm_rep1[state()], dis > 1 ); + if( dis > 1 ) + renc.encode_bit( bm_rep2[state()], dis > 2 ); + } + if( len == 1 ) state.set_short_rep(); + else + { + renc.encode_len( rep_len_model, len, pos_state ); + rep_len_prices.decrement_counter( pos_state ); + state.set_rep(); + } + } + else // match + { + dis -= num_rep_distances; + encode_pair( dis, len, pos_state ); + if( dis >= modeled_distances ) --align_price_counter; + --dis_price_counter; + match_len_prices.decrement_counter( pos_state ); + state.set_match(); + } + } + ahead -= len; i += len; + if( renc.member_position() >= member_size_limit ) + { + if( !dec_pos( ahead ) ) return false; + full_flush( state ); + return true; + } + } + } + full_flush( state ); + return true; + } diff --git a/encoder.h b/encoder.h new file mode 100644 index 0000000..2654753 --- /dev/null +++ b/encoder.h @@ -0,0 +1,290 @@ +/* Lzip - LZMA lossless data compressor + Copyright (C) 2008-2024 Antonio Diaz Diaz. + + This program is free software: you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation, either version 2 of the License, or + (at your option) any later version. + + This program is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License + along with this program. If not, see . +*/ + +class Len_prices + { + const Len_model & lm; + const int len_symbols; + const int count; + int prices[pos_states][max_len_symbols]; + int counters[pos_states]; // may decrement below 0 + + void update_low_mid_prices( const int pos_state ) + { + int * const pps = prices[pos_state]; + int tmp = price0( lm.choice1 ); + int len = 0; + for( ; len < len_low_symbols && len < len_symbols; ++len ) + pps[len] = tmp + price_symbol3( lm.bm_low[pos_state], len ); + if( len >= len_symbols ) return; + tmp = price1( lm.choice1 ) + price0( lm.choice2 ); + for( ; len < len_low_symbols + len_mid_symbols && len < len_symbols; ++len ) + pps[len] = tmp + + price_symbol3( lm.bm_mid[pos_state], len - len_low_symbols ); + } + + void update_high_prices() + { + const int tmp = price1( lm.choice1 ) + price1( lm.choice2 ); + for( int len = len_low_symbols + len_mid_symbols; len < len_symbols; ++len ) + // using 4 slots per value makes "price" faster + prices[3][len] = prices[2][len] = prices[1][len] = prices[0][len] = tmp + + price_symbol8( lm.bm_high, len - len_low_symbols - len_mid_symbols ); + } + +public: + void reset() { for( int i = 0; i < pos_states; ++i ) counters[i] = 0; } + + Len_prices( const Len_model & m, const int match_len_limit ) + : + lm( m ), + len_symbols( match_len_limit + 1 - min_match_len ), + count( ( match_len_limit > 12 ) ? 1 : len_symbols ) + { reset(); } + + void decrement_counter( const int pos_state ) { --counters[pos_state]; } + + void update_prices() + { + bool high_pending = false; + for( int pos_state = 0; pos_state < pos_states; ++pos_state ) + if( counters[pos_state] <= 0 ) + { counters[pos_state] = count; + update_low_mid_prices( pos_state ); high_pending = true; } + if( high_pending && len_symbols > len_low_symbols + len_mid_symbols ) + update_high_prices(); + } + + int price( const int len, const int pos_state ) const + { return prices[pos_state][len - min_match_len]; } + }; + + +class LZ_encoder : public LZ_encoder_base + { + struct Pair // distance-length pair + { + int dis; + int len; + }; + + enum { infinite_price = 0x0FFFFFFF, + max_num_trials = 1 << 13, + single_step_trial = -2, + dual_step_trial = -1 }; + + struct Trial + { + State state; + int price; // dual use var; cumulative price, match length + int dis4; // -1 for literal, or rep, or match distance + 4 + int prev_index; // index of prev trial in trials[] + int prev_index2; // -2 trial is single step + // -1 literal + rep0 + // >= 0 ( rep or match ) + literal + rep0 + int reps[num_rep_distances]; + + void update( const int pr, const int distance4, const int p_i ) + { + if( pr < price ) + { price = pr; dis4 = distance4; prev_index = p_i; + prev_index2 = single_step_trial; } + } + + void update2( const int pr, const int p_i ) + { + if( pr < price ) + { price = pr; dis4 = 0; prev_index = p_i; + prev_index2 = dual_step_trial; } + } + + void update3( const int pr, const int distance4, const int p_i, + const int p_i2 ) + { + if( pr < price ) + { price = pr; dis4 = distance4; prev_index = p_i; + prev_index2 = p_i2; } + } + }; + + const int cycles; + const int match_len_limit; + Len_prices match_len_prices; + Len_prices rep_len_prices; + int pending_num_pairs; + Pair pairs[max_match_len+1]; + Trial trials[max_num_trials]; + + int dis_slot_prices[len_states][2*max_dictionary_bits]; + int dis_prices[len_states][modeled_distances]; + int align_prices[dis_align_size]; + const int num_dis_slots; + + bool dec_pos( const int ahead ) + { + if( ahead < 0 || pos < ahead ) return false; + pos -= ahead; + if( cyclic_pos < ahead ) cyclic_pos += dictionary_size + 1; + cyclic_pos -= ahead; + return true; + } + + int get_match_pairs( Pair * pairs = 0 ); + void update_distance_prices(); + + // move-to-front dis in/into reps; do nothing if( dis4 <= 0 ) + static void mtf_reps( const int dis4, int reps[num_rep_distances] ) + { + if( dis4 >= num_rep_distances ) // match + { + reps[3] = reps[2]; reps[2] = reps[1]; reps[1] = reps[0]; + reps[0] = dis4 - num_rep_distances; + } + else if( dis4 > 0 ) // repeated match + { + const int distance = reps[dis4]; + for( int i = dis4; i > 0; --i ) reps[i] = reps[i-1]; + reps[0] = distance; + } + } + + int price_shortrep( const State state, const int pos_state ) const + { + return price0( bm_rep0[state()] ) + price0( bm_len[state()][pos_state] ); + } + + int price_rep( const int rep, const State state, const int pos_state ) const + { + if( rep == 0 ) return price0( bm_rep0[state()] ) + + price1( bm_len[state()][pos_state] ); + int price = price1( bm_rep0[state()] ); + if( rep == 1 ) + price += price0( bm_rep1[state()] ); + else + { + price += price1( bm_rep1[state()] ); + price += price_bit( bm_rep2[state()], rep - 2 ); + } + return price; + } + + int price_rep0_len( const int len, const State state, const int pos_state ) const + { + return price_rep( 0, state, pos_state ) + + rep_len_prices.price( len, pos_state ); + } + + int price_pair( const int dis, const int len, const int pos_state ) const + { + const int price = match_len_prices.price( len, pos_state ); + const int len_state = get_len_state( len ); + if( dis < modeled_distances ) + return price + dis_prices[len_state][dis]; + else + return price + dis_slot_prices[len_state][get_slot( dis )] + + align_prices[dis & (dis_align_size - 1)]; + } + + int read_match_distances() + { + const int num_pairs = get_match_pairs( pairs ); + if( num_pairs > 0 ) + { + const int len = pairs[num_pairs-1].len; + if( len == match_len_limit && len < max_match_len ) + pairs[num_pairs-1].len = + true_match_len( len, pairs[num_pairs-1].dis + 1 ); + } + return num_pairs; + } + + void move_and_update( int n ) + { + while( true ) + { + move_pos(); + if( --n <= 0 ) break; + get_match_pairs(); + } + } + + void backward( int cur ) + { + int dis4 = trials[cur].dis4; + while( cur > 0 ) + { + const int prev_index = trials[cur].prev_index; + Trial & prev_trial = trials[prev_index]; + + if( trials[cur].prev_index2 != single_step_trial ) + { + prev_trial.dis4 = -1; // literal + prev_trial.prev_index = prev_index - 1; + prev_trial.prev_index2 = single_step_trial; + if( trials[cur].prev_index2 >= 0 ) + { + Trial & prev_trial2 = trials[prev_index-1]; + prev_trial2.dis4 = dis4; dis4 = 0; // rep0 + prev_trial2.prev_index = trials[cur].prev_index2; + prev_trial2.prev_index2 = single_step_trial; + } + } + prev_trial.price = cur - prev_index; // len + cur = dis4; dis4 = prev_trial.dis4; prev_trial.dis4 = cur; + cur = prev_index; + } + } + + int sequence_optimizer( const int reps[num_rep_distances], + const State state ); + + enum { before_size = max_num_trials, + // bytes to keep in buffer after pos + after_size = ( 2 * max_match_len ) + 1, + dict_factor = 2, + num_prev_positions3 = 1 << 16, + num_prev_positions2 = 1 << 10, + num_prev_positions23 = num_prev_positions2 + num_prev_positions3, + pos_array_factor = 2 }; + +public: + LZ_encoder( const int dict_size, const int len_limit, + const int ifd, const int outfd ) + : + LZ_encoder_base( before_size, dict_size, after_size, dict_factor, + num_prev_positions23, pos_array_factor, ifd, outfd ), + cycles( ( len_limit < max_match_len ) ? 16 + ( len_limit / 2 ) : 256 ), + match_len_limit( len_limit ), + match_len_prices( match_len_model, match_len_limit ), + rep_len_prices( rep_len_model, match_len_limit ), + pending_num_pairs( 0 ), + num_dis_slots( 2 * real_bits( dictionary_size - 1 ) ) + { + trials[1].prev_index = 0; + trials[1].prev_index2 = single_step_trial; + } + + void reset() + { + LZ_encoder_base::reset(); + match_len_prices.reset(); + rep_len_prices.reset(); + pending_num_pairs = 0; + } + + bool encode_member( const unsigned long long member_size ); + };