So, I'm implementing LZ77 compression algorithm. To compress any file type, I use its binary representation and then read it as chars
(because 1 char
is equal to 1 byte, afaik) to a std::string
. Current program version compresses and decompresses files (.txt, .bmp, etc) just fine – size of raw file in bytes matches the size of uncompressed file. Though I started to wonder, if usage of byte representation instead of bits is optimal at all:
- Is it optimal to use
chars
(bytes) instead of single bits? No possible loss of bits?
Also, is there a way to compare file sizes in bits instead of bytes? (forgive me for stupid questions)
And now for the actual code part. Here's the short info on how LZ77 handles compression: Sliding window consists of 2 buffers Example of compression
Below are 2 main functions: compress and findLongestMatch:
compress
moves char data between 2 buffers and saves encoded tuple ⟨offset, length, nextchar⟩findLongestMatch
finds the longest match of lookheadBuffer in historyBuffer
- Is there a more elegant and effective way of searching for longest match?
(also in theory algo should search right to left, but is there any complexity difference? even if the offset
is actually longer, it's still int and still 4 bytes – I convert every int
into 4 chars
(bytes) to save into output binary file)
LZ77::Triplet LZ77::slidingWindow::findLongestPrefix()
{
// Minimal tuple (if no match >1 is found)
Triplet n(0, 0, lookheadBuffer[0]);
size_t lookCurrLen = lookheadBuffer.length() - 1;
size_t histCurrLen = historyBuffer.length();
// Increasing the substring (match) length on every iteration
for (size_t i = 1; i <= std::min(lookCurrLen, histCurrLen); i++)
{
// Getting the substring
std::string s = lookheadBuffer.substr(0, i);
size_t pos = historyBuffer.find(s);
if (pos == std::string::npos)
break;
if ((historyBuffer.compare(histCurrLen - i, i, s) == 0) && (lookheadBuffer[0] == lookheadBuffer[i]))
pos = histCurrLen - i;
// If the longest match is found, check if there are any repeats
// following the of current longest substring in lookheadBuffer
int extra = 0;
if (histCurrLen == pos + i)
{
// Check for full following repeats
while ((lookCurrLen >= i + extra + i) && (lookheadBuffer.compare(i + extra, i, s) == 0))
extra += i;
// Check for partial following repeats
int extraextra = i - 1;
while (extraextra > 0)
{
if ((lookCurrLen >= i + extra + extraextra) && (lookheadBuffer.compare(i + extra, extraextra, s, 0, extraextra) == 0))
break;
extraextra--;
}
extra += extraextra;
}
// Compare the lengths of matches
if (n.length <= i + extra)
n = Triplet(histCurrLen - pos, i + extra, lookheadBuffer[i + extra]);
}
return n;
}
void LZ77::compress()
{
do
{
if ((window.lookheadBuffer.length() < window.lookBufferMax) && (byteDataString.length() != 0))
{
int len = window.lookBufferMax - window.lookheadBuffer.length();
window.lookheadBuffer.append(byteDataString, 0, len);
byteDataString.erase(0, len);
}
LZ77::Triplet tiplet = window.findLongestPrefix();
// Move the used part of lookheadBuffer to historyBuffer
window.historyBuffer.append(window.lookheadBuffer, 0, tiplet.length + 1);
window.lookheadBuffer.erase(0, tiplet.length + 1);
// If historyBuffer's size exceeds max, delete oldest part
if (window.historyBuffer.length() > window.histBufferMax)
window.historyBuffer.erase(0, window.historyBuffer.length() - window.histBufferMax);
encoded.push_back(tiplet);
} while (window.lookheadBuffer.length());
}
Accessory functions:
int intFromBytes(std::istream& is)
{
char bytes[4];
for (int i = 0; i < 4; ++i)
is.get(bytes[i]);
int integer;
std::memcpy(&integer, &bytes, 4);
return integer;
}
void intToBytes(std::ostream& os, int value)
{
char bytes[4];
std::memcpy(&bytes, &value, 4);
os.write(bytes, 4);
}
struct Triplet
{
int offset;
int length;
char next;
}
1 Answer 1
In answer to your first question, it may depend on whether you wish to optimise for speed or compression ratio. If optimising for speed, it would seem that using bytes is best, as that is what the LZ4 algorithm does. LZ4 is a variant of LZ77, highly optimised for speed. If your optimisation is for the compression ratio, I am unsure which would be better, as I have never run a bitwise LZ77 compressor.
In answer to your second question: How about, instead of your historyBuffer.find() method returning the first position of a match, you return an ArrayList of Triplets which match? This is because if you find a match, you know you will perform another iteration of the loop, (provided i
is not at its maximum value which is unlikely). Next time you perform the iteration, instead of going through your entire sliding window looking for a match, simply check whether or not any of the phrases in your ArrayList
of Triplets
will still match when the string s
has that additional character appended. This is because a match longer than the current match must build upon either that current match, or some other equally long match. This way, you don't redo work that has already been done. This approach means you can get rid of the lines
if ((historyBuffer.compare(histCurrLen - i, i, s) == 0) && (lookheadBuffer[0] == lookheadBuffer[i]))
pos = histCurrLen - i;
// If the longest match is found, check if there are any repeats
// following the of current longest substring in lookheadBuffer
int extra = 0;
if (histCurrLen == pos + i)
{
// Check for full following repeats
while ((lookCurrLen >= i + extra + i) && (lookheadBuffer.compare(i + extra, i, s) == 0))
extra += i;
// Check for partial following repeats
int extraextra = i - 1;
while (extraextra > 0)
{
if ((lookCurrLen >= i + extra + extraextra) && (lookheadBuffer.compare(i + extra, extraextra, s, 0, extraextra) == 0))
break;
extraextra--;
}
extra += extraextra;
}
without losing any performance.
With regards to your question about the complexity of searching left-to-right or vice versa, the time complexity of the two will be identical.
One final note, however, is that incorporating my suggestion, where you look for multiple matches rather than one match, will influence which string searching algorithm would be optimal to implement in your historyBuffer.find()
method. The Rabin-Karp substring matching algorithm, is generally best for finding multiple matches. This algorithm uses hashing to discard parts of the historyBuffer
which will definitely not match the substring, leaving you to easily check the parts of the historyBuffer
which are likely to match. However, if you are simply finding one match, as your current implementation does, then the Boyer-Moore algorithm is your best choice.
-
1\$\begingroup\$ Thank you for your thorough answer. I'm kinda new to compression algorithms, so I hope you don't mind a couple more questions: 1. Is there a chance for data (bit) loss when reading blobs (bytes)? 2. Any way to compare raw and uncompressed file bitwise? \$\endgroup\$asymmetriq– asymmetriq2019年12月02日 13:37:58 +00:00Commented Dec 2, 2019 at 13:37
-
1\$\begingroup\$ 1. No chance to lose any bits of information when reading bytes as opposed to bits. Just make sure you don't perform signed arithmetic with unsigned variables. \$\endgroup\$waitaria– waitaria2019年12月02日 23:19:10 +00:00Commented Dec 2, 2019 at 23:19
-
2\$\begingroup\$ 2. You could read the two files each into an array of bytes, and then use bitwise AND to compare the two arrays:
if(c_1[i] & c_2[i] == c_1[i])
-> then the two arrays are equal for byte i.Else
, use bit masking to identify the bit of(c_1[i] & c_2[i])
which is not equal toc_1[i]
. This will show the bit position at which arrays c_1 and c_2 differ. Does this answer your question? \$\endgroup\$waitaria– waitaria2019年12月02日 23:26:50 +00:00Commented Dec 2, 2019 at 23:26