LZ77 compression (also longest string match)

Question 1

So, I'm implementing LZ77 compression algorithm. To compress any file type, I use its binary representation and then read it as chars (because 1 char is equal to 1 byte, afaik) to a std::string. Current program version compresses and decompresses files (.txt, .bmp, etc) just fine – size of raw file in bytes matches the size of uncompressed file. Though I started to wonder, if usage of byte representation instead of bits is optimal at all:

Is it optimal to use chars (bytes) instead of single bits? No possible loss of bits?

Also, is there a way to compare file sizes in bits instead of bytes? (forgive me for stupid questions)

And now for the actual code part. Here's the short info on how LZ77 handles compression: Sliding window consists of 2 buffers Example of compression

Below are 2 main functions: compress and findLongestMatch:

compress moves char data between 2 buffers and saves encoded tuple ⟨offset, length, nextchar⟩
findLongestMatch finds the longest match of lookheadBuffer in historyBuffer

Is there a more elegant and effective way of searching for longest match?

(also in theory algo should search right to left, but is there any complexity difference? even if the offset is actually longer, it's still int and still 4 bytes – I convert every int into 4 chars (bytes) to save into output binary file)

LZ77::Triplet LZ77::slidingWindow::findLongestPrefix()
{
 // Minimal tuple (if no match >1 is found)
 Triplet n(0, 0, lookheadBuffer[0]);
 size_t lookCurrLen = lookheadBuffer.length() - 1;
 size_t histCurrLen = historyBuffer.length();
 // Increasing the substring (match) length on every iteration 
 for (size_t i = 1; i <= std::min(lookCurrLen, histCurrLen); i++)
 {
 // Getting the substring
 std::string s = lookheadBuffer.substr(0, i);
 size_t pos = historyBuffer.find(s);
 if (pos == std::string::npos)
 break;
 if ((historyBuffer.compare(histCurrLen - i, i, s) == 0) && (lookheadBuffer[0] == lookheadBuffer[i]))
 pos = histCurrLen - i;
 // If the longest match is found, check if there are any repeats
 // following the of current longest substring in lookheadBuffer
 int extra = 0;
 if (histCurrLen == pos + i)
 {
 // Check for full following repeats
 while ((lookCurrLen >= i + extra + i) && (lookheadBuffer.compare(i + extra, i, s) == 0))
 extra += i;
 // Check for partial following repeats
 int extraextra = i - 1;
 while (extraextra > 0)
 {
 if ((lookCurrLen >= i + extra + extraextra) && (lookheadBuffer.compare(i + extra, extraextra, s, 0, extraextra) == 0))
 break;
 extraextra--;
 }
 extra += extraextra;
 }
 // Compare the lengths of matches
 if (n.length <= i + extra)
 n = Triplet(histCurrLen - pos, i + extra, lookheadBuffer[i + extra]);
 }
 return n;
}
void LZ77::compress()
{
 do
 {
 if ((window.lookheadBuffer.length() < window.lookBufferMax) && (byteDataString.length() != 0))
 {
 int len = window.lookBufferMax - window.lookheadBuffer.length();
 window.lookheadBuffer.append(byteDataString, 0, len);
 byteDataString.erase(0, len);
 }
 LZ77::Triplet tiplet = window.findLongestPrefix();
 // Move the used part of lookheadBuffer to historyBuffer
 window.historyBuffer.append(window.lookheadBuffer, 0, tiplet.length + 1);
 window.lookheadBuffer.erase(0, tiplet.length + 1);
 // If historyBuffer's size exceeds max, delete oldest part 
 if (window.historyBuffer.length() > window.histBufferMax)
 window.historyBuffer.erase(0, window.historyBuffer.length() - window.histBufferMax);
 encoded.push_back(tiplet);
 } while (window.lookheadBuffer.length());
}

Accessory functions:

int intFromBytes(std::istream& is)
{
 char bytes[4];
 for (int i = 0; i < 4; ++i)
 is.get(bytes[i]);
 int integer;
 std::memcpy(&integer, &bytes, 4);
 return integer;
}
void intToBytes(std::ostream& os, int value)
{
 char bytes[4];
 std::memcpy(&bytes, &value, 4);
 os.write(bytes, 4);
}
struct Triplet
{
 int offset;
 int length;
 char next;
}

Question 2

In answer to your first question, it may depend on whether you wish to optimise for speed or compression ratio. If optimising for speed, it would seem that using bytes is best, as that is what the LZ4 algorithm does. LZ4 is a variant of LZ77, highly optimised for speed. If your optimisation is for the compression ratio, I am unsure which would be better, as I have never run a bitwise LZ77 compressor.

In answer to your second question: How about, instead of your historyBuffer.find() method returning the first position of a match, you return an ArrayList of Triplets which match? This is because if you find a match, you know you will perform another iteration of the loop, (provided i is not at its maximum value which is unlikely). Next time you perform the iteration, instead of going through your entire sliding window looking for a match, simply check whether or not any of the phrases in your ArrayList of Triplets will still match when the string s has that additional character appended. This is because a match longer than the current match must build upon either that current match, or some other equally long match. This way, you don't redo work that has already been done. This approach means you can get rid of the lines

if ((historyBuffer.compare(histCurrLen - i, i, s) == 0) && (lookheadBuffer[0] == lookheadBuffer[i]))
 pos = histCurrLen - i;
// If the longest match is found, check if there are any repeats
// following the of current longest substring in lookheadBuffer
int extra = 0;
if (histCurrLen == pos + i)
{
 // Check for full following repeats
 while ((lookCurrLen >= i + extra + i) && (lookheadBuffer.compare(i + extra, i, s) == 0))
 extra += i;
 // Check for partial following repeats
 int extraextra = i - 1;
 while (extraextra > 0)
 {
 if ((lookCurrLen >= i + extra + extraextra) && (lookheadBuffer.compare(i + extra, extraextra, s, 0, extraextra) == 0))
 break;
 extraextra--;
 }
 extra += extraextra;
}

without losing any performance.

With regards to your question about the complexity of searching left-to-right or vice versa, the time complexity of the two will be identical.

One final note, however, is that incorporating my suggestion, where you look for multiple matches rather than one match, will influence which string searching algorithm would be optimal to implement in your historyBuffer.find() method. The Rabin-Karp substring matching algorithm, is generally best for finding multiple matches. This algorithm uses hashing to discard parts of the historyBuffer which will definitely not match the substring, leaving you to easily check the parts of the historyBuffer which are likely to match. However, if you are simply finding one match, as your current implementation does, then the Boyer-Moore algorithm is your best choice.

Question 3

Thank you for your thorough answer. I'm kinda new to compression algorithms, so I hope you don't mind a couple more questions: 1. Is there a chance for data (bit) loss when reading blobs (bytes)? 2. Any way to compare raw and uncompressed file bitwise?

Question 4

1. No chance to lose any bits of information when reading bytes as opposed to bits. Just make sure you don't perform signed arithmetic with unsigned variables.

Question 5

2. You could read the two files each into an array of bytes, and then use bitwise AND to compare the two arrays: if(c_1[i] & c_2[i] == c_1[i]) -> then the two arrays are equal for byte i. Else, use bit masking to identify the bit of (c_1[i] & c_2[i]) which is not equal to c_1[i]. This will show the bit position at which arrays c_1 and c_2 differ. Does this answer your question?

waitaria waitaria 411 bronze badge · Answer 1 · 2019-12-02 13:13:39Z

In answer to your first question, it may depend on whether you wish to optimise for speed or compression ratio. If optimising for speed, it would seem that using bytes is best, as that is what the LZ4 algorithm does. LZ4 is a variant of LZ77, highly optimised for speed. If your optimisation is for the compression ratio, I am unsure which would be better, as I have never run a bitwise LZ77 compressor.

In answer to your second question: How about, instead of your historyBuffer.find() method returning the first position of a match, you return an ArrayList of Triplets which match? This is because if you find a match, you know you will perform another iteration of the loop, (provided i is not at its maximum value which is unlikely). Next time you perform the iteration, instead of going through your entire sliding window looking for a match, simply check whether or not any of the phrases in your ArrayList of Triplets will still match when the string s has that additional character appended. This is because a match longer than the current match must build upon either that current match, or some other equally long match. This way, you don't redo work that has already been done. This approach means you can get rid of the lines

if ((historyBuffer.compare(histCurrLen - i, i, s) == 0) && (lookheadBuffer[0] == lookheadBuffer[i]))
 pos = histCurrLen - i;
// If the longest match is found, check if there are any repeats
// following the of current longest substring in lookheadBuffer
int extra = 0;
if (histCurrLen == pos + i)
{
 // Check for full following repeats
 while ((lookCurrLen >= i + extra + i) && (lookheadBuffer.compare(i + extra, i, s) == 0))
 extra += i;
 // Check for partial following repeats
 int extraextra = i - 1;
 while (extraextra > 0)
 {
 if ((lookCurrLen >= i + extra + extraextra) && (lookheadBuffer.compare(i + extra, extraextra, s, 0, extraextra) == 0))
 break;
 extraextra--;
 }
 extra += extraextra;
}

without losing any performance.

With regards to your question about the complexity of searching left-to-right or vice versa, the time complexity of the two will be identical.

One final note, however, is that incorporating my suggestion, where you look for multiple matches rather than one match, will influence which string searching algorithm would be optimal to implement in your historyBuffer.find() method. The Rabin-Karp substring matching algorithm, is generally best for finding multiple matches. This algorithm uses hashing to discard parts of the historyBuffer which will definitely not match the substring, leaving you to easily check the parts of the historyBuffer which are likely to match. However, if you are simply finding one match, as your current implementation does, then the Boyer-Moore algorithm is your best choice.

Thank you for your thorough answer. I'm kinda new to compression algorithms, so I hope you don't mind a couple more questions: 1. Is there a chance for data (bit) loss when reading blobs (bytes)? 2. Any way to compare raw and uncompressed file bitwise?
1. No chance to lose any bits of information when reading bytes as opposed to bits. Just make sure you don't perform signed arithmetic with unsigned variables.
2. You could read the two files each into an array of bytes, and then use bitwise AND to compare the two arrays: if(c_1[i] & c_2[i] == c_1[i]) -> then the two arrays are equal for byte i. Else, use bit masking to identify the bit of (c_1[i] & c_2[i]) which is not equal to c_1[i]. This will show the bit position at which arrays c_1 and c_2 differ. Does this answer your question?

Stack Exchange Network

LZ77 compression (also longest string match)

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

LZ77 compression (also longest string match)

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions