String compression implementation in C#

Question 1

I wrote a method that reduces a string like aabcccccaaa to a2b1c5a3

My implementation:

string Compress(string str)
{
 StringBuilder builder = new StringBuilder();
 using(TextReader reader = new StringReader(str))
 {
 while(reader.Peek() != - 1){
 char c = (char)reader.Read();
 int n = 1;
 while(reader.Peek() == c) {
 reader.Read();
 n++;
 }
 builder.AppendFormat("{0}{1}",c,n);
 }
 }
 return builder.ToString();
}

What do you think about this, is there a better way to do this?

Question 2

Looks fine. I'm sure you've already read the lit on the various ways of appending strings: dotnetperls.com/string-concat The only idea I had was maybe, this could be made faster by using memory stream and appending bytes. Strings have localization overhead, bytes wouldn't.

Question 3

@MatthewMartin yeah replacing the using statement with this: using(MemoryStream ms = new MemoryStream(Encoding.Unicode.GetBytes(str))) using(TextReader reader = new StreamReader(ms)) would do the job but i'm not sure if it would be faster or not.

Question 4

I put some more thought into this. To decompress & disambiguate from numbers (e.g. compress & decompress aaa1111111111111, a3113 => a(3113 times)) you'd need fixed width slots or a delimiter. And you could store the number in byte (1 byte to represent 256 instead of 3)

Question 5

BTW, this operation is called run-length encoding (RLE).

Question 6

You can use a regular expression to match a range of repeated characters and replace it with the character and the count:

public static string Compress(string str) {
 return Regex.Replace(str, @"(.)1円*", m => m.Groups[1].Value + m.Value.Length);
}

If that is better or not is up for debate, but it is at least a lot shorter.

You can also make it simpler using a regular loop and access the characters by index instead of using a StringReader. That makes it easier to compare the characters next to each other:

public static string Compress(string str) {
 StringBuilder builder = new StringBuilder();
 for (int i = 1, cnt = 1; i <= str.Length; i++, cnt++) {
 if (i == str.Length || str[i] != str[i - 1]) {
 builder.Append(str[i - 1]).Append(cnt);
 cnt = 0;
 }
 }
 return builder.ToString();
}

Question 7

Huh, never knew Regex.Replace took a delegate. That's pretty cool.

Question 8

That's a perfectly cromulent implementation. I would just suggest making the style more consistent, with regards to spacing and braces, and maybe using Append instead of AppendFormat:

var builder = new StringBuilder();
using (var reader = new StringReader(str))
{
 while (reader.Peek() != -1)
 {
 char c = (char)reader.Read();
 int n = 1;
 while (reader.Peek() == c)
 {
 reader.Read();
 n++;
 }
 builder.Append(c).Append(n);
 }
}
return builder.ToString();

StringReader might be overkill for the situation; I find it to be of most use for its ReadLine method, which we're not using here. You might want to consider something like this:

var compressed = new StringBuilder();
int start = 0;
while (start < input.Length)
{
 char c = input[start];
 int end = start + 1;
 while (end < input.Length && input[end] == c)
 {
 end++;
 }
 compressed.Append(c).Append(end - start);
 start = end;
}
return compressed.ToString();

Question 9

That's an option but i'm not sure if it would perform any better. StringReader just stores a reference to the string so there's no memory overhead.

Question 10

@mjolka builder.Append(c).Append(n); this is delightful.

Question 11

You need to consider the type of String that you wish to compress. If you are expecting multiple repeating characters then your system will work. However, if you wish to compress text, then this system will actually increase the size as few words have multiple repeating characters.

If you can guarantee that your string will be low-ASCII (no Unicode, no box-chars) then you can use the high-bit to signal your code to treat that character as a loop index.

if (c > 127) then n = c-127;

Also, you should be storing your repeat index as a value 'A' not a string char "65".

There are GNUish libraries available for RLE encoding, and it might be easier for you to use one of those.

Guffa Guffa 7,0101 gold badge22 silver badges29 bronze badges · Answer 1 · 2014-10-06 23:50:32Z

You can use a regular expression to match a range of repeated characters and replace it with the character and the count:

public static string Compress(string str) {
 return Regex.Replace(str, @"(.)1円*", m => m.Groups[1].Value + m.Value.Length);
}

If that is better or not is up for debate, but it is at least a lot shorter.

You can also make it simpler using a regular loop and access the characters by index instead of using a StringReader. That makes it easier to compare the characters next to each other:

public static string Compress(string str) {
 StringBuilder builder = new StringBuilder();
 for (int i = 1, cnt = 1; i <= str.Length; i++, cnt++) {
 if (i == str.Length || str[i] != str[i - 1]) {
 builder.Append(str[i - 1]).Append(cnt);
 cnt = 0;
 }
 }
 return builder.ToString();
}

Huh, never knew Regex.Replace took a delegate. That's pretty cool.

mjolka mjolka 16.3k2 gold badges30 silver badges73 bronze badges · Answer 2 · 2014-10-06 23:23:32Z

That's a perfectly cromulent implementation. I would just suggest making the style more consistent, with regards to spacing and braces, and maybe using Append instead of AppendFormat:

var builder = new StringBuilder();
using (var reader = new StringReader(str))
{
 while (reader.Peek() != -1)
 {
 char c = (char)reader.Read();
 int n = 1;
 while (reader.Peek() == c)
 {
 reader.Read();
 n++;
 }
 builder.Append(c).Append(n);
 }
}
return builder.ToString();

StringReader might be overkill for the situation; I find it to be of most use for its ReadLine method, which we're not using here. You might want to consider something like this:

var compressed = new StringBuilder();
int start = 0;
while (start < input.Length)
{
 char c = input[start];
 int end = start + 1;
 while (end < input.Length && input[end] == c)
 {
 end++;
 }
 compressed.Append(c).Append(end - start);
 start = end;
}
return compressed.ToString();

That's an option but i'm not sure if it would perform any better. StringReader just stores a reference to the string so there's no memory overhead.

Anonymous Coward Anonymous Coward 212 bronze badges · Answer 3 · 2014-10-07 06:55:30Z

You need to consider the type of String that you wish to compress. If you are expecting multiple repeating characters then your system will work. However, if you wish to compress text, then this system will actually increase the size as few words have multiple repeating characters.

If you can guarantee that your string will be low-ASCII (no Unicode, no box-chars) then you can use the high-bit to signal your code to treat that character as a loop index.

if (c > 127) then n = c-127;

Also, you should be storing your repeat index as a value 'A' not a string char "65".

There are GNUish libraries available for RLE encoding, and it might be easier for you to use one of those.

Stack Exchange Network

String compression implementation in C#

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

String compression implementation in C#

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions