I wrote a method that reduces a string like aabcccccaaa
to a2b1c5a3
My implementation:
string Compress(string str)
{
StringBuilder builder = new StringBuilder();
using(TextReader reader = new StringReader(str))
{
while(reader.Peek() != - 1){
char c = (char)reader.Read();
int n = 1;
while(reader.Peek() == c) {
reader.Read();
n++;
}
builder.AppendFormat("{0}{1}",c,n);
}
}
return builder.ToString();
}
What do you think about this, is there a better way to do this?
-
1\$\begingroup\$ Looks fine. I'm sure you've already read the lit on the various ways of appending strings: dotnetperls.com/string-concat The only idea I had was maybe, this could be made faster by using memory stream and appending bytes. Strings have localization overhead, bytes wouldn't. \$\endgroup\$MatthewMartin– MatthewMartin2014年10月06日 22:46:35 +00:00Commented Oct 6, 2014 at 22:46
-
\$\begingroup\$ @MatthewMartin yeah replacing the using statement with this: using(MemoryStream ms = new MemoryStream(Encoding.Unicode.GetBytes(str))) using(TextReader reader = new StreamReader(ms)) would do the job but i'm not sure if it would be faster or not. \$\endgroup\$Rezo Megrelidze– Rezo Megrelidze2014年10月06日 23:04:29 +00:00Commented Oct 6, 2014 at 23:04
-
1\$\begingroup\$ I put some more thought into this. To decompress & disambiguate from numbers (e.g. compress & decompress aaa1111111111111, a3113 => a(3113 times)) you'd need fixed width slots or a delimiter. And you could store the number in byte (1 byte to represent 256 instead of 3) \$\endgroup\$MatthewMartin– MatthewMartin2014年10月07日 01:26:46 +00:00Commented Oct 7, 2014 at 1:26
-
2\$\begingroup\$ BTW, this operation is called run-length encoding (RLE). \$\endgroup\$Gabe– Gabe2014年10月07日 04:44:41 +00:00Commented Oct 7, 2014 at 4:44
3 Answers 3
You can use a regular expression to match a range of repeated characters and replace it with the character and the count:
public static string Compress(string str) {
return Regex.Replace(str, @"(.)1円*", m => m.Groups[1].Value + m.Value.Length);
}
If that is better or not is up for debate, but it is at least a lot shorter.
You can also make it simpler using a regular loop and access the characters by index instead of using a StringReader
. That makes it easier to compare the characters next to each other:
public static string Compress(string str) {
StringBuilder builder = new StringBuilder();
for (int i = 1, cnt = 1; i <= str.Length; i++, cnt++) {
if (i == str.Length || str[i] != str[i - 1]) {
builder.Append(str[i - 1]).Append(cnt);
cnt = 0;
}
}
return builder.ToString();
}
-
1\$\begingroup\$ Huh, never knew
Regex.Replace
took a delegate. That's pretty cool. \$\endgroup\$Bob– Bob2014年10月07日 09:10:32 +00:00Commented Oct 7, 2014 at 9:10
That's a perfectly cromulent implementation. I would just suggest making the style more consistent, with regards to spacing and braces, and maybe using Append
instead of AppendFormat
:
var builder = new StringBuilder();
using (var reader = new StringReader(str))
{
while (reader.Peek() != -1)
{
char c = (char)reader.Read();
int n = 1;
while (reader.Peek() == c)
{
reader.Read();
n++;
}
builder.Append(c).Append(n);
}
}
return builder.ToString();
StringReader
might be overkill for the situation; I find it to be of most use for its ReadLine
method, which we're not using here. You might want to consider something like this:
var compressed = new StringBuilder();
int start = 0;
while (start < input.Length)
{
char c = input[start];
int end = start + 1;
while (end < input.Length && input[end] == c)
{
end++;
}
compressed.Append(c).Append(end - start);
start = end;
}
return compressed.ToString();
-
\$\begingroup\$ That's an option but i'm not sure if it would perform any better. StringReader just stores a reference to the string so there's no memory overhead. \$\endgroup\$Rezo Megrelidze– Rezo Megrelidze2014年10月06日 23:38:21 +00:00Commented Oct 6, 2014 at 23:38
-
\$\begingroup\$ @mjolka
builder.Append(c).Append(n);
this is delightful. \$\endgroup\$Rezo Megrelidze– Rezo Megrelidze2014年10月07日 14:34:10 +00:00Commented Oct 7, 2014 at 14:34
You need to consider the type of String
that you wish to compress. If you are expecting multiple repeating characters then your system will work. However, if you wish to compress text, then this system will actually increase the size as few words have multiple repeating characters.
If you can guarantee that your string will be low-ASCII (no Unicode, no box-chars) then you can use the high-bit to signal your code to treat that character as a loop index.
if (c > 127) then n = c-127;
Also, you should be storing your repeat index as a value 'A'
not a string char "65"
.
There are GNUish libraries available for RLE encoding, and it might be easier for you to use one of those.