I get huge incoming files (up to 6GB) and they are littered with Control and non-ASCII characters. I need to strip them and I made this routine (below). The problem is that it is insanely slow. I would love any thoughts or advice on how I can speed it up.
public void StripHighBitCharacters(string fn)
{
string writeFile = fn + "B";
using (var reader = new StreamReader(fn))
using (var writer = new StreamWriter(writeFile))
{
while (!reader.EndOfStream)
{
string line = reader.ReadLine();
if (line.Length > 0)
{
writer.WriteLine(BuildClearString(line));
}
else
{
writer.WriteLine(line);
}
}
}
File.Copy(writeFile, fn, true);
File.Delete(writeFile);
}
public string BuildClearString(string line)
{
StringBuilder sb = new StringBuilder();
foreach (char c in line)
{
if (c >= 32 && c <= 175)
{
sb.Append(c);
}
}
return (sb.ToString());
}
5 Answers 5
General
- You are using
using
statements which is always good. - You have small and well named methods which is good as well but the
StripHighBitCharacters()
method doesn't do what the name implies. TheBuildClearString()
method is doing whatStripHighBitCharacters()
should do based on its name. - The method parameter of
StripHighBitCharacters
is poorly named. Why don't you name itfileName
? - You should be consistent with the usage of the
var
type. Why didn't you use it e.g for thestring writeFile
?
@1201ProgramAlarm mentioned in his/her answer reusing the StringBuilder
which is the way to go for a performance boost but I would take this further.
I would initialize the
StringBuilder
with a starting capacity of at least 4 kb, because usually your filesystem is storing its data in 4 kb blocks. But because you expect to get real big files you should increase the capacity to e.g 4mb.Instead of creating a new file with a filename of
fn + "B"
you should usePath.GetTempFileName()
and after the content is written delete the original and move the temp file to the original destination.
Implementing the mentioned points will lead to
private const int maxCapacity = 4096 * 1024;
private StringBuilder sb = new StringBuilder(maxCapacity);
public void CleanFile(string fileName)
{
var tempFileName = Path.GetTempFileName();
using (var reader = new StreamReader(fileName))
using (var writer = new StreamWriter(tempFileName))
{
sb.Length = 0;
while (!reader.EndOfStream)
{
var line = reader.ReadLine();
if (line.Length + sb.Length > maxCapacity)
{
writer.Write(sb.ToString());
sb.Length = 0;
}
StripHighBitCharacters(line);
}
}
File.Delete(fileName);
File.Move(tempFileName, fileName);
}
public void StripHighBitCharacters(string value)
{
foreach (var c in value.Where(c => c > 31 && c < 176))
{
sb.Append(c);
}
sb.AppendLine();
}
After using poor man profiling (using Stopwatch) I figured that the provided StripHighBitCharacters()
method using linq took around 39 seconds.
Using just a loop and an if
like so
public void StripHighBitCharacters(string value)
{
foreach (var c in value)
{
if (c > 31 && c < 176)
{
sb.Append(c);
}
}
sb.AppendLine();
}
the measurements went better. It took only 22 seconds.
Both tests had been done using a file with 1.3 GB.
-
\$\begingroup\$ Nice general review. I've gone another step further and am reading and writing using a single fixed-size
char[]
. My (not very good) measurements suggest this gives a measly 20% reduction in time (buffer size over 16kB doesn't seem to help with my tiny 20MB test files (not properly tried messing with the filestream buffer sizes)), which seems to go up as the number of new-lines in the file increases (i.e. OP's code is working with smaller buffers). \$\endgroup\$VisualMelon– VisualMelon2018年02月05日 10:00:36 +00:00Commented Feb 5, 2018 at 10:00 -
\$\begingroup\$ Adding the attribute
[MethodImpl(MethodImplOptions.AggressiveInlining)]
toStripHighBitCharacters
should definitely make an improvement too. \$\endgroup\$Brad M– Brad M2018年02月06日 16:30:14 +00:00Commented Feb 6, 2018 at 16:30
One source of performance degradation is frequent memory allocations. In your case, the StringBuilder
will allocate space for every line in your file, and may allocate additional space (along with a data copy) for longer lines.
You can eliminate all of that by reusing the StringBuilder
object. At the start of BuildClearString
, call the clear method on it (sb.Clear();
or sb.Length = 0;
). Follow that up by a capacity check.
if (sb.Capacity < line.Length)
sb.Capacity = line.Length;
By changing the capacity you ensure that you have a buffer big enough to hold all the characters you will be adding, so you won't incur any memory allocations when processing a line. By reusing it, you keep the existing allocated memory, thus avoiding any allocations for later lines unless the line is longer than any you've already encountered. You can also set an initial capacity on the StringBuilder
object.
-
\$\begingroup\$ Are you saying to declare the SB object in the calling method then sending it via parameter to my BuildClearString function? \$\endgroup\$Missy– Missy2018年02月04日 14:10:13 +00:00Commented Feb 4, 2018 at 14:10
-
2\$\begingroup\$ @Missy That's one way to do it. Or you could declare it as a private member of the class your functions belong to. \$\endgroup\$1201ProgramAlarm– 1201ProgramAlarm2018年02月04日 18:29:45 +00:00Commented Feb 4, 2018 at 18:29
-
\$\begingroup\$ StringBuilder is going to add (and retain) Capacity as required. The check for sb.Capacity < line.Length may be more overhead than savings. \$\endgroup\$paparazzo– paparazzo2018年02月05日 15:24:04 +00:00Commented Feb 5, 2018 at 15:24
Why not just use the BinaryReader/BinaryWriter? If you have a lot of line breaks, you might end up with more iterations of your loop with ReadLine() and BinaryReader would minimize that overhead, and eliminate the need for StringBuilder or estimating the size of the buffer.
private void StripUnwantedChars(string InFile, string OutFile, int readSize = 1048576)
{
using (var fsInFile = File.Open(InFile, FileMode.Open, FileAccess.Read))
using (var bReader = new BinaryReader(fsInFile))
using (var fsOutfile = File.Open(OutFile, FileMode.Create))
using (var bWriter = new BinaryWriter(fsOutfile))
{
while (fsInFile.Position != fsInFile.Length)
{
byte[] bytes = bReader.ReadBytes(readSize);
foreach (byte checkByte in bytes)
{
if (((checkByte >= 32) && (checkByte <= 175)) || (checkByte == 13) || (checkByte == 10))
{
bWriter.Write(checkByte);
}
}
}
}
}
EDIT: Added check for line break and carriage return characters.
-
2\$\begingroup\$ @Heslacher - you're right, sorry. Was a little hasty in posting. Added the additional checks for the carriage return and line break characters. thanks for pointing that out (and it still runs faster than StreamReader). \$\endgroup\$jhilgeman– jhilgeman2018年02月05日 16:26:57 +00:00Commented Feb 5, 2018 at 16:26
-
1
-
2\$\begingroup\$ Old habits die hard. :) \$\endgroup\$jhilgeman– jhilgeman2018年02月05日 17:02:06 +00:00Commented Feb 5, 2018 at 17:02
-
\$\begingroup\$ Would this load a lot of data into RAM? The workstation this runs on won't have much RAM. \$\endgroup\$Missy– Missy2018年02月06日 14:26:29 +00:00Commented Feb 6, 2018 at 14:26
This should improve performance, so long as the StreamWriter.Write(char) implementation does not have particularly poor overheads.
NB, this will remove the need for any intermediate StringBuilder and associated temporary arrays.
public void StripHighBitCharacters(string fn)
{
string writeFile = fn + "B";
using (var reader = new StreamReader(fn))
using (var writer = new StreamWriter(writeFile))
{
while (!reader.EndOfStream)
{
string line = reader.ReadLine();
if (line.Length > 0)
{
foreach (var c in line.Where(c => c >= 32 && c <= 175)) { writer.Write(c); }
}
writer.WriteLine();
}
}
// You may wish to consider moving `fn` to a temp location and then deleting it after the `File.Move(writeFile, fn)` line succeeds
File.Delete(fn);
File.Move(writeFile, fn);
}
Consider a producer consumer pattern like BlockingCollection. Read a line and strip in the producer. Write the clean lines in the consumer. This keeps the disk active and strip is basically free. Use an UpperBound so the producer does not get too far ahead of the consumer.
As has been said just have one String builder
private StringBuilder sb = new StringBuilder();
public string BuildClearString(string line)
{
sb.clear();
If you don't need leading and trailing white-space characters then use String.Trim Method.
var line = reader.ReadLine().Trim();
This might be faster. But I doubt it.
foreach (char c in line)
{
if (c < 32)
{
continue;
}
if (c > 175)
{
continue;
}
sb.Append(c);
}
Without the producer consumer part I would trim it down. Those checks take time.
public void StripHighBitCharacters(string fn)
{
string writeFile = fn + "B";
using (var reader = new StreamReader(fn))
using (var writer = new StreamWriter(writeFile))
{
while (!reader.EndOfStream)
{
string line = reader.ReadLine().Trim();
writer.WriteLine(BuildClearString(line));
}
}
File.Delete(fn);
File.Move(writeFile, fn);
}
private StringBuilder sb = new StringBuilder();
public string BuildClearString(string line)
{
sb.Clear();
foreach (char c in line.Where(c => c >= 32 && c <= 175))
{
sb.Append(c);
}
return (sb.ToString());
}
ReadLine
andWriteline
, or can you accept just leaving\n
and\r
intact? \$\endgroup\$