I have to export data from one database in order to import it to another. Due to limitations on the available import tool, the TXT file I generate at export is too large to be imported in one go. I have been manually splitting the file in two, but with the addition of some extra fields to the exported data that's getting problematic (slow when it doesn't outright crash NotePad++).
To help I put together a small C# application. The application reads the text line by line & at each 300,000th line outputs to a new TXT. During testing I found the run time to be... slow (it ran for over an hour and hadn't even done half the test file). That code will be below, but I was hoping for any ideas on ways to more quickly achieve the same outcome.
A note, I did find some posts to CodeProject & ForgetCode that suggested going through the entire contents of the file character by character, keeping count of how many target characters (in this case \n
) that have been seen & parsing when that counter hits the magic number. Given that a line by line read is going slow, character by charter just seems like it would be even worse. Or am I wrong about that?
private void ParseBtn_Click(object sender, EventArgs e)
{
long Line_Cnt = 0;
long Completed_Lines = 0;
long Lines_To_Step = 0;
string Header = "";
int File_Nbr = 1;
string Temp = "";
//Reset the status text & bar
ParseStatusText.Text = "";
ParseStatusBar.Value = 0;
ParseStatusText.Text = DateTime.Now + " - Beginning QSI data parsing";
//Check if the provided path is valid
if (File.Exists(InputFile.Text))
{
string line = "";
//Read the file into the string array
try{
using (StreamReader file = new StreamReader(InputFile.Text))
{
//Update the status
ParseStatusText.Text = DateTime.Now + " - Getting count of lines in the QSI results\n" + ParseStatusText.Text;
//Get the record count to base the progress bar on
while ((line = file.ReadLine()) != null)
{
//Check if we're on the header
if (Line_Cnt == 0)
{
Header = line;
} //Else generic line & no special action needed
Line_Cnt = Line_Cnt + 1;
}
file.Close();
}
//Set the maximum size of the progress bar & it's step size so that we don't have to worry about partial steps
ParseStatusBar.Step = 1;
ParseStatusBar.Maximum = 100;
Lines_To_Step = Convert.ToInt16(Math.Ceiling(Convert.ToDouble(Line_Cnt/98)));
Line_Cnt = 0;
//Update the status
ParseStatusText.Text = DateTime.Now + " - Parsing the results into managable files\n" + ParseStatusText.Text;
ParseStatusBar.Value = 2;
Temp = Header;
using (StreamReader file = new StreamReader(InputFile.Text))
{
//Read through the lines
while ((line = file.ReadLine()) != null)
{
//Check if we have filled Temp for the file we're on
if ((Completed_Lines % 300000 == 0) && (Completed_Lines > 0))
{
//Write the file TEmp is meant for
using (StreamWriter Parse_File = new StreamWriter(InputFile.Text.Replace(".txt", " P" + File_Nbr + ".txt")))
{
Parse_File.WriteLine(Temp);
}
//Setup for the new file
File_Nbr++;
Temp = Header;
}
else
{
//Add the line to Temp
Temp = Temp + "/n" + line;
}
Completed_Lines++;
//Check if we need to update the progress bar
if ((Completed_Lines % Lines_To_Step) == 0 && ParseStatusBar.Value <= 100)
{
ParseStatusBar.PerformStep();
} //Else not time to step yet
}
file.Close();
}
//Final Status
ParseStatusBar.Value = 100;
ParseStatusText.Text = DateTime.Now + " - Parse completed!";
}catch(Exception ex){
//Log the error
if (ex.InnerException == null){
ParseStatusText.Text = DateTime.Now + " - Encountered an error while reading & parsing the contents of the provided file. Error Details: " + ex.Message +
". No Inner Exception.\n" + ParseStatusText.Text;
}else{
ParseStatusText.Text = DateTime.Now + " - Encountered an error while reading & parsing the contents of the provided file. Error Details: " + ex.Message +
". Inner Error: " + ex.InnerException.Message + ".\n" + Environment.NewLine + ParseStatusText.Text;
}
throw;
}
}
else
{
//Log the bad file path
ParseStatusText.Text = DateTime.Now + " - The provided file does not exist" + Environment.NewLine + ParseStatusText.Text;
}
}
-
\$\begingroup\$ Given that I can't change anything about the content of the lines I don't see the benefit. And unless I'm missing something, the only effect of differences between the lines would be that larger lines likely take longer to append than shorter ones. And if that is true, it seems like the answer is moving away from line by line & to, ideally, just being able to directly get index of the 300,000th instance of `\n' & substring based on that value. But to my knowledge there is no such method. \$\endgroup\$JMichael– JMichael2017年07月28日 20:00:29 +00:00Commented Jul 28, 2017 at 20:00
3 Answers 3
But you are not splitting on a specific character
String is immutable. This Temp = Temp + "/n" + line;
is killing performance. Use StringBuilder.
Count the lines to provide a progress bar is a bit excessive. Just report the number of files.
Reset the counter is going to be faster than Completed_Lines % 300000
You fail to write out the last set
You don't add line on 300000
300000
is hard coded
This code has some serious problems
You could get fancy with TextWriter.WriteLineAsync but I bet this solves your problem.
private static void Parse(string fileName)
{
if (File.Exists(fileName))
{
int File_Nbr = 1;
int count = 0;
int size = 300000;
StringBuilder sb = new StringBuilder();
using (StreamReader file = new StreamReader(fileName))
{
string line;
string header = file.ReadLine();
sb.Append(header);
while ((line = file.ReadLine()) != null)
{
if (string.IsNullOrEmpty(line))
continue;
sb.AppendLine(line.Trim());
count++;
if (count == size)
{
using (StreamWriter Parse_File = new StreamWriter(fileName.Replace(".txt", " P" + File_Nbr + ".txt")))
{
Parse_File.Write(sb.ToString());
}
count = 0;
File_Nbr++;
sb.Clear();
sb.Append(header);
}
}
if (count > 0)
{
using (StreamWriter Parse_File = new StreamWriter(fileName.Replace(".txt", " P" + File_Nbr + ".txt")))
{
Parse_File.WriteLine(sb.ToString());
}
}
}
}
else
{
}
}
-
\$\begingroup\$ You should write that the line might be killing performance. You have no proof that it does. Well, I still prefer using a profiler then reading tee leaves. \$\endgroup\$t3chb0t– t3chb0t2017年07月29日 06:58:05 +00:00Commented Jul 29, 2017 at 6:58
-
\$\begingroup\$ @t3chb0t I think you mean than. I don't mean might. \$\endgroup\$paparazzo– paparazzo2017年07月29日 07:19:39 +00:00Commented Jul 29, 2017 at 7:19
-
\$\begingroup\$ I did remove the buffer variable and just write each line as I read. I also switched to resetting my line counter that controls when a new file is started. Final code added to my original question. \$\endgroup\$JMichael– JMichael2017年07月31日 13:11:55 +00:00Commented Jul 31, 2017 at 13:11
-
\$\begingroup\$ Did you compare to StringBuilder? I don't like opening the output for each line. If you are going down that path then open it once and use TextWriter.WriteLineAsync. \$\endgroup\$paparazzo– paparazzo2017年07月31日 13:46:21 +00:00Commented Jul 31, 2017 at 13:46
I understand that you want to show a progress bar - that's definitely good for a long running application but you don't need to know how many lines are in the file to do that.
Instead of reading the file and counting the total number of lines you can use the file size and keep track of how much of it you've processed in bytes. In C#, a char
is always 2 bytes. Length of line * 2 gives you the size of the line you've just processed.
Then your progress is just (bytes_processed/total_bytes)*100.
Other answers have already addressed the string concatenation.
-
\$\begingroup\$ Don't even need to it line by line. Could just do it when you write out the file and update then. But you can also just look at the output files and see where you are. \$\endgroup\$paparazzo– paparazzo2017年07月31日 17:16:59 +00:00Commented Jul 31, 2017 at 17:16
-
\$\begingroup\$ @Paparazzi - true. If you're only splitting the file say 3 times, you'd just see 33%, 66% and 100% which might be a bit frustrating. If the file is being split into 10 chunks or more, I'd say updating once per file is a good option. \$\endgroup\$RobH– RobH2017年08月01日 07:32:51 +00:00Commented Aug 1, 2017 at 7:32
-
\$\begingroup\$ If you are doing it properly how long can it take to split 300000 lines of text? \$\endgroup\$paparazzo– paparazzo2017年08月01日 08:55:11 +00:00Commented Aug 1, 2017 at 8:55
-
\$\begingroup\$ @Paparazzi - How long is a piece of string? I've never done it so I have no idea. I wouldn't imagine that it takes very long but what if you have a really slow computer with a hard used HD on its last legs? I like my progress bars to be responsive. You have 100 graduations, it's nice to use them all. \$\endgroup\$RobH– RobH2017年08月01日 10:05:56 +00:00Commented Aug 1, 2017 at 10:05
As answered by Paparazzi. The string concatenation is suspect. As the string grows larger concatenation becomes slower
I would suggest not even trying to buffer the write by building a larger string.
Just read and write each line. The file writing will do its own buffering.