Splitting an input TXT after # instances of specific character

Question 1

I have to export data from one database in order to import it to another. Due to limitations on the available import tool, the TXT file I generate at export is too large to be imported in one go. I have been manually splitting the file in two, but with the addition of some extra fields to the exported data that's getting problematic (slow when it doesn't outright crash NotePad++).

To help I put together a small C# application. The application reads the text line by line & at each 300,000th line outputs to a new TXT. During testing I found the run time to be... slow (it ran for over an hour and hadn't even done half the test file). That code will be below, but I was hoping for any ideas on ways to more quickly achieve the same outcome.

A note, I did find some posts to CodeProject & ForgetCode that suggested going through the entire contents of the file character by character, keeping count of how many target characters (in this case \n) that have been seen & parsing when that counter hits the magic number. Given that a line by line read is going slow, character by charter just seems like it would be even worse. Or am I wrong about that?

private void ParseBtn_Click(object sender, EventArgs e)
{
 long Line_Cnt = 0;
 long Completed_Lines = 0;
 long Lines_To_Step = 0;
 string Header = "";
 int File_Nbr = 1;
 string Temp = "";
 //Reset the status text & bar
 ParseStatusText.Text = "";
 ParseStatusBar.Value = 0;
 ParseStatusText.Text = DateTime.Now + " - Beginning QSI data parsing";
 //Check if the provided path is valid
 if (File.Exists(InputFile.Text))
 {
 string line = "";
 //Read the file into the string array
 try{
 using (StreamReader file = new StreamReader(InputFile.Text))
 {
 //Update the status
 ParseStatusText.Text = DateTime.Now + " - Getting count of lines in the QSI results\n" + ParseStatusText.Text;
 //Get the record count to base the progress bar on
 while ((line = file.ReadLine()) != null)
 {
 //Check if we're on the header
 if (Line_Cnt == 0)
 {
 Header = line;
 } //Else generic line & no special action needed
 Line_Cnt = Line_Cnt + 1;
 }
 file.Close();
 }
 //Set the maximum size of the progress bar & it's step size so that we don't have to worry about partial steps
 ParseStatusBar.Step = 1;
 ParseStatusBar.Maximum = 100;
 Lines_To_Step = Convert.ToInt16(Math.Ceiling(Convert.ToDouble(Line_Cnt/98)));
 Line_Cnt = 0;
 //Update the status
 ParseStatusText.Text = DateTime.Now + " - Parsing the results into managable files\n" + ParseStatusText.Text;
 ParseStatusBar.Value = 2;
 Temp = Header;
 using (StreamReader file = new StreamReader(InputFile.Text))
 {
 //Read through the lines
 while ((line = file.ReadLine()) != null)
 {
 //Check if we have filled Temp for the file we're on
 if ((Completed_Lines % 300000 == 0) && (Completed_Lines > 0))
 {
 //Write the file TEmp is meant for
 using (StreamWriter Parse_File = new StreamWriter(InputFile.Text.Replace(".txt", " P" + File_Nbr + ".txt")))
 {
 Parse_File.WriteLine(Temp);
 }
 //Setup for the new file
 File_Nbr++;
 Temp = Header;
 }
 else
 {
 //Add the line to Temp
 Temp = Temp + "/n" + line;
 }
 Completed_Lines++;
 //Check if we need to update the progress bar
 if ((Completed_Lines % Lines_To_Step) == 0 && ParseStatusBar.Value <= 100)
 {
 ParseStatusBar.PerformStep();
 } //Else not time to step yet
 }
 file.Close();
 }
 //Final Status
 ParseStatusBar.Value = 100;
 ParseStatusText.Text = DateTime.Now + " - Parse completed!";
 }catch(Exception ex){
 //Log the error
 if (ex.InnerException == null){
 ParseStatusText.Text = DateTime.Now + " - Encountered an error while reading & parsing the contents of the provided file. Error Details: " + ex.Message +
 ". No Inner Exception.\n" + ParseStatusText.Text;
 }else{
 ParseStatusText.Text = DateTime.Now + " - Encountered an error while reading & parsing the contents of the provided file. Error Details: " + ex.Message +
 ". Inner Error: " + ex.InnerException.Message + ".\n" + Environment.NewLine + ParseStatusText.Text;
 }
 throw;
 }
 }
 else
 {
 //Log the bad file path
 ParseStatusText.Text = DateTime.Now + " - The provided file does not exist" + Environment.NewLine + ParseStatusText.Text;
 }
}

Question 2

Given that I can't change anything about the content of the lines I don't see the benefit. And unless I'm missing something, the only effect of differences between the lines would be that larger lines likely take longer to append than shorter ones. And if that is true, it seems like the answer is moving away from line by line & to, ideally, just being able to directly get index of the 300,000th instance of `\n' & substring based on that value. But to my knowledge there is no such method.

Question 3

But you are not splitting on a specific character

String is immutable. This Temp = Temp + "/n" + line; is killing performance. Use StringBuilder.

Count the lines to provide a progress bar is a bit excessive. Just report the number of files.

Reset the counter is going to be faster than Completed_Lines % 300000

You fail to write out the last set

You don't add line on 300000

300000 is hard coded

This code has some serious problems

You could get fancy with TextWriter.WriteLineAsync but I bet this solves your problem.

private static void Parse(string fileName)
{ 
 if (File.Exists(fileName))
 {
 int File_Nbr = 1;
 int count = 0;
 int size = 300000;
 StringBuilder sb = new StringBuilder();
 using (StreamReader file = new StreamReader(fileName))
 { 
 string line;
 string header = file.ReadLine();
 sb.Append(header);
 while ((line = file.ReadLine()) != null)
 {
 if (string.IsNullOrEmpty(line))
 continue;
 sb.AppendLine(line.Trim());
 count++;
 if (count == size)
 { 
 using (StreamWriter Parse_File = new StreamWriter(fileName.Replace(".txt", " P" + File_Nbr + ".txt")))
 {
 Parse_File.Write(sb.ToString()); 
 }
 count = 0;
 File_Nbr++;
 sb.Clear(); 
 sb.Append(header);
 }
 }
 if (count > 0)
 {
 using (StreamWriter Parse_File = new StreamWriter(fileName.Replace(".txt", " P" + File_Nbr + ".txt")))
 {
 Parse_File.WriteLine(sb.ToString());
 }
 }
 }
 }
 else
 {
 }
}

Question 4

You should write that the line might be killing performance. You have no proof that it does. Well, I still prefer using a profiler then reading tee leaves.

Question 5

@t3chb0t I think you mean than. I don't mean might.

Question 6

I did remove the buffer variable and just write each line as I read. I also switched to resetting my line counter that controls when a new file is started. Final code added to my original question.

Question 7

Did you compare to StringBuilder? I don't like opening the output for each line. If you are going down that path then open it once and use TextWriter.WriteLineAsync.

Question 8

I understand that you want to show a progress bar - that's definitely good for a long running application but you don't need to know how many lines are in the file to do that.

Instead of reading the file and counting the total number of lines you can use the file size and keep track of how much of it you've processed in bytes. In C#, a char is always 2 bytes. Length of line * 2 gives you the size of the line you've just processed.

Then your progress is just (bytes_processed/total_bytes)*100.

Other answers have already addressed the string concatenation.

Question 9

Don't even need to it line by line. Could just do it when you write out the file and update then. But you can also just look at the output files and see where you are.

Question 10

@Paparazzi - true. If you're only splitting the file say 3 times, you'd just see 33%, 66% and 100% which might be a bit frustrating. If the file is being split into 10 chunks or more, I'd say updating once per file is a good option.

Question 11

If you are doing it properly how long can it take to split 300000 lines of text?

Question 12

@Paparazzi - How long is a piece of string? I've never done it so I have no idea. I wouldn't imagine that it takes very long but what if you have a really slow computer with a hard used HD on its last legs? I like my progress bars to be responsive. You have 100 graduations, it's nice to use them all.

Question 13

As answered by Paparazzi. The string concatenation is suspect. As the string grows larger concatenation becomes slower

I would suggest not even trying to buffer the write by building a larger string.

Just read and write each line. The file writing will do its own buffering.

paparazzo paparazzo 6,1263 gold badges20 silver badges41 bronze badges · Accepted Answer · 2017-07-28 20:29:27Z

But you are not splitting on a specific character

String is immutable. This Temp = Temp + "/n" + line; is killing performance. Use StringBuilder.

Count the lines to provide a progress bar is a bit excessive. Just report the number of files.

Reset the counter is going to be faster than Completed_Lines % 300000

You fail to write out the last set

You don't add line on 300000

300000 is hard coded

This code has some serious problems

You could get fancy with TextWriter.WriteLineAsync but I bet this solves your problem.

private static void Parse(string fileName)
{ 
 if (File.Exists(fileName))
 {
 int File_Nbr = 1;
 int count = 0;
 int size = 300000;
 StringBuilder sb = new StringBuilder();
 using (StreamReader file = new StreamReader(fileName))
 { 
 string line;
 string header = file.ReadLine();
 sb.Append(header);
 while ((line = file.ReadLine()) != null)
 {
 if (string.IsNullOrEmpty(line))
 continue;
 sb.AppendLine(line.Trim());
 count++;
 if (count == size)
 { 
 using (StreamWriter Parse_File = new StreamWriter(fileName.Replace(".txt", " P" + File_Nbr + ".txt")))
 {
 Parse_File.Write(sb.ToString()); 
 }
 count = 0;
 File_Nbr++;
 sb.Clear(); 
 sb.Append(header);
 }
 }
 if (count > 0)
 {
 using (StreamWriter Parse_File = new StreamWriter(fileName.Replace(".txt", " P" + File_Nbr + ".txt")))
 {
 Parse_File.WriteLine(sb.ToString());
 }
 }
 }
 }
 else
 {
 }
}

You should write that the line might be killing performance. You have no proof that it does. Well, I still prefer using a profiler then reading tee leaves.
I did remove the buffer variable and just write each line as I read. I also switched to resetting my line counter that controls when a new file is started. Final code added to my original question.
Did you compare to StringBuilder? I don't like opening the output for each line. If you are going down that path then open it once and use TextWriter.WriteLineAsync.

Stack Exchange Network

Splitting an input TXT after # instances of specific character

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Splitting an input TXT after # instances of specific character

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions