Read a file format with mixed CSV and non-CSV content

Question 1

I have a file format that I need to read which contains blocks of CSV data interspersed by non-CSV (or at least, inconsistent CSV) lines. Other than the very first line of the file (which has comma-separated headers), each repeated "block" has one line containing some metadata - including the number of actual CSV lines in that block - followed by the CSV data. The blocks are of varying length within the file. An example with dummy data can be seen at the end of the question.

I have written the following code which uses the TextFileReader in pandas to parse the metadata line into a dictionary, and then read each block. I need the metadata dicts in a list and the CSV data concatenated. This reader works, and gives me what I need, but it seems a bit hokey. The non-CSV line is still read as if it's a CSV ... I can extract the info I need, but it's awkward. I also couldn't figure out how to exit using the TextFileReader nicely without simply catching the StopIteration error, but I very much dislike using while True.

Is there a cleaner way to exit using the iterative TextFileReader, and/or any other tidying this code could use?

from io import StringIO
import pandas as pd
# This is the reader function
def read_mixed(fname):
 metalist = []
 dflist = []
 ibuff = 0
 with pd.read_csv(fname, skipinitialspace=True, iterator=True) as reader:
 try:
 while True:
 ibuff += 1
 # Read the meta info
 meta_buff = reader.read(1)
 nlines = int(meta_buff.iloc[0, 0].split('=')[-1])
 tmp_meta = {x.split('=')[0]: x.split('=')[1]
 for x in meta_buff.iloc[0, 1:].dropna()}
 tmp_meta.update({'LINES': nlines})
 # Read the CSV info
 tmp_df = reader.read(nlines)
 tmp_df['ibuff'] = ibuff
 metalist.append(tmp_meta)
 dflist.append(tmp_df)
 except StopIteration:
 pass
 return metalist, pd.concat(dflist, axis=0).reset_index(drop=True)
# This is a tester that illustrates the format of the real file(s)
if __name__ == "__main__":
 fstr = StringIO("""a1,a2,a3,b1,b2,b3,c1,c2,c3
Comment line: LINES=5, CAR1=Top Fuel, CAR2=Funny, CAR3=Pro Stock
12,1,23,5,23,5,3,4,5
12,1,23,5,23,5,3,4,5
12,1,23,5,23,5,3,4,5
12,1,23,5,23,5,3,4,5
12,1,23,5,23,5,3,4,5
Comment line: LINES=5, CAR1=Funny, CAR2=Pro Stock, CAR3=Top Fuel
24,2,46,10,46,10,6,8,10
24,2,46,10,46,10,6,8,10
24,2,46,10,46,10,6,8,10
24,2,46,10,46,10,6,8,10
24,2,46,10,46,10,6,8,10
Comment line: LINES=3, CAR1=Funny, CAR2=Top Fuel, CAR3=Pro Stock
12,1,23,5,23,5,3,4,5
12,1,23,5,23,5,3,4,5
12,1,23,5,23,5,3,4,5
""")
 output = read_mixed(fstr)
 print(output)

Question 2

python3.x?

Question 3

How stringent are your files? E.g do they largely vary from your example? This regex seems to be matching your data quite well. regex101.com/r/gO9Nus/2

Question 4

@N3buchadnezzar The files vary very little from what I've put here ... down to the colon in the meta line and the fact that the CSV data are always integers. I see the regex you made - are you suggesting just parsing the whole file using regex? I'd have to see how that compares efficiency-wise; the format is very precise but the real files are also very large - there are ~3000 of these blocks with ~ 500 lines each.

Question 5

@Ajean I am hesitant to post this as an answer, but can you test how well the following code performs on your data? pastebin.com/ShXyQaUH

Question 6

@N3buchadnezzar It looks like that works, in that it pulls out the lines okay. And on my small example file it's actually faster than what I wrote above! I just shut down access where the real files are, so I won't be able to test it on the big ones until Monday. I'll bring it in then, and see how it works there. That'd be neat if the solution is so "low tech"

Question 7

While Reinderien posted an excellent solution to your problem. Let me take a closer look and give you some feedback on your code.

Let us just focus on the 3 biggest things which stands out to me and how to fix them

Your code has very low readability. Perhaps you know how it works and what it does, but would a colleague be able to. Without you telling them?
A big red sign is if you have try HUGE BLOCK OF CODE except ..., then you are not really looking for specific errors, just praying "please make it work".
Overcmplicated

Let us look at how we can fix 1. first

Comments should be docstrings

# This is the reader function
def read_mixed(fname):

Is better expressed as

def read_mixed(fname):
 """This is the reader function"""

Use proper names and variables

This is not something I can help you with, but ideall you should not use over generic names. Like read_mixed tells me nothing. Ideally you should incorperate domain knowledge into your docstrings and names

def car_data_from_file(file):
 """This function reads in a file containing ..."""

Like we do not care that the data is mixed, only we as programmers care that it is not a pure csv. really we need to explain more about the data is, then how it is structured.

Remove useless comments

Documenting the code is good! However, the proper way of doing this is not through generic comments.
# Read the meta info
meta_buff = reader.read(1)

The comment adds nothing of value here, it just states the obvious. Comments should mainly be reserved for explaining why not how. If you have a very terse line, than maaaybe. However, then you should really be restructuring the code.

Secondly comments adds overhead, now we need to make sure every comment is in sync with the code. Again, better to just express the intent of the doce though well named functions, variables and classes; with proper docstrings.

You need to figure out why the code raises StopIteration is it because it has reached the end of the file? If so a simple with would work just as fine
The question is how sure are you that your data is formatted properly. If you want to validate your data, my solution is not the way.

So without any validation we can do

DIGITS = set([str(number) for number in range(1, 10)])
def car_data_from_file_2(path: str) -> tuple[list[str], pd.DataFrame]:
 csvs, metalist = [], []
 with open(path, "r") as f:
 header = next(f).strip().split(",")
 for line in f:
 if not (stripped := line.rstrip()):
 pass
 elif stripped[0] in DIGITS:
 csvs.append(stripped)
 else:
 metalist.append(stripped)
 df = pd.DataFrame(csv.reader(csvs), columns=header)
 return metalist, df

All this code does is go through each line in the file, if it starts with a digit, it must be part of the csv; otherwise it is part of the metadata.

Perhaps a better way is as follows

# Uses regex to match a line containing the word LINE
METADATA = re.compile(r".*LINE.*")
def car_metadata_from_file(filename: str) -> list[str]:
 """Open file with car data and splits out the metadata"""
 with open(filename, "r") as f:
 return METADATA.findall(f.read())
def car_data_from_file(filename: str) -> pd.DataFrame:
 """Open file with car data and splits out the metadata"""
 with open(filename, "r") as f:
 df = pd.read_csv(f)
 # Metadata lines are stripped by looking at the rows whose
 # last column is NaN due to having insufficient columns:
 return df[df[df.columns[-1]].notna()]

However, you need to do some benchmarks to figure out which one is the best for you. Also note that you should take into consideration readability. Document well your code and pick a solution that is readable to you.

Question 8

This is really good feedback thank you! I think I'll take all of these suggestions and write them up, then do the benchmarking on the real files. I may end up skipping the non-validating version, though, because even though every file I've seen has the exact same format, it's not defined in a document (at least, that I can find) so I can't guarantee it won't change. But these are super helpful general tips, much appreciated!

Question 9

These version also allow me to use with - the only reason I had to resort to the try/catch for StopIteration instead, is because with barfed when I tried to use it for the TextFileReader. Maybe that's a bug in my version, or something.

Question 10

Also - I just realized and made the same comment under the other answer ... how might I keep track of the block number with these? In my (admittedly poor) code, I add an 'ibuff' column to the DataFrame from the loop ... this was the main reason for making it a loop, but I would love to do it differently.

Question 11

@Ajean If you have a list of metadatas (as outlined in my answer), you already have the blocknumber no? The first metadata would be 1, the second 2 and so forth?

Question 12

Yes, but I meant included as part of the dataframe as well. Now I'm all good though, I was able fill it out using the missing indices. Thank you!

Question 13

You're working too hard. You concatenate all of the CSV rows anyway, so LINES= has no value. Just read the CSV, and choose an easy way to identify invalid rows - such as rows whose last column is NaN due to having insufficient columns.

Since you're using Pandas, your dictionary format is probably not a good idea. Consider instead forming a second dataframe whose column names are based on the keys. The key-value separation can be done with Pandas in a vectorised manner:

Suggested

from io import StringIO
import pandas as pd
def read_mixed(fname) -> tuple[pd.DataFrame, pd.DataFrame]:
 df = pd.read_csv(fname) # Comment and data rows are mixed in
 is_data = df[df.columns[-1]].notna() # Assume that "short" rows are comments, "long" rows are data
 data = df[is_data] # All data rows in one frame
 comments = df[~is_data] # Comment rows are "everything left"
 comments.dropna(axis='columns', inplace=True) # Drop NA columns
 first_col = comments.columns[0] # Name of the first column, disposable
 comments[first_col] = comments[first_col].str.removeprefix('Comment line: ') # Drop the prefix on all rows
 comments.columns = comments.iloc[0, :].str.split('=', 1).str[0] # Set columns to be keys from comment cells
 for col in comments.columns:
 comments[col] = comments[col].str.split('=', 1).str[1]
 return data, comments
# This is a tester that illustrates the format of the real file(s)
if __name__ == "__main__":
 fstr = StringIO("""a1,a2,a3,b1,b2,b3,c1,c2,c3
Comment line: LINES=5, CAR1=Top Fuel, CAR2=Funny, CAR3=Pro Stock
12,1,23,5,23,5,3,4,5
12,1,23,5,23,5,3,4,5
12,1,23,5,23,5,3,4,5
12,1,23,5,23,5,3,4,5
12,1,23,5,23,5,3,4,5
Comment line: LINES=5, CAR1=Funny, CAR2=Pro Stock, CAR3=Top Fuel
24,2,46,10,46,10,6,8,10
24,2,46,10,46,10,6,8,10
24,2,46,10,46,10,6,8,10
24,2,46,10,46,10,6,8,10
24,2,46,10,46,10,6,8,10
Comment line: LINES=3, CAR1=Funny, CAR2=Top Fuel, CAR3=Pro Stock
12,1,23,5,23,5,3,4,5
12,1,23,5,23,5,3,4,5
12,1,23,5,23,5,3,4,5
""")
 output = read_mixed(fstr)
 print(output)

Output

 a1 a2 a3 b1 b2 b3 c1 c2 c3
1 12 1 23 5 23.0 5.0 3.0 4.0 5.0
2 12 1 23 5 23.0 5.0 3.0 4.0 5.0
3 12 1 23 5 23.0 5.0 3.0 4.0 5.0
4 12 1 23 5 23.0 5.0 3.0 4.0 5.0
5 12 1 23 5 23.0 5.0 3.0 4.0 5.0
7 24 2 46 10 46.0 10.0 6.0 8.0 10.0
8 24 2 46 10 46.0 10.0 6.0 8.0 10.0
9 24 2 46 10 46.0 10.0 6.0 8.0 10.0
10 24 2 46 10 46.0 10.0 6.0 8.0 10.0
11 24 2 46 10 46.0 10.0 6.0 8.0 10.0
13 12 1 23 5 23.0 5.0 3.0 4.0 5.0
14 12 1 23 5 23.0 5.0 3.0 4.0 5.0
15 12 1 23 5 23.0 5.0 3.0 4.0 5.0

0 LINES CAR1 CAR2 CAR3
0 5 Top Fuel Funny Pro Stock
6 5 Funny Pro Stock Top Fuel
12 3 Funny Top Fuel Pro Stock

Question 14

You are definitely right in that I'm positive that my current code is not the best (hence posted here, hehe). I guess I couldn't figure out how else to keep track of the block number - that's the only thing that's sort of missing here ... I need to keep track of which block each csv line came from (in my code I add the column 'ibuff' to the DataFrame). But given your version, I might see a way to get that in there, I can run with it a bit unless you have a suggestion :)

Question 15

The block number can be derived from the indices in the dataframes shown above.

Question 16

Yep, I figured so ... working on using that. Thank you!

N3buchadnezzar N3buchadnezzar 5,79525 silver badges40 bronze badges · Answer 1 · 2022-04-23 08:22:45Z

While Reinderien posted an excellent solution to your problem. Let me take a closer look and give you some feedback on your code.

Let us just focus on the 3 biggest things which stands out to me and how to fix them

Your code has very low readability. Perhaps you know how it works and what it does, but would a colleague be able to. Without you telling them?
A big red sign is if you have try HUGE BLOCK OF CODE except ..., then you are not really looking for specific errors, just praying "please make it work".
Overcmplicated

Let us look at how we can fix 1. first

Comments should be docstrings

# This is the reader function
def read_mixed(fname):

Is better expressed as

def read_mixed(fname):
 """This is the reader function"""

Use proper names and variables

This is not something I can help you with, but ideall you should not use over generic names. Like read_mixed tells me nothing. Ideally you should incorperate domain knowledge into your docstrings and names

def car_data_from_file(file):
 """This function reads in a file containing ..."""

Like we do not care that the data is mixed, only we as programmers care that it is not a pure csv. really we need to explain more about the data is, then how it is structured.

Remove useless comments

Documenting the code is good! However, the proper way of doing this is not through generic comments.
# Read the meta info
meta_buff = reader.read(1)

The comment adds nothing of value here, it just states the obvious. Comments should mainly be reserved for explaining why not how. If you have a very terse line, than maaaybe. However, then you should really be restructuring the code.

Secondly comments adds overhead, now we need to make sure every comment is in sync with the code. Again, better to just express the intent of the doce though well named functions, variables and classes; with proper docstrings.

You need to figure out why the code raises StopIteration is it because it has reached the end of the file? If so a simple with would work just as fine
The question is how sure are you that your data is formatted properly. If you want to validate your data, my solution is not the way.

So without any validation we can do

DIGITS = set([str(number) for number in range(1, 10)])
def car_data_from_file_2(path: str) -> tuple[list[str], pd.DataFrame]:
 csvs, metalist = [], []
 with open(path, "r") as f:
 header = next(f).strip().split(",")
 for line in f:
 if not (stripped := line.rstrip()):
 pass
 elif stripped[0] in DIGITS:
 csvs.append(stripped)
 else:
 metalist.append(stripped)
 df = pd.DataFrame(csv.reader(csvs), columns=header)
 return metalist, df

All this code does is go through each line in the file, if it starts with a digit, it must be part of the csv; otherwise it is part of the metadata.

Perhaps a better way is as follows

# Uses regex to match a line containing the word LINE
METADATA = re.compile(r".*LINE.*")
def car_metadata_from_file(filename: str) -> list[str]:
 """Open file with car data and splits out the metadata"""
 with open(filename, "r") as f:
 return METADATA.findall(f.read())
def car_data_from_file(filename: str) -> pd.DataFrame:
 """Open file with car data and splits out the metadata"""
 with open(filename, "r") as f:
 df = pd.read_csv(f)
 # Metadata lines are stripped by looking at the rows whose
 # last column is NaN due to having insufficient columns:
 return df[df[df.columns[-1]].notna()]

However, you need to do some benchmarks to figure out which one is the best for you. Also note that you should take into consideration readability. Document well your code and pick a solution that is readable to you.

This is really good feedback thank you! I think I'll take all of these suggestions and write them up, then do the benchmarking on the real files. I may end up skipping the non-validating version, though, because even though every file I've seen has the exact same format, it's not defined in a document (at least, that I can find) so I can't guarantee it won't change. But these are super helpful general tips, much appreciated!
These version also allow me to use with - the only reason I had to resort to the try/catch for StopIteration instead, is because with barfed when I tried to use it for the TextFileReader. Maybe that's a bug in my version, or something.
Also - I just realized and made the same comment under the other answer ... how might I keep track of the block number with these? In my (admittedly poor) code, I add an 'ibuff' column to the DataFrame from the loop ... this was the main reason for making it a loop, but I would love to do it differently.
@Ajean If you have a list of metadatas (as outlined in my answer), you already have the blocknumber no? The first metadata would be 1, the second 2 and so forth?
Yes, but I meant included as part of the dataframe as well. Now I'm all good though, I was able fill it out using the missing indices. Thank you!

Reinderien Reinderien 71k5 gold badges76 silver badges256 bronze badges · Answer 2 · 2022-04-23 01:44:03Z

You're working too hard. You concatenate all of the CSV rows anyway, so LINES= has no value. Just read the CSV, and choose an easy way to identify invalid rows - such as rows whose last column is NaN due to having insufficient columns.

Since you're using Pandas, your dictionary format is probably not a good idea. Consider instead forming a second dataframe whose column names are based on the keys. The key-value separation can be done with Pandas in a vectorised manner:

Suggested

from io import StringIO
import pandas as pd
def read_mixed(fname) -> tuple[pd.DataFrame, pd.DataFrame]:
 df = pd.read_csv(fname) # Comment and data rows are mixed in
 is_data = df[df.columns[-1]].notna() # Assume that "short" rows are comments, "long" rows are data
 data = df[is_data] # All data rows in one frame
 comments = df[~is_data] # Comment rows are "everything left"
 comments.dropna(axis='columns', inplace=True) # Drop NA columns
 first_col = comments.columns[0] # Name of the first column, disposable
 comments[first_col] = comments[first_col].str.removeprefix('Comment line: ') # Drop the prefix on all rows
 comments.columns = comments.iloc[0, :].str.split('=', 1).str[0] # Set columns to be keys from comment cells
 for col in comments.columns:
 comments[col] = comments[col].str.split('=', 1).str[1]
 return data, comments
# This is a tester that illustrates the format of the real file(s)
if __name__ == "__main__":
 fstr = StringIO("""a1,a2,a3,b1,b2,b3,c1,c2,c3
Comment line: LINES=5, CAR1=Top Fuel, CAR2=Funny, CAR3=Pro Stock
12,1,23,5,23,5,3,4,5
12,1,23,5,23,5,3,4,5
12,1,23,5,23,5,3,4,5
12,1,23,5,23,5,3,4,5
12,1,23,5,23,5,3,4,5
Comment line: LINES=5, CAR1=Funny, CAR2=Pro Stock, CAR3=Top Fuel
24,2,46,10,46,10,6,8,10
24,2,46,10,46,10,6,8,10
24,2,46,10,46,10,6,8,10
24,2,46,10,46,10,6,8,10
24,2,46,10,46,10,6,8,10
Comment line: LINES=3, CAR1=Funny, CAR2=Top Fuel, CAR3=Pro Stock
12,1,23,5,23,5,3,4,5
12,1,23,5,23,5,3,4,5
12,1,23,5,23,5,3,4,5
""")
 output = read_mixed(fstr)
 print(output)

Output

 a1 a2 a3 b1 b2 b3 c1 c2 c3
1 12 1 23 5 23.0 5.0 3.0 4.0 5.0
2 12 1 23 5 23.0 5.0 3.0 4.0 5.0
3 12 1 23 5 23.0 5.0 3.0 4.0 5.0
4 12 1 23 5 23.0 5.0 3.0 4.0 5.0
5 12 1 23 5 23.0 5.0 3.0 4.0 5.0
7 24 2 46 10 46.0 10.0 6.0 8.0 10.0
8 24 2 46 10 46.0 10.0 6.0 8.0 10.0
9 24 2 46 10 46.0 10.0 6.0 8.0 10.0
10 24 2 46 10 46.0 10.0 6.0 8.0 10.0
11 24 2 46 10 46.0 10.0 6.0 8.0 10.0
13 12 1 23 5 23.0 5.0 3.0 4.0 5.0
14 12 1 23 5 23.0 5.0 3.0 4.0 5.0
15 12 1 23 5 23.0 5.0 3.0 4.0 5.0

0 LINES CAR1 CAR2 CAR3
0 5 Top Fuel Funny Pro Stock
6 5 Funny Pro Stock Top Fuel
12 3 Funny Top Fuel Pro Stock

You are definitely right in that I'm positive that my current code is not the best (hence posted here, hehe). I guess I couldn't figure out how else to keep track of the block number - that's the only thing that's sort of missing here ... I need to keep track of which block each csv line came from (in my code I add the column 'ibuff' to the DataFrame). But given your version, I might see a way to get that in there, I can run with it a bit unless you have a suggestion :)
The block number can be derived from the indices in the dataframes shown above.

Stack Exchange Network

Read a file format with mixed CSV and non-CSV content

2 Answers 2

Comments should be docstrings

Use proper names and variables

Remove useless comments

Suggested

Output

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Read a file format with mixed CSV and non-CSV content

2 Answers 2

Comments should be docstrings

Use proper names and variables

Remove useless comments

Suggested

Output

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions