I have a file format that I need to read which contains blocks of CSV data interspersed by non-CSV (or at least, inconsistent CSV) lines. Other than the very first line of the file (which has comma-separated headers), each repeated "block" has one line containing some metadata - including the number of actual CSV lines in that block - followed by the CSV data. The blocks are of varying length within the file. An example with dummy data can be seen at the end of the question.
I have written the following code which uses the TextFileReader in pandas to parse the metadata line into a dictionary, and then read each block. I need the metadata dicts in a list and the CSV data concatenated. This reader works, and gives me what I need, but it seems a bit hokey. The non-CSV line is still read as if it's a CSV ... I can extract the info I need, but it's awkward. I also couldn't figure out how to exit using the TextFileReader nicely without simply catching the StopIteration error, but I very much dislike using while True
.
Is there a cleaner way to exit using the iterative TextFileReader, and/or any other tidying this code could use?
from io import StringIO
import pandas as pd
# This is the reader function
def read_mixed(fname):
metalist = []
dflist = []
ibuff = 0
with pd.read_csv(fname, skipinitialspace=True, iterator=True) as reader:
try:
while True:
ibuff += 1
# Read the meta info
meta_buff = reader.read(1)
nlines = int(meta_buff.iloc[0, 0].split('=')[-1])
tmp_meta = {x.split('=')[0]: x.split('=')[1]
for x in meta_buff.iloc[0, 1:].dropna()}
tmp_meta.update({'LINES': nlines})
# Read the CSV info
tmp_df = reader.read(nlines)
tmp_df['ibuff'] = ibuff
metalist.append(tmp_meta)
dflist.append(tmp_df)
except StopIteration:
pass
return metalist, pd.concat(dflist, axis=0).reset_index(drop=True)
# This is a tester that illustrates the format of the real file(s)
if __name__ == "__main__":
fstr = StringIO("""a1,a2,a3,b1,b2,b3,c1,c2,c3
Comment line: LINES=5, CAR1=Top Fuel, CAR2=Funny, CAR3=Pro Stock
12,1,23,5,23,5,3,4,5
12,1,23,5,23,5,3,4,5
12,1,23,5,23,5,3,4,5
12,1,23,5,23,5,3,4,5
12,1,23,5,23,5,3,4,5
Comment line: LINES=5, CAR1=Funny, CAR2=Pro Stock, CAR3=Top Fuel
24,2,46,10,46,10,6,8,10
24,2,46,10,46,10,6,8,10
24,2,46,10,46,10,6,8,10
24,2,46,10,46,10,6,8,10
24,2,46,10,46,10,6,8,10
Comment line: LINES=3, CAR1=Funny, CAR2=Top Fuel, CAR3=Pro Stock
12,1,23,5,23,5,3,4,5
12,1,23,5,23,5,3,4,5
12,1,23,5,23,5,3,4,5
""")
output = read_mixed(fstr)
print(output)
-
\$\begingroup\$ python3.x? \$\endgroup\$Toby Speight– Toby Speight2022年04月22日 21:16:54 +00:00Commented Apr 22, 2022 at 21:16
-
\$\begingroup\$ How stringent are your files? E.g do they largely vary from your example? This regex seems to be matching your data quite well. regex101.com/r/gO9Nus/2 \$\endgroup\$N3buchadnezzar– N3buchadnezzar2022年04月22日 21:58:13 +00:00Commented Apr 22, 2022 at 21:58
-
\$\begingroup\$ @N3buchadnezzar The files vary very little from what I've put here ... down to the colon in the meta line and the fact that the CSV data are always integers. I see the regex you made - are you suggesting just parsing the whole file using regex? I'd have to see how that compares efficiency-wise; the format is very precise but the real files are also very large - there are ~3000 of these blocks with ~ 500 lines each. \$\endgroup\$Ajean– Ajean2022年04月22日 22:13:11 +00:00Commented Apr 22, 2022 at 22:13
-
\$\begingroup\$ @Ajean I am hesitant to post this as an answer, but can you test how well the following code performs on your data? pastebin.com/ShXyQaUH \$\endgroup\$N3buchadnezzar– N3buchadnezzar2022年04月22日 22:32:19 +00:00Commented Apr 22, 2022 at 22:32
-
\$\begingroup\$ @N3buchadnezzar It looks like that works, in that it pulls out the lines okay. And on my small example file it's actually faster than what I wrote above! I just shut down access where the real files are, so I won't be able to test it on the big ones until Monday. I'll bring it in then, and see how it works there. That'd be neat if the solution is so "low tech" \$\endgroup\$Ajean– Ajean2022年04月22日 22:48:29 +00:00Commented Apr 22, 2022 at 22:48
2 Answers 2
While Reinderien posted an excellent solution to your problem. Let me take a closer look and give you some feedback on your code.
Let us just focus on the 3 biggest things which stands out to me and how to fix them
- Your code has very low readability. Perhaps you know how it works and what it does, but would a colleague be able to. Without you telling them?
- A big red sign is if you have
try HUGE BLOCK OF CODE except ...
, then you are not really looking for specific errors, just praying "please make it work". - Overcmplicated
Let us look at how we can fix 1. first
Comments should be docstrings
# This is the reader function
def read_mixed(fname):
Is better expressed as
def read_mixed(fname):
"""This is the reader function"""
Use proper names and variables
This is not something I can help you with, but ideall you should not use over generic names. Like read_mixed
tells me nothing. Ideally you should incorperate domain knowledge into your docstrings and names
def car_data_from_file(file):
"""This function reads in a file containing ..."""
Like we do not care that the data is mixed, only we as programmers care that it is not a pure csv. really we need to explain more about the data is, then how it is structured.
Remove useless comments
Documenting the code is good! However, the proper way of doing this is not through generic comments.
# Read the meta info
meta_buff = reader.read(1)
The comment adds nothing of value here, it just states the obvious. Comments should mainly be reserved for explaining why not how. If you have a very terse line, than maaaybe. However, then you should really be restructuring the code.
Secondly comments adds overhead, now we need to make sure every comment is in sync with the code. Again, better to just express the intent of the doce though well named functions, variables and classes; with proper docstrings.
You need to figure out why the code raises
StopIteration
is it because it has reached the end of the file? If so a simplewith
would work just as fineThe question is how sure are you that your data is formatted properly. If you want to validate your data, my solution is not the way.
So without any validation we can do
DIGITS = set([str(number) for number in range(1, 10)])
def car_data_from_file_2(path: str) -> tuple[list[str], pd.DataFrame]:
csvs, metalist = [], []
with open(path, "r") as f:
header = next(f).strip().split(",")
for line in f:
if not (stripped := line.rstrip()):
pass
elif stripped[0] in DIGITS:
csvs.append(stripped)
else:
metalist.append(stripped)
df = pd.DataFrame(csv.reader(csvs), columns=header)
return metalist, df
All this code does is go through each line in the file, if it starts with a digit, it must be part of the csv; otherwise it is part of the metadata.
Perhaps a better way is as follows
# Uses regex to match a line containing the word LINE
METADATA = re.compile(r".*LINE.*")
def car_metadata_from_file(filename: str) -> list[str]:
"""Open file with car data and splits out the metadata"""
with open(filename, "r") as f:
return METADATA.findall(f.read())
def car_data_from_file(filename: str) -> pd.DataFrame:
"""Open file with car data and splits out the metadata"""
with open(filename, "r") as f:
df = pd.read_csv(f)
# Metadata lines are stripped by looking at the rows whose
# last column is NaN due to having insufficient columns:
return df[df[df.columns[-1]].notna()]
However, you need to do some benchmarks to figure out which one is the best for you. Also note that you should take into consideration readability. Document well your code and pick a solution that is readable to you.
-
\$\begingroup\$ This is really good feedback thank you! I think I'll take all of these suggestions and write them up, then do the benchmarking on the real files. I may end up skipping the non-validating version, though, because even though every file I've seen has the exact same format, it's not defined in a document (at least, that I can find) so I can't guarantee it won't change. But these are super helpful general tips, much appreciated! \$\endgroup\$Ajean– Ajean2022年05月02日 17:31:26 +00:00Commented May 2, 2022 at 17:31
-
\$\begingroup\$ These version also allow me to use
with
- the only reason I had to resort to the try/catch for StopIteration instead, is becausewith
barfed when I tried to use it for the TextFileReader. Maybe that's a bug in my version, or something. \$\endgroup\$Ajean– Ajean2022年05月02日 17:36:24 +00:00Commented May 2, 2022 at 17:36 -
\$\begingroup\$ Also - I just realized and made the same comment under the other answer ... how might I keep track of the block number with these? In my (admittedly poor) code, I add an 'ibuff' column to the DataFrame from the loop ... this was the main reason for making it a loop, but I would love to do it differently. \$\endgroup\$Ajean– Ajean2022年05月02日 17:41:32 +00:00Commented May 2, 2022 at 17:41
-
\$\begingroup\$ @Ajean If you have a list of metadatas (as outlined in my answer), you already have the blocknumber no? The first metadata would be 1, the second 2 and so forth? \$\endgroup\$N3buchadnezzar– N3buchadnezzar2022年05月02日 22:30:56 +00:00Commented May 2, 2022 at 22:30
-
\$\begingroup\$ Yes, but I meant included as part of the dataframe as well. Now I'm all good though, I was able fill it out using the missing indices. Thank you! \$\endgroup\$Ajean– Ajean2022年05月05日 15:59:24 +00:00Commented May 5, 2022 at 15:59
You're working too hard. You concatenate all of the CSV rows anyway, so LINES=
has no value. Just read the CSV, and choose an easy way to identify invalid rows - such as rows whose last column is NaN due to having insufficient columns.
Since you're using Pandas, your dictionary format is probably not a good idea. Consider instead forming a second dataframe whose column names are based on the keys. The key-value separation can be done with Pandas in a vectorised manner:
Suggested
from io import StringIO
import pandas as pd
def read_mixed(fname) -> tuple[pd.DataFrame, pd.DataFrame]:
df = pd.read_csv(fname) # Comment and data rows are mixed in
is_data = df[df.columns[-1]].notna() # Assume that "short" rows are comments, "long" rows are data
data = df[is_data] # All data rows in one frame
comments = df[~is_data] # Comment rows are "everything left"
comments.dropna(axis='columns', inplace=True) # Drop NA columns
first_col = comments.columns[0] # Name of the first column, disposable
comments[first_col] = comments[first_col].str.removeprefix('Comment line: ') # Drop the prefix on all rows
comments.columns = comments.iloc[0, :].str.split('=', 1).str[0] # Set columns to be keys from comment cells
for col in comments.columns:
comments[col] = comments[col].str.split('=', 1).str[1]
return data, comments
# This is a tester that illustrates the format of the real file(s)
if __name__ == "__main__":
fstr = StringIO("""a1,a2,a3,b1,b2,b3,c1,c2,c3
Comment line: LINES=5, CAR1=Top Fuel, CAR2=Funny, CAR3=Pro Stock
12,1,23,5,23,5,3,4,5
12,1,23,5,23,5,3,4,5
12,1,23,5,23,5,3,4,5
12,1,23,5,23,5,3,4,5
12,1,23,5,23,5,3,4,5
Comment line: LINES=5, CAR1=Funny, CAR2=Pro Stock, CAR3=Top Fuel
24,2,46,10,46,10,6,8,10
24,2,46,10,46,10,6,8,10
24,2,46,10,46,10,6,8,10
24,2,46,10,46,10,6,8,10
24,2,46,10,46,10,6,8,10
Comment line: LINES=3, CAR1=Funny, CAR2=Top Fuel, CAR3=Pro Stock
12,1,23,5,23,5,3,4,5
12,1,23,5,23,5,3,4,5
12,1,23,5,23,5,3,4,5
""")
output = read_mixed(fstr)
print(output)
Output
a1 a2 a3 b1 b2 b3 c1 c2 c3
1 12 1 23 5 23.0 5.0 3.0 4.0 5.0
2 12 1 23 5 23.0 5.0 3.0 4.0 5.0
3 12 1 23 5 23.0 5.0 3.0 4.0 5.0
4 12 1 23 5 23.0 5.0 3.0 4.0 5.0
5 12 1 23 5 23.0 5.0 3.0 4.0 5.0
7 24 2 46 10 46.0 10.0 6.0 8.0 10.0
8 24 2 46 10 46.0 10.0 6.0 8.0 10.0
9 24 2 46 10 46.0 10.0 6.0 8.0 10.0
10 24 2 46 10 46.0 10.0 6.0 8.0 10.0
11 24 2 46 10 46.0 10.0 6.0 8.0 10.0
13 12 1 23 5 23.0 5.0 3.0 4.0 5.0
14 12 1 23 5 23.0 5.0 3.0 4.0 5.0
15 12 1 23 5 23.0 5.0 3.0 4.0 5.0
0 LINES CAR1 CAR2 CAR3
0 5 Top Fuel Funny Pro Stock
6 5 Funny Pro Stock Top Fuel
12 3 Funny Top Fuel Pro Stock
-
\$\begingroup\$ You are definitely right in that I'm positive that my current code is not the best (hence posted here, hehe). I guess I couldn't figure out how else to keep track of the block number - that's the only thing that's sort of missing here ... I need to keep track of which block each csv line came from (in my code I add the column 'ibuff' to the DataFrame). But given your version, I might see a way to get that in there, I can run with it a bit unless you have a suggestion :) \$\endgroup\$Ajean– Ajean2022年05月02日 17:39:41 +00:00Commented May 2, 2022 at 17:39
-
\$\begingroup\$ The block number can be derived from the indices in the dataframes shown above. \$\endgroup\$Reinderien– Reinderien2022年05月02日 17:49:07 +00:00Commented May 2, 2022 at 17:49
-
1\$\begingroup\$ Yep, I figured so ... working on using that. Thank you! \$\endgroup\$Ajean– Ajean2022年05月02日 18:02:39 +00:00Commented May 2, 2022 at 18:02