Load recurring (but not strictly identical) sets of Key, Values into a DataFrame from text files

Question 1

I am reading text files that contain data from observations. The format is not Fixed Width or Delimited, so I built a generator that gets Key, Values pairs and yields a dictionary when it has read a full observation record (~75 pairs in my case). The main loop builds a list from these dictionaries and then loads them into a DataFrame.

This code, although working, is slow and I realize it may be not optimum to build a long list of dictionaries to finally load into a DataFrame.

Side notes:

I am not a developer and new to Python
I choose to use Dictionaries because my data files may contain different number of observations (Key, Value pairs) for each Record, but generally, this number should be consistent within one data file
I do not control the text files format, I have to live with it...
The text files format is vaguely resembling JSON with each Record ([ ] enclosed) consisting of some (Key, Value) pairs and nested Sub-Records both indented by 2 spaces, (Key, Value) pairs within Sub-Records are indented by 4 spaces and [ ] enclosed.
Key and Values are separated by [space]:[space]
Values can span multiple line, in this case the line ends with 4 spaces and the following lines are indented by more than 4 spaces, but this can be broken if the Value contains a LF character.

Is there a more efficient way to load my Key, Value dictionaries into my DataFrame?

import glob
import pandas as pd
raw_ext = '.raw'
raw_path = input('RAW Files Folder path: ')
OBREP = '=' * 15
SBREP = '[\n'
KVSEP = ' : '
VCONT = ' \n'
VENDS = ' \n'
def get_rec_dict(file):
 recs = {}
 f = False
 for line in file:
 if OBREP in line:
 f = True
 if KVSEP in line and not line.endswith(SBREP):
 vlist = line.split(KVSEP)
 k = vlist.pop(0).strip()
 if line.endswith(VCONT):
 for line in file:
 vlist.append(line.strip())
 if not line.endswith(VCONT):
 break
 if k == 'User_Header':
 for line in file:
 if line == VENDS:
 break
 else:
 vlist.append(line.strip())
 v = '\n'.join(val.strip() for val in vlist)
 recs[k] = v
 if f:
 if recs:
 yield recs
 f = False
file_no = 0
raw_files = glob.glob('{0}*{1}'.format(raw_path, raw_ext))
rec_list = []
for raw in raw_files:
 with open(raw, 'r', encoding='latin-1') as infile:
 for rec_dict in get_rec_dict(infile):
 rec_list.append(rec_dict)
 file_no += 1
df = pd.DataFrame(rec_list)
if file_no > 0:
 print('{} RAW files loaded.'.format(file_no))
else:
 print('No file found.')

RAW file sample:

This sample contains 2 Observations sets:

Obs_Report_Result : [
# ===== (1) =====
Observer_Report : [
# ===============
 Version : "5.0"
 Exploitation_Mode : NORMAL
 Line_Report : [
 Filter_Type : 8N MIN
 Aux_Nb_Trace : 1
 Seis_Nb_Trace : 16674
 Total_Nb_Trace : 16675
 Nb_Of_Dead_Seis_Channels : 9
 Nb_Of_Live_Seis_Channels : 16665
 Dead_Seis_Channels : 586:333(2152-2154) 
 658:384(12979-12981) 
 662:306(13345-13347)
 Live_Seis_Channels : 574:216-415(1-600) 
 578:216-415(601-1200) 
 582:216-415(1201-1800) 
 586:216-332(1801-2151)334-415(2155-2400) 
 590:216-415(2401-3000) 
 594:216-415(3001-3600) 
 598:216-415(3601-4200) 
 602:216-415(4201-4800) 
 606:216-415(4801-5400) 
 610:216-415(5401-6000) 
 614:216-415(6001-6600) 
 618:216-415(6601-7200) 
 622:216-415(7201-7800) 
 626:216-415(7801-8400) 
 630:216-279(8401-8592)291-415(8593-8967) 
 634:216-280(8968-9162)292-415(9163-9534) 
 638:216-280(9535-9729)291-415(9730-10104) 
 642:216-280(10105-10299)291-415(10300-10674) 
 646:216-415(10675-11274) 
 650:216-415(11275-11874) 
 654:216-415(11875-12474) 
 658:216-383(12475-12978)385-415(12982-13074) 
 662:216-305(13075-13344)307-415(13348-13674) 
 666:216-415(13675-14274) 
 670:216-415(14275-14874) 
 674:216-415(14875-15474) 
 678:216-415(15475-16074) 
 682:216-415(16075-16674)
 SFL : 574
 SFN : 216
 Spread_Nb : 1090
 Spread_Type : ABSOLUTE
 Acq_Error : 
 ITB : FALSE
 ]
 Shot_Report : [
 Swath_Name : TDG_South
 Swath_ID : -2147483648
 Shot_Nb : 2448
 Line_Name : 317.0
 Point_Number : 362.0
 Point_Index : 1
 Acq_Length : 16500 # (msec)
 Sweep_Length : 0 # (ms)
 Pilot_Length : 0 # (ms)
 Record_Length : 16500 # (ms)
 Sample_Rate : 1000
 Total_Nb_Sample : 16501
 Type_Of_Source : EXPLO
 Source_Nb : 11
 Tb_Window : 2500
 Date : Sun Feb 17 18:12:04 2015
 Julian_Day : 1
 Cog_State : NO COG
 Cog_Easting : N/A
 Cog_Northing : N/A
 Cog_Elevation : 0.0
 Cog_Deviation : 0.0
 Uphole_Time : 0.00 # (msec)
 ]
 Noise_Report : [
 Noise_Elim_type : NE OFF
 Thres_Hold_Var : N/A
 Hist_Editing_Type : N/A
 Hist_Range : N/A # (dB)
 Hist_Taper_Length : N/A # (power)
 Hist_Thres_Init_Val : N/A # (dB)
 Hist_Zeroing_Length : N/A # (msec)
 Low_Trace_Value : 0 # (dB)
 Low_Trace_Percent : 0
 Noisy_Trace_Percent : N/A
 Low_Noisy_Verbose : 
 Nb_Of_Window : 0
 ]
 Process_Report : [
 Type_Of_Process : IMPULSIVE
 Acq_Nb : 1
 Correl_Pilot_Nb : 0
 Auto_Cor_Peak_Time : 0
 Dump_Stacking_Fold : 1
 Max_Of_Max_Aux_Char : " -7.929688e+01"
 Max_Of_Max_Seis_Char : " 1.088968e+06"
 Max_Time_Value_Verbose : ""
 ]
 Record_Report : [
 File_Nb : 12221
 Type_Of_Dump : DUMP
 Type_Of_Test : N/A 3
 Tape_Nb : 36
 Tape_Label : "TD South"
 Record_Type : NORMAL
 Blocking_Mode : FALSE
 Device_Bypass : FALSE
 Tape_Error_Text : ""
 Tape_Time : "Sun Feb 17 18:13:03 2015
"
 File_Count : "17
 File_Per_Tape : "2000"
 ]
 Comment : "N/A"
 User_Header : "*SGD-S SP#2448/SL#317.0/SN#362.0/SI#1/SEQ#11/STA:1/CTB:00.000/UH:000.0
ICIS #105. Hits: 6. Single Hit Rec: 2.0s. Total Rec Length: 16.5s.
NMEA: 5717.5386,N,11201.3849,W,+00408.3,M,1,06,07.2,000.04,270.0
TB=02182013,011205.6520652
Hit=02182013,011206.2430453
HP= 63PSI
Hit=02182013,011208.7418981
HP= 83PSI
Hit=02182013,011211.2414192
HP= 64PSI
Hit=02182013,011213.7418408
HP= 79PSI
Hit=02182013,011216.2420402
HP= 90PSI
Hit=02182013,011218.7414871
HP= 71PSI
Acquisition Complete.
"
]
# ===== (2) =====
Observer_Report : [
# ===============
 Version : "5.0"
 Exploitation_Mode : NORMAL
 Line_Report : [
 Filter_Type : 8N MIN
 Aux_Nb_Trace : 1
 Seis_Nb_Trace : 16674
 Total_Nb_Trace : 16675
 Nb_Of_Dead_Seis_Channels : 9
 Nb_Of_Live_Seis_Channels : 16665
 Dead_Seis_Channels : 586:333(2152-2154) 
 658:384(12979-12981) 
 662:306(13345-13347)
 Live_Seis_Channels : 574:216-415(1-600) 
 578:216-415(601-1200) 
 582:216-415(1201-1800) 
 586:216-332(1801-2151)334-415(2155-2400) 
 590:216-415(2401-3000) 
 594:216-415(3001-3600) 
 598:216-415(3601-4200) 
 602:216-415(4201-4800) 
 606:216-415(4801-5400) 
 610:216-415(5401-6000) 
 614:216-415(6001-6600) 
 618:216-415(6601-7200) 
 622:216-415(7201-7800) 
 626:216-415(7801-8400) 
 630:216-279(8401-8592)291-415(8593-8967) 
 634:216-280(8968-9162)292-415(9163-9534) 
 638:216-280(9535-9729)291-415(9730-10104) 
 642:216-280(10105-10299)291-415(10300-10674) 
 646:216-415(10675-11274) 
 650:216-415(11275-11874) 
 654:216-415(11875-12474) 
 658:216-383(12475-12978)385-415(12982-13074) 
 662:216-305(13075-13344)307-415(13348-13674) 
 666:216-415(13675-14274) 
 670:216-415(14275-14874) 
 674:216-415(14875-15474) 
 678:216-415(15475-16074) 
 682:216-415(16075-16674)
 SFL : 574
 SFN : 216
 Spread_Nb : 1090
 Spread_Type : ABSOLUTE
 Acq_Error : 
 ITB : FALSE
 ]
 Shot_Report : [
 Swath_Name : TD_South
 Swath_ID : -2147483648
 Shot_Nb : 2448
 Line_Name : 317.0
 Point_Number : 362.0
 Point_Index : 1
 Acq_Length : 16500 # (msec)
 Sweep_Length : 0 # (ms)
 Pilot_Length : 0 # (ms)
 Record_Length : 16500 # (ms)
 Sample_Rate : 1000
 Total_Nb_Sample : 16501
 Type_Of_Source : EXPLO
 Source_Nb : 11
 Tb_Window : 2500
 Date : Sun Feb 17 18:12:04 2015
 Julian_Day : 1
 Cog_State : NO COG
 Cog_Easting : N/A
 Cog_Northing : N/A
 Cog_Elevation : 0.0
 Cog_Deviation : 0.0
 Uphole_Time : 0.00 # (msec)
 ]
 Noise_Report : [
 Noise_Elim_type : NE OFF
 Thres_Hold_Var : N/A
 Hist_Editing_Type : N/A
 Hist_Range : N/A # (dB)
 Hist_Taper_Length : N/A # (power)
 Hist_Thres_Init_Val : N/A # (dB)
 Hist_Zeroing_Length : N/A # (msec)
 Low_Trace_Value : 0 # (dB)
 Low_Trace_Percent : 0
 Noisy_Trace_Percent : N/A
 Low_Noisy_Verbose : 
 Nb_Of_Window : 0
 ]
 Process_Report : [
 Type_Of_Process : IMPULSIVE
 Acq_Nb : 1
 Correl_Pilot_Nb : 0
 Auto_Cor_Peak_Time : 0
 Dump_Stacking_Fold : 1
 Max_Of_Max_Aux_Char : " -7.929688e+01"
 Max_Of_Max_Seis_Char : " 1.088968e+06"
 Max_Time_Value_Verbose : ""
 ]
 Record_Report : [
 File_Nb : 12221
 Type_Of_Dump : DUMP
 Type_Of_Test : N/A 3
 Tape_Nb : 36
 Tape_Label : "TDG South"
 Record_Type : NORMAL
 Blocking_Mode : FALSE
 Device_Bypass : FALSE
 Tape_Error_Text : ""
 Tape_Time : "Sun Feb 17 18:13:08 2015
"
 File_Count : "17
 File_Per_Tape : "2000"
 ]
 Comment : "N/A"
 User_Header : "*SGD-S SP#2448/SL#317.0/SN#362.0/SI#1/SEQ#11/STA:1/CTB:00.000/UH:000.0
ICIS #105. Hits: 6. Single Hit Rec: 2.0s. Total Rec Length: 16.5s.
NMEA: 5717.5386,N,11201.3849,W,+00408.3,M,1,06,07.2,000.04,270.0
TB=02182013,011205.6520652
Hit=02182013,011206.2430453
HP= 63PSI
Hit=02182013,011208.7418981
HP= 83PSI
Hit=02182013,011211.2414192
HP= 64PSI
Hit=02182013,011213.7418408
HP= 79PSI
Hit=02182013,011216.2420402
HP= 90PSI
Hit=02182013,011218.7414871
HP= 71PSI
Acquisition Complete.
"
]

Question 2

This code feels broken. When using the 2 observations sample file (and correcting it to add VENDS after the closing " of User_Header) I only get 1 sample dictionary yielded. What happens is that f = True will execute when reaching the first # ================, process nothing as there is no ' : ' in this line and reach if f:; then turn f to False without yielding anything. The first observation is then parsed and, upon reaching the second # =============== is yielded. The second observation is parsed but not yielded since there is no # =============== any more.

Question 3

I ran it and got 2 dictionaries as the return value, (I haven't validated this any more than that).

Question 4

@SuperBiasedMan I can't get with open('<path_to_file>') as f: print(len(list(get_rec_dict(f)))) to print 2. I get 0 with the original sample and 1 after adding two spaces after the closing " of User_Header.

Question 5

@MathiasEttinger I didn't create a file, I just copied the string data and pasted it into IDLE with data.split('\n'), does that make any difference for you?

Question 6

@SuperBiasedMan Oh, this one worked. (Even though I don't understand why.) But there are entries like 'Line_Report': '[' or 'Observer_Report': '[' that are getting into it due to the split removing the line ending that should be present for SBREP, VCONT or VENDS.

Question 7

I will extend on both my comment and @SuperBiasedMan answer.

Bugs?

To start with, I still believe that your code produces one less dictionary than records per file. At least with the given input. If you rely on finding # =============== to yield the record that just get parsed, you will never yield the final one. Instead, I'd rather use the fact that records are always ordered in the same way and that 'User_Header' is the last field.

You can thus yield your result right after parsing that specific field.

A second thing to note is that you are always using, overriding and yielding the same dictionary. Thus you’re only actually parsing the last record, \$n\$ times. Let me show you why:

>>> def test():
... recs = {}
... yield recs
... recs['one'] = 1
... yield recs
... recs['two'] = 2
... yield recs
... 
>>> list(test())
[{'one': 1, 'two': 2}, {'one': 1, 'two': 2}, {'one': 1, 'two': 2}]

You're basically doing the same thing except you do it in a for loop so it is less visible. You need to change dictionaries after each yield so your data is not overwritten.

Expand your generators

Generating data with the yield keyword can help reduce memory footprint and increase overall efficiency of your program, let's take this approach further. Python has a yield from syntax that lets you "chain" iterators; meaning we can wrap a generator into an other one and yield the same elements without increased overhead. For instance:

def parse_data(filename):
 with open(filename, 'r', encoding='latin-1') as f:
 yield from get_rec_dict(f)

Going one level further, we can wrap this into the iteration over glob:

def parse_directory(files):
 for filename in files:
 yield from parse_data(filename)

This let you build your final rec_list using only list(parse_directory(get_raw_files())).

Use EAFP

Looking at your input file, you’re expecting to have much more lines that can be splitted with ' : ' than those who wont. In such cases, it is recommended to use EAFP approach. Basically, you just split your line anyway, maybe creating a 1-sized list, and you try to assign two elements regardless. If you fail (and you will but not often) you handled the exception that will arise by knowing that you should have skipped this line.

Combine that with mapping strip over each part of the split, and you might end up with something like:

try:
 key, value = map(str.strip, line.split(KVSEP))
except ValueError:
 # Not enought value to unpack
 continue
else:
 recs[key] = value

Proposed improvements

Putting a bit of variable renaming into the equation as well as using if __name__ == '__main__' to protect top-level code:

import os.path
import glob
import pandas as pd
IGNORED = '[\n'
SEPARATOR = ' : '
MULTILINE = ' \n'
END_OF_RECORDS = ' \n'
def get_files(ext='.raw'):
 path = input('RAW Files Folder path: ')
 pattern = os.path.join(path, '*{}'.format(ext))
 return glob.glob(pattern)
def parse_records(file):
 records = {}
 for line in file:
 if line.endswith(IGNORED):
 continue
 try:
 key, value = map(str.strip, line.split(SEPARATOR))
 except ValueError:
 # Not enought value to unpack
 continue
 else:
 records[key] = value
 if line.endswith(MULTILINE):
 multiline_value = [value]
 for line in file:
 multiline_value.append(line.strip())
 if not line.endswith(MULTILINE):
 break
 records[key] = '\n'.join(multiline_value)
 if key == 'User_Header':
 multiline_value = [value]
 for line in file:
 multiline_value.append(line.strip())
 if line.endswith(END_OF_RECORDS)
 break
 records[key] = '\n'.join(multiline_value)
 yield records
 records = {}
def parse_data(filename):
 with open(filename, 'r', encoding='latin-1') as f:
 yield from parse_records(f)
def parse_files(file_paths):
 for filename in file_paths:
 yield from parse_data(filename)
if __name__ == '__main__':
 files = get_files()
 records = list(parse_files(files))
 if files:
 print('Processed', len(files), 'files and found', len(records), 'records')
 else:
 print('No files found.')
 df = pd.DataFrame(records)

Question 8

I think you're right that the same record gets parsed over and over even if I don't really understand why at the moment. And I am also missing one record. I'll take a deeper look at your suggestions.

Question 9

I think you should refactor things about. raw_ext should be RAW_EXT to clarify that it's a constant. Then raw_path should be down with raw_files, to make it clear that it's a user defined value. I might even make a function for get_raw_files(), like this:

def get_raw_files():
 raw_path = input('RAW Files Folder path: ')
 return glob.glob('{0}*{1}'.format(raw_path, raw_ext))

This makes it easier to test individual parts of your code, and update them too if you realise that you need to test the user's input, for example like this:

def get_raw_files():
 while True:
 raw_path = input('RAW Files Folder path: ')
 if os.path.isdir(raw_path):
 break
 print('Path not found, please check that it exists')
 return glob.glob('{0}*{1}'.format(raw_path, raw_ext))

get_rec_dict is a very long function, separating it into individual tasks would make it much more readable.

It's good that you've clearly marked some strings as constants, but the names are terribly unclear. They're clearly shortened words, but that makes me unable to gather what they're supposed to mean. Sure KVSEP is probably a separator, but of what? Try to make them more clear, even if it involves longer verbose lines.

In get_rec_dict, your f value seems pointless. Wouldn't it just be the same to end your loops with if OBREP in line? line is never modified in your loop, so it'll give the same result whether you test at the start or end of an iteration. If there is a realy difference between what you have and what I suggest, then it's hacky and unclear. And you should clarify it with a comment. You should also combine if tests, rather than nesting them:

if OBREP in line and recs:
 yield recs

You can use map to run a function on every value of a list, so instead of this:

 v = '\n'.join(val.strip() for val in vlist)

you can do this:

 v = '\n'.join(map(str.strip, vlist))

It's a little better performance wise, and easier to read.

I may be wrong, but I believe you could just call list(get_rec_dict(infile)) directly, rather than looping over the result and appending each value. Your loop doesn't allow breaking or catch errors, so I can't see any difference with it apart from inefficiency.

rec_list += list(get_rec_dict(infile))

It also seems silly to have file_no when all you care about is having more than one file in the raw_files list. Use Python's truthiness for this test instead. An empty list can be evaluated as False by Python, while a list that contains elements will be read as True.

if raw_files:
 print('{} RAW files loaded.'.format(file_no))
else:
 print('No file found.')

Question 10

Thanks for all the suggestions! I will update the code shortly. You are correct about the constants: KVSEP (renamed KV_SEPARATOR) is identifying the separation between Key and Value, VCONT (renamed V_CONTINUE) is a sign that the Value spans several lines. Unfortunately, this is not 100% reliable as you can see on User_Header. VENDS (renamed V_TERMINATED) indicates the Value has been fully read at the previous line. The intent on the flag f was to know when to yield the dictionary (my logic was: when the dictionary is not empty and we are starting to read a new set of Observation)

Question 11

your two suggestions v = '\n'.join(map(str.strip, vlist)) and rec_list += list(get_rec_dict(infile)) do seem to improve the performance overall, so I have integrated them also.

Question 12

@YeO Great! Glad to help. When the code is updated you can post a new question to ask about more feedback if you'd like. If you are though, please look into the strange behaviour Mathias mentioned in the comments above as code on CR should always be fully working as intended.

Question 13

I finally understand the unnecessary complexity of the f flag and your suggestion is spot on! Much smarter to test at the end and the nesting becomes unnecessary as well. Thanks.

Question 14

@YeO Ah, glad to hear it. I was starting to think I'd misread it somehow haha.

Question 15

Here's the edited (and fixed) code after the improvements suggested by @SuperBiasedMan and taking inconsideration @Mathias Ettinger comment.
My code was indeed broken and was only returning the same record.
After some more tests, I reverted to the for loop to build the records list as it seems to be slightly faster, I have kept the suggestion as comment and for reference.
To be noted: @Mathias Ettinger 's code is faster. :-)

import glob
import pandas as pd
RAW_EXT = '.raw'
OBS_REPORT = '=' * 15 # identifies a set of Observations (Observer Report)
SUB_REPORT = '[\n' # identifies a Sub-Report within the main set
KV_SEPARATOR = ' : ' # the Key-Value Separator
V_CONTINUE = ' \n' # if the line ends with four space, the Value continues on the next line
V_TERMINED = ' \n' # if the line is 2 spaces and LF, we got to the end of the Value
def get_rec_dict(file):
 recs = {}
 for line in file:
 # if KV_SEPARATOR is found and the line is not a Sub Report Header, then we have a Key and the start of a Value
 if KV_SEPARATOR in line and not line.endswith(SUB_REPORT):
 vlist = line.split(KV_SEPARATOR) # the Key is the left of the separator
 k = vlist.pop(0).strip()
 if line.endswith(V_CONTINUE):
 for line in file:
 # add all lines ending with 4 spaces to the Value
 vlist.append(line.strip())
 if not line.endswith(V_CONTINUE):
 break
 # User_Header may not use the 4 spaces to indicate multi-line, so we read until we are sure Value is all captured
 if k == 'User_Header':
 for line in file:
 if line == V_TERMINED:
 break # if we encounter a line that is V_TERMINED, we are sure we got all the Value already
 else:
 vlist.append(line.strip())
 yield recs # we yield the result after having read User_Header
 recs = {}
## recs[k] = '\n'.join(val.strip() for val in vlist) was slower
 recs[k] = '\n'.join(map(str.strip, vlist))
def get_raw_files():
 raw_path = input('RAW Files Folder path: ')
 if not raw_path.endswith('\\'):
 raw_path = raw_path + '\\'
 return glob.glob('{0}*{1}'.format(raw_path, RAW_EXT))
rec_list = []
raw_files = get_raw_files()
# Main loop
for raw in raw_files:
 with open(raw, 'r', encoding='latin-1') as infile:
 for rec_dict in get_rec_dict(infile):
 rec_list.append(rec_dict)
## rec_list += list(get_rec_dict(infile))
df = pd.DataFrame(rec_list)
if raw_files:
 print('{} RAW files loaded.'.format(len(raw_files)))
else:
 print('No file found.')

score 2 · Accepted Answer · 2016-01-18 21:50:32Z

I will extend on both my comment and @SuperBiasedMan answer.

Bugs?

To start with, I still believe that your code produces one less dictionary than records per file. At least with the given input. If you rely on finding # =============== to yield the record that just get parsed, you will never yield the final one. Instead, I'd rather use the fact that records are always ordered in the same way and that 'User_Header' is the last field.

You can thus yield your result right after parsing that specific field.

A second thing to note is that you are always using, overriding and yielding the same dictionary. Thus you’re only actually parsing the last record, \$n\$ times. Let me show you why:

>>> def test():
... recs = {}
... yield recs
... recs['one'] = 1
... yield recs
... recs['two'] = 2
... yield recs
... 
>>> list(test())
[{'one': 1, 'two': 2}, {'one': 1, 'two': 2}, {'one': 1, 'two': 2}]

You're basically doing the same thing except you do it in a for loop so it is less visible. You need to change dictionaries after each yield so your data is not overwritten.

Expand your generators

Generating data with the yield keyword can help reduce memory footprint and increase overall efficiency of your program, let's take this approach further. Python has a yield from syntax that lets you "chain" iterators; meaning we can wrap a generator into an other one and yield the same elements without increased overhead. For instance:

def parse_data(filename):
 with open(filename, 'r', encoding='latin-1') as f:
 yield from get_rec_dict(f)

Going one level further, we can wrap this into the iteration over glob:

def parse_directory(files):
 for filename in files:
 yield from parse_data(filename)

This let you build your final rec_list using only list(parse_directory(get_raw_files())).

Use EAFP

Looking at your input file, you’re expecting to have much more lines that can be splitted with ' : ' than those who wont. In such cases, it is recommended to use EAFP approach. Basically, you just split your line anyway, maybe creating a 1-sized list, and you try to assign two elements regardless. If you fail (and you will but not often) you handled the exception that will arise by knowing that you should have skipped this line.

Combine that with mapping strip over each part of the split, and you might end up with something like:

try:
 key, value = map(str.strip, line.split(KVSEP))
except ValueError:
 # Not enought value to unpack
 continue
else:
 recs[key] = value

Proposed improvements

Putting a bit of variable renaming into the equation as well as using if __name__ == '__main__' to protect top-level code:

import os.path
import glob
import pandas as pd
IGNORED = '[\n'
SEPARATOR = ' : '
MULTILINE = ' \n'
END_OF_RECORDS = ' \n'
def get_files(ext='.raw'):
 path = input('RAW Files Folder path: ')
 pattern = os.path.join(path, '*{}'.format(ext))
 return glob.glob(pattern)
def parse_records(file):
 records = {}
 for line in file:
 if line.endswith(IGNORED):
 continue
 try:
 key, value = map(str.strip, line.split(SEPARATOR))
 except ValueError:
 # Not enought value to unpack
 continue
 else:
 records[key] = value
 if line.endswith(MULTILINE):
 multiline_value = [value]
 for line in file:
 multiline_value.append(line.strip())
 if not line.endswith(MULTILINE):
 break
 records[key] = '\n'.join(multiline_value)
 if key == 'User_Header':
 multiline_value = [value]
 for line in file:
 multiline_value.append(line.strip())
 if line.endswith(END_OF_RECORDS)
 break
 records[key] = '\n'.join(multiline_value)
 yield records
 records = {}
def parse_data(filename):
 with open(filename, 'r', encoding='latin-1') as f:
 yield from parse_records(f)
def parse_files(file_paths):
 for filename in file_paths:
 yield from parse_data(filename)
if __name__ == '__main__':
 files = get_files()
 records = list(parse_files(files))
 if files:
 print('Processed', len(files), 'files and found', len(records), 'records')
 else:
 print('No files found.')
 df = pd.DataFrame(records)

I think you're right that the same record gets parsed over and over even if I don't really understand why at the moment. And I am also missing one record. I'll take a deeper look at your suggestions.

Stack Exchange Network

Load recurring (but not strictly identical) sets of Key, Values into a DataFrame from text files

3 Answers 3

Bugs?

Expand your generators

Use EAFP

Proposed improvements

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Load recurring (but not strictly identical) sets of Key, Values into a DataFrame from text files

3 Answers 3

Bugs?

Expand your generators

Use EAFP

Proposed improvements

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions