I am reading text files that contain data from observations. The format is not Fixed Width
or Delimited
, so I built a generator that gets Key
, Values
pairs and yields a dictionary when it has read a full observation record (~75 pairs in my case). The main loop builds a list from these dictionaries and then loads them into a DataFrame
.
This code, although working, is slow and I realize it may be not optimum to build a long list of dictionaries to finally load into a DataFrame
.
Side notes:
- I am not a developer and new to Python
- I choose to use
Dictionaries
because my data files may contain different number of observations (Key
,Value
pairs) for each Record, but generally, this number should be consistent within one data file - I do not control the text files format, I have to live with it...
- The text files format is vaguely resembling JSON with each Record ([ ] enclosed) consisting of some (
Key
,Value
) pairs and nested Sub-Records both indented by 2 spaces, (Key
,Value
) pairs within Sub-Records are indented by 4 spaces and [ ] enclosed. Key
andValues
are separated by[space]:[space]
Values
can span multiple line, in this case the line ends with 4 spaces and the following lines are indented by more than 4 spaces, but this can be broken if theValue
contains aLF
character.
Is there a more efficient way to load my Key
, Value
dictionaries into my DataFrame
?
import glob
import pandas as pd
raw_ext = '.raw'
raw_path = input('RAW Files Folder path: ')
OBREP = '=' * 15
SBREP = '[\n'
KVSEP = ' : '
VCONT = ' \n'
VENDS = ' \n'
def get_rec_dict(file):
recs = {}
f = False
for line in file:
if OBREP in line:
f = True
if KVSEP in line and not line.endswith(SBREP):
vlist = line.split(KVSEP)
k = vlist.pop(0).strip()
if line.endswith(VCONT):
for line in file:
vlist.append(line.strip())
if not line.endswith(VCONT):
break
if k == 'User_Header':
for line in file:
if line == VENDS:
break
else:
vlist.append(line.strip())
v = '\n'.join(val.strip() for val in vlist)
recs[k] = v
if f:
if recs:
yield recs
f = False
file_no = 0
raw_files = glob.glob('{0}*{1}'.format(raw_path, raw_ext))
rec_list = []
for raw in raw_files:
with open(raw, 'r', encoding='latin-1') as infile:
for rec_dict in get_rec_dict(infile):
rec_list.append(rec_dict)
file_no += 1
df = pd.DataFrame(rec_list)
if file_no > 0:
print('{} RAW files loaded.'.format(file_no))
else:
print('No file found.')
RAW file sample:
This sample contains 2 Observations sets:
Obs_Report_Result : [ # ===== (1) ===== Observer_Report : [ # =============== Version : "5.0" Exploitation_Mode : NORMAL Line_Report : [ Filter_Type : 8N MIN Aux_Nb_Trace : 1 Seis_Nb_Trace : 16674 Total_Nb_Trace : 16675 Nb_Of_Dead_Seis_Channels : 9 Nb_Of_Live_Seis_Channels : 16665 Dead_Seis_Channels : 586:333(2152-2154) 658:384(12979-12981) 662:306(13345-13347) Live_Seis_Channels : 574:216-415(1-600) 578:216-415(601-1200) 582:216-415(1201-1800) 586:216-332(1801-2151)334-415(2155-2400) 590:216-415(2401-3000) 594:216-415(3001-3600) 598:216-415(3601-4200) 602:216-415(4201-4800) 606:216-415(4801-5400) 610:216-415(5401-6000) 614:216-415(6001-6600) 618:216-415(6601-7200) 622:216-415(7201-7800) 626:216-415(7801-8400) 630:216-279(8401-8592)291-415(8593-8967) 634:216-280(8968-9162)292-415(9163-9534) 638:216-280(9535-9729)291-415(9730-10104) 642:216-280(10105-10299)291-415(10300-10674) 646:216-415(10675-11274) 650:216-415(11275-11874) 654:216-415(11875-12474) 658:216-383(12475-12978)385-415(12982-13074) 662:216-305(13075-13344)307-415(13348-13674) 666:216-415(13675-14274) 670:216-415(14275-14874) 674:216-415(14875-15474) 678:216-415(15475-16074) 682:216-415(16075-16674) SFL : 574 SFN : 216 Spread_Nb : 1090 Spread_Type : ABSOLUTE Acq_Error : ITB : FALSE ] Shot_Report : [ Swath_Name : TDG_South Swath_ID : -2147483648 Shot_Nb : 2448 Line_Name : 317.0 Point_Number : 362.0 Point_Index : 1 Acq_Length : 16500 # (msec) Sweep_Length : 0 # (ms) Pilot_Length : 0 # (ms) Record_Length : 16500 # (ms) Sample_Rate : 1000 Total_Nb_Sample : 16501 Type_Of_Source : EXPLO Source_Nb : 11 Tb_Window : 2500 Date : Sun Feb 17 18:12:04 2015 Julian_Day : 1 Cog_State : NO COG Cog_Easting : N/A Cog_Northing : N/A Cog_Elevation : 0.0 Cog_Deviation : 0.0 Uphole_Time : 0.00 # (msec) ] Noise_Report : [ Noise_Elim_type : NE OFF Thres_Hold_Var : N/A Hist_Editing_Type : N/A Hist_Range : N/A # (dB) Hist_Taper_Length : N/A # (power) Hist_Thres_Init_Val : N/A # (dB) Hist_Zeroing_Length : N/A # (msec) Low_Trace_Value : 0 # (dB) Low_Trace_Percent : 0 Noisy_Trace_Percent : N/A Low_Noisy_Verbose : Nb_Of_Window : 0 ] Process_Report : [ Type_Of_Process : IMPULSIVE Acq_Nb : 1 Correl_Pilot_Nb : 0 Auto_Cor_Peak_Time : 0 Dump_Stacking_Fold : 1 Max_Of_Max_Aux_Char : " -7.929688e+01" Max_Of_Max_Seis_Char : " 1.088968e+06" Max_Time_Value_Verbose : "" ] Record_Report : [ File_Nb : 12221 Type_Of_Dump : DUMP Type_Of_Test : N/A 3 Tape_Nb : 36 Tape_Label : "TD South" Record_Type : NORMAL Blocking_Mode : FALSE Device_Bypass : FALSE Tape_Error_Text : "" Tape_Time : "Sun Feb 17 18:13:03 2015 " File_Count : "17 File_Per_Tape : "2000" ] Comment : "N/A" User_Header : "*SGD-S SP#2448/SL#317.0/SN#362.0/SI#1/SEQ#11/STA:1/CTB:00.000/UH:000.0 ICIS #105. Hits: 6. Single Hit Rec: 2.0s. Total Rec Length: 16.5s. NMEA: 5717.5386,N,11201.3849,W,+00408.3,M,1,06,07.2,000.04,270.0 TB=02182013,011205.6520652 Hit=02182013,011206.2430453 HP= 63PSI Hit=02182013,011208.7418981 HP= 83PSI Hit=02182013,011211.2414192 HP= 64PSI Hit=02182013,011213.7418408 HP= 79PSI Hit=02182013,011216.2420402 HP= 90PSI Hit=02182013,011218.7414871 HP= 71PSI Acquisition Complete. " ] # ===== (2) ===== Observer_Report : [ # =============== Version : "5.0" Exploitation_Mode : NORMAL Line_Report : [ Filter_Type : 8N MIN Aux_Nb_Trace : 1 Seis_Nb_Trace : 16674 Total_Nb_Trace : 16675 Nb_Of_Dead_Seis_Channels : 9 Nb_Of_Live_Seis_Channels : 16665 Dead_Seis_Channels : 586:333(2152-2154) 658:384(12979-12981) 662:306(13345-13347) Live_Seis_Channels : 574:216-415(1-600) 578:216-415(601-1200) 582:216-415(1201-1800) 586:216-332(1801-2151)334-415(2155-2400) 590:216-415(2401-3000) 594:216-415(3001-3600) 598:216-415(3601-4200) 602:216-415(4201-4800) 606:216-415(4801-5400) 610:216-415(5401-6000) 614:216-415(6001-6600) 618:216-415(6601-7200) 622:216-415(7201-7800) 626:216-415(7801-8400) 630:216-279(8401-8592)291-415(8593-8967) 634:216-280(8968-9162)292-415(9163-9534) 638:216-280(9535-9729)291-415(9730-10104) 642:216-280(10105-10299)291-415(10300-10674) 646:216-415(10675-11274) 650:216-415(11275-11874) 654:216-415(11875-12474) 658:216-383(12475-12978)385-415(12982-13074) 662:216-305(13075-13344)307-415(13348-13674) 666:216-415(13675-14274) 670:216-415(14275-14874) 674:216-415(14875-15474) 678:216-415(15475-16074) 682:216-415(16075-16674) SFL : 574 SFN : 216 Spread_Nb : 1090 Spread_Type : ABSOLUTE Acq_Error : ITB : FALSE ] Shot_Report : [ Swath_Name : TD_South Swath_ID : -2147483648 Shot_Nb : 2448 Line_Name : 317.0 Point_Number : 362.0 Point_Index : 1 Acq_Length : 16500 # (msec) Sweep_Length : 0 # (ms) Pilot_Length : 0 # (ms) Record_Length : 16500 # (ms) Sample_Rate : 1000 Total_Nb_Sample : 16501 Type_Of_Source : EXPLO Source_Nb : 11 Tb_Window : 2500 Date : Sun Feb 17 18:12:04 2015 Julian_Day : 1 Cog_State : NO COG Cog_Easting : N/A Cog_Northing : N/A Cog_Elevation : 0.0 Cog_Deviation : 0.0 Uphole_Time : 0.00 # (msec) ] Noise_Report : [ Noise_Elim_type : NE OFF Thres_Hold_Var : N/A Hist_Editing_Type : N/A Hist_Range : N/A # (dB) Hist_Taper_Length : N/A # (power) Hist_Thres_Init_Val : N/A # (dB) Hist_Zeroing_Length : N/A # (msec) Low_Trace_Value : 0 # (dB) Low_Trace_Percent : 0 Noisy_Trace_Percent : N/A Low_Noisy_Verbose : Nb_Of_Window : 0 ] Process_Report : [ Type_Of_Process : IMPULSIVE Acq_Nb : 1 Correl_Pilot_Nb : 0 Auto_Cor_Peak_Time : 0 Dump_Stacking_Fold : 1 Max_Of_Max_Aux_Char : " -7.929688e+01" Max_Of_Max_Seis_Char : " 1.088968e+06" Max_Time_Value_Verbose : "" ] Record_Report : [ File_Nb : 12221 Type_Of_Dump : DUMP Type_Of_Test : N/A 3 Tape_Nb : 36 Tape_Label : "TDG South" Record_Type : NORMAL Blocking_Mode : FALSE Device_Bypass : FALSE Tape_Error_Text : "" Tape_Time : "Sun Feb 17 18:13:08 2015 " File_Count : "17 File_Per_Tape : "2000" ] Comment : "N/A" User_Header : "*SGD-S SP#2448/SL#317.0/SN#362.0/SI#1/SEQ#11/STA:1/CTB:00.000/UH:000.0 ICIS #105. Hits: 6. Single Hit Rec: 2.0s. Total Rec Length: 16.5s. NMEA: 5717.5386,N,11201.3849,W,+00408.3,M,1,06,07.2,000.04,270.0 TB=02182013,011205.6520652 Hit=02182013,011206.2430453 HP= 63PSI Hit=02182013,011208.7418981 HP= 83PSI Hit=02182013,011211.2414192 HP= 64PSI Hit=02182013,011213.7418408 HP= 79PSI Hit=02182013,011216.2420402 HP= 90PSI Hit=02182013,011218.7414871 HP= 71PSI Acquisition Complete. " ]
3 Answers 3
I will extend on both my comment and @SuperBiasedMan answer.
Bugs?
To start with, I still believe that your code produces one less dictionary than records per file. At least with the given input. If you rely on finding # ===============
to yield the record that just get parsed, you will never yield the final one. Instead, I'd rather use the fact that records are always ordered in the same way and that 'User_Header'
is the last field.
You can thus yield
your result right after parsing that specific field.
A second thing to note is that you are always using, overriding and yielding the same dictionary. Thus you’re only actually parsing the last record, \$n\$ times. Let me show you why:
>>> def test():
... recs = {}
... yield recs
... recs['one'] = 1
... yield recs
... recs['two'] = 2
... yield recs
...
>>> list(test())
[{'one': 1, 'two': 2}, {'one': 1, 'two': 2}, {'one': 1, 'two': 2}]
You're basically doing the same thing except you do it in a for
loop so it is less visible. You need to change dictionaries after each yield so your data is not overwritten.
Expand your generators
Generating data with the yield
keyword can help reduce memory footprint and increase overall efficiency of your program, let's take this approach further. Python has a yield from
syntax that lets you "chain" iterators; meaning we can wrap a generator into an other one and yield the same elements without increased overhead. For instance:
def parse_data(filename):
with open(filename, 'r', encoding='latin-1') as f:
yield from get_rec_dict(f)
Going one level further, we can wrap this into the iteration over glob
:
def parse_directory(files):
for filename in files:
yield from parse_data(filename)
This let you build your final rec_list
using only list(parse_directory(get_raw_files()))
.
Use EAFP
Looking at your input file, you’re expecting to have much more lines that can be splitted with ' : '
than those who wont. In such cases, it is recommended to use EAFP approach. Basically, you just split your line anyway, maybe creating a 1-sized list, and you try to assign two elements regardless. If you fail (and you will but not often) you handled the exception that will arise by knowing that you should have skipped this line.
Combine that with map
ping strip
over each part of the split, and you might end up with something like:
try:
key, value = map(str.strip, line.split(KVSEP))
except ValueError:
# Not enought value to unpack
continue
else:
recs[key] = value
Proposed improvements
Putting a bit of variable renaming into the equation as well as using if __name__ == '__main__'
to protect top-level code:
import os.path
import glob
import pandas as pd
IGNORED = '[\n'
SEPARATOR = ' : '
MULTILINE = ' \n'
END_OF_RECORDS = ' \n'
def get_files(ext='.raw'):
path = input('RAW Files Folder path: ')
pattern = os.path.join(path, '*{}'.format(ext))
return glob.glob(pattern)
def parse_records(file):
records = {}
for line in file:
if line.endswith(IGNORED):
continue
try:
key, value = map(str.strip, line.split(SEPARATOR))
except ValueError:
# Not enought value to unpack
continue
else:
records[key] = value
if line.endswith(MULTILINE):
multiline_value = [value]
for line in file:
multiline_value.append(line.strip())
if not line.endswith(MULTILINE):
break
records[key] = '\n'.join(multiline_value)
if key == 'User_Header':
multiline_value = [value]
for line in file:
multiline_value.append(line.strip())
if line.endswith(END_OF_RECORDS)
break
records[key] = '\n'.join(multiline_value)
yield records
records = {}
def parse_data(filename):
with open(filename, 'r', encoding='latin-1') as f:
yield from parse_records(f)
def parse_files(file_paths):
for filename in file_paths:
yield from parse_data(filename)
if __name__ == '__main__':
files = get_files()
records = list(parse_files(files))
if files:
print('Processed', len(files), 'files and found', len(records), 'records')
else:
print('No files found.')
df = pd.DataFrame(records)
-
\$\begingroup\$ I think you're right that the same record gets parsed over and over even if I don't really understand why at the moment. And I am also missing one record. I'll take a deeper look at your suggestions. \$\endgroup\$YeO– YeO2016年01月19日 07:13:56 +00:00Commented Jan 19, 2016 at 7:13
I think you should refactor things about. raw_ext
should be RAW_EXT
to clarify that it's a constant. Then raw_path
should be down with raw_files
, to make it clear that it's a user defined value. I might even make a function for get_raw_files()
, like this:
def get_raw_files():
raw_path = input('RAW Files Folder path: ')
return glob.glob('{0}*{1}'.format(raw_path, raw_ext))
This makes it easier to test individual parts of your code, and update them too if you realise that you need to test the user's input, for example like this:
def get_raw_files():
while True:
raw_path = input('RAW Files Folder path: ')
if os.path.isdir(raw_path):
break
print('Path not found, please check that it exists')
return glob.glob('{0}*{1}'.format(raw_path, raw_ext))
get_rec_dict
is a very long function, separating it into individual tasks would make it much more readable.
It's good that you've clearly marked some strings as constants, but the names are terribly unclear. They're clearly shortened words, but that makes me unable to gather what they're supposed to mean. Sure KVSEP
is probably a separator, but of what? Try to make them more clear, even if it involves longer verbose lines.
In get_rec_dict
, your f
value seems pointless. Wouldn't it just be the same to end your loops with if OBREP in line
? line
is never modified in your loop, so it'll give the same result whether you test at the start or end of an iteration. If there is a realy difference between what you have and what I suggest, then it's hacky and unclear. And you should clarify it with a comment. You should also combine if
tests, rather than nesting them:
if OBREP in line and recs:
yield recs
You can use map to run a function on every value of a list, so instead of this:
v = '\n'.join(val.strip() for val in vlist)
you can do this:
v = '\n'.join(map(str.strip, vlist))
It's a little better performance wise, and easier to read.
I may be wrong, but I believe you could just call list(get_rec_dict(infile))
directly, rather than looping over the result and append
ing each value. Your loop doesn't allow breaking or catch errors, so I can't see any difference with it apart from inefficiency.
rec_list += list(get_rec_dict(infile))
It also seems silly to have file_no
when all you care about is having more than one file in the raw_files
list. Use Python's truthiness for this test instead. An empty list can be evaluated as False
by Python, while a list that contains elements will be read as True
.
if raw_files:
print('{} RAW files loaded.'.format(file_no))
else:
print('No file found.')
-
\$\begingroup\$ Thanks for all the suggestions! I will update the code shortly. You are correct about the constants: KVSEP (renamed KV_SEPARATOR) is identifying the separation between Key and Value, VCONT (renamed V_CONTINUE) is a sign that the Value spans several lines. Unfortunately, this is not 100% reliable as you can see on User_Header. VENDS (renamed V_TERMINATED) indicates the Value has been fully read at the previous line. The intent on the flag
f
was to know when to yield the dictionary (my logic was: when the dictionary is not empty and we are starting to read a new set of Observation) \$\endgroup\$YeO– YeO2016年01月18日 15:27:16 +00:00Commented Jan 18, 2016 at 15:27 -
\$\begingroup\$ your two suggestions
v = '\n'.join(map(str.strip, vlist))
andrec_list += list(get_rec_dict(infile))
do seem to improve the performance overall, so I have integrated them also. \$\endgroup\$YeO– YeO2016年01月18日 15:41:49 +00:00Commented Jan 18, 2016 at 15:41 -
\$\begingroup\$ @YeO Great! Glad to help. When the code is updated you can post a new question to ask about more feedback if you'd like. If you are though, please look into the strange behaviour Mathias mentioned in the comments above as code on CR should always be fully working as intended. \$\endgroup\$SuperBiasedMan– SuperBiasedMan2016年01月18日 15:45:04 +00:00Commented Jan 18, 2016 at 15:45
-
\$\begingroup\$ I finally understand the unnecessary complexity of the
f
flag and your suggestion is spot on! Much smarter to test at the end and the nesting becomes unnecessary as well. Thanks. \$\endgroup\$YeO– YeO2016年01月18日 16:37:48 +00:00Commented Jan 18, 2016 at 16:37 -
\$\begingroup\$ @YeO Ah, glad to hear it. I was starting to think I'd misread it somehow haha. \$\endgroup\$SuperBiasedMan– SuperBiasedMan2016年01月18日 17:06:27 +00:00Commented Jan 18, 2016 at 17:06
Here's the edited (and fixed) code after the improvements suggested by @SuperBiasedMan and taking inconsideration @Mathias Ettinger comment.
My code was indeed broken and was only returning the same record.
After some more tests, I reverted to the for loop to build the records list as it seems to be slightly faster, I have kept the suggestion as comment and for reference.
To be noted: @Mathias Ettinger 's code is faster. :-)
import glob
import pandas as pd
RAW_EXT = '.raw'
OBS_REPORT = '=' * 15 # identifies a set of Observations (Observer Report)
SUB_REPORT = '[\n' # identifies a Sub-Report within the main set
KV_SEPARATOR = ' : ' # the Key-Value Separator
V_CONTINUE = ' \n' # if the line ends with four space, the Value continues on the next line
V_TERMINED = ' \n' # if the line is 2 spaces and LF, we got to the end of the Value
def get_rec_dict(file):
recs = {}
for line in file:
# if KV_SEPARATOR is found and the line is not a Sub Report Header, then we have a Key and the start of a Value
if KV_SEPARATOR in line and not line.endswith(SUB_REPORT):
vlist = line.split(KV_SEPARATOR) # the Key is the left of the separator
k = vlist.pop(0).strip()
if line.endswith(V_CONTINUE):
for line in file:
# add all lines ending with 4 spaces to the Value
vlist.append(line.strip())
if not line.endswith(V_CONTINUE):
break
# User_Header may not use the 4 spaces to indicate multi-line, so we read until we are sure Value is all captured
if k == 'User_Header':
for line in file:
if line == V_TERMINED:
break # if we encounter a line that is V_TERMINED, we are sure we got all the Value already
else:
vlist.append(line.strip())
yield recs # we yield the result after having read User_Header
recs = {}
## recs[k] = '\n'.join(val.strip() for val in vlist) was slower
recs[k] = '\n'.join(map(str.strip, vlist))
def get_raw_files():
raw_path = input('RAW Files Folder path: ')
if not raw_path.endswith('\\'):
raw_path = raw_path + '\\'
return glob.glob('{0}*{1}'.format(raw_path, RAW_EXT))
rec_list = []
raw_files = get_raw_files()
# Main loop
for raw in raw_files:
with open(raw, 'r', encoding='latin-1') as infile:
for rec_dict in get_rec_dict(infile):
rec_list.append(rec_dict)
## rec_list += list(get_rec_dict(infile))
df = pd.DataFrame(rec_list)
if raw_files:
print('{} RAW files loaded.'.format(len(raw_files)))
else:
print('No file found.')
Explore related questions
See similar questions with these tags.
VENDS
after the closing"
ofUser_Header
) I only get 1 sample dictionary yielded. What happens is thatf = True
will execute when reaching the first# ================
, process nothing as there is no' : '
in this line and reachif f:
; then turnf
toFalse
without yielding anything. The first observation is then parsed and, upon reaching the second# ===============
is yielded. The second observation is parsed but not yielded since there is no# ===============
any more. \$\endgroup\$with open('<path_to_file>') as f: print(len(list(get_rec_dict(f))))
to print 2. I get 0 with the original sample and 1 after adding two spaces after the closing"
ofUser_Header
. \$\endgroup\$data.split('\n')
, does that make any difference for you? \$\endgroup\$'Line_Report': '['
or'Observer_Report': '['
that are getting into it due to the split removing the line ending that should be present forSBREP
,VCONT
orVENDS
. \$\endgroup\$