Putting a bit of variable renaming into the equation as well as using if __name__ == '__main__'
if __name__ == '__main__'
to protect top-level code:
Putting a bit of variable renaming into the equation as well as using if __name__ == '__main__'
to protect top-level code:
Putting a bit of variable renaming into the equation as well as using if __name__ == '__main__'
to protect top-level code:
I will extend on both my comment my comment and @SuperBiasedMan answer @SuperBiasedMan answer.
I will extend on both my comment and @SuperBiasedMan answer.
I will extend on both my comment and @SuperBiasedMan answer.
I will extend on both my comment and @SuperBiasedMan answer.
#Bugs?
To start with, I still believe that your code produces one less dictionary than records per file. At least with the given input. If you rely on finding # ===============
to yield the record that just get parsed, you will never yield the final one. Instead, I'd rather use the fact that records are always ordered in the same way and that 'User_Header'
is the last field.
You can thus yield
your result right after parsing that specific field.
A second thing to note is that you are always using, overriding and yielding the same dictionary. Thus you’re only actually parsing the last record, \$n\$ times. Let me show you why:
>>> def test():
... recs = {}
... yield recs
... recs['one'] = 1
... yield recs
... recs['two'] = 2
... yield recs
...
>>> list(test())
[{'one': 1, 'two': 2}, {'one': 1, 'two': 2}, {'one': 1, 'two': 2}]
You're basically doing the same thing except you do it in a for
loop so it is less visible. You need to change dictionaries after each yield so your data is not overwritten.
#Expand your generators
Generating data with the yield
keyword can help reduce memory footprint and increase overall efficiency of your program, let's take this approach further. Python has a yield from
syntax that lets you "chain" iterators; meaning we can wrap a generator into an other one and yield the same elements without increased overhead. For instance:
def parse_data(filename):
with open(filename, 'r', encoding='latin-1') as f:
yield from get_rec_dict(f)
Going one level further, we can wrap this into the iteration over glob
:
def parse_directory(files):
for filename in files:
yield from parse_data(filename)
This let you build your final rec_list
using only list(parse_directory(get_raw_files()))
.
#Use EAFP
Looking at your input file, you’re expecting to have much more lines that can be splitted with ' : '
than those who wont. In such cases, it is recommended to use EAFP approach. Basically, you just split your line anyway, maybe creating a 1-sized list, and you try to assign two elements regardless. If you fail (and you will but not often) you handled the exception that will arise by knowing that you should have skipped this line.
Combine that with map
ping strip
over each part of the split, and you might end up with something like:
try:
key, value = map(str.strip, line.split(KVSEP))
except ValueError:
# Not enought value to unpack
continue
else:
recs[key] = value
#Proposed improvements
Putting a bit of variable renaming into the equation as well as using if __name__ == '__main__'
to protect top-level code:
import os.path
import glob
import pandas as pd
IGNORED = '[\n'
SEPARATOR = ' : '
MULTILINE = ' \n'
END_OF_RECORDS = ' \n'
def get_files(ext='.raw'):
path = input('RAW Files Folder path: ')
pattern = os.path.join(path, '*{}'.format(ext))
return glob.glob(pattern)
def parse_records(file):
records = {}
for line in file:
if line.endswith(IGNORED):
continue
try:
key, value = map(str.strip, line.split(SEPARATOR))
except ValueError:
# Not enought value to unpack
continue
else:
records[key] = value
if line.endswith(MULTILINE):
multiline_value = [value]
for line in file:
multiline_value.append(line.strip())
if not line.endswith(MULTILINE):
break
records[key] = '\n'.join(multiline_value)
if key == 'User_Header':
multiline_value = [value]
for line in file:
multiline_value.append(line.strip())
if line.endswith(END_OF_RECORDS)
break
records[key] = '\n'.join(multiline_value)
yield records
records = {}
def parse_data(filename):
with open(filename, 'r', encoding='latin-1') as f:
yield from parse_records(f)
def parse_files(file_paths):
for filename in file_paths:
yield from parse_data(filename)
if __name__ == '__main__':
files = get_files()
records = list(parse_files(files))
if files:
print('Processed', len(files), 'files and found', len(records), 'records')
else:
print('No files found.')
df = pd.DataFrame(records)