Return to Answer

replaced http://stackoverflow.com/ with https://stackoverflow.com/

edited May 23, 2017 at 11:33

Putting a bit of variable renaming into the equation as well as using if __name__ == '__main__' if __name__ == '__main__' to protect top-level code:

Putting a bit of variable renaming into the equation as well as using if __name__ == '__main__' to protect top-level code:

replaced http://codereview.stackexchange.com/ with https://codereview.stackexchange.com/

Source Link

edited Apr 13, 2017 at 12:40

Community Bot

edited Apr 13, 2017 at 12:40

Community Bot

I will extend on both my comment my comment and @SuperBiasedMan answer @SuperBiasedMan answer.

I will extend on both my comment and @SuperBiasedMan answer.

Source Link

answered Jan 18, 2016 at 21:50

301_Moved_Permanently

answered Jan 18, 2016 at 21:50

301_Moved_Permanently

29.4k
3
48
98

I will extend on both my comment and @SuperBiasedMan answer.

#Bugs?

To start with, I still believe that your code produces one less dictionary than records per file. At least with the given input. If you rely on finding # =============== to yield the record that just get parsed, you will never yield the final one. Instead, I'd rather use the fact that records are always ordered in the same way and that 'User_Header' is the last field.

You can thus yield your result right after parsing that specific field.

A second thing to note is that you are always using, overriding and yielding the same dictionary. Thus you’re only actually parsing the last record, \$n\$ times. Let me show you why:

>>> def test():
... recs = {}
... yield recs
... recs['one'] = 1
... yield recs
... recs['two'] = 2
... yield recs
... 
>>> list(test())
[{'one': 1, 'two': 2}, {'one': 1, 'two': 2}, {'one': 1, 'two': 2}]

You're basically doing the same thing except you do it in a for loop so it is less visible. You need to change dictionaries after each yield so your data is not overwritten.

#Expand your generators

Generating data with the yield keyword can help reduce memory footprint and increase overall efficiency of your program, let's take this approach further. Python has a yield from syntax that lets you "chain" iterators; meaning we can wrap a generator into an other one and yield the same elements without increased overhead. For instance:

def parse_data(filename):
 with open(filename, 'r', encoding='latin-1') as f:
 yield from get_rec_dict(f)

Going one level further, we can wrap this into the iteration over glob:

def parse_directory(files):
 for filename in files:
 yield from parse_data(filename)

This let you build your final rec_list using only list(parse_directory(get_raw_files())).

#Use EAFP

Looking at your input file, you’re expecting to have much more lines that can be splitted with ' : ' than those who wont. In such cases, it is recommended to use EAFP approach. Basically, you just split your line anyway, maybe creating a 1-sized list, and you try to assign two elements regardless. If you fail (and you will but not often) you handled the exception that will arise by knowing that you should have skipped this line.

Combine that with mapping strip over each part of the split, and you might end up with something like:

try:
 key, value = map(str.strip, line.split(KVSEP))
except ValueError:
 # Not enought value to unpack
 continue
else:
 recs[key] = value

#Proposed improvements

Putting a bit of variable renaming into the equation as well as using if __name__ == '__main__' to protect top-level code:

import os.path
import glob
import pandas as pd
IGNORED = '[\n'
SEPARATOR = ' : '
MULTILINE = ' \n'
END_OF_RECORDS = ' \n'
def get_files(ext='.raw'):
 path = input('RAW Files Folder path: ')
 pattern = os.path.join(path, '*{}'.format(ext))
 return glob.glob(pattern)
def parse_records(file):
 records = {}
 for line in file:
 if line.endswith(IGNORED):
 continue
 try:
 key, value = map(str.strip, line.split(SEPARATOR))
 except ValueError:
 # Not enought value to unpack
 continue
 else:
 records[key] = value
 if line.endswith(MULTILINE):
 multiline_value = [value]
 for line in file:
 multiline_value.append(line.strip())
 if not line.endswith(MULTILINE):
 break
 records[key] = '\n'.join(multiline_value)
 if key == 'User_Header':
 multiline_value = [value]
 for line in file:
 multiline_value.append(line.strip())
 if line.endswith(END_OF_RECORDS)
 break
 records[key] = '\n'.join(multiline_value)
 yield records
 records = {}
def parse_data(filename):
 with open(filename, 'r', encoding='latin-1') as f:
 yield from parse_records(f)
def parse_files(file_paths):
 for filename in file_paths:
 yield from parse_data(filename)
if __name__ == '__main__':
 files = get_files()
 records = list(parse_files(files))
 
 if files:
 print('Processed', len(files), 'files and found', len(records), 'records')
 else:
 print('No files found.')
 df = pd.DataFrame(records)

lang-py