Is there better way to read from a file and connect all data in one big data than to use generators?
At the moment, I do the following:
- use generators to read data from files.
- use NumPy to pack all files in 3D array.
- use pandas to stack it to 2D array so readable for next operations (e.g. plotting).
In my example:
I split reading from file operations generators in 2 modules, and have a third module for reading names (file is not connected with others generators_module). In the last separate file I import that generators_modules, and make arrays with NumPy, then pandas.
In the code of modules I use os.walk
to read from files, regex to read only that data which is needed.
In the code of making arrays with NumPy, pandas:
- I enter variable parameters needed to get data with generator_modules
I pass from generators to array with code:
xdata=np.array( [(float(line['PRESSURE']), float(line['CURVE'])) for line in dt_cols] )
I have four python files, as below:
module1 (gen_enter
):
import re, matplotlib as mpl, matplotlib.pyplot as plt, os, fnmatch
def gen_find(filepat,top):
for path, dirlist, filelist in os.walk(top):
for name in fnmatch.filter(filelist,filepat): yield os.path.join(path,name)
def gen_open(filenames):
for name in filenames:
if name.endswith(".XL"): yield open(name)
else: pass
def gen_get(sources):
for s in sources:
for item in s: yield item
def gen_grep(pattern, fileparse):
for line in fileparse:
inputlines = line[:].strip().replace(';',' ')
if pattern.search(inputlines): yield inputlines
else: pass
def field_map(dictseq,name,func):
for d in dictseq:
d[name] = func(d[name])
yield d
module2 (gen_returnlines
):
from gen_enter import *
def lines_from_dir(filepat, dirname):
findnames = gen_find(filepat, dirname)
openfiles = gen_open(findnames)
getlines = gen_get(openfiles)
#patlines = gen_grep(pattern, getlines)
return getlines
module3 (gen_shownames
):
import re, matplotlib as mpl, matplotlib.pyplot as plt, os, fnmatch
def gen_shownames(filepat,top):
for path, dirlist, filelist in os.walk(top):
for name in fnmatch.filter(filelist,filepat): yield name
The main code:
import numpy as np, pandas as pd, os, fnmatch, re
from pylab import *
from gen_returnlines import *
from gen_shownames import *
def dict_cols(lines):
groups = (patlines.match(line) for line in getlines)
tuples = (group.groups() for group in groups if group)
colnames = ('PRESSURE','CURVE')
line = (dict(zip(colnames,t)) for t in tuples)
line = (field_map(line,"PRESSURE", lambda s: float(s)))
line = (field_map(line,"CURVE",float))
return line
dir='C:\\Users\\REDHOOD\\workspace\\Politechnika_python\\\silniki\\files' #note: all files from ../files/
pattern = re.compile(r'(\d{3}\.\d{1})\D*(\d{3}\.\d{1})\D*')
pats = '(\d{3}\.\d{1})\D*(\d{3}\.\d{1})\D*'
patlines = re.compile(pats)
if __name__ == '__main__':
names = gen_shownames('*', dir)
getlines = lines_from_dir('*',dir)
dt_cols = dict_cols(getlines)
xdata = np.array( [(float(line['PRESSURE']), float(line['CURVE'])) for line in dt_cols] )
xrdata = np.reshape(xdata,(17,360,2))
datapanel = pd.Panel(
xrdata,
items=[k for k in names],
)
datapaneldf = datapanel.to_frame() #pressure na curve -> 360 stopni
1 Answer 1
Is there better way to read from a file and connect all data in one big data than to use generators?
Whenever possible, using generators is better than collecting data in one big object. One big object needs a lot of memory at once, using a generator you can use a pipeline where only the data necessary for the current processing is in memory, and discarded when completed.
I didn't look hard enough in your program if your operation can be fully streamlined.
What I can see is that in the dict_cols
method you effectively collect from all the generators:
def dict_cols(lines): groups = (patlines.match(line) for line in getlines) tuples = (group.groups() for group in groups if group)
Here, getlines
and group
are generators,
but you fully consume them.
At this point, it doesn't matter anymore if getlines
and group
were generators or lists, because everything will be loaded in memory anyway.
I don't know if it's possible to change your processing to avoid fully consuming the generators at this point or later.
It would be great if you could.
But even if you cannot,
it's still good that getlines
and group
are generators instead of lists,
because maybe in the future you will find a better way to process the data in a more streamlined way without collecting.
Using __name__
properly
The purpose of the if __name__ == '__main__': ...
check is to make it possible to load a package without executing code. However, you have this:
if __name__ == '__main__': names = gen_shownames('*', dir) getlines = lines_from_dir('*',dir) dt_cols = dict_cols(getlines) xdata = np.array( [(float(line['PRESSURE']), float(line['CURVE'])) for line in dt_cols] ) xrdata = np.reshape(xdata,(17,360,2))
This defeats the purpose. It would be better to move everything you do inside the if
statement and afterward inside a main
method, and call that from the if
statement, like this:
def main():
names = gen_shownames('*', dir)
getlines = lines_from_dir('*',dir)
dt_cols = dict_cols(getlines)
xdata = np.array( [(float(line['PRESSURE']), float(line['CURVE'])) for line in dt_cols] )
xrdata = np.reshape(xdata,(17,360,2))
# ... and so on
if __name__ == '__main__':
main()
Coding style
It seems you're not following PEP8, the official style guide of Python. I suggest to give it a good read and follow it.
There shouldn't be multiple statements on one line:
for name in fnmatch.filter(filelist,filepat): yield os.path.join(path,name)
Break the line after :
for name in fnmatch.filter(filelist,filepat):
yield os.path.join(path,name)
Do the same in other places too, always break the line after a :
.
The else
is pointless here and should be removed:
def gen_open(filenames): for name in filenames: if name.endswith(".XL"): yield open(name) else: pass
I see the same in other methods too, fix it everywhere.
Put a space after method parameters, so instead of:
def field_map(dictseq,name,func): for d in dictseq:
Write like this:
def field_map(dictseq, name, func):
for d in dictseq:
Avoid wildcard imports like this:
from gen_enter import *
This practice limits the ability of IDEs and static analysis tools to check for invalid references.
Instead of this:
dir='C:\\Users\\REDHOOD'
You can write simpler this way:
dir='C:/Users/REDHOOD'
read_csv
which will probably be faster (especially in pandas 0.10). \$\endgroup\$