Reading from a file and connect all data in one big data than to use generators

Question 1

Is there better way to read from a file and connect all data in one big data than to use generators?

At the moment, I do the following:

use generators to read data from files.
use NumPy to pack all files in 3D array.
use pandas to stack it to 2D array so readable for next operations (e.g. plotting).

In my example:

I split reading from file operations generators in 2 modules, and have a third module for reading names (file is not connected with others generators_module). In the last separate file I import that generators_modules, and make arrays with NumPy, then pandas.

In the code of modules I use os.walk to read from files, regex to read only that data which is needed.

In the code of making arrays with NumPy, pandas:

I enter variable parameters needed to get data with generator_modules

I pass from generators to array with code:

xdata=np.array( [(float(line['PRESSURE']), float(line['CURVE'])) for line in dt_cols] )

I have four python files, as below:

module1 (`gen_enter`):

import re, matplotlib as mpl, matplotlib.pyplot as plt, os, fnmatch
def gen_find(filepat,top):
 for path, dirlist, filelist in os.walk(top):
 for name in fnmatch.filter(filelist,filepat): yield os.path.join(path,name)
def gen_open(filenames):
 for name in filenames:
 if name.endswith(".XL"): yield open(name)
 else: pass
def gen_get(sources):
 for s in sources:
 for item in s: yield item
def gen_grep(pattern, fileparse):
 for line in fileparse:
 inputlines = line[:].strip().replace(';',' ')
 if pattern.search(inputlines): yield inputlines
 else: pass
def field_map(dictseq,name,func):
 for d in dictseq:
 d[name] = func(d[name])
 yield d

module2 (`gen_returnlines`):

from gen_enter import *
def lines_from_dir(filepat, dirname):
 findnames = gen_find(filepat, dirname)
 openfiles = gen_open(findnames)
 getlines = gen_get(openfiles)
 #patlines = gen_grep(pattern, getlines)
 return getlines

module3 (`gen_shownames`):

import re, matplotlib as mpl, matplotlib.pyplot as plt, os, fnmatch
def gen_shownames(filepat,top):
 for path, dirlist, filelist in os.walk(top):
 for name in fnmatch.filter(filelist,filepat): yield name

The main code:

import numpy as np, pandas as pd, os, fnmatch, re
from pylab import *
from gen_returnlines import *
from gen_shownames import *
def dict_cols(lines):
 groups = (patlines.match(line) for line in getlines)
 tuples = (group.groups() for group in groups if group)
 colnames = ('PRESSURE','CURVE')
 line = (dict(zip(colnames,t)) for t in tuples)
 line = (field_map(line,"PRESSURE", lambda s: float(s)))
 line = (field_map(line,"CURVE",float))
 return line
dir='C:\\Users\\REDHOOD\\workspace\\Politechnika_python\\\silniki\\files' #note: all files from ../files/
pattern = re.compile(r'(\d{3}\.\d{1})\D*(\d{3}\.\d{1})\D*')
pats = '(\d{3}\.\d{1})\D*(\d{3}\.\d{1})\D*'
patlines = re.compile(pats)
if __name__ == '__main__':
 names = gen_shownames('*', dir)
 getlines = lines_from_dir('*',dir)
 dt_cols = dict_cols(getlines)
xdata = np.array( [(float(line['PRESSURE']), float(line['CURVE'])) for line in dt_cols] )
xrdata = np.reshape(xdata,(17,360,2))
datapanel = pd.Panel(
 xrdata, 
 items=[k for k in names],
 )
datapaneldf = datapanel.to_frame() #pressure na curve -> 360 stopni

Question 2

This all looks pretty good to me. Do you have a specific concern, or are you merely looking for review?

Question 3

You should be able to use read_csv which will probably be faster (especially in pandas 0.10).

Question 4

read_csv --- its for reading csv_files, right? so i should first rewrite file to csv format, if its in txt format files?

Question 5

is it possible to write code, what will automaticly choose the dimensions of array, by the number of catalogues, files, lines, columns and rows?

Question 6

Is there better way to read from a file and connect all data in one big data than to use generators?

Whenever possible, using generators is better than collecting data in one big object. One big object needs a lot of memory at once, using a generator you can use a pipeline where only the data necessary for the current processing is in memory, and discarded when completed.

I didn't look hard enough in your program if your operation can be fully streamlined. What I can see is that in the dict_cols method you effectively collect from all the generators:

def dict_cols(lines):
 groups = (patlines.match(line) for line in getlines)
 tuples = (group.groups() for group in groups if group)

Here, getlines and group are generators, but you fully consume them. At this point, it doesn't matter anymore if getlines and group were generators or lists, because everything will be loaded in memory anyway. I don't know if it's possible to change your processing to avoid fully consuming the generators at this point or later. It would be great if you could. But even if you cannot, it's still good that getlines and group are generators instead of lists, because maybe in the future you will find a better way to process the data in a more streamlined way without collecting.

Using `name` properly

The purpose of the if __name__ == '__main__': ... check is to make it possible to load a package without executing code. However, you have this:

if __name__ == '__main__':
 names = gen_shownames('*', dir)
 getlines = lines_from_dir('*',dir)
 dt_cols = dict_cols(getlines)
xdata = np.array( [(float(line['PRESSURE']), float(line['CURVE'])) for line in dt_cols] )
xrdata = np.reshape(xdata,(17,360,2))

This defeats the purpose. It would be better to move everything you do inside the if statement and afterward inside a main method, and call that from the if statement, like this:

def main():
 names = gen_shownames('*', dir)
 getlines = lines_from_dir('*',dir)
 dt_cols = dict_cols(getlines)
 xdata = np.array( [(float(line['PRESSURE']), float(line['CURVE'])) for line in dt_cols] )
 xrdata = np.reshape(xdata,(17,360,2))
 # ... and so on
if __name__ == '__main__':
 main()

Coding style

It seems you're not following PEP8, the official style guide of Python. I suggest to give it a good read and follow it.

There shouldn't be multiple statements on one line:

 for name in fnmatch.filter(filelist,filepat): yield os.path.join(path,name)

Break the line after :

 for name in fnmatch.filter(filelist,filepat):
 yield os.path.join(path,name)

Do the same in other places too, always break the line after a :.

The else is pointless here and should be removed:

def gen_open(filenames):
 for name in filenames:
 if name.endswith(".XL"): yield open(name)
 else: pass

I see the same in other methods too, fix it everywhere.

Put a space after method parameters, so instead of:

def field_map(dictseq,name,func):
 for d in dictseq:

Write like this:

def field_map(dictseq, name, func):
 for d in dictseq:

Avoid wildcard imports like this:

from gen_enter import *

This practice limits the ability of IDEs and static analysis tools to check for invalid references.

Instead of this:

dir='C:\\Users\\REDHOOD'

You can write simpler this way:

dir='C:/Users/REDHOOD'

janos janos 113k15 gold badges154 silver badges396 bronze badges · Answer 1 · 2014-12-20 10:40:29Z

Is there better way to read from a file and connect all data in one big data than to use generators?

Whenever possible, using generators is better than collecting data in one big object. One big object needs a lot of memory at once, using a generator you can use a pipeline where only the data necessary for the current processing is in memory, and discarded when completed.

I didn't look hard enough in your program if your operation can be fully streamlined. What I can see is that in the dict_cols method you effectively collect from all the generators:

def dict_cols(lines):
 groups = (patlines.match(line) for line in getlines)
 tuples = (group.groups() for group in groups if group)

Here, getlines and group are generators, but you fully consume them. At this point, it doesn't matter anymore if getlines and group were generators or lists, because everything will be loaded in memory anyway. I don't know if it's possible to change your processing to avoid fully consuming the generators at this point or later. It would be great if you could. But even if you cannot, it's still good that getlines and group are generators instead of lists, because maybe in the future you will find a better way to process the data in a more streamlined way without collecting.

Using `name` properly

The purpose of the if __name__ == '__main__': ... check is to make it possible to load a package without executing code. However, you have this:

if __name__ == '__main__':
 names = gen_shownames('*', dir)
 getlines = lines_from_dir('*',dir)
 dt_cols = dict_cols(getlines)
xdata = np.array( [(float(line['PRESSURE']), float(line['CURVE'])) for line in dt_cols] )
xrdata = np.reshape(xdata,(17,360,2))

This defeats the purpose. It would be better to move everything you do inside the if statement and afterward inside a main method, and call that from the if statement, like this:

def main():
 names = gen_shownames('*', dir)
 getlines = lines_from_dir('*',dir)
 dt_cols = dict_cols(getlines)
 xdata = np.array( [(float(line['PRESSURE']), float(line['CURVE'])) for line in dt_cols] )
 xrdata = np.reshape(xdata,(17,360,2))
 # ... and so on
if __name__ == '__main__':
 main()

Coding style

It seems you're not following PEP8, the official style guide of Python. I suggest to give it a good read and follow it.

There shouldn't be multiple statements on one line:

 for name in fnmatch.filter(filelist,filepat): yield os.path.join(path,name)

Break the line after :

 for name in fnmatch.filter(filelist,filepat):
 yield os.path.join(path,name)

Do the same in other places too, always break the line after a :.

The else is pointless here and should be removed:

def gen_open(filenames):
 for name in filenames:
 if name.endswith(".XL"): yield open(name)
 else: pass

I see the same in other methods too, fix it everywhere.

Put a space after method parameters, so instead of:

def field_map(dictseq,name,func):
 for d in dictseq:

Write like this:

def field_map(dictseq, name, func):
 for d in dictseq:

Avoid wildcard imports like this:

from gen_enter import *

This practice limits the ability of IDEs and static analysis tools to check for invalid references.

Instead of this:

dir='C:\\Users\\REDHOOD'

You can write simpler this way:

dir='C:/Users/REDHOOD'

Stack Exchange Network

Reading from a file and connect all data in one big data than to use generators

In my example:

module1 (`gen_enter`):

module2 (`gen_returnlines`):

module3 (`gen_shownames`):

The main code:

1 Answer 1

Using `name` properly

Coding style

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Reading from a file and connect all data in one big data than to use generators

In my example:

module1 (gen_enter):

module2 (gen_returnlines):

module3 (gen_shownames):

The main code:

1 Answer 1

Using __name__ properly

Coding style

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions

module1 (`gen_enter`):

module2 (`gen_returnlines`):

module3 (`gen_shownames`):

Using `name` properly