I have millions of files in the Google cloud storage's bucket. I want to search within files with a .index extension and retrieve the contents.
This is currently what I am doing, but the time required for overall process is large. Is there any better and faster way to do this?
class storage():
...
self.indexFile = []
self.indexFileIndex = []
...
def get_content(self,param1,param2):
c_n = []
for value, index in zip(param1, param2):
object_contents = StringIO.StringIO()
srcObjURI = boto.storage_uri(value, self.storage)
srcObjURI.get_key().get_file(object_contents)
c_n.append(object_contents.getvalue())
object_contents.close()
return c_n
def get_PATHs(self):
paths=[]
#pts=open("paths.txt","w")
pts_log=open("paths_log.txt","a")
pts_log.write("-"*20+time.ctime()+"-"*20+"\n")
indexFileContents = self.get_content(self.indexFile, self.indexFileIndex)
for c,d in zip(indexFileContents,self.indexFile):
regx = r"(.*)\/" + r"(.*?)\|"
patternPathList = re.compile(regx)
for match in patternPathList.finditer(c):
p=match.group(1).strip() + "/"+ match.group(2).strip()
tst_exst=""
if p in paths:
tst_exst="Already exist !"
else:
tst_exst="Added to PATHs list"
paths.append(p)
#pts.write(p)
#pts.write("\n")
pts_log.write("FROM : %s --> %s %s"%(d,p,tst_exst))
pts_log.write("\n")
#pts.close()
pts_log.close()
return paths
The files that I am trying to search varies from 200KB to 1MB and sometimes has Unicode characters.
2 Answers 2
A few remarks to add to @JoeWallis already excellent answer:
If you want speed, consider using the module
cStringIO
instead ofStringIO
. Since we are missing yourimport
statements, I can't tell whether you have atry .. catch
to conditionally import it or not. You should really post the whole code.There are very few occasions when you want to keep commented out code. Generally speaking, the best thing to do is to remove dead code and let source control software remind you of what the old code was like.
Instead of
zip
, consider usingitertools.izip
which uses lazy evaluation to zip the iterables. In your case, it does not change many things since you never break early (in case of error maybe?) but in general, it saves more memory and sometimes avoids computing unused values.You don't need the parenthesis in
class storage():
, unless you explicitly inherit from a class. Since you don't inherit from anything, you can simply drop them so that your code looks cleaner.
You should use with
to open
files.
with open(...) as pts_log:
...
This implicitly calls pts_log.close
. Even if the program fails.
You should use str.format
to add values to strings.
pts_log.write("-"*20+time.ctime()+"-"*20+"\n")
# To
pts_log.write('{0}{1}{0}\n'.format('-' * 20, time.ctime))
Why pass self.
variables to a function that has self?
Also param1
is not descriptive.
Assuming you want it to be a commonly used function, just change the name of the parameters.
get_content(self, obj, obj_index)
You can reduce the amount of variables in get_content
.
And you can make it more readable.
def get_content(self, obj, obj_index):
for value, index in zip(obj, obj_index):
obj_contents = StringIO.StringIO()
boto(.storage_uri(value, self.storage)
.get_key()
.get_file(object_contents))
yield obj_contents.getvalue()
obj_contents.close()
You need spaces after ,
s. And you need better variable names.
for c, d in zip(indexFileContents, self.indexFile):
This is very undescriptive.
Why split a regex into two raw-strings, to then immediately combine them.
regx = r"(.*)\/(.*?)\|"
This is much easier to read and understand.
There is no point in compiling a regex every loop. Do it outside the loop.
Also the name patternPathList
is unpythonic. Use pattern_path_list
instead.
Python's style guide, PEP8, has strict rules on operators. You have to have a space either side of them.
p=match.group(1).strip() + "/"+ match.group(2).strip()
# Should be
p = match.group(1).strip() + "/" + match.group(2).strip()
And again p
is undescriptive. path
is a better name.
You use the %
operator instead of str.format
.
There are some unfixable bugs with Pythons %
operator.
So str.format
should be used instead.
And you can merge the two pts_log.write
together.
pts_log.write("FROM : {} --> {} {}\n".format(d,p,tst_exst))
Explore related questions
See similar questions with these tags.