4
\$\begingroup\$

I have millions of files in the Google cloud storage's bucket. I want to search within files with a .index extension and retrieve the contents.

This is currently what I am doing, but the time required for overall process is large. Is there any better and faster way to do this?

class storage():
 ...
 self.indexFile = []
 self.indexFileIndex = []
 ...
 def get_content(self,param1,param2):
 c_n = []
 for value, index in zip(param1, param2):
 object_contents = StringIO.StringIO()
 srcObjURI = boto.storage_uri(value, self.storage)
 srcObjURI.get_key().get_file(object_contents)
 c_n.append(object_contents.getvalue())
 object_contents.close()
 return c_n
 def get_PATHs(self):
 paths=[]
 #pts=open("paths.txt","w")
 pts_log=open("paths_log.txt","a")
 pts_log.write("-"*20+time.ctime()+"-"*20+"\n")
 indexFileContents = self.get_content(self.indexFile, self.indexFileIndex)
 for c,d in zip(indexFileContents,self.indexFile):
 regx = r"(.*)\/" + r"(.*?)\|"
 patternPathList = re.compile(regx)
 for match in patternPathList.finditer(c):
 p=match.group(1).strip() + "/"+ match.group(2).strip()
 tst_exst=""
 if p in paths:
 tst_exst="Already exist !"
 else:
 tst_exst="Added to PATHs list"
 paths.append(p)
 #pts.write(p)
 #pts.write("\n")
 pts_log.write("FROM : %s --> %s %s"%(d,p,tst_exst))
 pts_log.write("\n")
 #pts.close()
 pts_log.close()
 return paths

The files that I am trying to search varies from 200KB to 1MB and sometimes has Unicode characters.

200_success
145k22 gold badges190 silver badges478 bronze badges
asked Jul 30, 2015 at 17:57
\$\endgroup\$
0

2 Answers 2

4
\$\begingroup\$

A few remarks to add to @JoeWallis already excellent answer:

  • If you want speed, consider using the module cStringIO instead of StringIO. Since we are missing your import statements, I can't tell whether you have a try .. catch to conditionally import it or not. You should really post the whole code.

  • There are very few occasions when you want to keep commented out code. Generally speaking, the best thing to do is to remove dead code and let source control software remind you of what the old code was like.

  • Instead of zip, consider using itertools.izip which uses lazy evaluation to zip the iterables. In your case, it does not change many things since you never break early (in case of error maybe?) but in general, it saves more memory and sometimes avoids computing unused values.

  • You don't need the parenthesis in class storage():, unless you explicitly inherit from a class. Since you don't inherit from anything, you can simply drop them so that your code looks cleaner.

answered Jul 30, 2015 at 18:56
\$\endgroup\$
3
\$\begingroup\$

You should use with to open files.

with open(...) as pts_log:
 ...

This implicitly calls pts_log.close. Even if the program fails.


You should use str.format to add values to strings.

pts_log.write("-"*20+time.ctime()+"-"*20+"\n")
# To
pts_log.write('{0}{1}{0}\n'.format('-' * 20, time.ctime))

Why pass self.variables to a function that has self?
Also param1 is not descriptive.

Assuming you want it to be a commonly used function, just change the name of the parameters.

get_content(self, obj, obj_index)

You can reduce the amount of variables in get_content. And you can make it more readable.

def get_content(self, obj, obj_index):
 for value, index in zip(obj, obj_index):
 obj_contents = StringIO.StringIO()
 boto(.storage_uri(value, self.storage)
 .get_key()
 .get_file(object_contents))
 yield obj_contents.getvalue()
 obj_contents.close()

You need spaces after ,s. And you need better variable names.

for c, d in zip(indexFileContents, self.indexFile):

This is very undescriptive.


Why split a regex into two raw-strings, to then immediately combine them.

regx = r"(.*)\/(.*?)\|"

This is much easier to read and understand.


There is no point in compiling a regex every loop. Do it outside the loop.

Also the name patternPathList is unpythonic. Use pattern_path_list instead.


Python's style guide, PEP8, has strict rules on operators. You have to have a space either side of them.

p=match.group(1).strip() + "/"+ match.group(2).strip()
# Should be
p = match.group(1).strip() + "/" + match.group(2).strip()

And again p is undescriptive. path is a better name.


You use the % operator instead of str.format. There are some unfixable bugs with Pythons % operator. So str.format should be used instead.

And you can merge the two pts_log.write together.

pts_log.write("FROM : {} --> {} {}\n".format(d,p,tst_exst))
answered Jul 30, 2015 at 18:40
\$\endgroup\$
0

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.