Searching within multiple objects of the bucket

Question 1

I have millions of files in the Google cloud storage's bucket. I want to search within files with a .index extension and retrieve the contents.

This is currently what I am doing, but the time required for overall process is large. Is there any better and faster way to do this?

class storage():
 ...
 self.indexFile = []
 self.indexFileIndex = []
 ...
 def get_content(self,param1,param2):
 c_n = []
 for value, index in zip(param1, param2):
 object_contents = StringIO.StringIO()
 srcObjURI = boto.storage_uri(value, self.storage)
 srcObjURI.get_key().get_file(object_contents)
 c_n.append(object_contents.getvalue())
 object_contents.close()
 return c_n
 def get_PATHs(self):
 paths=[]
 #pts=open("paths.txt","w")
 pts_log=open("paths_log.txt","a")
 pts_log.write("-"*20+time.ctime()+"-"*20+"\n")
 indexFileContents = self.get_content(self.indexFile, self.indexFileIndex)
 for c,d in zip(indexFileContents,self.indexFile):
 regx = r"(.*)\/" + r"(.*?)\|"
 patternPathList = re.compile(regx)
 for match in patternPathList.finditer(c):
 p=match.group(1).strip() + "/"+ match.group(2).strip()
 tst_exst=""
 if p in paths:
 tst_exst="Already exist !"
 else:
 tst_exst="Added to PATHs list"
 paths.append(p)
 #pts.write(p)
 #pts.write("\n")
 pts_log.write("FROM : %s --> %s %s"%(d,p,tst_exst))
 pts_log.write("\n")
 #pts.close()
 pts_log.close()
 return paths

The files that I am trying to search varies from 200KB to 1MB and sometimes has Unicode characters.

Question 2

A few remarks to add to @JoeWallis already excellent answer:

If you want speed, consider using the module cStringIO instead of StringIO. Since we are missing your import statements, I can't tell whether you have a try .. catch to conditionally import it or not. You should really post the whole code.
There are very few occasions when you want to keep commented out code. Generally speaking, the best thing to do is to remove dead code and let source control software remind you of what the old code was like.
Instead of zip, consider using itertools.izip which uses lazy evaluation to zip the iterables. In your case, it does not change many things since you never break early (in case of error maybe?) but in general, it saves more memory and sometimes avoids computing unused values.
You don't need the parenthesis in class storage():, unless you explicitly inherit from a class. Since you don't inherit from anything, you can simply drop them so that your code looks cleaner.

Question 3

You should use with to open files.

with open(...) as pts_log:
 ...

This implicitly calls pts_log.close. Even if the program fails.

You should use str.format to add values to strings.

pts_log.write("-"*20+time.ctime()+"-"*20+"\n")
# To
pts_log.write('{0}{1}{0}\n'.format('-' * 20, time.ctime))

Why pass self.variables to a function that has self?
Also param1 is not descriptive.

Assuming you want it to be a commonly used function, just change the name of the parameters.

get_content(self, obj, obj_index)

You can reduce the amount of variables in get_content. And you can make it more readable.

def get_content(self, obj, obj_index):
 for value, index in zip(obj, obj_index):
 obj_contents = StringIO.StringIO()
 boto(.storage_uri(value, self.storage)
 .get_key()
 .get_file(object_contents))
 yield obj_contents.getvalue()
 obj_contents.close()

You need spaces after ,s. And you need better variable names.

for c, d in zip(indexFileContents, self.indexFile):

This is very undescriptive.

Why split a regex into two raw-strings, to then immediately combine them.

regx = r"(.*)\/(.*?)\|"

This is much easier to read and understand.

There is no point in compiling a regex every loop. Do it outside the loop.

Also the name patternPathList is unpythonic. Use pattern_path_list instead.

Python's style guide, PEP8, has strict rules on operators. You have to have a space either side of them.

p=match.group(1).strip() + "/"+ match.group(2).strip()
# Should be
p = match.group(1).strip() + "/" + match.group(2).strip()

And again p is undescriptive. path is a better name.

You use the % operator instead of str.format. There are some unfixable bugs with Pythons % operator. So str.format should be used instead.

And you can merge the two pts_log.write together.

pts_log.write("FROM : {} --> {} {}\n".format(d,p,tst_exst))

Morwenn Morwenn 20.2k3 gold badges69 silver badges132 bronze badges · Answer 1 · 2015-07-30 18:56:55Z

A few remarks to add to @JoeWallis already excellent answer:

If you want speed, consider using the module cStringIO instead of StringIO. Since we are missing your import statements, I can't tell whether you have a try .. catch to conditionally import it or not. You should really post the whole code.
There are very few occasions when you want to keep commented out code. Generally speaking, the best thing to do is to remove dead code and let source control software remind you of what the old code was like.
Instead of zip, consider using itertools.izip which uses lazy evaluation to zip the iterables. In your case, it does not change many things since you never break early (in case of error maybe?) but in general, it saves more memory and sometimes avoids computing unused values.
You don't need the parenthesis in class storage():, unless you explicitly inherit from a class. Since you don't inherit from anything, you can simply drop them so that your code looks cleaner.

Peilonrayz ♦Peilonrayz 44.4k7 gold badges80 silver badges157 bronze badges · Answer 2 · 2015-07-30 18:40:29Z

You should use with to open files.

with open(...) as pts_log:
 ...

This implicitly calls pts_log.close. Even if the program fails.

You should use str.format to add values to strings.

pts_log.write("-"*20+time.ctime()+"-"*20+"\n")
# To
pts_log.write('{0}{1}{0}\n'.format('-' * 20, time.ctime))

Why pass self.variables to a function that has self?
Also param1 is not descriptive.

Assuming you want it to be a commonly used function, just change the name of the parameters.

get_content(self, obj, obj_index)

You can reduce the amount of variables in get_content. And you can make it more readable.

def get_content(self, obj, obj_index):
 for value, index in zip(obj, obj_index):
 obj_contents = StringIO.StringIO()
 boto(.storage_uri(value, self.storage)
 .get_key()
 .get_file(object_contents))
 yield obj_contents.getvalue()
 obj_contents.close()

You need spaces after ,s. And you need better variable names.

for c, d in zip(indexFileContents, self.indexFile):

This is very undescriptive.

Why split a regex into two raw-strings, to then immediately combine them.

regx = r"(.*)\/(.*?)\|"

This is much easier to read and understand.

There is no point in compiling a regex every loop. Do it outside the loop.

Also the name patternPathList is unpythonic. Use pattern_path_list instead.

Python's style guide, PEP8, has strict rules on operators. You have to have a space either side of them.

p=match.group(1).strip() + "/"+ match.group(2).strip()
# Should be
p = match.group(1).strip() + "/" + match.group(2).strip()

And again p is undescriptive. path is a better name.

You use the % operator instead of str.format. There are some unfixable bugs with Pythons % operator. So str.format should be used instead.

And you can merge the two pts_log.write together.

pts_log.write("FROM : {} --> {} {}\n".format(d,p,tst_exst))

Stack Exchange Network

Searching within multiple objects of the bucket

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Searching within multiple objects of the bucket

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions