1
\$\begingroup\$

The process:

  1. I read the lookup and create a list aa.

  2. I traverse through the glob.glob("C:\\project resource\\dump\\*.html") and find matching.

  3. If a match is found, I create a file if not available, but if available, I just append it to the file.

import glob,re
from bs4 import BeautifulSoup
import unidecode
lis=[]
aa=open("C:\project resource\lookup.txt","r")
aa=aa.readlines()
aa=[b.split("\t") for b in aa ] 
print "started"
for sa in aa:
 for a in glob.glob("C:\\project resource\\dump\\*.html"):
 if sa[0] =="".join(re.findall("_(\d+).",a)):
 try:
 fi= open(sa[1].strip()+".txt","a")
 html=" ".join(open(a,"r").readlines()[1:])
 soup = BeautifulSoup(html)
 # kill all script and style elements
 for script in soup(["script", "style"]):
 script.extract() # rip it out
 # get text
 text = soup.get_text()
 # break into lines and remove leading and trailing space on each
 lines = (line.strip() for line in text.splitlines())
 # break multi-headlines into a line each
 chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
 # drop blank lines
 text = ' '.join(chunk for chunk in chunks if chunk)
 text=re.sub("\s+"," ",text)
 fi.write(unidecode.unidecode(text)+"\n")
 except:
 print a
 fi.close()
 break
 else:
 print sa,"no match"

lookup.txt sample:

123\tapple
124\tdad
125\tdude

Sample file names in dump:

12_123.html
12_124.html
123234_125.html

Time:

File count :1,65,000
Time: 2 days

Is there a change which can increase the speed of this process?

Jamal
35.2k13 gold badges134 silver badges238 bronze badges
asked Jul 30, 2015 at 10:17
\$\endgroup\$

1 Answer 1

5
\$\begingroup\$

Imports should be on different lines.

import glob
import re

'lis' should be removed, due to it not being used.
If you did use it, it should be renamed, a descriptive name is the best.


when opening a file use with.

with open(...) as aa:
 ...

This automatically closes the file, even if there is an unhandled exception. So if for example you keyboard interrupt the program, the file will close.

Also change the variable name aa to something better. file_handler is better.


You overwrite aa. This means that you cannot close it. You should always close a file handler.

You could not overwrite aa by changing the for statement.

for sa in (b.split("\t") for b in aa.readlines()):

Python's style guide, PEP8, says that all operators should have a space either side.

fi = open(sa[1].strip() + ".txt", "a")

Currently you use three styles to, one of which is allowed.
The following one line shows two of them. fi= o )+"

fi= open(sa[1].strip()+".txt","a")

Your try statement is massive. It is also a bare except. Both being very bad.

You should try to keep the try as small as possible. This is so that you don't 'mask' bugs.

try:
 file_handler = open(sa[1].strip() + ".txt", "a")
except:
 print "Error on opening " + a
else:
 ...
finally:
 file_handler.close()

You could remove the print statements. They take a surprising amount of time to print.


You should wrap you main in a if __name__ == '__main__':


Performance


Your massive amount of white-stripping may be the problem. And you are opening a lot of files.


The former you can deal with.

text = soup.get_text()
lines = (line.strip() for line in text.splitlines())
chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
text = ' '.join(chunk for chunk in chunks if chunk)

First off this can be simplified.

text = ' '.join(
 chunk
 for chunk in (
 phrase.strip()
 for phrase in line.strip().split(" ")
 for line in soup.get_text().splitlines()
 )
 if chunk
)

You strip every time you split. This can't be good.

It seems that you want to replace all blocks of white space with one space.
If this is true then the below function should be simpler and should be faster.

def strip_entire(string):
 def inner():
 white_space = False
 for char in string:
 if char in ' \t\n':
 if white_space:
 continue
 white_space = True
 yield ' '
 else:
 white_space = False
 yield char
 return ''.join(inner())

It is \$O(n)\$. And should be faster than yours.


If you are likely to open the same file multiple times, then you could save the output to another file. Say a + ".stripped"

This will have the benefit of being able to reduce the amount of duplicate work. This is good as then your computer can effortlessly append the file to the other file.

with open(sa[1].strip() + ".txt", "a") as file_handler:
 if os.path.exists(a + ".stripped"):
 with open(a + ".stripped", "r") as stripped_handler:
 file_handler.write(stripped_handler.read())
 else:
 # Other method

If you were to make all these changes you should get

import glob
import re
import unidecode
import os.path
from bs4 import BeautifulSoup
def strip_entire(string):
 def inner():
 white_space = False
 for char in string:
 if char in ' \t\n':
 if white_space:
 continue
 white_space = True
 yield ' '
 else:
 white_space = False
 yield char
 return ''.join(inner())
def append_file(append_name, html_name):
 with open(append_name, "a") as file_handler:
 if os.path.exists(html_name + ".stripped"):
 with open(html_name + ".stripped", "r") as stripped_handler:
 file_handler.write(stripped_handler.read())
 else:
 with open(html_name, "r") html_handler:
 soup = BeautifulSoup(html_handler)
 # If you can't have the first line. No matter what.
 #soup = BeautifulSoup(''.join(html_handler.readlines()[1:]))
 for to_remove in soup(("script", "style")):
 to_remove.extract()
 file_handler.write(
 unidecode.unidecode(
 re.sub("\s+", " ", strip_entire(soup.get_text()))) + "\n")
if __name__ == '__main__':
 with open("C:\project resource\lookup.txt", "r") as file_handler:
 print("started")
 dump_files = glob.glob("C:\\project resource\\dump\\*.html")
 for split_line in (b.split("\t") for b in file_handler.readlines()):
 for dump_file in dump_files:
 if split_line[0] == "".join(re.findall("_(\d+).", dump_file)):
 append_file(split_line[1].strip() + ".txt", dump_file)
 break

I don't fully understand what your regex's do. As I don't fully understand the re library.
They seem kinda odd, re.findall("_(\d+).", dump_file), I thought you only had one number on your files that followed that regex.
And so I decided to leave them as is.

I also removed the try except statement, as I didn't know what you were error handling. Was it your loops, your opens, everything, keyboard interrupts. This is one reason you shouldn't have bare except statements.

If you were doing the opens then you should do this:

try:
 file_handler = open(...)
except IOError:
 # Raise an error, break, return, whatever.
else:
 # Code in the with
finally:
 file_handler.close()
answered Jul 30, 2015 at 22:21
\$\endgroup\$
6
  • \$\begingroup\$ with also closes the file on unhandled exceptions. \$\endgroup\$ Commented Jul 30, 2015 at 22:51
  • \$\begingroup\$ @JoeWallis First time on code review and it is the best experience mate thanks a lot for your time \$\endgroup\$ Commented Jul 31, 2015 at 5:01
  • \$\begingroup\$ @JoeWallis I have certain doubts why should import be on separate lines \$\endgroup\$ Commented Jul 31, 2015 at 5:05
  • \$\begingroup\$ @JoeWallis I had a bare except because there were many error like unicode error ,file unavilable error and so on I just wanted to create files I don't care about the error \$\endgroup\$ Commented Jul 31, 2015 at 5:13
  • \$\begingroup\$ @JoeWallis last but not least I thought calling functions would increase the time and is there any performance increase when doing if __name__ == '__main__': \$\endgroup\$ Commented Jul 31, 2015 at 5:20

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.