I am working with this code to parse through HTML files stored on my computer and extract HTML text by defining a certain tag that should be found:
from bs4 import BeautifulSoup
import glob
import os
import re
import contextlib
@contextlib.contextmanager
def stdout2file(fname):
import sys
f = open(fname, 'w')
sys.stdout = f
yield
sys.stdout = sys.__stdout__
f.close()
def trade_spider():
os.chdir(r"C:\Users\Independent Auditors Report")
with stdout2file("auditfeesexpenses.txt"):
for file in glob.iglob('**/*.html', recursive=True):
with open(file, encoding="utf8") as f:
contents = f.read()
soup = BeautifulSoup(contents, "html.parser")
for item in soup.findAll("ix:nonfraction"):
if re.match(".*AuditFeesExpenses", item['name']):
print(file.split(os.path.sep)[-1], end="| ")
print(item['name'], end="| ")
print(item.get_text())
break
trade_spider()
The code works perfectly thanks to the help of the stackflow community! As I am not an expert in python coding, I am wondering whether there are some magic tricks some of you might know, to speed up my code and reduce processing time as it has to parse through ~ 4 Million files.
Perhaps in a nutshell what my code does: -> open text file -> parse through all html documents in set directory -> if regex is found, print result into open text file -> break, no more than one match and continue to next file...
I am open to any suggestions on improving this code.
Update:
Further Explanation: Basically I want to find a certain name attribute (name=".+AuditFeesExpenses") in each HTML document and IF this attribute is found I want to have the name of the file, the Name Attribute and the correlating HTML text be printed into a separat text file.
An example string that I extracted from a single HTML file is:
<span class="fontid4"><span style="display: inline-block; white-space: nowrap; margin-right: 0px;"><span style="display: inline-block; width: 3.31pt; background-color: #ffffff;"><span style="display: inline-block; min-height: 1em;"></span></span><span style="width: 59.57pt; display: inline-block; text-align: right;"><ix:nonFraction name="f:AuditFeesExpenses" contextRef="c201" unitRef="u5" decimals="0" format="ixt:numcommadot">8,930</ix:nonFraction></span><span style="display: inline-block; width: 3.36pt; background-color: #ffffff;"><span style="display: inline-block; min-height: 1em;"></span></span></span></span>
2 Answers 2
I don't know if this would be significant, but a first suggestion would be to replace the relatively costly re
operation with the basic string operationitem['name'].endswith("AuditFeesExpenses")
.
Another possible suggestion, based on @Dex'ter's comment would be to change the stdout redirection into a regular .write()
on the output file.
But what I'd really recommend is to profile the script to figure out the hot spots. I suspect that the bottleneck is within BeautifulSoup, and if that's the case, (given that you're only searching for a substring and not parsing) perhaps you could find an alternative search method.
-
\$\begingroup\$ You are certainly right that the 're' Operation could be a part that can be improved. I'll check on that! I know that my redirection with stdout could be another issues. However I don't know how to get my Output automatically into a text file while processing through all files? What would be a sugeestion for '.write()' ? I'll check on profiling my Skript later on and see where the hotspots are. \$\endgroup\$Florian Schramm– Florian Schramm2016年05月17日 06:33:04 +00:00Commented May 17, 2016 at 6:33
Apart from profiling, which is certainly a good idea to get a better
understanding of the bottlenecks, I'd also recommend looking into
streaming parsing instead of reading all the files completely and
building a full DOM every single time. The other thing would be to
assume that you can process more than one file at a time using the
multiprocessing
module (with multiple processes, not threads, so you don't run into
problems with the GIL). A similar result could be done with xargs
and
handling multiple input files probably.
For Python 3 there's
html.parser
,
maybe take a look at that, e.g. something like this:
from html.parser import HTMLParser
BUFFER_SIZE = 4096
def valid_tag(tag, attrs):
if tag == "ix:nonfraction":
for name, value in attrs:
if name == "name" and value.endswith("AuditFeesExpenses"):
return True
class MyMatcher(HTMLParser):
def __init__(self):
super().__init__()
self.record_data = False
self.data = []
def handle_starttag(self, tag, attrs):
if valid_tag(tag, attrs):
self.record_data = True
self.data = []
def handle_endtag(self, tag):
if tag == "ix:nonfraction":
self.record_data = False
print("".join(self.data))
def handle_data(self, data):
if self.record_data:
self.data.append(data)
def trade_spider():
matcher = MyMatcher()
with open("foo.html", encoding="utf8") as f:
matcher.reset()
chunk = f.read(BUFFER_SIZE)
while chunk:
matcher.feed(chunk)
chunk = f.read(BUFFER_SIZE)
matcher.close()
if __name__ == "__main__":
trade_spider()
Note that it depends entirely on your HTML structure how complicated the parser instance will be - with multiple nested attributes etc. you'd have to count the current level you're in to correctly collect and dump the text content; the example is quite limited in that respect. The main advantage is not reading the whole file into memory and not constructing a DOM in the first place.
Some more general remarks about the code as is:
- Preferably don't do the whole standard output redirection - just pass through the file object you want to write to, or redirect the standard output of the Python script in the shell.
- Use the
__name__
variable to run your main function. That way it could still be imported / loaded into a running Python instance without immediately executing it.
-
\$\begingroup\$ Edited my OP. I know that my redirection will probably slow down the whole process, but I need a text file with my output in a certain Format: Filename | Tagname | HTML Text, that's why I used stdout. Any idea how to do this more elegant? I don't want to read each file into Memory, my Intention is to open each file, browse for this certain Name Attribute and if it is found print the correlating html text (which is a number). I used beautiful soup as I already used it for a webcrawler i wrote and i thought this would also work for locally stored html files? \$\endgroup\$Florian Schramm– Florian Schramm2016年05月17日 06:51:08 +00:00Commented May 17, 2016 at 6:51
-
\$\begingroup\$ I'll check on your suggested code later on and give you some Feedback how it works \$\endgroup\$Florian Schramm– Florian Schramm2016年05月17日 06:51:48 +00:00Commented May 17, 2016 at 6:51
-
\$\begingroup\$ @FlorianSchramm I'm not saying that the redirection will cause slowdown (pretty sure that's negligible, put feel free to measure that), I'm saying that it's cleaner to directly call
output.write
with the correct output file if you want to do that in Python. \$\endgroup\$ferada– ferada2016年05月17日 07:21:55 +00:00Commented May 17, 2016 at 7:21 -
\$\begingroup\$ How would you integrate your Output.write() Approach into my code? I am not a big python expert.... Regarding the Beautiful Soup module. Do you know whether an approach with html.Parser (as you suggested) has any significant Advantages over the BS4 module? What I heard is that BS4 is pretty popular among data extraction in HTML files? I simply Need a fast way to extract Information out of my html files... I don't have any prefered module, BS4 has just been the first module I wrote my first web crawler code with. \$\endgroup\$Florian Schramm– Florian Schramm2016年05月17日 13:55:28 +00:00Commented May 17, 2016 at 13:55
-
\$\begingroup\$ @FlorianSchramm
with open(...) as f: f.write(...)
; like I said, BS4 will construct a full representation of the HTML DOM in-memory - the benefit ofhtml.parser
would be not to do that, which should be considerably faster. \$\endgroup\$ferada– ferada2016年05月17日 14:05:51 +00:00Commented May 17, 2016 at 14:05
Explore related questions
See similar questions with these tags.
contextlib.contextmanager
? \$\endgroup\$