Parsing locally stored HTML files

Question 1

I am working with this code to parse through HTML files stored on my computer and extract HTML text by defining a certain tag that should be found:

from bs4 import BeautifulSoup
import glob
import os
import re
import contextlib
@contextlib.contextmanager
def stdout2file(fname):
 import sys
 f = open(fname, 'w')
 sys.stdout = f
 yield
 sys.stdout = sys.__stdout__
 f.close()
def trade_spider():
 os.chdir(r"C:\Users\Independent Auditors Report")
 with stdout2file("auditfeesexpenses.txt"):
 for file in glob.iglob('**/*.html', recursive=True):
 with open(file, encoding="utf8") as f:
 contents = f.read()
 soup = BeautifulSoup(contents, "html.parser")
 for item in soup.findAll("ix:nonfraction"):
 if re.match(".*AuditFeesExpenses", item['name']):
 print(file.split(os.path.sep)[-1], end="| ")
 print(item['name'], end="| ")
 print(item.get_text())
 break
trade_spider()

The code works perfectly thanks to the help of the stackflow community! As I am not an expert in python coding, I am wondering whether there are some magic tricks some of you might know, to speed up my code and reduce processing time as it has to parse through ~ 4 Million files.

Perhaps in a nutshell what my code does: -> open text file -> parse through all html documents in set directory -> if regex is found, print result into open text file -> break, no more than one match and continue to next file...

I am open to any suggestions on improving this code.

Update:

Further Explanation: Basically I want to find a certain name attribute (name=".+AuditFeesExpenses") in each HTML document and IF this attribute is found I want to have the name of the file, the Name Attribute and the correlating HTML text be printed into a separat text file.

An example string that I extracted from a single HTML file is:

<span class="fontid4"><span style="display: inline-block; white-space: nowrap; margin-right: 0px;"><span style="display: inline-block; width: 3.31pt; background-color: #ffffff;"><span style="display: inline-block; min-height: 1em;"></span></span><span style="width: 59.57pt; display: inline-block; text-align: right;"><ix:nonFraction name="f:AuditFeesExpenses" contextRef="c201" unitRef="u5" decimals="0" format="ixt:numcommadot">8,930</ix:nonFraction></span><span style="display: inline-block; width: 3.36pt; background-color: #ffffff;"><span style="display: inline-block; min-height: 1em;"></span></span></span></span>

Question 2

May I ask why did you use contextlib.contextmanager ?

Question 3

@Dex'ter: I thought I need contextlib.contextmanager to get access on my contextlib module that I am using for storing my results in a generated text file?

Question 4

Welcome to Code Review! I have rolled back the last edit. Please see what you may and may not do after receiving answers .

Question 5

I don't know if this would be significant, but a first suggestion would be to replace the relatively costly re operation with the basic string operationitem['name'].endswith("AuditFeesExpenses").

Another possible suggestion, based on @Dex'ter's comment would be to change the stdout redirection into a regular .write() on the output file.

But what I'd really recommend is to profile the script to figure out the hot spots. I suspect that the bottleneck is within BeautifulSoup, and if that's the case, (given that you're only searching for a substring and not parsing) perhaps you could find an alternative search method.

Question 6

You are certainly right that the 're' Operation could be a part that can be improved. I'll check on that! I know that my redirection with stdout could be another issues. However I don't know how to get my Output automatically into a text file while processing through all files? What would be a sugeestion for '.write()' ? I'll check on profiling my Skript later on and see where the hotspots are.

Question 7

Apart from profiling, which is certainly a good idea to get a better understanding of the bottlenecks, I'd also recommend looking into streaming parsing instead of reading all the files completely and building a full DOM every single time. The other thing would be to assume that you can process more than one file at a time using the multiprocessing module (with multiple processes, not threads, so you don't run into problems with the GIL). A similar result could be done with xargs and handling multiple input files probably.

For Python 3 there's html.parser, maybe take a look at that, e.g. something like this:

from html.parser import HTMLParser
BUFFER_SIZE = 4096
def valid_tag(tag, attrs):
 if tag == "ix:nonfraction":
 for name, value in attrs:
 if name == "name" and value.endswith("AuditFeesExpenses"):
 return True
class MyMatcher(HTMLParser):
 def __init__(self):
 super().__init__()
 self.record_data = False
 self.data = []
 def handle_starttag(self, tag, attrs):
 if valid_tag(tag, attrs):
 self.record_data = True
 self.data = []
 def handle_endtag(self, tag):
 if tag == "ix:nonfraction":
 self.record_data = False
 print("".join(self.data))
 def handle_data(self, data):
 if self.record_data:
 self.data.append(data)
def trade_spider():
 matcher = MyMatcher()
 with open("foo.html", encoding="utf8") as f:
 matcher.reset()
 chunk = f.read(BUFFER_SIZE)
 while chunk:
 matcher.feed(chunk)
 chunk = f.read(BUFFER_SIZE)
 matcher.close()
if __name__ == "__main__":
 trade_spider()

Note that it depends entirely on your HTML structure how complicated the parser instance will be - with multiple nested attributes etc. you'd have to count the current level you're in to correctly collect and dump the text content; the example is quite limited in that respect. The main advantage is not reading the whole file into memory and not constructing a DOM in the first place.

Some more general remarks about the code as is:

Preferably don't do the whole standard output redirection - just pass through the file object you want to write to, or redirect the standard output of the Python script in the shell.
Use the __name__ variable to run your main function. That way it could still be imported / loaded into a running Python instance without immediately executing it.

Question 8

Edited my OP. I know that my redirection will probably slow down the whole process, but I need a text file with my output in a certain Format: Filename | Tagname | HTML Text, that's why I used stdout. Any idea how to do this more elegant? I don't want to read each file into Memory, my Intention is to open each file, browse for this certain Name Attribute and if it is found print the correlating html text (which is a number). I used beautiful soup as I already used it for a webcrawler i wrote and i thought this would also work for locally stored html files?

Question 9

I'll check on your suggested code later on and give you some Feedback how it works

Question 10

@FlorianSchramm I'm not saying that the redirection will cause slowdown (pretty sure that's negligible, put feel free to measure that), I'm saying that it's cleaner to directly call output.write with the correct output file if you want to do that in Python.

Question 11

How would you integrate your Output.write() Approach into my code? I am not a big python expert.... Regarding the Beautiful Soup module. Do you know whether an approach with html.Parser (as you suggested) has any significant Advantages over the BS4 module? What I heard is that BS4 is pretty popular among data extraction in HTML files? I simply Need a fast way to extract Information out of my html files... I don't have any prefered module, BS4 has just been the first module I wrote my first web crawler code with.

Question 12

@FlorianSchramm with open(...) as f: f.write(...); like I said, BS4 will construct a full representation of the HTML DOM in-memory - the benefit of html.parser would be not to do that, which should be considerably faster.

yoniLavi yoniLavi 1716 bronze badges · Answer 1 · 2016-05-16 19:55:08Z

I don't know if this would be significant, but a first suggestion would be to replace the relatively costly re operation with the basic string operationitem['name'].endswith("AuditFeesExpenses").

Another possible suggestion, based on @Dex'ter's comment would be to change the stdout redirection into a regular .write() on the output file.

But what I'd really recommend is to profile the script to figure out the hot spots. I suspect that the bottleneck is within BeautifulSoup, and if that's the case, (given that you're only searching for a substring and not parsing) perhaps you could find an alternative search method.

You are certainly right that the 're' Operation could be a part that can be improved. I'll check on that! I know that my redirection with stdout could be another issues. However I don't know how to get my Output automatically into a text file while processing through all files? What would be a sugeestion for '.write()' ? I'll check on profiling my Skript later on and see where the hotspots are.

ferada ferada 11.4k25 silver badges65 bronze badges · Answer 2 · 2016-05-16 21:35:50Z

Apart from profiling, which is certainly a good idea to get a better understanding of the bottlenecks, I'd also recommend looking into streaming parsing instead of reading all the files completely and building a full DOM every single time. The other thing would be to assume that you can process more than one file at a time using the multiprocessing module (with multiple processes, not threads, so you don't run into problems with the GIL). A similar result could be done with xargs and handling multiple input files probably.

For Python 3 there's html.parser, maybe take a look at that, e.g. something like this:

from html.parser import HTMLParser
BUFFER_SIZE = 4096
def valid_tag(tag, attrs):
 if tag == "ix:nonfraction":
 for name, value in attrs:
 if name == "name" and value.endswith("AuditFeesExpenses"):
 return True
class MyMatcher(HTMLParser):
 def __init__(self):
 super().__init__()
 self.record_data = False
 self.data = []
 def handle_starttag(self, tag, attrs):
 if valid_tag(tag, attrs):
 self.record_data = True
 self.data = []
 def handle_endtag(self, tag):
 if tag == "ix:nonfraction":
 self.record_data = False
 print("".join(self.data))
 def handle_data(self, data):
 if self.record_data:
 self.data.append(data)
def trade_spider():
 matcher = MyMatcher()
 with open("foo.html", encoding="utf8") as f:
 matcher.reset()
 chunk = f.read(BUFFER_SIZE)
 while chunk:
 matcher.feed(chunk)
 chunk = f.read(BUFFER_SIZE)
 matcher.close()
if __name__ == "__main__":
 trade_spider()

Note that it depends entirely on your HTML structure how complicated the parser instance will be - with multiple nested attributes etc. you'd have to count the current level you're in to correctly collect and dump the text content; the example is quite limited in that respect. The main advantage is not reading the whole file into memory and not constructing a DOM in the first place.

Some more general remarks about the code as is:

Preferably don't do the whole standard output redirection - just pass through the file object you want to write to, or redirect the standard output of the Python script in the shell.
Use the __name__ variable to run your main function. That way it could still be imported / loaded into a running Python instance without immediately executing it.

Edited my OP. I know that my redirection will probably slow down the whole process, but I need a text file with my output in a certain Format: Filename | Tagname | HTML Text, that's why I used stdout. Any idea how to do this more elegant? I don't want to read each file into Memory, my Intention is to open each file, browse for this certain Name Attribute and if it is found print the correlating html text (which is a number). I used beautiful soup as I already used it for a webcrawler i wrote and i thought this would also work for locally stored html files?
I'll check on your suggested code later on and give you some Feedback how it works
@FlorianSchramm I'm not saying that the redirection will cause slowdown (pretty sure that's negligible, put feel free to measure that), I'm saying that it's cleaner to directly call output.write with the correct output file if you want to do that in Python.
How would you integrate your Output.write() Approach into my code? I am not a big python expert.... Regarding the Beautiful Soup module. Do you know whether an approach with html.Parser (as you suggested) has any significant Advantages over the BS4 module? What I heard is that BS4 is pretty popular among data extraction in HTML files? I simply Need a fast way to extract Information out of my html files... I don't have any prefered module, BS4 has just been the first module I wrote my first web crawler code with.
@FlorianSchramm with open(...) as f: f.write(...); like I said, BS4 will construct a full representation of the HTML DOM in-memory - the benefit of html.parser would be not to do that, which should be considerably faster.

Stack Exchange Network

Parsing locally stored HTML files

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Parsing locally stored HTML files

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions