I wrote this simple HTML parser in Ruby.
Basically I have a folder that has files. Every file has crawled web pages. The only addition within these files is before the HTML content there is a page metadata, #URL
is one of them.
What I am aiming is to have two files:
1)Will contain: currentURL, destinationURL, description
2)Will contain: currentURL, destinationURL, word, count
So if I did that for this current page and assuming that href that I am looking at is:
<a href="http://stackoverflow.com" title="professional and enthusiast programmers">Stack Overflow</a>
1st file would have
codereview,stackexchange.com, http://stackoverflow.com, Stack Overflow
2nd file would have
codereview,stackexchange.com, http://stackoverflow.com, Stack, 1
codereview,stackexchange.com, http://stackoverflow.com, Overflow, 2
However my code works really slow. It has been longer than an hour and still parsing a file with size of 31MB.
My code is:
beginning_time = Time.now
folder = "allParts/"
extension = "*.dat"
urlMet = "#URL:"
newLine = "\n"
files = Dir[folder+extension]
hrefTag = "<a href=\""
qtMark = "\""
descStart = "\">"
hrefEnd = "</a>"
firstOutput = "firstOut.txt"
secondOutput = "secondOut.txt"
File.open(firstOutput, 'w') {}
File.open(secondOutput, 'w') {}
countFile = 1
for file in files do
puts file + "#: "+countFile.to_s
countFile++
File.open(file, "r") do |f|
source = ""
f.each_line do |line|
dest = ""
desc = ""
if line.include? urlMet
source = line[/#{urlMet}(.*?)#{newLine}/m, 1]
end
if line.include? hrefTag
dest = line[/#{hrefTag}(.*?)#{qtMark}/m, 1]
descStIn = line.rindex(descStart)
descEndIn = line.rindex(hrefEnd)
desc = line[descStIn+2..descEndIn-1]
end
if (source != "" && dest != "")
occur = Hash.new(0)
mainEntry = "original-url=\"" + source +
"\", dest-url=\"" + dest + "\""
descEntry = ""
if (desc != nil && desc != "")
descEntry = ", desc=\"" + desc + "\""
words = desc.split(' ')
words.each { |word| occur[word] += 1 }
end
firstEntry = mainEntry+descEntry+"\n\n"
File.open(firstOutput, 'a') { |file|
file.write(firstEntry)
}
occur.each { |word, occurrance|
wordEntry = ", word=\"" + word +
"\", count=" + occurrance.to_s
secondEntry = mainEntry+wordEntry+"\n\n"
File.open(secondOutput, 'a') { |file|
file.write(secondEntry)
}
}
end
end
end
end
end_time = Time.now
puts "Time elapsed #{(end_time - beginning_time)} seconds"
1 Answer 1
You are trying to parse HTML with regexes, which is widely considered to be a bad idea. For example, all these tags would have equivalent link text and target URL:
<!-- "normal" -->
<a href="http://example.com">Link Text</a>
<!-- more than one space before the attribute -->
<a href="http://example.com">Link Text</a>
<!-- tags split across multiple lines -->
<a
href="http://example.com">
Link Text
</a>
<!-- universal attribute "id" in front of the "href" -->
<a id="foo" href="http://example.com">Link Text</a>
<!-- different quote character -->
<a href='http://example.com'>Link Text</a>
A line-based parser will fail here, and a regex-based parser will be very difficult to write correctly. So please use an existing HTML parser. The result will be fast and elegant, but as I don't use Ruby I can't give examples.
Let's move on to coding style. You are using multiple naming styles: beginning_time
and newLine
. Whatever you do, be consistent, although I'd recommend snake_case
for variable names.
Some of your variable names contain abbreviations where none are necessary: quote_character
is probably better than qtMark
, descStart
would probably be better described as description_start
.
You used many variables to make your regexes appear more modular, but they are not. Your regexes will become more readable if you write them down directly rather than assembling them. You could also specify them outside of your loops:
href_regex = /href="([^"]+)"/
Points to note here:
[^"]+
is generally preferable over.*?
- The
/m
option was highly unnecessary, because you process your input line by line. So there is no way that your.
could ever match a newline in your original regex.
In your regexes, you used #{...}
interpolation a lot. Strangely you didn't do any of that when assembling other strings, instead using less readable concatenation. E.g. we could rewrite
mainEntry = "original-url=\"" + source +
"\", dest-url=\"" + dest + "\""
as
main_entry = "original-url=\"#{source}\", dest-url=\"#{dest}\""
This is still pretty unreadable because of those pesky escapes – using the alternate %Q()
literal syntax we could simplify this to:
main_entry = %Q(original-url="#{source}", dest-url="#{dest}")
... and ditto in the similar cases.
A snippet like
words = desc.split(' ')
words.each { |word| occur[word] += 1 }
could be improved in two ways:
- get rid of the unnecessary variables. The code is already sufficiently self-documenting with the
word
parameter to the block - rename
occur
to something sensible likeword_count
You declare your source
variable outside of the each_line
loop. It would be better to declare your variables as close to their point of usage as possible. Furthermore, you initialize them with the empty string. I (as a Ruby newbie) would probably initialize them to nil
instead. This has the further advantage that we can remove those line.include? "some string"
tests which do not even guarantee that the regex will match. The String#[Regex, capture]
method returns either the string matched by that capture group, or nil
– which is exactly what we want.
Don't re-open a file all the time inside a tight loop, this is quite expensive. Open the file once, write to it multiple times. This will probably remove your main performance bottleneck.
If we stuff all this advice into your code, we might end up with the following refactoring:
start_time = Time.now
file_count = 1
File.open("firstOut.txt", 'w') do |out1|
File.open("secondOut.txt", 'w') do |out2|
for file_name in Dir["allParts/*.dat"] do
puts "#{file_name}#:#{file_count}"
file_count++
File.open(file_name, "r") do |file|
file.each_line do |line|
source = line[/#URL:(.*?)\n/, 1]
destination = line[/<a href="([^"]+)"/, 1]
if (source != nil && destination != nil)
word_count = Hash.new(0)
main_entry = %Q(original-url="#{source}", dest-url="#{dest}")
description_start = line.rindex('">') + 2
description_end = line.rindex('</a>') - 1
description = line[description_start .. description_end]
if (description != nil && description != "")
out1.write(%Q(#{main_entry}, desc="#{desc}"\n\n))
description.split(' ').each do |word|
word_count[word] += 1
end
else
out1.write(%Q(#{main_entry}\n\n))
end
word_count.each do |word, count|
out2.write(%Q(#{main_entry}, word="#{word}", count=#{count}\n\n))
end
end
end
end
end
end
end
puts "Time elapsed #{Time.now - start_time} seconds"
Note that this can be improved further – the high level of indentation suggests we should extract some parts as separate functions. I further want to reiterate that I'm not a proficient Rubyist, so there might be some egregious oversights in the above code.
-
\$\begingroup\$ I am new to Ruby. Thanks for pointing out many things. Most of my code(as you might have guessed from having different naming conventions) is copied and pasted from another source. Thank you very much for writing so detailed answer. I have a question, I tried to google HTML parser but all I can find is 3rd party libraries. Are you telling me to use in-built library that comes with Ruby? If so what is the library name/ \$\endgroup\$Sarp Kaya– Sarp Kaya2014年03月03日 13:12:46 +00:00Commented Mar 3, 2014 at 13:12
-
1\$\begingroup\$ By the way about this part "You declare your source variable outside of the each_line loop. It would be better to declare your variables as close to their point of usage as possible." The reason why source was declared outside of the scope is because at the beginning of the file you get the source, declaring it right next to where it was used would only return null. \$\endgroup\$Sarp Kaya– Sarp Kaya2014年03月03日 14:52:08 +00:00Commented Mar 3, 2014 at 14:52