Ruby html href parser working slow

Question 1

I wrote this simple HTML parser in Ruby.

Basically I have a folder that has files. Every file has crawled web pages. The only addition within these files is before the HTML content there is a page metadata, #URL is one of them.

What I am aiming is to have two files:

1)Will contain: currentURL, destinationURL, description

2)Will contain: currentURL, destinationURL, word, count

So if I did that for this current page and assuming that href that I am looking at is:

<a href="http://stackoverflow.com" title="professional and enthusiast programmers">Stack Overflow</a>

1st file would have

codereview,stackexchange.com, http://stackoverflow.com, Stack Overflow

2nd file would have

codereview,stackexchange.com, http://stackoverflow.com, Stack, 1

codereview,stackexchange.com, http://stackoverflow.com, Overflow, 2

However my code works really slow. It has been longer than an hour and still parsing a file with size of 31MB.

My code is:

beginning_time = Time.now
folder = "allParts/"
extension = "*.dat"
urlMet = "#URL:"
newLine = "\n"
files = Dir[folder+extension]
hrefTag = "<a href=\""
qtMark = "\""
descStart = "\">"
hrefEnd = "</a>"
firstOutput = "firstOut.txt"
secondOutput = "secondOut.txt"
File.open(firstOutput, 'w') {}
File.open(secondOutput, 'w') {}
countFile = 1
for file in files do
 puts file + "#: "+countFile.to_s
 countFile++
 File.open(file, "r") do |f|
 source = ""
 f.each_line do |line|
 dest = ""
 desc = ""
 if line.include? urlMet
 source = line[/#{urlMet}(.*?)#{newLine}/m, 1]
 end
 if line.include? hrefTag 
 dest = line[/#{hrefTag}(.*?)#{qtMark}/m, 1]
 descStIn = line.rindex(descStart)
 descEndIn = line.rindex(hrefEnd)
 desc = line[descStIn+2..descEndIn-1]
 end
 if (source != "" && dest != "")
 occur = Hash.new(0)
 mainEntry = "original-url=\"" + source + 
 "\", dest-url=\"" + dest + "\"" 
 descEntry = ""
 if (desc != nil && desc != "")
 descEntry = ", desc=\"" + desc + "\""
 words = desc.split(' ')
 words.each { |word| occur[word] += 1 }
 end
 firstEntry = mainEntry+descEntry+"\n\n"
 File.open(firstOutput, 'a') { |file| 
 file.write(firstEntry) 
 }
 occur.each { |word, occurrance| 
 wordEntry = ", word=\"" + word +
 "\", count=" + occurrance.to_s
 secondEntry = mainEntry+wordEntry+"\n\n"
 File.open(secondOutput, 'a') { |file| 
 file.write(secondEntry) 
 }
 }
 end
 end
 end
end
end_time = Time.now
puts "Time elapsed #{(end_time - beginning_time)} seconds"

Question 2

You are trying to parse HTML with regexes, which is widely considered to be a bad idea. For example, all these tags would have equivalent link text and target URL:

<!-- "normal" -->
<a href="http://example.com">Link Text</a>
<!-- more than one space before the attribute -->
<a href="http://example.com">Link Text</a>
<!-- tags split across multiple lines -->
<a
 href="http://example.com">
 Link Text
</a>
<!-- universal attribute "id" in front of the "href" -->
<a id="foo" href="http://example.com">Link Text</a>
<!-- different quote character -->
<a href='http://example.com'>Link Text</a>

A line-based parser will fail here, and a regex-based parser will be very difficult to write correctly. So please use an existing HTML parser. The result will be fast and elegant, but as I don't use Ruby I can't give examples.

Let's move on to coding style. You are using multiple naming styles: beginning_time and newLine. Whatever you do, be consistent, although I'd recommend snake_case for variable names.

Some of your variable names contain abbreviations where none are necessary: quote_character is probably better than qtMark, descStart would probably be better described as description_start.

You used many variables to make your regexes appear more modular, but they are not. Your regexes will become more readable if you write them down directly rather than assembling them. You could also specify them outside of your loops:

href_regex = /href="([^"]+)"/

Points to note here:

[^"]+ is generally preferable over .*?
The /m option was highly unnecessary, because you process your input line by line. So there is no way that your . could ever match a newline in your original regex.

In your regexes, you used #{...} interpolation a lot. Strangely you didn't do any of that when assembling other strings, instead using less readable concatenation. E.g. we could rewrite

mainEntry = "original-url=\"" + source + 
"\", dest-url=\"" + dest + "\""

as

main_entry = "original-url=\"#{source}\", dest-url=\"#{dest}\""

This is still pretty unreadable because of those pesky escapes – using the alternate %Q() literal syntax we could simplify this to:

main_entry = %Q(original-url="#{source}", dest-url="#{dest}")

... and ditto in the similar cases.

A snippet like

words = desc.split(' ')
words.each { |word| occur[word] += 1 }

could be improved in two ways:

get rid of the unnecessary variables. The code is already sufficiently self-documenting with the word parameter to the block
rename occur to something sensible like word_count

You declare your source variable outside of the each_line loop. It would be better to declare your variables as close to their point of usage as possible. Furthermore, you initialize them with the empty string. I (as a Ruby newbie) would probably initialize them to nil instead. This has the further advantage that we can remove those line.include? "some string" tests which do not even guarantee that the regex will match. The String#[Regex, capture] method returns either the string matched by that capture group, or nil – which is exactly what we want.

Don't re-open a file all the time inside a tight loop, this is quite expensive. Open the file once, write to it multiple times. This will probably remove your main performance bottleneck.

If we stuff all this advice into your code, we might end up with the following refactoring:

start_time = Time.now
file_count = 1
File.open("firstOut.txt", 'w') do |out1|
 File.open("secondOut.txt", 'w') do |out2|
 for file_name in Dir["allParts/*.dat"] do
 puts "#{file_name}#:#{file_count}"
 file_count++
 File.open(file_name, "r") do |file|
 file.each_line do |line|
 source = line[/#URL:(.*?)\n/, 1]
 destination = line[/<a href="([^"]+)"/, 1]
 if (source != nil && destination != nil)
 word_count = Hash.new(0)
 main_entry = %Q(original-url="#{source}", dest-url="#{dest}")
 description_start = line.rindex('">') + 2
 description_end = line.rindex('</a>') - 1
 description = line[description_start .. description_end]
 if (description != nil && description != "")
 out1.write(%Q(#{main_entry}, desc="#{desc}"\n\n)) 
 description.split(' ').each do |word|
 word_count[word] += 1
 end
 else
 out1.write(%Q(#{main_entry}\n\n)) 
 end
 word_count.each do |word, count|
 out2.write(%Q(#{main_entry}, word="#{word}", count=#{count}\n\n)) 
 end
 end
 end
 end
 end
 end
end
puts "Time elapsed #{Time.now - start_time} seconds"

Note that this can be improved further – the high level of indentation suggests we should extract some parts as separate functions. I further want to reiterate that I'm not a proficient Rubyist, so there might be some egregious oversights in the above code.

Question 3

I am new to Ruby. Thanks for pointing out many things. Most of my code(as you might have guessed from having different naming conventions) is copied and pasted from another source. Thank you very much for writing so detailed answer. I have a question, I tried to google HTML parser but all I can find is 3rd party libraries. Are you telling me to use in-built library that comes with Ruby? If so what is the library name/

Question 4

By the way about this part "You declare your source variable outside of the each_line loop. It would be better to declare your variables as close to their point of usage as possible." The reason why source was declared outside of the scope is because at the beginning of the file you get the source, declaring it right next to where it was used would only return null.

amon amon 12.7k37 silver badges67 bronze badges · Accepted Answer · 2014-03-03 12:59:53Z

You are trying to parse HTML with regexes, which is widely considered to be a bad idea. For example, all these tags would have equivalent link text and target URL:

<!-- "normal" -->
<a href="http://example.com">Link Text</a>
<!-- more than one space before the attribute -->
<a href="http://example.com">Link Text</a>
<!-- tags split across multiple lines -->
<a
 href="http://example.com">
 Link Text
</a>
<!-- universal attribute "id" in front of the "href" -->
<a id="foo" href="http://example.com">Link Text</a>
<!-- different quote character -->
<a href='http://example.com'>Link Text</a>

A line-based parser will fail here, and a regex-based parser will be very difficult to write correctly. So please use an existing HTML parser. The result will be fast and elegant, but as I don't use Ruby I can't give examples.

Let's move on to coding style. You are using multiple naming styles: beginning_time and newLine. Whatever you do, be consistent, although I'd recommend snake_case for variable names.

Some of your variable names contain abbreviations where none are necessary: quote_character is probably better than qtMark, descStart would probably be better described as description_start.

You used many variables to make your regexes appear more modular, but they are not. Your regexes will become more readable if you write them down directly rather than assembling them. You could also specify them outside of your loops:

href_regex = /href="([^"]+)"/

Points to note here:

[^"]+ is generally preferable over .*?
The /m option was highly unnecessary, because you process your input line by line. So there is no way that your . could ever match a newline in your original regex.

In your regexes, you used #{...} interpolation a lot. Strangely you didn't do any of that when assembling other strings, instead using less readable concatenation. E.g. we could rewrite

mainEntry = "original-url=\"" + source + 
"\", dest-url=\"" + dest + "\""

as

main_entry = "original-url=\"#{source}\", dest-url=\"#{dest}\""

This is still pretty unreadable because of those pesky escapes – using the alternate %Q() literal syntax we could simplify this to:

main_entry = %Q(original-url="#{source}", dest-url="#{dest}")

... and ditto in the similar cases.

A snippet like

words = desc.split(' ')
words.each { |word| occur[word] += 1 }

could be improved in two ways:

get rid of the unnecessary variables. The code is already sufficiently self-documenting with the word parameter to the block
rename occur to something sensible like word_count

You declare your source variable outside of the each_line loop. It would be better to declare your variables as close to their point of usage as possible. Furthermore, you initialize them with the empty string. I (as a Ruby newbie) would probably initialize them to nil instead. This has the further advantage that we can remove those line.include? "some string" tests which do not even guarantee that the regex will match. The String#[Regex, capture] method returns either the string matched by that capture group, or nil – which is exactly what we want.

Don't re-open a file all the time inside a tight loop, this is quite expensive. Open the file once, write to it multiple times. This will probably remove your main performance bottleneck.

If we stuff all this advice into your code, we might end up with the following refactoring:

start_time = Time.now
file_count = 1
File.open("firstOut.txt", 'w') do |out1|
 File.open("secondOut.txt", 'w') do |out2|
 for file_name in Dir["allParts/*.dat"] do
 puts "#{file_name}#:#{file_count}"
 file_count++
 File.open(file_name, "r") do |file|
 file.each_line do |line|
 source = line[/#URL:(.*?)\n/, 1]
 destination = line[/<a href="([^"]+)"/, 1]
 if (source != nil && destination != nil)
 word_count = Hash.new(0)
 main_entry = %Q(original-url="#{source}", dest-url="#{dest}")
 description_start = line.rindex('">') + 2
 description_end = line.rindex('</a>') - 1
 description = line[description_start .. description_end]
 if (description != nil && description != "")
 out1.write(%Q(#{main_entry}, desc="#{desc}"\n\n)) 
 description.split(' ').each do |word|
 word_count[word] += 1
 end
 else
 out1.write(%Q(#{main_entry}\n\n)) 
 end
 word_count.each do |word, count|
 out2.write(%Q(#{main_entry}, word="#{word}", count=#{count}\n\n)) 
 end
 end
 end
 end
 end
 end
end
puts "Time elapsed #{Time.now - start_time} seconds"

Note that this can be improved further – the high level of indentation suggests we should extract some parts as separate functions. I further want to reiterate that I'm not a proficient Rubyist, so there might be some egregious oversights in the above code.

I am new to Ruby. Thanks for pointing out many things. Most of my code(as you might have guessed from having different naming conventions) is copied and pasted from another source. Thank you very much for writing so detailed answer. I have a question, I tried to google HTML parser but all I can find is 3rd party libraries. Are you telling me to use in-built library that comes with Ruby? If so what is the library name/
By the way about this part "You declare your source variable outside of the each_line loop. It would be better to declare your variables as close to their point of usage as possible." The reason why source was declared outside of the scope is because at the beginning of the file you get the source, declaring it right next to where it was used would only return null.

Stack Exchange Network

Ruby html href parser working slow

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Ruby html href parser working slow

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions