Parse webpage and save fetched images to a directory

Question 1

I had a task for a programmer position which I failed. I am a newbie programmer and I accept that. The only problem is that employer never told me what the actual problem with the code is. So maybe community will be able to give few hints on how to improve it.

The task was to write code which would parse a given webpage, fetch all the images and save it to given directory. Webpage address and directory are command line parameters. Performance was a critical issue for this task.

require 'open-uri'
require 'nokogiri'
class Grab
 def runner(url, path)
 threads = []
 doc = Nokogiri::HTML(open("http://#{url}"))
 img_srcs = doc.css('img').map{ |i| i['src'] }.uniq
 img_srcs = rel_to_abs(url, img_srcs)
 img_srcs.each do |img_src|
 threads << Thread.new(img_src) do
 name = img_src.match(/^http:\/\/.*\/(.*)$/)[1]
 image = fetch img_src
 save(image, name, path)
 end
 end
 threads.each{ |thread| thread.join }
 end
 def fetch(img_src)
 puts "Fetching #{img_src}\n"
 image = open(img_src)
 end
 def save(image, name, path)
 File.open("#{path}/#{name}", "wb"){ |file| file.write(image.read) }
 end
 def rel_to_abs(url, img_srcs)
 img_srcs.each_with_index do |img_src, index|
 img_srcs[index] = "http://#{url}/#{img_src}" unless img_src.match(/http:\/\//)
 end
 img_srcs
 end
end

Question 2

Most "newbie programmers" wouldn't be able to write the code that you did. You probably don't want to work for these people anyway! (Unless the job was something other than entry-level...)

Question 3

Did the interviewer ask you to write code to accomplish the task, or simply get the task done? You could, for example, use wget to get it done with one line of shell code.

Question 4

This is awesome, tell those people to shove it.

Question 5

Few minor issues:

It will be cool if you support URL parameter with protocol (HTTP/HTTPS):
```
doc = Nokogiri::HTML(open("http://#{url}"))
```

For URL manipulations you can use standard tools - URI::HTTP/URI:HTTPS instead regexps:

name = img_src.match(/^http:\/\/.*\/(.*)$/)
img_srcs[index] = "http://#{url}/#{img_src}" unless img_src.match(/http:\/\//)

This is a little unsecure operation, you should escape name:
```
File.open("#{path}/#{name}", "wb")
```
You ignore links with HTTPS protocol:
```
img_src.match(/http:\/\//)
```

This is not the Ruby way:

img_srcs.each_with_index do |img_src, index|
 img_srcs[index] = "http://#{url}/#{img_src}" unless img_src.match(/http:\/\//)
end
img_srcs

A little simpler

img_srcs.map do |src|
 src.match(/http:\/\//) ? "http://#{url}/#{src}" : src
end

You load page uneffiсiently:

image = open(img_src)

So for every image temp file created. This may not be an efficient way.

open("http://stackoverflow.com/questions/1878891/how-to-load-a-web-page-and-search-for-a-word-in-ruby").class
=> Tempfile

Question 6

I wouldn't call point 3 a minor security issue. It's an arbitrary file overwrite condition.

Question 7

IMO it's just rude that you took the time to write a code and got no feedback at all. Your code is not bad, but there's always room for improvement, hints:

make it more functional,
use specific libraries for the task,
return something useful for the main method (the saved file paths?),
use module URI to work with urls.

Anyway, given enough time and access to documentation, that's what I would have written:

require 'nokogiri'
require 'typhoeus'
class Spider
 def self.download_images(url, destination_path, options = {})
 base_url = URI.join(url, "/").to_s
 body = Typhoeus::Request.get(url).body
 imgs = Nokogiri::HTML(body).css("img")
 image_srcs = imgs.map { |img| img["src"] }.compact.uniq
 hydra = Typhoeus::Hydra.new(:max_concurrency => options[:max_concurrency] || 50)
 image_paths = image_srcs.map do |image_src|
 image_url = URI.join(base_url, image_src).to_s
 path = File.join(destination_path, File.basename(image_url))
 request = Typhoeus::Request.new(image_url)
 request.on_complete { |response| File.write(path, response.body) }
 hydra.queue(request)
 path
 end
 hydra.run
 image_paths
 end
end

Note: Images with the same file name will be overwritten, it has not been considered for simplicity. You'd just use group_by to add "-n" suffixes.

Question 8

I'm not familiar with Ruby, just a hint: If I'm right the code creates one thread for every img tag which could be too much. For example, the source of yahoo.com contains more than one hundred img tags which means 100+ threads and 100+ connections to the same host. You should use a thread pool with a queue which limits the number of parallel threads and parallel connections.

Furthermore, you should use keep-alive connections to download more than one image with one TCP connection. (I don't know whether Ruby supports it or not nor that the code above uses it or not.)

Question 9

Just a quick note : the Ruby interpreter (at least the reference implementation) does not use system threads but so called "green threads". I think it's not really an issue to create a hundred threads. Your point holds for the http requests, though.

Question 10

Bit rude of them to leave you hanging.

Anyway, my 2c would be that the code is actually quite alright, but I'd argue, it's a basic web crawler and they've outlined that performance is a key goal, so half of the problem to solve would be to improve cache performance, i.e. to detect cache hits between runs (also to reduce effort of resuming from a crashed run.. in the event of a crash). This could be as simple as organising downloads correctly and checking for the existence of files (although you would need to be wary of only doing atomic writes for that to work).

Overall, could be a plethora of reasons to not hire, it's not always the code, so don't take it too hard.

Either way, if you are mainly concerned with feedback on the code, the main issues that flag for me are:

No error handling

I'd wager, if you failed based on code and not on solution flaws, it would probably be based on lack of error handling, or as below, maybe providing a comment about how x,y,z could fail. In this situation, you're dealing with threading, network and file IO, each of which is potential dynamite and should be well error handled if it's to be performant. (E.g. if a thread throws an exception, you could lose of all your work and have to start the crawl from the beginning again :( ) Ofcourse depending on the interview process, you may have been quizzed on such things, so it could be irrelevant.

Lack of any commenting

Depending on how harsh you think they were judging you & the time constrains of the code (in interview vs do it and send), they may have expected you to provide commentary in your code. It's probably not the biggest issue in interviews though from what I've seen.

Question 11

I've worked at places that considered comments code smell. It was really painful to get comments laser code review so I stopped writing them. Unfortunately, this habit stuck :(

Question 12

I am not a ruby-developer either, but here are a few questions I have after reading your code,

require 'open-uri'
require 'nokogiri'
class Grab
 def runner(url, path)
 threads = []
 doc = Nokogiri::HTML(open("http://#{url}"))

I assume you are getting an array here

 img_srcs = doc.css('img').map{ |i| i['src'] }.uniq

The rel_to_abs method is odd. I think what you really need is a map .(explained why in that method)

 img_srcs = rel_to_abs(url, img_srcs)

Why did you choose to do the rel_to_abs outside, but extraction of name inside?

 img_srcs.each do |img_src|
 threads << Thread.new(img_src) do

What happens when the name extraction fails? Also you assume that the name extracted is a valid file name. Is that always the case? (What happens if it is a query?) And I much rather prefer a smaller match like .*\/([^\/]+)$

 name = img_src.match(/^http:\/\/.*\/(.*)$/)[1]

Your fetch implementation feels like it is partly done. I would prefer fetch to either read the image or move the opening of uri completely to save. Right now, the resource (an open connection) is acquired in one method, and used in another, while you could have done both in one go, and avoided that. (I suppose there is nothing wrong in that, but it just feels wrong.)

 image = fetch img_src
 save(image, name, path)

If it were upto me, I would move the entire name processing and saving into a save_to_disk(url,image-src,path) and do the entire processing there. It would also have the advantage of not having to change a relative mypath to http://domain/myapth format, and then again extracting the mypath.

 end
 end
 threads.each{ |thread| thread.join }
 end
 def fetch(img_src)
 puts "Fetching #{img_src}\n"

Here, I would much rather see open(src).read so that the open connection is not exposed outside.

 image = open(img_src)
 end
 def save(image, name, path)
 File.open("#{path}/#{name}", "wb"){ |file| file.write(image.read) }
 end

This particular method feels rather odd. All you need is a map that checks if img_src matches http:// and if not, returns an updated version. Here you are mutating a data structure within a foreach loop (it works I agree) while you have no reason to do that.

 def rel_to_abs(url, img_srcs)
 img_srcs.map do |img_src|
 case img_src
 when /http:\/\// then img_src
 else "http://#{url}/#{img_src}" 
 end
 end
 end
end

And any reason you preferred to do the invocation in Grab.runner rather than Grab.initialize? with the later, you have to do (Grab.new).runner(url,path) instead of Grab.new(url,path)

But these are all minor. I wouldn't reject some one based on these. So I don't know.

andrykonchin andrykonchin 4862 silver badges4 bronze badges · Accepted Answer · 2012-06-14 21:44:13Z

Few minor issues:

It will be cool if you support URL parameter with protocol (HTTP/HTTPS):
```
doc = Nokogiri::HTML(open("http://#{url}"))
```

For URL manipulations you can use standard tools - URI::HTTP/URI:HTTPS instead regexps:

name = img_src.match(/^http:\/\/.*\/(.*)$/)
img_srcs[index] = "http://#{url}/#{img_src}" unless img_src.match(/http:\/\//)

This is a little unsecure operation, you should escape name:
```
File.open("#{path}/#{name}", "wb")
```
You ignore links with HTTPS protocol:
```
img_src.match(/http:\/\//)
```

This is not the Ruby way:

img_srcs.each_with_index do |img_src, index|
 img_srcs[index] = "http://#{url}/#{img_src}" unless img_src.match(/http:\/\//)
end
img_srcs

A little simpler

img_srcs.map do |src|
 src.match(/http:\/\//) ? "http://#{url}/#{src}" : src
end

You load page uneffiсiently:

image = open(img_src)

So for every image temp file created. This may not be an efficient way.

open("http://stackoverflow.com/questions/1878891/how-to-load-a-web-page-and-search-for-a-word-in-ruby").class
=> Tempfile

I wouldn't call point 3 a minor security issue. It's an arbitrary file overwrite condition.

Stack Exchange Network

Parse webpage and save fetched images to a directory

5 Answers 5

No error handling

Lack of any commenting

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Parse webpage and save fetched images to a directory

5 Answers 5

No error handling

Lack of any commenting

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions