I had a task for a programmer position which I failed. I am a newbie programmer and I accept that. The only problem is that employer never told me what the actual problem with the code is. So maybe community will be able to give few hints on how to improve it.
The task was to write code which would parse a given webpage, fetch all the images and save it to given directory. Webpage address and directory are command line parameters. Performance was a critical issue for this task.
require 'open-uri'
require 'nokogiri'
class Grab
def runner(url, path)
threads = []
doc = Nokogiri::HTML(open("http://#{url}"))
img_srcs = doc.css('img').map{ |i| i['src'] }.uniq
img_srcs = rel_to_abs(url, img_srcs)
img_srcs.each do |img_src|
threads << Thread.new(img_src) do
name = img_src.match(/^http:\/\/.*\/(.*)$/)[1]
image = fetch img_src
save(image, name, path)
end
end
threads.each{ |thread| thread.join }
end
def fetch(img_src)
puts "Fetching #{img_src}\n"
image = open(img_src)
end
def save(image, name, path)
File.open("#{path}/#{name}", "wb"){ |file| file.write(image.read) }
end
def rel_to_abs(url, img_srcs)
img_srcs.each_with_index do |img_src, index|
img_srcs[index] = "http://#{url}/#{img_src}" unless img_src.match(/http:\/\//)
end
img_srcs
end
end
-
9\$\begingroup\$ Most "newbie programmers" wouldn't be able to write the code that you did. You probably don't want to work for these people anyway! (Unless the job was something other than entry-level...) \$\endgroup\$Pat– Pat2012年06月21日 20:22:35 +00:00Commented Jun 21, 2012 at 20:22
-
\$\begingroup\$ Did the interviewer ask you to write code to accomplish the task, or simply get the task done? You could, for example, use wget to get it done with one line of shell code. \$\endgroup\$200_success– 200_success2014年03月30日 18:10:23 +00:00Commented Mar 30, 2014 at 18:10
-
\$\begingroup\$ This is awesome, tell those people to shove it. \$\endgroup\$Bam– Bam2016年01月07日 20:19:16 +00:00Commented Jan 7, 2016 at 20:19
5 Answers 5
Few minor issues:
It will be cool if you support URL parameter with protocol (HTTP/HTTPS):
doc = Nokogiri::HTML(open("http://#{url}"))
For URL manipulations you can use standard tools - URI::HTTP/URI:HTTPS instead regexps:
name = img_src.match(/^http:\/\/.*\/(.*)$/) img_srcs[index] = "http://#{url}/#{img_src}" unless img_src.match(/http:\/\//)
This is a little unsecure operation, you should escape name:
File.open("#{path}/#{name}", "wb")
You ignore links with HTTPS protocol:
img_src.match(/http:\/\//)
This is not the Ruby way:
img_srcs.each_with_index do |img_src, index| img_srcs[index] = "http://#{url}/#{img_src}" unless img_src.match(/http:\/\//) end img_srcs
A little simpler
img_srcs.map do |src| src.match(/http:\/\//) ? "http://#{url}/#{src}" : src end
You load page uneffiсiently:
image = open(img_src)
So for every image temp file created. This may not be an efficient way.
open("http://stackoverflow.com/questions/1878891/how-to-load-a-web-page-and-search-for-a-word-in-ruby").class => Tempfile
-
\$\begingroup\$ I wouldn't call point 3 a minor security issue. It's an arbitrary file overwrite condition. \$\endgroup\$Polynomial– Polynomial2014年02月03日 15:20:53 +00:00Commented Feb 3, 2014 at 15:20
IMO it's just rude that you took the time to write a code and got no feedback at all. Your code is not bad, but there's always room for improvement, hints:
- make it more functional,
- use specific libraries for the task,
- return something useful for the main method (the saved file paths?),
- use module URI to work with urls.
Anyway, given enough time and access to documentation, that's what I would have written:
require 'nokogiri'
require 'typhoeus'
class Spider
def self.download_images(url, destination_path, options = {})
base_url = URI.join(url, "/").to_s
body = Typhoeus::Request.get(url).body
imgs = Nokogiri::HTML(body).css("img")
image_srcs = imgs.map { |img| img["src"] }.compact.uniq
hydra = Typhoeus::Hydra.new(:max_concurrency => options[:max_concurrency] || 50)
image_paths = image_srcs.map do |image_src|
image_url = URI.join(base_url, image_src).to_s
path = File.join(destination_path, File.basename(image_url))
request = Typhoeus::Request.new(image_url)
request.on_complete { |response| File.write(path, response.body) }
hydra.queue(request)
path
end
hydra.run
image_paths
end
end
Note: Images with the same file name will be overwritten, it has not been considered for simplicity. You'd just use group_by
to add "-n" suffixes.
I'm not familiar with Ruby, just a hint: If I'm right the code creates one thread for every img
tag which could be too much. For example, the source of yahoo.com contains more than one hundred img
tags which means 100+ threads and 100+ connections to the same host. You should use a thread pool with a queue which limits the number of parallel threads and parallel connections.
Furthermore, you should use keep-alive connections to download more than one image with one TCP connection. (I don't know whether Ruby supports it or not nor that the code above uses it or not.)
-
\$\begingroup\$ Just a quick note : the Ruby interpreter (at least the reference implementation) does not use system threads but so called "green threads". I think it's not really an issue to create a hundred threads. Your point holds for the http requests, though. \$\endgroup\$barjak– barjak2012年10月17日 13:52:52 +00:00Commented Oct 17, 2012 at 13:52
Bit rude of them to leave you hanging.
Anyway, my 2c would be that the code is actually quite alright, but I'd argue, it's a basic web crawler and they've outlined that performance is a key goal, so half of the problem to solve would be to improve cache performance, i.e. to detect cache hits between runs (also to reduce effort of resuming from a crashed run.. in the event of a crash). This could be as simple as organising downloads correctly and checking for the existence of files (although you would need to be wary of only doing atomic writes for that to work).
Overall, could be a plethora of reasons to not hire, it's not always the code, so don't take it too hard.
Either way, if you are mainly concerned with feedback on the code, the main issues that flag for me are:
No error handling
I'd wager, if you failed based on code and not on solution flaws, it would probably be based on lack of error handling, or as below, maybe providing a comment about how x,y,z could fail. In this situation, you're dealing with threading, network and file IO, each of which is potential dynamite and should be well error handled if it's to be performant. (E.g. if a thread throws an exception, you could lose of all your work and have to start the crawl from the beginning again :( ) Ofcourse depending on the interview process, you may have been quizzed on such things, so it could be irrelevant.
Lack of any commenting
Depending on how harsh you think they were judging you & the time constrains of the code (in interview vs do it and send), they may have expected you to provide commentary in your code. It's probably not the biggest issue in interviews though from what I've seen.
-
\$\begingroup\$ I've worked at places that considered comments code smell. It was really painful to get comments laser code review so I stopped writing them. Unfortunately, this habit stuck :( \$\endgroup\$user341493– user3414932014年03月22日 02:12:53 +00:00Commented Mar 22, 2014 at 2:12
I am not a ruby-developer either, but here are a few questions I have after reading your code,
require 'open-uri'
require 'nokogiri'
class Grab
def runner(url, path)
threads = []
doc = Nokogiri::HTML(open("http://#{url}"))
I assume you are getting an array here
img_srcs = doc.css('img').map{ |i| i['src'] }.uniq
The rel_to_abs method is odd. I think what you really need is a map .(explained why in that method)
img_srcs = rel_to_abs(url, img_srcs)
Why did you choose to do the rel_to_abs outside, but extraction of name inside?
img_srcs.each do |img_src|
threads << Thread.new(img_src) do
What happens when the name extraction fails?
Also you assume that the name extracted is a valid file name. Is that always the case? (What happens if it is a query?) And I much rather prefer a smaller match like .*\/([^\/]+)$
name = img_src.match(/^http:\/\/.*\/(.*)$/)[1]
Your fetch implementation feels like it is partly done. I would prefer fetch to either read the image or move the opening of uri completely to save. Right now, the resource (an open connection) is acquired in one method, and used in another, while you could have done both in one go, and avoided that. (I suppose there is nothing wrong in that, but it just feels wrong.)
image = fetch img_src
save(image, name, path)
If it were upto me, I would move the entire name processing and saving into a save_to_disk(url,image-src,path)
and do the entire processing there. It would also have the advantage of not having to change a relative mypath
to http://domain/myapth
format, and then again extracting the mypath
.
end
end
threads.each{ |thread| thread.join }
end
def fetch(img_src)
puts "Fetching #{img_src}\n"
Here, I would much rather see open(src).read
so that the open connection is not exposed outside.
image = open(img_src)
end
def save(image, name, path)
File.open("#{path}/#{name}", "wb"){ |file| file.write(image.read) }
end
This particular method feels rather odd. All you need is a map that checks if img_src matches http:// and if not, returns an updated version. Here you are mutating a data structure within a foreach loop (it works I agree) while you have no reason to do that.
def rel_to_abs(url, img_srcs)
img_srcs.map do |img_src|
case img_src
when /http:\/\// then img_src
else "http://#{url}/#{img_src}"
end
end
end
end
And any reason you preferred to do the invocation in Grab.runner rather than Grab.initialize? with the later, you have to do (Grab.new).runner(url,path)
instead of Grab.new(url,path)
But these are all minor. I wouldn't reject some one based on these. So I don't know.
Explore related questions
See similar questions with these tags.