Refactoring a Crawler

Question 1

I've recently ported an old project and made it object-oriented. However, I've noticed that rubocop points out the following status: Assignment Branch Condition size for fetch_images is too high. [22.41/15]. Can I make this method be made more efficient, perhaps?

Snippit

class WebCrawler 
 attr_reader :agent, :tumblr, :images 
 def initialize 
 @agent = Mechanize.new 
 @agent.user_agent_alias = Mechanize::AGENT_ALIASES.keys 
 .reject { |agent| agent == 'Mechanize' }.sample 
 @tumblr = URI('http://www.tumblr.com') 
 @images = [] 
 end 
 def fetch_images(blog, proxy) 
 fail Exceptions::InvalidBlog unless valid_blog?(blog) 
 fail Exceptions::InvalidProxy unless valid_ip?(proxy) 
 tokens = proxy.split(':') 
 proxy_addr = { ip: tokens[0], port: tokens[1].to_i } 
 @agent.set_proxy(proxy_addr[:ip], proxy_addr[:port]) 
 @tumblr.host.gsub!(/^w{3}/, '').insert(0, "#{blog}") 
 page = @agent.get(@tumblr) 
 page.search('img[src$="jpg"], img[src$="png"]').each do |img| 
 @images << img['src'] 
 end 
 @images.keep_if do |img| 
 img =~ %r{^https?://\d+|media} && img !~ /avatar/ 
 end 
 end
end

Question 2

In case you didn't know, there's a Ruby wrapper for the tumblr API here: github.com/tumblr/tumblr_client

Question 3

I didn't know that, but for what I'm doing I don't think I need a wrapper. It's much more fun to write it on my own. @raptortech97

Question 4

Assignment Branch Condition size for fetch_images is too high. [22.41/15]

What that warning is saying is that, according to rubocop's standard, you have too many assignments, branches, and conditionals (ABCs) in that method. ABCs are associated with code that is complex and harder to reason about. It can also indicate that a method is trying to do too much, as I would say it is telling you about this method.

Looking over the method in question, we can clearly see it has several responsibilities:

Guard clauses for validity
Tokenizing a proxy string into a hash data structure
Setting a proxy with the data from that data structure
Tearing apart and rebuilding a URL
Fetching a page over HTTP
Parsing markup for image tags, and shoving them on an array
Filtering that array of images
Returning that array

That's a lot of responsibilities for one method. It's dealing with several different types of objects/behaviors at several different levels of abstraction. We can, and should, break it up to make it easier to manage.

Before I do that, I'm going to look at the rest of the class (the initialize method). In the initializer, I see a lot of state assignment, and I see some work beyond just state assignment.

@agent = Mechanize.new

Here you are getting a new instance of Mechanize, but you are doing it directly in the constructor. I would recommend injecting this as a parameter instead. You could even use a default parameter to give you flexibility:

def initialize(agent = Mechanize.new)
 @agent = agent
 ...

This gives you the ability to change this dependency at runtime, and to mock or stub this dependency in your tests.

@agent.user_agent_alias = Mechanize::AGENT_ALIASES.keys 
 .reject { |agent| agent == 'Mechanize' }.sample

Here is some more of the extra work being done in the initialize method. Ideally, you should get a fully initialized object injected, and you should not need to to any extra configuration with it. In this case, you might want to create a MechanizeFactory class to encapsulate the construction and initialization of Mechanize objects so you don't have to worry about any of this in your constructor:

class MechanizeFactory
 def self.get
 mechanize = Mechanize.new
 mechanize.user_agent_alias = Mechanize::AGENT_ALIASES.keys 
 .reject { |agent| agent == 'Mechanize' }.sample
 mechanize
 end
end
...
class WebCrawler
 def initialize(agent = MechanizeFactory.get)
 @agent = agent
 ...

@tumblr = URI('http://www.tumblr.com')

Here you are setting the URL to the Tumblr main site as an instance variable. The way you use it does not really make it appropriate to be an instance variable. I will discuss why and alternatives further down, but for now I am going to say that this line is unnecessary.

@images = []

You are creating an instance variable to hold the list of returned images. I don't see any reason that this object needs to hold this state; the list of images can simply be returned from the fetch_images method. This line is also unnecessary.

Now onto the fetch_images method.

fail Exceptions::InvalidBlog unless valid_blog?(blog)
fail Exceptions::InvalidProxy unless valid_ip?(proxy)

Here are your guard clauses. For now, I'm going to just leave them as they are.

tokens = proxy.split(':') 
proxy_addr = { ip: tokens[0], port: tokens[1].to_i } 
@agent.set_proxy(proxy_addr[:ip], proxy_addr[:port])

Here is one place where a lot is happening. We are doing string tokenization, data structure juggling, and setting more properties on @agent. I would first have to ask if the proxy really changes with each request. I think it would probably be better to simply set the proxy in the MechanizeFactory.get method by passing it in by parameter. If it does need to change, there are other strategies to deal with it, but for now I'm going to go on my assumption. This means those three lines are effectively gone, with their functionality relocated. The proxy parameter and proxy validation guard clause are also gone.

@tumblr.host.gsub!(/^w{3}/, '').insert(0, "#{blog}")

Now we are statefully changing that URL instance variable, which I said we no longer need. We can use a local variable, but this still deals with a lower level of abstraction than the rest of the method. We should extract it to a private helper and call that from our code. We can also just combine it with the following line to avoid the need for a temporary local variable.

class MechanizeFactory
 def self.get(proxy = '') # Or some logical default for proxy
 fail Exceptions::InvalidProxy unless valid_ip?(proxy)
 mechanize = Mechanize.new
 mechanize.user_agent_alias = Mechanize::AGENT_ALIASES.keys 
 .reject { |agent| agent == 'Mechanize' }.sample
 proxy_ip, proxy_port = proxy.split(':')
 mechanize.set_proxy(proxy_ip, proxy_port)
 mechanize
 end
end
...
class WebCrawler
 def initialize(agent = MechanizeFactory.get)
 @agent = agent
 end
 def fetch_images(blog)
 page = @agent.get(get_blog_uri(blog))
 ...
 end
 private
 def get_blog_uri(blog)
 fail Exceptions::InvalidBlog unless valid_blog?(blog)
 URI("http://#{blog}.tumblr.com/")
 end
end

page.search('img[src$="jpg"], img[src$="png"]').each do |img|
 @images << img['src']
end
@images.keep_if do |img|
 img =~ %r{^https?://\d+|media} && img !~ /avatar/
end

Here you are doing an XPath search, shoveling the results onto an array, and then filtering the array. This is also at a different level of abstraction, and it can be optimized. We can move it into it's own private helper method. We don't have the @images instance variable, but page.search returns an Enumerable object that we can call map on, and then manipulate. The final product looks like:

class MechanizeFactory
 def self.get(proxy = '') # Or some logical default for proxy
 fail Exceptions::InvalidProxy unless valid_ip?(proxy)
 mechanize = Mechanize.new
 mechanize.user_agent_alias = Mechanize::AGENT_ALIASES.keys 
 .reject { |agent| agent == 'Mechanize' }.sample
 proxy_ip, proxy_port = proxy.split(':')
 mechanize.set_proxy(proxy_ip, proxy_port)
 mechanize
 end
end
...
class WebCrawler
 def initialize(agent = MechanizeFactory.get)
 @agent = agent
 end
 def fetch_images(blog)
 page = @agent.get(get_blog_uri(blog))
 extract_images_from_page(page)
 end
 private
 def get_blog_uri(blog)
 fail Exceptions::InvalidBlog unless valid_blog?(blog)
 URI("http://#{blog}.tumblr.com/")
 end
 def extract_images_from_page(page)
 page.search('img[src$="jpg"], img[src$="png"]').map do |img|
 img['src']
 end.keep_if do |img|
 img =~ %r{^https?://\d+|media} && img !~ /avatar/
 end
 end
end

There is more that can be done, I am sure, but this at least starts to break the code down into smaller, more focused pieces.

Question 5

Love the insight, I've never made a factory before so that sounds like something I'll look further into. Do you think this is better off procedural, rather than trying to port this into an oop format?

Question 6

@LucienLachance It depends on how and how much you intend to use it. If it's just a one-off script, procedural is probably fine, and not worth the effort of porting. If you are using it for some production function, or are planning to integrate it into a larger app, OOP is the way to go. OOP design allows it to be more easily unit tested and to be more easily modularized for reuse in other code.

Question 7

You're right. I went with sticking with your factory pattern and passing in a proxy in the initialize method and assigning @agent to the factory method to create a mechanize object based on the given proxy. So now I create the crawler object with a proxy and once created, it makes a request under that proxy without setting one every request which is exactly what I wanted. It's also abstracted all of the setup code as you pointed out which is another plus. I also moved valid_ip? into the factory class.

cbojar cbojar 2,7422 gold badges15 silver badges20 bronze badges · Accepted Answer · 2014-11-14 04:09:16Z

Assignment Branch Condition size for fetch_images is too high. [22.41/15]

What that warning is saying is that, according to rubocop's standard, you have too many assignments, branches, and conditionals (ABCs) in that method. ABCs are associated with code that is complex and harder to reason about. It can also indicate that a method is trying to do too much, as I would say it is telling you about this method.

Looking over the method in question, we can clearly see it has several responsibilities:

Guard clauses for validity
Tokenizing a proxy string into a hash data structure
Setting a proxy with the data from that data structure
Tearing apart and rebuilding a URL
Fetching a page over HTTP
Parsing markup for image tags, and shoving them on an array
Filtering that array of images
Returning that array

That's a lot of responsibilities for one method. It's dealing with several different types of objects/behaviors at several different levels of abstraction. We can, and should, break it up to make it easier to manage.

Before I do that, I'm going to look at the rest of the class (the initialize method). In the initializer, I see a lot of state assignment, and I see some work beyond just state assignment.

@agent = Mechanize.new

Here you are getting a new instance of Mechanize, but you are doing it directly in the constructor. I would recommend injecting this as a parameter instead. You could even use a default parameter to give you flexibility:

def initialize(agent = Mechanize.new)
 @agent = agent
 ...

This gives you the ability to change this dependency at runtime, and to mock or stub this dependency in your tests.

@agent.user_agent_alias = Mechanize::AGENT_ALIASES.keys 
 .reject { |agent| agent == 'Mechanize' }.sample

Here is some more of the extra work being done in the initialize method. Ideally, you should get a fully initialized object injected, and you should not need to to any extra configuration with it. In this case, you might want to create a MechanizeFactory class to encapsulate the construction and initialization of Mechanize objects so you don't have to worry about any of this in your constructor:

class MechanizeFactory
 def self.get
 mechanize = Mechanize.new
 mechanize.user_agent_alias = Mechanize::AGENT_ALIASES.keys 
 .reject { |agent| agent == 'Mechanize' }.sample
 mechanize
 end
end
...
class WebCrawler
 def initialize(agent = MechanizeFactory.get)
 @agent = agent
 ...

@tumblr = URI('http://www.tumblr.com')

Here you are setting the URL to the Tumblr main site as an instance variable. The way you use it does not really make it appropriate to be an instance variable. I will discuss why and alternatives further down, but for now I am going to say that this line is unnecessary.

@images = []

You are creating an instance variable to hold the list of returned images. I don't see any reason that this object needs to hold this state; the list of images can simply be returned from the fetch_images method. This line is also unnecessary.

Now onto the fetch_images method.

fail Exceptions::InvalidBlog unless valid_blog?(blog)
fail Exceptions::InvalidProxy unless valid_ip?(proxy)

Here are your guard clauses. For now, I'm going to just leave them as they are.

tokens = proxy.split(':') 
proxy_addr = { ip: tokens[0], port: tokens[1].to_i } 
@agent.set_proxy(proxy_addr[:ip], proxy_addr[:port])

Here is one place where a lot is happening. We are doing string tokenization, data structure juggling, and setting more properties on @agent. I would first have to ask if the proxy really changes with each request. I think it would probably be better to simply set the proxy in the MechanizeFactory.get method by passing it in by parameter. If it does need to change, there are other strategies to deal with it, but for now I'm going to go on my assumption. This means those three lines are effectively gone, with their functionality relocated. The proxy parameter and proxy validation guard clause are also gone.

@tumblr.host.gsub!(/^w{3}/, '').insert(0, "#{blog}")

Now we are statefully changing that URL instance variable, which I said we no longer need. We can use a local variable, but this still deals with a lower level of abstraction than the rest of the method. We should extract it to a private helper and call that from our code. We can also just combine it with the following line to avoid the need for a temporary local variable.

class MechanizeFactory
 def self.get(proxy = '') # Or some logical default for proxy
 fail Exceptions::InvalidProxy unless valid_ip?(proxy)
 mechanize = Mechanize.new
 mechanize.user_agent_alias = Mechanize::AGENT_ALIASES.keys 
 .reject { |agent| agent == 'Mechanize' }.sample
 proxy_ip, proxy_port = proxy.split(':')
 mechanize.set_proxy(proxy_ip, proxy_port)
 mechanize
 end
end
...
class WebCrawler
 def initialize(agent = MechanizeFactory.get)
 @agent = agent
 end
 def fetch_images(blog)
 page = @agent.get(get_blog_uri(blog))
 ...
 end
 private
 def get_blog_uri(blog)
 fail Exceptions::InvalidBlog unless valid_blog?(blog)
 URI("http://#{blog}.tumblr.com/")
 end
end

page.search('img[src$="jpg"], img[src$="png"]').each do |img|
 @images << img['src']
end
@images.keep_if do |img|
 img =~ %r{^https?://\d+|media} && img !~ /avatar/
end

Here you are doing an XPath search, shoveling the results onto an array, and then filtering the array. This is also at a different level of abstraction, and it can be optimized. We can move it into it's own private helper method. We don't have the @images instance variable, but page.search returns an Enumerable object that we can call map on, and then manipulate. The final product looks like:

class MechanizeFactory
 def self.get(proxy = '') # Or some logical default for proxy
 fail Exceptions::InvalidProxy unless valid_ip?(proxy)
 mechanize = Mechanize.new
 mechanize.user_agent_alias = Mechanize::AGENT_ALIASES.keys 
 .reject { |agent| agent == 'Mechanize' }.sample
 proxy_ip, proxy_port = proxy.split(':')
 mechanize.set_proxy(proxy_ip, proxy_port)
 mechanize
 end
end
...
class WebCrawler
 def initialize(agent = MechanizeFactory.get)
 @agent = agent
 end
 def fetch_images(blog)
 page = @agent.get(get_blog_uri(blog))
 extract_images_from_page(page)
 end
 private
 def get_blog_uri(blog)
 fail Exceptions::InvalidBlog unless valid_blog?(blog)
 URI("http://#{blog}.tumblr.com/")
 end
 def extract_images_from_page(page)
 page.search('img[src$="jpg"], img[src$="png"]').map do |img|
 img['src']
 end.keep_if do |img|
 img =~ %r{^https?://\d+|media} && img !~ /avatar/
 end
 end
end

There is more that can be done, I am sure, but this at least starts to break the code down into smaller, more focused pieces.

Love the insight, I've never made a factory before so that sounds like something I'll look further into. Do you think this is better off procedural, rather than trying to port this into an oop format?
@LucienLachance It depends on how and how much you intend to use it. If it's just a one-off script, procedural is probably fine, and not worth the effort of porting. If you are using it for some production function, or are planning to integrate it into a larger app, OOP is the way to go. OOP design allows it to be more easily unit tested and to be more easily modularized for reuse in other code.
You're right. I went with sticking with your factory pattern and passing in a proxy in the initialize method and assigning @agent to the factory method to create a mechanize object based on the given proxy. So now I create the crawler object with a proxy and once created, it makes a request under that proxy without setting one every request which is exactly what I wanted. It's also abstracted all of the setup code as you pointed out which is another plus. I also moved valid_ip? into the factory class.

Stack Exchange Network

Refactoring a Crawler

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Refactoring a Crawler

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions