I've recently ported an old project and made it object-oriented. However, I've noticed that rubocop points out the following status: Assignment Branch Condition size for fetch_images is too high. [22.41/15]
. Can I make this method be made more efficient, perhaps?
Snippit
class WebCrawler
attr_reader :agent, :tumblr, :images
def initialize
@agent = Mechanize.new
@agent.user_agent_alias = Mechanize::AGENT_ALIASES.keys
.reject { |agent| agent == 'Mechanize' }.sample
@tumblr = URI('http://www.tumblr.com')
@images = []
end
def fetch_images(blog, proxy)
fail Exceptions::InvalidBlog unless valid_blog?(blog)
fail Exceptions::InvalidProxy unless valid_ip?(proxy)
tokens = proxy.split(':')
proxy_addr = { ip: tokens[0], port: tokens[1].to_i }
@agent.set_proxy(proxy_addr[:ip], proxy_addr[:port])
@tumblr.host.gsub!(/^w{3}/, '').insert(0, "#{blog}")
page = @agent.get(@tumblr)
page.search('img[src$="jpg"], img[src$="png"]').each do |img|
@images << img['src']
end
@images.keep_if do |img|
img =~ %r{^https?://\d+|media} && img !~ /avatar/
end
end
end
-
1\$\begingroup\$ In case you didn't know, there's a Ruby wrapper for the tumblr API here: github.com/tumblr/tumblr_client \$\endgroup\$raptortech97– raptortech972014年11月14日 02:08:15 +00:00Commented Nov 14, 2014 at 2:08
-
\$\begingroup\$ I didn't know that, but for what I'm doing I don't think I need a wrapper. It's much more fun to write it on my own. @raptortech97 \$\endgroup\$user27606– user276062014年11月14日 02:12:09 +00:00Commented Nov 14, 2014 at 2:12
1 Answer 1
Assignment Branch Condition size for fetch_images is too high. [22.41/15]
What that warning is saying is that, according to rubocop's standard, you have too many assignments, branches, and conditionals (ABCs) in that method. ABCs are associated with code that is complex and harder to reason about. It can also indicate that a method is trying to do too much, as I would say it is telling you about this method.
Looking over the method in question, we can clearly see it has several responsibilities:
- Guard clauses for validity
- Tokenizing a proxy string into a hash data structure
- Setting a proxy with the data from that data structure
- Tearing apart and rebuilding a URL
- Fetching a page over HTTP
- Parsing markup for image tags, and shoving them on an array
- Filtering that array of images
- Returning that array
That's a lot of responsibilities for one method. It's dealing with several different types of objects/behaviors at several different levels of abstraction. We can, and should, break it up to make it easier to manage.
Before I do that, I'm going to look at the rest of the class (the initialize method). In the initializer, I see a lot of state assignment, and I see some work beyond just state assignment.
@agent = Mechanize.new
Here you are getting a new instance of Mechanize
, but you are doing it directly in the constructor. I would recommend injecting this as a parameter instead. You could even use a default parameter to give you flexibility:
def initialize(agent = Mechanize.new)
@agent = agent
...
This gives you the ability to change this dependency at runtime, and to mock or stub this dependency in your tests.
@agent.user_agent_alias = Mechanize::AGENT_ALIASES.keys .reject { |agent| agent == 'Mechanize' }.sample
Here is some more of the extra work being done in the initialize method. Ideally, you should get a fully initialized object injected, and you should not need to to any extra configuration with it. In this case, you might want to create a MechanizeFactory
class to encapsulate the construction and initialization of Mechanize
objects so you don't have to worry about any of this in your constructor:
class MechanizeFactory
def self.get
mechanize = Mechanize.new
mechanize.user_agent_alias = Mechanize::AGENT_ALIASES.keys
.reject { |agent| agent == 'Mechanize' }.sample
mechanize
end
end
...
class WebCrawler
def initialize(agent = MechanizeFactory.get)
@agent = agent
...
@tumblr = URI('http://www.tumblr.com')
Here you are setting the URL to the Tumblr main site as an instance variable. The way you use it does not really make it appropriate to be an instance variable. I will discuss why and alternatives further down, but for now I am going to say that this line is unnecessary.
@images = []
You are creating an instance variable to hold the list of returned images. I don't see any reason that this object needs to hold this state; the list of images can simply be returned from the fetch_images
method. This line is also unnecessary.
Now onto the fetch_images method.
fail Exceptions::InvalidBlog unless valid_blog?(blog) fail Exceptions::InvalidProxy unless valid_ip?(proxy)
Here are your guard clauses. For now, I'm going to just leave them as they are.
tokens = proxy.split(':') proxy_addr = { ip: tokens[0], port: tokens[1].to_i } @agent.set_proxy(proxy_addr[:ip], proxy_addr[:port])
Here is one place where a lot is happening. We are doing string tokenization, data structure juggling, and setting more properties on @agent
. I would first have to ask if the proxy really changes with each request. I think it would probably be better to simply set the proxy in the MechanizeFactory.get
method by passing it in by parameter. If it does need to change, there are other strategies to deal with it, but for now I'm going to go on my assumption. This means those three lines are effectively gone, with their functionality relocated. The proxy parameter and proxy validation guard clause are also gone.
@tumblr.host.gsub!(/^w{3}/, '').insert(0, "#{blog}")
Now we are statefully changing that URL instance variable, which I said we no longer need. We can use a local variable, but this still deals with a lower level of abstraction than the rest of the method. We should extract it to a private helper and call that from our code. We can also just combine it with the following line to avoid the need for a temporary local variable.
class MechanizeFactory
def self.get(proxy = '') # Or some logical default for proxy
fail Exceptions::InvalidProxy unless valid_ip?(proxy)
mechanize = Mechanize.new
mechanize.user_agent_alias = Mechanize::AGENT_ALIASES.keys
.reject { |agent| agent == 'Mechanize' }.sample
proxy_ip, proxy_port = proxy.split(':')
mechanize.set_proxy(proxy_ip, proxy_port)
mechanize
end
end
...
class WebCrawler
def initialize(agent = MechanizeFactory.get)
@agent = agent
end
def fetch_images(blog)
page = @agent.get(get_blog_uri(blog))
...
end
private
def get_blog_uri(blog)
fail Exceptions::InvalidBlog unless valid_blog?(blog)
URI("http://#{blog}.tumblr.com/")
end
end
page.search('img[src$="jpg"], img[src$="png"]').each do |img| @images << img['src'] end @images.keep_if do |img| img =~ %r{^https?://\d+|media} && img !~ /avatar/ end
Here you are doing an XPath search, shoveling the results onto an array, and then filtering the array. This is also at a different level of abstraction, and it can be optimized. We can move it into it's own private helper method. We don't have the @images
instance variable, but page.search
returns an Enumerable object that we can call map
on, and then manipulate. The final product looks like:
class MechanizeFactory
def self.get(proxy = '') # Or some logical default for proxy
fail Exceptions::InvalidProxy unless valid_ip?(proxy)
mechanize = Mechanize.new
mechanize.user_agent_alias = Mechanize::AGENT_ALIASES.keys
.reject { |agent| agent == 'Mechanize' }.sample
proxy_ip, proxy_port = proxy.split(':')
mechanize.set_proxy(proxy_ip, proxy_port)
mechanize
end
end
...
class WebCrawler
def initialize(agent = MechanizeFactory.get)
@agent = agent
end
def fetch_images(blog)
page = @agent.get(get_blog_uri(blog))
extract_images_from_page(page)
end
private
def get_blog_uri(blog)
fail Exceptions::InvalidBlog unless valid_blog?(blog)
URI("http://#{blog}.tumblr.com/")
end
def extract_images_from_page(page)
page.search('img[src$="jpg"], img[src$="png"]').map do |img|
img['src']
end.keep_if do |img|
img =~ %r{^https?://\d+|media} && img !~ /avatar/
end
end
end
There is more that can be done, I am sure, but this at least starts to break the code down into smaller, more focused pieces.
-
\$\begingroup\$ Love the insight, I've never made a factory before so that sounds like something I'll look further into. Do you think this is better off procedural, rather than trying to port this into an oop format? \$\endgroup\$user27606– user276062014年11月14日 04:24:47 +00:00Commented Nov 14, 2014 at 4:24
-
1\$\begingroup\$ @LucienLachance It depends on how and how much you intend to use it. If it's just a one-off script, procedural is probably fine, and not worth the effort of porting. If you are using it for some production function, or are planning to integrate it into a larger app, OOP is the way to go. OOP design allows it to be more easily unit tested and to be more easily modularized for reuse in other code. \$\endgroup\$cbojar– cbojar2014年11月14日 17:09:01 +00:00Commented Nov 14, 2014 at 17:09
-
\$\begingroup\$ You're right. I went with sticking with your factory pattern and passing in a proxy in the initialize method and assigning @agent to the factory method to create a mechanize object based on the given proxy. So now I create the crawler object with a proxy and once created, it makes a request under that proxy without setting one every request which is exactly what I wanted. It's also abstracted all of the setup code as you pointed out which is another plus. I also moved
valid_ip?
into the factory class. \$\endgroup\$user27606– user276062014年11月16日 09:16:00 +00:00Commented Nov 16, 2014 at 9:16