5
\$\begingroup\$

I'm a JavaScript developer. I'm pretty sure that will be immediately apparent in the below code if for no other reason than the level/depth of chaining that I'm comfortable with. However, I'm learning Ruby, and so I'd love to write beautiful Ruby code too. My simple first project is a Craigslist search-across-regions script.

The full code is on GitHub, but broken down into snippets with questions below.


def get_us_regions()
 # Accumulator for building up the returned object.
 results = {}
 # Important URLs
 sitelist = URI('http://www.craigslist.org/about/sites')
 geospatial = URI('http://www.craigslist.org/about/areas.json')
 # Get a collection of nodes for the US regions out of craigslist's site list.
 usregions = Nokogiri::HTML(Net::HTTP.get(sitelist)).search("a[name=US]").first().parent().next_element().search('a')
 # Parse out the information to build a usable representation.
 usregions.each { |usregion|
 hostname = usregion.attr('href').gsub('http://','').gsub('.craigslist.org','')
 results[hostname] = { name: usregion.content, state: usregion.parent().parent().previous_element().content }
 }
 # Merge that information with craigslist's geographic information.
 areas = JSON.parse(Net::HTTP.get(geospatial))
 areas.each { |area|
 if results[area["hostname"]]
 results[area["hostname"]][:stateabbrev] = area["region"]
 results[area["hostname"]][:latitude] = area["lat"]
 results[area["hostname"]][:longitude] = area["lon"]
 end
 }
 # This is a complete list of the US regions, keyed off of their hostname.
 return results
end

Bootstrapping

  • How should I get procedural information I need to get started?
  • Does this change if I were doing this on bootstrap for a long-running application and wanted to refresh, say, monthly?
  • Should I shove bootstrapping this into some really abstract class?

This Isn't JS

  • How should I chain calls to class methods?
  • Why hashes with string keys versus the weird named key thing?

Object-ions

  • Should I be creating an object for each region and feeding the pieces I parse out of the document into the constructor function?
  • If I did that, should that constructor function just take a DOM node and be clever with figuring out what I passed it?
  • And for reopening an object since I have to collate across two sources what's the "right" approach?

# Perform a search in a particular region.
def search_region(regionhostname, query)
 # In case there are multiple pages of results from a search
 pages = []
 pagecount = false
 # An accumulator for storing what we need to return.
 result = []
 # Make requests for every page.
 while (pages.length != pagecount)
 # End up with a start of "0" on the first time, 100 is craigslist's page length.
 page = pages.length * 100 
 # Here is the URL we'll be making the request of.
 url = URI("http://#{regionhostname}.craigslist.org/search/cto?query=#{query}&srchType=T&s=#{page}")
 # Get the response and parse it.
 pages << Nokogiri::HTML(Net::HTTP.get(url))
 # If this is the first time through
 if (pagecount == false)
 #check to make sure there are results.
 if pages.last().search('.resulttotal').length() != 0
 # There are results, and we need to see if additional requests are necessary.
 pagecount = (pages.last().search('.resulttotal').first().content().gsub(/[^0-9]/,'').to_i / 100.0).ceil
 else
 # There are no results, we're done here.
 return []
 end
 end
 end
 # Go through each of the pages of results and process the listings
 pages.each { |page|
 # Go through all of the listings on each page
 page.search('.row').each { |listing|
 # Skip listings from other regions in case there are any ("FEW LOCAL RESULTS FOUND").
 if listing.search('a[href^=http]').length() != 0
 next
 end
 # Parse information out of the listing.
 car = {}
 car["id"] = listing["data-pid"]
 car["date"] = listing.search(".date").length() == 1 ? Date.parse(listing.search(".date").first().content) : nil
 # When Craigslist wraps at the end of the year it doesn't add a year field.
 # Fortunately Craigslist has an approximately one month time limit that makes it easy to know which year is being referred to.
 # Overshooting by incrementing the month to make sure that timezone differences between this and CL servers don't result in weirdness
 if car["date"].month > Date.today.month + 1
 car["date"] = car["date"].prev_year
 end
 car["link"] = "http://#{regionhostname}.craigslist.org/cto/#{car['id']}.html"
 car["description"] = listing.search(".pl > a").length() == 1 ? listing.search(".pl > a").first().content : nil
 car["price"] = listing.search("span.price").length() == 1 ? listing.search("span.price").first().content : nil
 car["location"] = listing.search(".l2 small").length() == 1 ? listing.search(".l2 small").first().content.gsub(/[\(\)]/,'').strip : nil
 car["longitude"] = listing["data-longitude"]
 car["latitude"] = listing["data-latitude"]
 # Pull car model year from description
 # Can be wrong, but likely to be accurate.
 if /(?:\b19[0-9]{2}\b|\b20[0-9]{2}\b|\b[0-9]{2}\b)/.match(car["description"]) { |result|
 # Two digit year
 if result[0].length == 2
 # Not an arbitrary wrapping point like it is in MySQL, etc.
 # Cars have known manufacture dates and can't be too far in the future.
 if result[0].to_i <= Date.today.strftime("%y").to_i + 1
 car["year"] = "20#{result[0]}"
 else
 car["year"] = "19#{result[0]}"
 end
 # Four digit year is easy.
 elsif result[0].length == 4
 car["year"] = result[0]
 end
 }
 else
 car["year"] = nil
 end
 # Store the region lookup key.
 car["regionhostname"] = regionhostname
 result << car
 }
 }
 return result
end

Car vs. Listing

  • Now I have two possible "competing" objects if I were to throw this into a class. The listing is describing a car, but I care about capturing information from both. Should I store both and link them? A listing "has-one" car?

Results Pages

  • Should each page be an object? The first little bit I go through is to figure out how many pages I need to request.
  • How should I prevent this from running in serial? Should I bubble these functions back out by returning functions? Is that really possible to do cleanly in Ruby?

If the code has to look like this...

  • The if statements where I'm checking to see if something is there (and calling the method twice) are terrible. But it throws ugly errors if I try to access things that aren't present.
  • Ternary was the best I found, are there other tricks?

Misc.

  • Is "next" in good favor?
  • Are there idioms for pulling information out of Match objects?
  • What about building up strings? Am I doing that right?

def search(query)
 results = []
 # Get a copy of the regions we're going to search.
 regions = get_us_regions()
 # Divide the requests to each region across the "right" number of threads.
 iterations = 5
 count = (regions.length/iterations.to_f).ceil
 # Spin up the threads!
 (0..(iterations-1)).each { |iteration|
 threads = []
 # Protect against source exhaustion
 if iteration * count > regions.length()
 next
 end
 # Split the requests by region.
 regions.keys.slice(iteration*count,count).each { |regionhostname|
 threads << Thread.new(regionhostname) { |activeregionhostname|
 # New block for proper scoping of regionhostname
 results << search_region(activeregionhostname, query)
 }
 }
 # Wait until all threads are complete before kicking off the next set.
 threads.each { |thread| thread.join }
 }
 # From search_region we return an array, which means we need to flatten(1) to pull everything up to the top level.
 results = results.flatten(1)
 # Sort the search results by date, descending.
 results.sort! { |a,b|
 if a["date"] == b["date"]
 b["id"].to_i <=> a["id"].to_i
 else
 b["date"] <=> a["date"]
 end
 }
 return results
end
puts search("TDI").to_json

public static void main

  • Threading! Asynchronous code makes sense to me, but my (Ruby) threads blow up if I create too many at once. Is there an idiom for queuing activity for a set number of worker threads?
  • For statements? or (0..5).each { |index| }?
  • Globals for collections of objects? only in "main"?
  • Are there naming conventions I'm doing incredibly wrong?
  • Is there anything else I should have asked?

Conclusion

The code works, and you can use it to get results from every region for a car search on Craigslist. This is nice for rare/hard-to-find vehicles. I'd love for the threading to be better and include the multiple requests from pagination on different threads, but I'd need some sort of pool to handle that. Eventually I'm thinking of integrating Rack into this for a simplistic vehicle search API. Or maybe it gets smarter and stores the results in a DB to track pricing over time and create more educated sellers and consumers or to flag good deals.

Jamal
35.2k13 gold badges134 silver badges238 bronze badges
asked Jun 27, 2013 at 6:35
\$\endgroup\$
5
  • \$\begingroup\$ my goodness i had totally forgotten that ruby even had a for statement \$\endgroup\$ Commented Jun 28, 2013 at 23:33
  • \$\begingroup\$ So I think that answers the question about using "for" statements. :) \$\endgroup\$ Commented Jun 29, 2013 at 1:22
  • \$\begingroup\$ 5.times { |i| } is a little more expressive \$\endgroup\$ Commented Jun 29, 2013 at 4:02
  • \$\begingroup\$ also i'm sure you know this at this point, but when a block spans more than one line, you should almost always be using the do |i| end construct instead of the { |i| } one. \$\endgroup\$ Commented Jun 29, 2013 at 4:06
  • \$\begingroup\$ Check out this Ruby-Style-Guide. It's a good reference for all the little questions you may have. Good luck \$\endgroup\$ Commented Jul 8, 2013 at 21:23

1 Answer 1

2
\$\begingroup\$

Long, long question. I'll take your first snippet and leave others to tackle the rest. First some comments about your code:

  • def get_us_regions(): It's not idiomatic to puts those () on methods without arguments.

  • first(): It's not idiomatic to write them on calls without arguments either.

  • results = {}: I've already written a lot about this subject in CR, so I'll just give the link: functional programming with Ruby.

  • # Important URLs: Not sure if important enough to deserve a comment :-)

  • Nokogiri::HTML(...) ... long expression. Expressions can be chained without end, you must decide when to break and give meaningful names. I'd broke it down at least into two subexpressions.

  • gsub('http://','').gsub('.craigslist.org',''): Use module URI instead of performing manual manipulation of URLs.

  • results[area["hostname"]][:stateabbrev]: Again, a functional approach on this kind of expressions make them more concise and clear.

  • return results: Explicit returns are not idiomatic.

  • def get_us_regions. When a method is so trivial to make it configurable, do so, here give the country as argument -> def get_regions(country_code).

Now how I'd write the method. First I'd use Facets, an excellent library with lots of cool abstractions not present in the core:

require 'uri'
require 'nokogiri'
require 'net/http'
require 'json'
require 'facets'
module CraigsList
 SitelistUrl = 'http://www.craigslist.org/about/sites'
 GeospatialUrl = 'http://www.craigslist.org/about/areas.json'
 # Return hash of pairs (hostname, {:name, :state, :stateabbr, :latitude, :longitude})
 # for US regions in craigslist.
 def self.get_regions(country_code)
 doc = Nokogiri::HTML(Net::HTTP.get(URI(SitelistUrl)))
 usregions = doc.search("a[name=#{country_code}]").first.parent.next_element.search('a')
 state_info = usregions.mash do |usregion|
 hostname = URI.parse(usregion.attr('href')).host.split(".").first
 state = usregion.parent.parent.previous_element.content
 info = {name: usregion.content, state: state}
 [hostname, info] 
 end
 areas = JSON.parse(Net::HTTP.get(URI(GeospatialUrl)))
 geo_info = areas.slice(*state_info.keys).mash do |area|
 info = {stateabbrev: area["region"], latitude: area["lat"], longitude: area["lon"]}
 [area["hostname"], info] 
 end
 state_info.deep_merge(geo_info)
 end
end

You mention that you write Javascript code. The good thing about a functional approach is that code is (syntactic differences aside) identical in any language (if it has minimal functional capabilities). In JS (although FP style is more friendly with Coffeescript) and underscore (+ custom abstractions as mixins) you could write the same code.

answered Jun 27, 2013 at 20:55
\$\endgroup\$
5
  • \$\begingroup\$ Okay, it looks like I need to learn how Ruby blocks work in depth, as well as all of their syntax. They appear to be wonderfully powerful. \$\endgroup\$ Commented Jun 29, 2013 at 2:05
  • \$\begingroup\$ in one way, blocks are just syntactic sugar for javascript's anonymous functions :) \$\endgroup\$ Commented Jun 29, 2013 at 4:00
  • \$\begingroup\$ although i guess javascript rarely encourages functional programming to the extent that ruby does \$\endgroup\$ Commented Jun 29, 2013 at 4:08
  • \$\begingroup\$ They're incredibly similar to anonymous functions in JS, but I saw a few use cases where you could pass blocks around into things that look like ES6 generators with their use of yield. I'm reading as much as I can on it for now. :) And, hello lexical scoping! My how I've missed you old friend! \$\endgroup\$ Commented Jun 29, 2013 at 4:31
  • \$\begingroup\$ @JustinL. Personally I always use (try to use at least) JS with a purely functional approach except, of course, for external side-effects. But as I said I use Coffeescript because its light syntax makes it more bearable. \$\endgroup\$ Commented Jun 29, 2013 at 6:40

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.