I'm a JavaScript developer. I'm pretty sure that will be immediately apparent in the below code if for no other reason than the level/depth of chaining that I'm comfortable with. However, I'm learning Ruby, and so I'd love to write beautiful Ruby code too. My simple first project is a Craigslist search-across-regions script.
The full code is on GitHub, but broken down into snippets with questions below.
def get_us_regions()
# Accumulator for building up the returned object.
results = {}
# Important URLs
sitelist = URI('http://www.craigslist.org/about/sites')
geospatial = URI('http://www.craigslist.org/about/areas.json')
# Get a collection of nodes for the US regions out of craigslist's site list.
usregions = Nokogiri::HTML(Net::HTTP.get(sitelist)).search("a[name=US]").first().parent().next_element().search('a')
# Parse out the information to build a usable representation.
usregions.each { |usregion|
hostname = usregion.attr('href').gsub('http://','').gsub('.craigslist.org','')
results[hostname] = { name: usregion.content, state: usregion.parent().parent().previous_element().content }
}
# Merge that information with craigslist's geographic information.
areas = JSON.parse(Net::HTTP.get(geospatial))
areas.each { |area|
if results[area["hostname"]]
results[area["hostname"]][:stateabbrev] = area["region"]
results[area["hostname"]][:latitude] = area["lat"]
results[area["hostname"]][:longitude] = area["lon"]
end
}
# This is a complete list of the US regions, keyed off of their hostname.
return results
end
Bootstrapping
- How should I get procedural information I need to get started?
- Does this change if I were doing this on bootstrap for a long-running application and wanted to refresh, say, monthly?
- Should I shove bootstrapping this into some really abstract class?
This Isn't JS
- How should I chain calls to class methods?
- Why hashes with string keys versus the weird named key thing?
Object-ions
- Should I be creating an object for each region and feeding the pieces I parse out of the document into the constructor function?
- If I did that, should that constructor function just take a DOM node and be clever with figuring out what I passed it?
- And for reopening an object since I have to collate across two sources what's the "right" approach?
# Perform a search in a particular region.
def search_region(regionhostname, query)
# In case there are multiple pages of results from a search
pages = []
pagecount = false
# An accumulator for storing what we need to return.
result = []
# Make requests for every page.
while (pages.length != pagecount)
# End up with a start of "0" on the first time, 100 is craigslist's page length.
page = pages.length * 100
# Here is the URL we'll be making the request of.
url = URI("http://#{regionhostname}.craigslist.org/search/cto?query=#{query}&srchType=T&s=#{page}")
# Get the response and parse it.
pages << Nokogiri::HTML(Net::HTTP.get(url))
# If this is the first time through
if (pagecount == false)
#check to make sure there are results.
if pages.last().search('.resulttotal').length() != 0
# There are results, and we need to see if additional requests are necessary.
pagecount = (pages.last().search('.resulttotal').first().content().gsub(/[^0-9]/,'').to_i / 100.0).ceil
else
# There are no results, we're done here.
return []
end
end
end
# Go through each of the pages of results and process the listings
pages.each { |page|
# Go through all of the listings on each page
page.search('.row').each { |listing|
# Skip listings from other regions in case there are any ("FEW LOCAL RESULTS FOUND").
if listing.search('a[href^=http]').length() != 0
next
end
# Parse information out of the listing.
car = {}
car["id"] = listing["data-pid"]
car["date"] = listing.search(".date").length() == 1 ? Date.parse(listing.search(".date").first().content) : nil
# When Craigslist wraps at the end of the year it doesn't add a year field.
# Fortunately Craigslist has an approximately one month time limit that makes it easy to know which year is being referred to.
# Overshooting by incrementing the month to make sure that timezone differences between this and CL servers don't result in weirdness
if car["date"].month > Date.today.month + 1
car["date"] = car["date"].prev_year
end
car["link"] = "http://#{regionhostname}.craigslist.org/cto/#{car['id']}.html"
car["description"] = listing.search(".pl > a").length() == 1 ? listing.search(".pl > a").first().content : nil
car["price"] = listing.search("span.price").length() == 1 ? listing.search("span.price").first().content : nil
car["location"] = listing.search(".l2 small").length() == 1 ? listing.search(".l2 small").first().content.gsub(/[\(\)]/,'').strip : nil
car["longitude"] = listing["data-longitude"]
car["latitude"] = listing["data-latitude"]
# Pull car model year from description
# Can be wrong, but likely to be accurate.
if /(?:\b19[0-9]{2}\b|\b20[0-9]{2}\b|\b[0-9]{2}\b)/.match(car["description"]) { |result|
# Two digit year
if result[0].length == 2
# Not an arbitrary wrapping point like it is in MySQL, etc.
# Cars have known manufacture dates and can't be too far in the future.
if result[0].to_i <= Date.today.strftime("%y").to_i + 1
car["year"] = "20#{result[0]}"
else
car["year"] = "19#{result[0]}"
end
# Four digit year is easy.
elsif result[0].length == 4
car["year"] = result[0]
end
}
else
car["year"] = nil
end
# Store the region lookup key.
car["regionhostname"] = regionhostname
result << car
}
}
return result
end
Car vs. Listing
- Now I have two possible "competing" objects if I were to throw this into a class. The listing is describing a car, but I care about capturing information from both. Should I store both and link them? A listing "has-one" car?
Results Pages
- Should each page be an object? The first little bit I go through is to figure out how many pages I need to request.
- How should I prevent this from running in serial? Should I bubble these functions back out by returning functions? Is that really possible to do cleanly in Ruby?
If the code has to look like this...
- The if statements where I'm checking to see if something is there (and calling the method twice) are terrible. But it throws ugly errors if I try to access things that aren't present.
- Ternary was the best I found, are there other tricks?
Misc.
- Is "next" in good favor?
- Are there idioms for pulling information out of Match objects?
- What about building up strings? Am I doing that right?
def search(query)
results = []
# Get a copy of the regions we're going to search.
regions = get_us_regions()
# Divide the requests to each region across the "right" number of threads.
iterations = 5
count = (regions.length/iterations.to_f).ceil
# Spin up the threads!
(0..(iterations-1)).each { |iteration|
threads = []
# Protect against source exhaustion
if iteration * count > regions.length()
next
end
# Split the requests by region.
regions.keys.slice(iteration*count,count).each { |regionhostname|
threads << Thread.new(regionhostname) { |activeregionhostname|
# New block for proper scoping of regionhostname
results << search_region(activeregionhostname, query)
}
}
# Wait until all threads are complete before kicking off the next set.
threads.each { |thread| thread.join }
}
# From search_region we return an array, which means we need to flatten(1) to pull everything up to the top level.
results = results.flatten(1)
# Sort the search results by date, descending.
results.sort! { |a,b|
if a["date"] == b["date"]
b["id"].to_i <=> a["id"].to_i
else
b["date"] <=> a["date"]
end
}
return results
end
puts search("TDI").to_json
public static void main
- Threading! Asynchronous code makes sense to me, but my (Ruby) threads blow up if I create too many at once. Is there an idiom for queuing activity for a set number of worker threads?
- For statements? or (0..5).each { |index| }?
- Globals for collections of objects? only in "main"?
- Are there naming conventions I'm doing incredibly wrong?
- Is there anything else I should have asked?
Conclusion
The code works, and you can use it to get results from every region for a car search on Craigslist. This is nice for rare/hard-to-find vehicles. I'd love for the threading to be better and include the multiple requests from pagination on different threads, but I'd need some sort of pool to handle that. Eventually I'm thinking of integrating Rack into this for a simplistic vehicle search API. Or maybe it gets smarter and stores the results in a DB to track pricing over time and create more educated sellers and consumers or to flag good deals.
1 Answer 1
Long, long question. I'll take your first snippet and leave others to tackle the rest. First some comments about your code:
def get_us_regions()
: It's not idiomatic to puts those()
on methods without arguments.first()
: It's not idiomatic to write them on calls without arguments either.results = {}
: I've already written a lot about this subject in CR, so I'll just give the link: functional programming with Ruby.# Important URLs
: Not sure if important enough to deserve a comment :-)Nokogiri::HTML(...) ... long expression
. Expressions can be chained without end, you must decide when to break and give meaningful names. I'd broke it down at least into two subexpressions.gsub('http://','').gsub('.craigslist.org','')
: Use module URI instead of performing manual manipulation of URLs.results[area["hostname"]][:stateabbrev]
: Again, a functional approach on this kind of expressions make them more concise and clear.return results
: Explicitreturn
s are not idiomatic.def get_us_regions
. When a method is so trivial to make it configurable, do so, here give the country as argument ->def get_regions(country_code)
.
Now how I'd write the method. First I'd use Facets, an excellent library with lots of cool abstractions not present in the core:
require 'uri'
require 'nokogiri'
require 'net/http'
require 'json'
require 'facets'
module CraigsList
SitelistUrl = 'http://www.craigslist.org/about/sites'
GeospatialUrl = 'http://www.craigslist.org/about/areas.json'
# Return hash of pairs (hostname, {:name, :state, :stateabbr, :latitude, :longitude})
# for US regions in craigslist.
def self.get_regions(country_code)
doc = Nokogiri::HTML(Net::HTTP.get(URI(SitelistUrl)))
usregions = doc.search("a[name=#{country_code}]").first.parent.next_element.search('a')
state_info = usregions.mash do |usregion|
hostname = URI.parse(usregion.attr('href')).host.split(".").first
state = usregion.parent.parent.previous_element.content
info = {name: usregion.content, state: state}
[hostname, info]
end
areas = JSON.parse(Net::HTTP.get(URI(GeospatialUrl)))
geo_info = areas.slice(*state_info.keys).mash do |area|
info = {stateabbrev: area["region"], latitude: area["lat"], longitude: area["lon"]}
[area["hostname"], info]
end
state_info.deep_merge(geo_info)
end
end
You mention that you write Javascript code. The good thing about a functional approach is that code is (syntactic differences aside) identical in any language (if it has minimal functional capabilities). In JS (although FP style is more friendly with Coffeescript) and underscore (+ custom abstractions as mixins) you could write the same code.
-
\$\begingroup\$ Okay, it looks like I need to learn how Ruby blocks work in depth, as well as all of their syntax. They appear to be wonderfully powerful. \$\endgroup\$nathanhammond– nathanhammond2013年06月29日 02:05:29 +00:00Commented Jun 29, 2013 at 2:05
-
\$\begingroup\$ in one way, blocks are just syntactic sugar for javascript's anonymous functions :) \$\endgroup\$Justin L.– Justin L.2013年06月29日 04:00:50 +00:00Commented Jun 29, 2013 at 4:00
-
\$\begingroup\$ although i guess javascript rarely encourages functional programming to the extent that ruby does \$\endgroup\$Justin L.– Justin L.2013年06月29日 04:08:21 +00:00Commented Jun 29, 2013 at 4:08
-
\$\begingroup\$ They're incredibly similar to anonymous functions in JS, but I saw a few use cases where you could pass blocks around into things that look like ES6 generators with their use of yield. I'm reading as much as I can on it for now. :) And, hello lexical scoping! My how I've missed you old friend! \$\endgroup\$nathanhammond– nathanhammond2013年06月29日 04:31:05 +00:00Commented Jun 29, 2013 at 4:31
-
\$\begingroup\$ @JustinL. Personally I always use (try to use at least) JS with a purely functional approach except, of course, for external side-effects. But as I said I use Coffeescript because its light syntax makes it more bearable. \$\endgroup\$tokland– tokland2013年06月29日 06:40:43 +00:00Commented Jun 29, 2013 at 6:40
Explore related questions
See similar questions with these tags.
for
statement \$\endgroup\$5.times { |i| }
is a little more expressive \$\endgroup\$do |i| end
construct instead of the{ |i| }
one. \$\endgroup\$