\$\begingroup\$
\$\endgroup\$
HTTP request is made, and a JSON string is returned, which needs to be parsed.
Example response:
{"urlkey": "com,practicingruby)/", "timestamp": "20150420004437", "status": "200", "url": "https://practicingruby.com/", "filename": "common-crawl/crawl-data/CC-MAIN-2015-18/segments/1429246644200.21/warc/CC-MAIN-20150417045724-00242-ip-10-235-10-82.ec2.internal.warc.gz", "length": "9219", "mime": "text/html", "offset": "986953615", "digest": "DOGJXRGCHRUNDTKKJMLYW2UY2BSWCSHX"}
{"urlkey": "com,practicingruby)/", "timestamp": "20150425001851", "status": "200", "url": "https://practicingruby.com/", "filename": "common-crawl/crawl-data/CC-MAIN-2015-18/segments/1429246645538.5/warc/CC-MAIN-20150417045725-00242-ip-10-235-10-82.ec2.internal.warc.gz", "length": "9218", "mime": "text/html", "offset": "935932558", "digest": "LJKP47MYZ2KEEAYWZ4HICSVIHDG7CARQ"}
{"urlkey": "com,practicingruby)/articles/ant-colony-simulation?u=5c7a967f21", "timestamp": "20150421081357", "status": "200", "url": "https://practicingruby.com/articles/ant-colony-simulation?u=5c7a967f21", "filename": "common-crawl/crawl-data/CC-MAIN-2015-18/segments/1429246641054.14/warc/CC-MAIN-20150417045721-00029-ip-10-235-10-82.ec2.internal.warc.gz", "length": "10013", "mime": "text/html", "offset": "966385301", "digest": "AWIR7EJQJCGJYUBWCQBC5UFHCJ2ZNWPQ"}
My code:
result = Net::HTTP.get(URI("http://index.commoncrawl.org/CC-MAIN-2015-18-index?url=#{url}&output=json")).split("}")
result.each do |res|
break if res == "\n"
#need to add back braces because we used it to split the various json hashes from the http request
res << "}"
to_crawl = JSON.parse(res)
puts to_crawl
end
It works, but I'm sure there is a much better way to do it, or at least a better way to write the code.
2 Answers 2
\$\begingroup\$
\$\endgroup\$
This body.split('{'})
is doing you a disservice, as it destroys the structure of the response. Split it by lines instead:
body = Net::HTTP.get(...)
data = body.lines.map { |line| JSON.parse(line) }
answered Jun 30, 2015 at 19:01
\$\begingroup\$
\$\endgroup\$
0
Use faraday
require 'faraday'
conn = Faraday.new("http://index.commoncrawl.org/") do |faraday|
faraday.request :url_encoded # form-encode POST params
faraday.adapter Faraday.default_adapter # make requests with Net::HTTP
end
response = conn.get("/CC-MAIN-2015-18-index?url=#{url}&output=json")
parsed = JSON.parse(response.body)
lang-rb