2
\$\begingroup\$

HTTP request is made, and a JSON string is returned, which needs to be parsed.
Example response:

{"urlkey": "com,practicingruby)/", "timestamp": "20150420004437", "status": "200", "url": "https://practicingruby.com/", "filename": "common-crawl/crawl-data/CC-MAIN-2015-18/segments/1429246644200.21/warc/CC-MAIN-20150417045724-00242-ip-10-235-10-82.ec2.internal.warc.gz", "length": "9219", "mime": "text/html", "offset": "986953615", "digest": "DOGJXRGCHRUNDTKKJMLYW2UY2BSWCSHX"}
{"urlkey": "com,practicingruby)/", "timestamp": "20150425001851", "status": "200", "url": "https://practicingruby.com/", "filename": "common-crawl/crawl-data/CC-MAIN-2015-18/segments/1429246645538.5/warc/CC-MAIN-20150417045725-00242-ip-10-235-10-82.ec2.internal.warc.gz", "length": "9218", "mime": "text/html", "offset": "935932558", "digest": "LJKP47MYZ2KEEAYWZ4HICSVIHDG7CARQ"}
{"urlkey": "com,practicingruby)/articles/ant-colony-simulation?u=5c7a967f21", "timestamp": "20150421081357", "status": "200", "url": "https://practicingruby.com/articles/ant-colony-simulation?u=5c7a967f21", "filename": "common-crawl/crawl-data/CC-MAIN-2015-18/segments/1429246641054.14/warc/CC-MAIN-20150417045721-00029-ip-10-235-10-82.ec2.internal.warc.gz", "length": "10013", "mime": "text/html", "offset": "966385301", "digest": "AWIR7EJQJCGJYUBWCQBC5UFHCJ2ZNWPQ"}

My code:

result = Net::HTTP.get(URI("http://index.commoncrawl.org/CC-MAIN-2015-18-index?url=#{url}&output=json")).split("}")
result.each do |res|
 break if res == "\n"
 #need to add back braces because we used it to split the various json hashes from the http request
 res << "}"
 to_crawl = JSON.parse(res)
 puts to_crawl
end

It works, but I'm sure there is a much better way to do it, or at least a better way to write the code.

asked Jun 30, 2015 at 18:41
\$\endgroup\$

2 Answers 2

3
\$\begingroup\$

This body.split('{'}) is doing you a disservice, as it destroys the structure of the response. Split it by lines instead:

body = Net::HTTP.get(...)
data = body.lines.map { |line| JSON.parse(line) }
answered Jun 30, 2015 at 19:01
\$\endgroup\$
2
\$\begingroup\$

Use faraday

require 'faraday'
conn = Faraday.new("http://index.commoncrawl.org/") do |faraday|
 faraday.request :url_encoded # form-encode POST params
 faraday.adapter Faraday.default_adapter # make requests with Net::HTTP
end
response = conn.get("/CC-MAIN-2015-18-index?url=#{url}&output=json")
parsed = JSON.parse(response.body)
answered Jul 10, 2015 at 7:12
\$\endgroup\$
0

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.