9
\$\begingroup\$

I'm scraping some comments from Reddit using Reddit JSON API and R. Since the data does not have a flat structure, extracting it is a little tricky, but I've found a way.

To give you a flavour of what I'm having to do, here is a brief example:

x = "http://www.reddit.com/r/funny/comments/2eerfs/fifa_glitch_cosplay/.json" # example url
rawdat = readLines(x,warn=F) # reading in the data
rawdat = fromJSON(rawdat) # formatting
dat_list = repl = rawdat[[2]][[2]][[2]] # this will be used later
sq = seq(dat_list)[-1]-1 # number of comments
txt = unlist(lapply(sq,function(x)dat_list[[x]][[2]][[14]])) # comments (not replies)
# loop time:
for(a in sq){
 repl = tryCatch(repl[[a]][[2]][[5]][[2]][[2]],error=function(e) NULL) # getting replies all replies to comment a
 if(length(repl)>0){ # in case there are no replies
 sq = seq(repl)[-1]-1 # number of replies
 txt = c(txt,unlist(lapply(sq,function(x)repl[[x]][[2]][[14]]))) # this is what I want
 # next level down
 for(b in sq){
 repl = tryCatch(repl[[b]][[2]][[5]][[2]][[2]],error=function(e) NULL) # getting all replies to reply b of comment a
 if(length(repl)>0){
 sq = seq(repl)[-1]-1
 txt = c(txt,unlist(lapply(sq,function(x)repl[[x]][[2]][[14]]))) 
 }
 }
 }
}

The above example gets all comments, the first level of replies to each of these comments and the second level of replies (i.e. replies to each of the replies), but this could go down much deeper, so I'm trying to figure out an efficient way of handling this. To achieve this manually, what I'm having to do is this:

  1. Copy the following code from the last loop:

    for(b in sq){
     repl = tryCatch(repl[[b]][[2]][[5]][[2]][[2]],error=function(e) NULL)
     if(length(repl)>0){
     sq = seq(repl)[-1]-1
     txt = c(txt,unlist(lapply(sq,function(x)repl[[x]][[2]][[14]]))) 
     }
    }
    
  2. Paste that code right after the line that starts with txt = ... and change b in the loop to c.

  3. Repeat this procedure approximately 20 times or so, to make sure everything is captured, which as you can imagine creates a huge loop. I was hoping that there must be a way to fold this loop somehow and make it more elegant.

Jamal
35.2k13 gold badges134 silver badges238 bronze badges
asked Aug 31, 2014 at 5:28
\$\endgroup\$
0

1 Answer 1

7
\$\begingroup\$

Here are my main recommendations:

  1. use recursion
  2. use names instead of list indices, for example node$data$reply$data$children reads much better than node[[2]][[5]][[2]][[2]] and it is also more robust to data changes.
  3. use well-named variables so you code reads easily

Now for the code:

url <- "http://www.reddit.com/r/funny/comments/2eerfs/fifa_glitch_cosplay/.json"
rawdat <- fromJSON(readLines(url, warn = FALSE))
main.node <- rawdat[[2]]$data$children
get.comments <- function(node) {
 comment <- node$data$body
 replies <- node$data$replies
 reply.nodes <- if (is.list(replies)) replies$data$children else NULL
 return(list(comment, lapply(reply.nodes, get.comments)))
}
txt <- unlist(lapply(main.node, get.comments))
length(txt)
# [1] 199
answered Aug 31, 2014 at 12:35
\$\endgroup\$
5
  • \$\begingroup\$ is there a difference between unlist(lapply(items, fun)) and sapply(items, fun) ? \$\endgroup\$ Commented Aug 31, 2014 at 18:34
  • 1
    \$\begingroup\$ Yes. If fun were to return a vector of the same length for each item, then sapply would put all the output vectors into a matrix, otherwise in a list. lapply on the other hand always returns a list. unlist(lapply(...)) will unwrap the list into a vector. \$\endgroup\$ Commented Aug 31, 2014 at 19:07
  • \$\begingroup\$ Great, thanks @flodel, that is exactly what I'm after! The only thing that concerns me is that there are over 400 responses in total, but the code only returns 199 observations. I thought the API was capped at 500 results. Would you happen to have an idea why this may be so? Because if this is all I'm going to get, I might as well scrape the HTML which could give me up to 500 comments... \$\endgroup\$ Commented Sep 3, 2014 at 6:47
  • \$\begingroup\$ If I browse to the url and search for "body:", chrome tells me there are 198 hits (not sure where the one-off comes from) so I'm leaning towards an API limitation. \$\endgroup\$ Commented Sep 3, 2014 at 11:23
  • \$\begingroup\$ Yes, you are right. I've found a way to capture up to 500 comments and I've added it to the question, but this still won't get all the comments. Anyways, that's a great start! \$\endgroup\$ Commented Sep 4, 2014 at 5:16

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.