Using Reddit API in R

Question 1

I'm scraping some comments from Reddit using Reddit JSON API and R. Since the data does not have a flat structure, extracting it is a little tricky, but I've found a way.

To give you a flavour of what I'm having to do, here is a brief example:

x = "http://www.reddit.com/r/funny/comments/2eerfs/fifa_glitch_cosplay/.json" # example url
rawdat = readLines(x,warn=F) # reading in the data
rawdat = fromJSON(rawdat) # formatting
dat_list = repl = rawdat[[2]][[2]][[2]] # this will be used later
sq = seq(dat_list)[-1]-1 # number of comments
txt = unlist(lapply(sq,function(x)dat_list[[x]][[2]][[14]])) # comments (not replies)
# loop time:
for(a in sq){
 repl = tryCatch(repl[[a]][[2]][[5]][[2]][[2]],error=function(e) NULL) # getting replies all replies to comment a
 if(length(repl)>0){ # in case there are no replies
 sq = seq(repl)[-1]-1 # number of replies
 txt = c(txt,unlist(lapply(sq,function(x)repl[[x]][[2]][[14]]))) # this is what I want
 # next level down
 for(b in sq){
 repl = tryCatch(repl[[b]][[2]][[5]][[2]][[2]],error=function(e) NULL) # getting all replies to reply b of comment a
 if(length(repl)>0){
 sq = seq(repl)[-1]-1
 txt = c(txt,unlist(lapply(sq,function(x)repl[[x]][[2]][[14]]))) 
 }
 }
 }
}

The above example gets all comments, the first level of replies to each of these comments and the second level of replies (i.e. replies to each of the replies), but this could go down much deeper, so I'm trying to figure out an efficient way of handling this. To achieve this manually, what I'm having to do is this:

Copy the following code from the last loop:

for(b in sq){
 repl = tryCatch(repl[[b]][[2]][[5]][[2]][[2]],error=function(e) NULL)
 if(length(repl)>0){
 sq = seq(repl)[-1]-1
 txt = c(txt,unlist(lapply(sq,function(x)repl[[x]][[2]][[14]]))) 
 }
}

Paste that code right after the line that starts with txt = ... and change b in the loop to c.
Repeat this procedure approximately 20 times or so, to make sure everything is captured, which as you can imagine creates a huge loop. I was hoping that there must be a way to fold this loop somehow and make it more elegant.

Question 2

Here are my main recommendations:

use recursion
use names instead of list indices, for example node$data$reply$data$children reads much better than node[[2]][[5]][[2]][[2]] and it is also more robust to data changes.
use well-named variables so you code reads easily

Now for the code:

url <- "http://www.reddit.com/r/funny/comments/2eerfs/fifa_glitch_cosplay/.json"
rawdat <- fromJSON(readLines(url, warn = FALSE))
main.node <- rawdat[[2]]$data$children
get.comments <- function(node) {
 comment <- node$data$body
 replies <- node$data$replies
 reply.nodes <- if (is.list(replies)) replies$data$children else NULL
 return(list(comment, lapply(reply.nodes, get.comments)))
}
txt <- unlist(lapply(main.node, get.comments))
length(txt)
# [1] 199

Question 3

is there a difference between unlist(lapply(items, fun)) and sapply(items, fun) ?

Question 4

Yes. If fun were to return a vector of the same length for each item, then sapply would put all the output vectors into a matrix, otherwise in a list. lapply on the other hand always returns a list. unlist(lapply(...)) will unwrap the list into a vector.

Question 5

Great, thanks @flodel, that is exactly what I'm after! The only thing that concerns me is that there are over 400 responses in total, but the code only returns 199 observations. I thought the API was capped at 500 results. Would you happen to have an idea why this may be so? Because if this is all I'm going to get, I might as well scrape the HTML which could give me up to 500 comments...

Question 6

If I browse to the url and search for "body:", chrome tells me there are 198 hits (not sure where the one-off comes from) so I'm leaning towards an API limitation.

Question 7

Yes, you are right. I've found a way to capture up to 500 comments and I've added it to the question, but this still won't get all the comments. Anyways, that's a great start!

flodel flodel 3,5551 gold badge16 silver badges15 bronze badges · Accepted Answer · 2014-08-31 12:35:16Z

7

\$\begingroup\$

Here are my main recommendations:

use recursion
use names instead of list indices, for example node$data$reply$data$children reads much better than node[[2]][[5]][[2]][[2]] and it is also more robust to data changes.
use well-named variables so you code reads easily

Now for the code:

url <- "http://www.reddit.com/r/funny/comments/2eerfs/fifa_glitch_cosplay/.json"
rawdat <- fromJSON(readLines(url, warn = FALSE))
main.node <- rawdat[[2]]$data$children
get.comments <- function(node) {
 comment <- node$data$body
 replies <- node$data$replies
 reply.nodes <- if (is.list(replies)) replies$data$children else NULL
 return(list(comment, lapply(reply.nodes, get.comments)))
}
txt <- unlist(lapply(main.node, get.comments))
length(txt)
# [1] 199

Share

edited Aug 31, 2014 at 12:50

answered Aug 31, 2014 at 12:35

flodel's user avatar

flodel flodel

3,5551 gold badge16 silver badges15 bronze badges

\$\endgroup\$

5

\$\begingroup\$ is there a difference between unlist(lapply(items, fun)) and sapply(items, fun) ? \$\endgroup\$

janos
– janos

2014年08月31日 18:34:38 +00:00
Commented Aug 31, 2014 at 18:34
1

\$\begingroup\$ Yes. If fun were to return a vector of the same length for each item, then sapply would put all the output vectors into a matrix, otherwise in a list. lapply on the other hand always returns a list. unlist(lapply(...)) will unwrap the list into a vector. \$\endgroup\$

flodel
– flodel

2014年08月31日 19:07:27 +00:00
Commented Aug 31, 2014 at 19:07
\$\begingroup\$ Great, thanks @flodel, that is exactly what I'm after! The only thing that concerns me is that there are over 400 responses in total, but the code only returns 199 observations. I thought the API was capped at 500 results. Would you happen to have an idea why this may be so? Because if this is all I'm going to get, I might as well scrape the HTML which could give me up to 500 comments... \$\endgroup\$

de1pher
– de1pher

2014年09月03日 06:47:17 +00:00
Commented Sep 3, 2014 at 6:47
\$\begingroup\$ If I browse to the url and search for "body:", chrome tells me there are 198 hits (not sure where the one-off comes from) so I'm leaning towards an API limitation. \$\endgroup\$

flodel
– flodel

2014年09月03日 11:23:11 +00:00
Commented Sep 3, 2014 at 11:23
\$\begingroup\$ Yes, you are right. I've found a way to capture up to 500 comments and I've added it to the question, but this still won't get all the comments. Anyways, that's a great start! \$\endgroup\$

de1pher
– de1pher

2014年09月04日 05:16:32 +00:00
Commented Sep 4, 2014 at 5:16

Add a comment |

Stack Exchange Network

Using Reddit API in R

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Using Reddit API in R

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions