YouTube comment scraper returns limited results
I was (for the most part) able to accomplish this by using the latest version of the Youtube Data API and the R package httr
. The basic approach I took was to send multiple GET
requests to the appropriate URL and grab the data in batches of 100 (the maximum the API allows) - i.e.
base_url <- "https://www.googleapis.com/youtube/v3/commentThreads/"api_opts <- list( part = "snippet", maxResults = 100, textFormat = "plainText", videoId = "4H9pTgQY_mo", key = "my_google_developer_api_key", fields = "items,nextPageToken", orderBy = "published")
where key
is your actual Google Developer key, of course.
The initial batch is retrieved like this:
init_results <- httr::content(httr::GET(base_url, query = api_opts))##R> names(init_results)#[1] "nextPageToken" "items"R> init_results$nextPageToken#[1] "Cg0Q-YjT3bmSxQIgACgBEhQIABDI3ZWQkbzEAhjVneqH75u4AhgCIGQ=" R> class(init_results)#[1] "list"
The second element - items
- is the actual result set from the first batch: it's a list of length 100, since we specified maxResults = 100
in the GET request. The first element - nextPageToken
- is what we use to make sure each request returns the appropriate sequence of results. For example, we can get the next 100 results like this:
api_opts$pageToken <- gsub("\\=","",init_results$nextPageToken)next_results <- httr::content( httr::GET(base_url, query = api_opts))##R> next_results$nextPageToken#[1] "ChYQ-YjT3bmSxQIYyN2VkJG8xAIgACgCEhQIABDI3ZWQkbzEAhiSsMv-ivu0AhgCIMgB"
where the current request's pageToken
is returned as the previous requests nextPageToken
, and we are given a new nextPageToken
for obtaining out next batch of results.
This is pretty straightforward, but it would obviously be very tedious to have to keep changing the value of nextPageToken
by hand after each request we send. Instead I thought this would be a good use case for a simple R6 class:
yt_scraper <- setRefClass( "yt_scraper", fields = list( base_url = "character", api_opts = "list", nextPageToken = "character", data = "list", unique_count = "numeric", done = "logical", core_df = "data.frame"), methods = list( scrape = function() { opts <- api_opts if (nextPageToken != "") { opts$pageToken <- nextPageToken } res <- httr::content( httr::GET(base_url, query = opts)) nextPageToken <<- gsub("\\=","",res$nextPageToken) data <<- c(data, res$items) unique_count <<- length(unique(data)) }, scrape_all = function() { while (TRUE) { old_count <- unique_count scrape() if (unique_count == old_count) { done <<- TRUE nextPageToken <<- "" data <<- unique(data) break } } }, initialize = function() { base_url <<- "https://www.googleapis.com/youtube/v3/commentThreads/" api_opts <<- list( part = "snippet", maxResults = 100, textFormat = "plainText", videoId = "4H9pTgQY_mo", key = "my_google_developer_api_key", fields = "items,nextPageToken", orderBy = "published") nextPageToken <<- "" data <<- list() unique_count <<- 0 done <<- FALSE core_df <<- data.frame() }, reset = function() { data <<- list() nextPageToken <<- "" unique_count <<- 0 done <<- FALSE core_df <<- data.frame() }, cache_core_data = function() { if (nrow(core_df) < unique_count) { sub_data <- lapply(data, function(x) { data.frame( Comment = x$snippet$topLevelComment$snippet$textDisplay, User = x$snippet$topLevelComment$snippet$authorDisplayName, ReplyCount = x$snippet$totalReplyCount, LikeCount = x$snippet$topLevelComment$snippet$likeCount, PublishTime = x$snippet$topLevelComment$snippet$publishedAt, CommentId = x$snippet$topLevelComment$id, stringsAsFactors=FALSE) }) core_df <<- do.call("rbind", sub_data) } else { message("\n`core_df` is already up to date.\n") } } ))
which can be used like this:
rObj <- yt_scraper()##R> rObj$data#list()R> rObj$unique_count#[1] 0##rObj$scrape_all()##R> rObj$unique_count#[1] 1673R> length(rObj$data)#[1] 1673R> ##R> head(rObj$core_df) Comment User ReplyCount LikeCount PublishTime1 That Andorra player was really Ruud..<U+feff> Cistrolat 0 6 2015-03-22T14:07:31.213Z2 This just in; Karma is a bitch.<U+feff> Swagdalf The Obey 0 1 2015-03-21T20:00:26.044Z3 Legend! Haha B)<U+feff> martyn baltussen 0 1 2015-01-26T15:33:00.311Z4 When did Van der sar ran up? He must have run real fast!<U+feff> Witsakorn Poomjan 0 0 2015-01-04T03:33:36.157Z5 <U+003c>b<U+003e>LOL<U+003c>/b<U+003e> F Hanif 5 19 2014-12-30T13:46:44.028Z6 Fucking Legend.<U+feff> Heisenberg 0 12 2014-12-27T11:59:39.845Z CommentId1 z123ybioxyqojdgka231tn5zbl20tdcvn2 z13hilaiftvus1cc1233trvrwzfjg1enm3 z13fidjhbsvih5hok04cfrkrnla2htjpxfk4 z12js3zpvm2hipgtf23oytbxqkyhcro125 z12egtfq5ojifdapz04ceffqfrregdnrrbk6 z12fth0gemnwdtlnj22zg3vymlrogthwd04
As I alluded to earlier, this gets you almost everything - 1673 out of about 1790 total comments. For some reason, it does not seem to catch users' nested replies, and I'm not quite sure how to specify this within the API framework.
I had previously set up a Google Developer account a while back for using the Google Analytics API, but if you haven't done that yet, it should be pretty straightforward. Here's an overview - you shouldn't need to set up OAuth or anything like that, just make a project and create a new Public API access key.
An alternative to the XML
package is the rvest
package. Using the URL that you've provided, scraping comments would look like this:
library(rvest)x <- "https://gdata.youtube.com/feeds/api/videos/4H9pTgQY_mo/comments?orderby=published"x %>% html %>% html_nodes("content") %>% html_text
Which returns a character vector of the comments:
[1] "That Andorra player was really Ruud.." [2] "This just in; Karma is a bitch." [3] "Legend! Haha B)" [4] "When did Van der sar ran up? He must have run real fast!" [5] "What a beast Ruud was!"...
More information on rvest
can be found here.
Your issue lies with getting max results.
Solution Algorithm
First you need to call url https://gdata.youtube.com/feeds/api/videos/4H9pTgQY_mo?v=2
This url contains the information for the video comments count, from there extract that number and us it to iterate over.
<gd:comments><gd:feedLink ..... countHint='1797'/></gd:comments>
After that use it to iterate thought url with these 2 parameters https://gdata.youtube.com/feeds/api/videos/4H9pTgQY_mo/comments?max-results=50&start-index=1
When you are iterating you need to change start-index from 1,51,101,151... Did test the max-result
it has limit to 50.