Jun 122015
 

My general approach to extracting data from any API is to extract the data into a relational database and then write SQL queries on top of it to get the required information. Though this works, its often tedious and involves running multiple applications.

The R programming language works great for statistical computation and plotting graphics and I have been tinkering around with it for the last few weeks. While learning R, I thought of using R to extract data from the API as well. This would allow extracting the latest data from the API and compute stats with a single script. And though the XML package in R doesn’t make for the most intuitive parsing code, the vectorized operations reduces the need for frequent loops and keeps the code concise and readable.

And though this code is written for the Socialcast API, it can be easily tweaked to pull data from any social API like Facebook, Yammer etc. The first step is to pull the data from the API – the RCurl package gets us the data which can then be parsed using the XML package.

rm(list=ls())
library(RCurl)
library(XML)

continueLoading=TRUE
page = 1
finalDataFrame <- NULL

getInnerText <- function(inputData,parentNode,childNode) {
  test <- xpathSApply(inputData,parentNode,function(x){
    if(is.null(x[childNode][[childNode]])) {
      ""
    }else {
      xmlValue(x[childNode][[childNode]])
    }
  })
  return(test)
}

while(continueLoading) {

  messagesData <- getURL(paste("https://demo.socialcast.com/api/groups/acmecorpsoftballteam/messages.xml?per_page=20&page=",page,sep=""),
                         userpwd="emily@socialcast.com:demo", ssl.verifypeer = FALSE, httpauth = 1L)
  print(paste("LOADING PAGE:",page))
  data <- xmlParse(messagesData)
  totalMessages <- length(getNodeSet(data,"//messages/message"))

The totalMessages property is to check the number of messages returned by the API. When it’s zero, the while loop is exited, else the execution continues. The xmlParse function gives us a in memory structure of the document which can be iterated upon. we use the sapply function which applies a function to each element of a list and returns a vector. I’ll come to the getUserNodeValue function later

if (totalMessages == 0){
    continueLoading = FALSE
  }
  else {
    tempDataFrame <- data.frame(
      InteractionType = "Message",
      ID = sapply(getNodeSet(data, "//messages/message/id"),xmlValue),
      Author = sapply(getNodeSet(data,"//messages/message/user/name"),xmlValue),
      Body = sapply(getNodeSet(data,"//messages/message/body"),xmlValue),
      Url = sapply(getNodeSet(data,"//messages/message/permalink-url"),xmlValue),
      Type = sapply(getNodeSet(data,"//messages/message/message-type"),xmlValue),
      CreatedAt = sapply(getNodeSet(data,"//messages/message/created-at"),xmlValue),
      Location = sapply(getNodeSet(data,"//messages/message/user/id"),function(x){getUserNodeValue(x,"Location")}),
      Country = sapply(getNodeSet(data,"//messages/message/user/id"),function(x){getUserNodeValue(x,"Country")}),
      Sector = sapply(getNodeSet(data,"//messages/message/user/id"),function(x){getUserNodeValue(x,"Sector")}),
      Title = sapply(getNodeSet(data,"//messages/message/user/id"),function(x){getUserNodeValue(x,"Title")}),
      Department = sapply(getNodeSet(data,"//messages/message/user/id"),function(x){getUserNodeValue(x,"Department")})
    )

    if (is.null(finalDataFrame)) {
      finalDataFrame <- tempDataFrame
    }
    else{
      finalDataFrame <- rbind(finalDataFrame,tempDataFrame)
    }

Now we have a data frame with all the Messages from the API. However, we also need the comments and likes. This is the only place where I needed to use a for loop to iterate through each individual message node and select their comments. The xpathSApply function reduces our code further by being able to query each node of the NodeSet with the given XPath expression and applying a function on it. Furthermore it returns a vector which fits in nicely into our existing data frame.

   for( i in 1:length(getNodeSet(data,"//messages/message"))) {
      if(length(getNodeSet(data,paste("//messages/message[position()=",i,"]/comments/comment"))) > 0){

        allComments <- getNodeSet(data,paste("//messages/message[position()=",i,"]/comments"))[[1]]

        xpathSApply(allComments,"comment/id",xmlValue)

        commentFrame <-  data.frame(
          InteractionType = "Comment",
          ID = xpathSApply(allComments,"comment/id",xmlValue),
          Author = xpathSApply(allComments,"comment/user/name",xmlValue),
          Body = xpathSApply(allComments,"comment/text",xmlValue),
          Url = xpathSApply(allComments,"comment/permalink-url",xmlValue),
          Type = "",
          CreatedAt = xpathSApply(allComments,"comment/created-at",xmlValue),
          Location = xpathSApply(allComments,"comment/user/id",function(x){getUserNodeValue(x,"Location")}),
          Country = xpathSApply(allComments,"comment/user/id",function(x){getUserNodeValue(x,"Country")}),
          Sector = xpathSApply(allComments,"comment/user/id",function(x){getUserNodeValue(x,"Sector")}),
          Title = xpathSApply(allComments,"comment/user/id",function(x){getUserNodeValue(x,"Title")}),
          Department = xpathSApply(allComments,"comment/user/id",function(x){getUserNodeValue(x,"Department")})
        )

        finalDataFrame <- rbind(finalDataFrame,commentFrame)
      }

      if(length(getNodeSet(data,paste("//messages/message[position()=",i,"]/likes/like"))) > 0){

        allLikes <- getNodeSet(data,paste("//messages/message[position()=",i,"]/likes"))[[1]]

        likeFrame <-  data.frame(
          InteractionType = "Like",
          ID = xpathSApply(allLikes,"like/id",xmlValue),
          Author = xpathSApply(allLikes,"like/user/name",xmlValue),
          Body = "",
          Url = "",
          Type ="",
          CreatedAt = xpathSApply(allLikes,"like/created-at",xmlValue),
          Location = xpathSApply(allLikes,"like/user/id",function(x){getUserNodeValue(x,"Location")}),
          Country = xpathSApply(allLikes,"like/user/id",function(x){getUserNodeValue(x,"Country")}),
          Sector = xpathSApply(allLikes,"like/user/id",function(x){getUserNodeValue(x,"Sector")}),
          Title = xpathSApply(allLikes,"like/user/id",function(x){getUserNodeValue(x,"Title")}),
          Department = xpathSApply(allLikes,"like/user/id",function(x){getUserNodeValue(x,"Department")})
        )

        finalDataFrame <- rbind(finalDataFrame,likeFrame)
      }
    }

  }
  page <- page + 1

}

rm(list=c("commentFrame","likeFrame","tempDataFrame","users"))

Now we come to the getNodeUserValue function. This is simply a performance optimization since calling the API to get the user details each time becomes very time consuming. So I generally keep the user data in a database and use the id in the xml response to query the data frame and fetch the correct user record. This step however is purely optional and you could easily call the api to get each user’s response and parse it.

getUserNodeValue <- function(inputNode,queryNode){
  if (nrow(users[users$ID == xmlValue(inputNode),]) == 0)
    ""
  else
    users[users$ID == xmlValue(inputNode),][[queryNode]]
}

At this point we have all the API information parsed into a data frame (finalDataFrame). Now for the fun part! Though you can subset and count easily using the built in language functions, a package called dplyr makes this code more readable and intuitive. With dplyr you can perform multiple data manipulation operations like filter, select, order, group by etc and chain them together to get the final result

So to group the data frame by a column and count, the code is as simple as

#############Type of Activity in the Group#####################################

interactionType <- group_by(finalDataFrame,InteractionType) %>%
                   summarise(count = n())

print(interactionType)

activityType

#############Active Day of Week#############################

activeDay <- group_by(finalDataFrame,weekdays(as.Date(CreatedAt))) %>%
 summarise(count = n()) %>%
 select(DayOfWeek=1,count=2)

print(activeDay)

DayOfWeek
The top 5 users by total Activity

activeUsers <- group_by(finalDataFrame,Author) %>%
 summarize(TotalActivity=n()) %>%
 arrange(-TotalActivity) %>%
 top_n(5,TotalActivity)

print(activeUsers)

The Type of Messages being created

messageTypes <- filter(finalDataFrame,InteractionType == "Message") %>%
 group_by(Type) %>%
 summarize(count = n()) %>%
 arrange(-count)

print(messageTypes)

The stats shown here barely scratch the surface of what R is capable of computing.