Jun 122015
 

My general approach to extracting data from any API is to extract the data into a relational database and then write SQL queries on top of it to get the required information. Though this works, its often tedious and involves running multiple applications.

The R programming language works great for statistical computation and plotting graphics and I have been tinkering around with it for the last few weeks. While learning R, I thought of using R to extract data from the API as well. This would allow extracting the latest data from the API and compute stats with a single script. And though the XML package in R doesn’t make for the most intuitive parsing code, the vectorized operations reduces the need for frequent loops and keeps the code concise and readable.

And though this code is written for the Socialcast API, it can be easily tweaked to pull data from any social API like Facebook, Yammer etc. The first step is to pull the data from the API – the RCurl package gets us the data which can then be parsed using the XML package.

rm(list=ls())
library(RCurl)
library(XML)

continueLoading=TRUE
page = 1
finalDataFrame <- NULL

getInnerText <- function(inputData,parentNode,childNode) {
  test <- xpathSApply(inputData,parentNode,function(x){
    if(is.null(x[childNode][[childNode]])) {
      ""
    }else {
      xmlValue(x[childNode][[childNode]])
    }
  })
  return(test)
}

while(continueLoading) {

  messagesData <- getURL(paste("https://demo.socialcast.com/api/groups/acmecorpsoftballteam/messages.xml?per_page=20&page=",page,sep=""),
                         userpwd="emily@socialcast.com:demo", ssl.verifypeer = FALSE, httpauth = 1L)
  print(paste("LOADING PAGE:",page))
  data <- xmlParse(messagesData)
  totalMessages <- length(getNodeSet(data,"//messages/message"))

The totalMessages property is to check the number of messages returned by the API. When it’s zero, the while loop is exited, else the execution continues. The xmlParse function gives us a in memory structure of the document which can be iterated upon. we use the sapply function which applies a function to each element of a list and returns a vector. I’ll come to the getUserNodeValue function later

if (totalMessages == 0){
    continueLoading = FALSE
  }
  else {
    tempDataFrame <- data.frame(
      InteractionType = "Message",
      ID = sapply(getNodeSet(data, "//messages/message/id"),xmlValue),
      Author = sapply(getNodeSet(data,"//messages/message/user/name"),xmlValue),
      Body = sapply(getNodeSet(data,"//messages/message/body"),xmlValue),
      Url = sapply(getNodeSet(data,"//messages/message/permalink-url"),xmlValue),
      Type = sapply(getNodeSet(data,"//messages/message/message-type"),xmlValue),
      CreatedAt = sapply(getNodeSet(data,"//messages/message/created-at"),xmlValue),
      Location = sapply(getNodeSet(data,"//messages/message/user/id"),function(x){getUserNodeValue(x,"Location")}),
      Country = sapply(getNodeSet(data,"//messages/message/user/id"),function(x){getUserNodeValue(x,"Country")}),
      Sector = sapply(getNodeSet(data,"//messages/message/user/id"),function(x){getUserNodeValue(x,"Sector")}),
      Title = sapply(getNodeSet(data,"//messages/message/user/id"),function(x){getUserNodeValue(x,"Title")}),
      Department = sapply(getNodeSet(data,"//messages/message/user/id"),function(x){getUserNodeValue(x,"Department")})
    )

    if (is.null(finalDataFrame)) {
      finalDataFrame <- tempDataFrame
    }
    else{
      finalDataFrame <- rbind(finalDataFrame,tempDataFrame)
    }

Now we have a data frame with all the Messages from the API. However, we also need the comments and likes. This is the only place where I needed to use a for loop to iterate through each individual message node and select their comments. The xpathSApply function reduces our code further by being able to query each node of the NodeSet with the given XPath expression and applying a function on it. Furthermore it returns a vector which fits in nicely into our existing data frame.

   for( i in 1:length(getNodeSet(data,"//messages/message"))) {
      if(length(getNodeSet(data,paste("//messages/message[position()=",i,"]/comments/comment"))) > 0){

        allComments <- getNodeSet(data,paste("//messages/message[position()=",i,"]/comments"))[[1]]

        xpathSApply(allComments,"comment/id",xmlValue)

        commentFrame <-  data.frame(
          InteractionType = "Comment",
          ID = xpathSApply(allComments,"comment/id",xmlValue),
          Author = xpathSApply(allComments,"comment/user/name",xmlValue),
          Body = xpathSApply(allComments,"comment/text",xmlValue),
          Url = xpathSApply(allComments,"comment/permalink-url",xmlValue),
          Type = "",
          CreatedAt = xpathSApply(allComments,"comment/created-at",xmlValue),
          Location = xpathSApply(allComments,"comment/user/id",function(x){getUserNodeValue(x,"Location")}),
          Country = xpathSApply(allComments,"comment/user/id",function(x){getUserNodeValue(x,"Country")}),
          Sector = xpathSApply(allComments,"comment/user/id",function(x){getUserNodeValue(x,"Sector")}),
          Title = xpathSApply(allComments,"comment/user/id",function(x){getUserNodeValue(x,"Title")}),
          Department = xpathSApply(allComments,"comment/user/id",function(x){getUserNodeValue(x,"Department")})
        )

        finalDataFrame <- rbind(finalDataFrame,commentFrame)
      }

      if(length(getNodeSet(data,paste("//messages/message[position()=",i,"]/likes/like"))) > 0){

        allLikes <- getNodeSet(data,paste("//messages/message[position()=",i,"]/likes"))[[1]]

        likeFrame <-  data.frame(
          InteractionType = "Like",
          ID = xpathSApply(allLikes,"like/id",xmlValue),
          Author = xpathSApply(allLikes,"like/user/name",xmlValue),
          Body = "",
          Url = "",
          Type ="",
          CreatedAt = xpathSApply(allLikes,"like/created-at",xmlValue),
          Location = xpathSApply(allLikes,"like/user/id",function(x){getUserNodeValue(x,"Location")}),
          Country = xpathSApply(allLikes,"like/user/id",function(x){getUserNodeValue(x,"Country")}),
          Sector = xpathSApply(allLikes,"like/user/id",function(x){getUserNodeValue(x,"Sector")}),
          Title = xpathSApply(allLikes,"like/user/id",function(x){getUserNodeValue(x,"Title")}),
          Department = xpathSApply(allLikes,"like/user/id",function(x){getUserNodeValue(x,"Department")})
        )

        finalDataFrame <- rbind(finalDataFrame,likeFrame)
      }
    }

  }
  page <- page + 1

}

rm(list=c("commentFrame","likeFrame","tempDataFrame","users"))

Now we come to the getNodeUserValue function. This is simply a performance optimization since calling the API to get the user details each time becomes very time consuming. So I generally keep the user data in a database and use the id in the xml response to query the data frame and fetch the correct user record. This step however is purely optional and you could easily call the api to get each user’s response and parse it.

getUserNodeValue <- function(inputNode,queryNode){
  if (nrow(users[users$ID == xmlValue(inputNode),]) == 0)
    ""
  else
    users[users$ID == xmlValue(inputNode),][[queryNode]]
}

At this point we have all the API information parsed into a data frame (finalDataFrame). Now for the fun part! Though you can subset and count easily using the built in language functions, a package called dplyr makes this code more readable and intuitive. With dplyr you can perform multiple data manipulation operations like filter, select, order, group by etc and chain them together to get the final result

So to group the data frame by a column and count, the code is as simple as

#############Type of Activity in the Group#####################################

interactionType <- group_by(finalDataFrame,InteractionType) %>%
                   summarise(count = n())

print(interactionType)

activityType

#############Active Day of Week#############################

activeDay <- group_by(finalDataFrame,weekdays(as.Date(CreatedAt))) %>%
 summarise(count = n()) %>%
 select(DayOfWeek=1,count=2)

print(activeDay)

DayOfWeek
The top 5 users by total Activity

activeUsers <- group_by(finalDataFrame,Author) %>%
 summarize(TotalActivity=n()) %>%
 arrange(-TotalActivity) %>%
 top_n(5,TotalActivity)

print(activeUsers)

The Type of Messages being created

messageTypes <- filter(finalDataFrame,InteractionType == "Message") %>%
 group_by(Type) %>%
 summarize(count = n()) %>%
 arrange(-count)

print(messageTypes)

The stats shown here barely scratch the surface of what R is capable of computing.

Dec 212012
 

In my previous post, I blogged about how to access the Socialcast community data without using the API. This is usually necessary when the API doesnt support any particular functionality which is provided by the site.

This is true of the usecase of updating of the user’s profile avatar. Though there is a way to update the user profile in the API, but there is no obvious method of updating the user’s avatar. I asked Socialcast on twitter, but they didn’t answer so I went ahead with trying to use Mechanize to login to the site.

I was finally able to update the profile avatar using the below script. Works like a charm.

require 'Mechanize'
agent = Mechanize.new
agent.agent.http.verify_mode = OpenSSL::SSL::VERIFY_NONE
agent.get("https://demo.socialcast.com/login")
form = agent.page.forms.first
puts "Please enter user email id"
form.email = gets.chomp
puts "Please enter password. caution: it is not masked"
form.password= gets.chomp
form.submit
puts "Please enter username"
agent.get ("https://demo.socialcast.com/users/emilyjames/edit")
form = agent.page.forms.detect{ |f| f.file_upload_with(:name => "profile_photo[data]") }
puts "Please enter file path of the image to replace"
form.file_uploads.first.file_name = gets.chomp
form.submit

Dec 212012
 

The Socialcast REST API provides programmatic access to the Socialcast community data with XML and JSON endpoints. The API provides most of the information one would require to extract out of the site but there are still gaps where the API is not up to date.

This made me look into the possibility of scraping the site directly using cUrl and parsing the generated HTML. However Socialcast is built on Rails and has a security feature which prevents cross site request forgery, using an authenticity token which is a random token generated and sent with every request embedded in a hidden form field. When the form is posted back, this token is checked and an error generated if it’s not found. This makes direct scraping of the page difficult and cUrl fails. Googling gave me a few articles which specified how to use cUrl with sites protected with the authenticity token (Link1, Link2) but unfortunately none of them seemed to work.

Then I came across a suggestion to use Mechanize, a ruby library to automate interaction with websites. Mechanize works like a charm with sites protected by an authenticity token. Here is the ruby script to login to the Socialcast Demo site.

require 'Mechanize'
agent = Mechanize.new
agent.agent.http.verify_mode = OpenSSL::SSL::VERIFY_NONE
agent.get("https://demo.socialcast.com/login")
form = agent.page.forms.first
form.email = "emily@socialcast.com"
form.password= "demo"
form.submit

In Interactive Ruby, we can see that the authenticity token is returned when the first GET is called on the login page. And when the form is submitted the token is posted back to the server and we are redirected to the home page.

login

From here on, we can automate any interaction with the site just as a normal user would do without worrying about the authenticity token restriction. In my next post, I will explain how to automatically update a user’s avatar without relying on the API