Dec 212012
 

The Socialcast REST API provides programmatic access to the Socialcast community data with XML and JSON endpoints. The API provides most of the information one would require to extract out of the site but there are still gaps where the API is not up to date.

This made me look into the possibility of scraping the site directly using cUrl and parsing the generated HTML. However Socialcast is built on Rails and has a security feature which prevents cross site request forgery, using an authenticity token which is a random token generated and sent with every request embedded in a hidden form field. When the form is posted back, this token is checked and an error generated if it’s not found. This makes direct scraping of the page difficult and cUrl fails. Googling gave me a few articles which specified how to use cUrl with sites protected with the authenticity token (Link1, Link2) but unfortunately none of them seemed to work.

Then I came across a suggestion to use Mechanize, a ruby library to automate interaction with websites. Mechanize works like a charm with sites protected by an authenticity token. Here is the ruby script to login to the Socialcast Demo site.

require 'Mechanize'
agent = Mechanize.new
agent.agent.http.verify_mode = OpenSSL::SSL::VERIFY_NONE
agent.get("https://demo.socialcast.com/login")
form = agent.page.forms.first
form.email = "emily@socialcast.com"
form.password= "demo"
form.submit

In Interactive Ruby, we can see that the authenticity token is returned when the first GET is called on the login page. And when the form is submitted the token is posted back to the server and we are redirected to the home page.

login

From here on, we can automate any interaction with the site just as a normal user would do without worrying about the authenticity token restriction. In my next post, I will explain how to automatically update a user’s avatar without relying on the API