Dec 242011

I have been spending some of my free time trying to build a complete cricket statistics database by parsing records from Cricinfo. However scraping HTML pages is an ardous task. There is simply no standard way of achieving it and often becomes a struggle with regular expressions. A good solution to this problem is the Html Agility Pack. Its a library which standardizes parsing of HTML pages and converts them into a XML style DOM object that you can extract data from. There are a good number of options for error checking (for HTML which is not XHTML compliant)

The API is very similar to the XmlDocument class in System.Xml namespace and hence there is hardly any learning curve. You can search for nodes based on the Xpath expression of the element you want to search. Now getting the xpath can be a bit tricky, so an easier way would be to use a chrome extension called XPath Helper. Once this extension is installed and activated, press Ctrl+Shift+X to activate and then shift to give the xpath of any particular element on which the mouse is hovering. The given XPath can be easily tailored to get the whole set of data which we need to extract.

Now, its time to start scraping. Download Html Agility pack from Codeplex and add a reference to the dll. Its a pretty simple code to get the webpage as a string , then load it in the HTML Agility pack and let it create the DOM structure. Then the XPath is used to get the list of rows in the table and each row is translated into an innings object and added to a collection. At the end its written to a csv file that can be converted to an excel spreadsheet. The code is pretty rough and I did it more for a trial. When the complete database will be built it will become much more difficult since it would involve parsing of different kind of pages and ensure integrity of data.

class Program
        static void Main(string[] args)
            new ReadText().StartParsing();


    class ReadText
        public void StartParsing()
            string TestUrl = ";filter=advanced;page={0};orderby=start;size=200;template=results;type=batting;view=innings;wrappertype=print";
            Console.WriteLine("Extracting Tests\n\n");
            ExtractInningsView(TestUrl, "..\\..\\AllTestInnings.csv",404);
            Console.WriteLine("Extracting ODIs\n\n");
            string ODIUrl = ";filter=advanced;page={0};orderby=start;size=200;template=results;type=batting;view=innings;wrappertype=print";


        private void ExtractInningsView(string statUrl,string fileName,int pageCount)
            List<InningsPlayed> AllInnings = new List<InningsPlayed>();
            for (int j = 1; j < pageCount; j++)
                Console.WriteLine("Reading Page: " + j.ToString());
                string pageText = ReadWebPage(String.Format(statUrl, j));
                var htmlDoc = new HtmlDocument();

                for (int i = 1; i < 200; i++)
                    string inningsXpath = "//tbody/tr[@class='data1'][{0}]/td";
                    var nodeList = htmlDoc.DocumentNode.SelectNodes(String.Format(inningsXpath, i));

                    if (nodeList != null)
                        AllInnings.Add(new InningsPlayed()
                            Name = nodeList[0].InnerText,
                            Runs = nodeList[1].InnerText,
                            Minutes = nodeList[2].InnerText,
                            BallsFaced = nodeList[3].InnerText,
                            Fours = nodeList[4].InnerText,
                            Sixes = nodeList[5].InnerText,
                            StrikeRate = nodeList[6].InnerText,
                            Innings = nodeList[7].InnerText,
                            Opposition = nodeList[9].InnerText,
                            Ground = nodeList[10].InnerText,
                            StartDate = nodeList[11].InnerText

        private void DumpToFile(List<InningsPlayed> AllInnings,string fileName)
            StreamWriter writer = new StreamWriter(fileName);
            StringBuilder builder = new StringBuilder();
            int iterations = 0;
            foreach (var inning in AllInnings)
                writer.WriteLine(string.Format("{0},{1},{2},{3},{4},{5},{6},{7},{8},{9},{10}", inning.Name, inning.Runs, inning.Minutes, inning.BallsFaced, inning.Fours, inning.Sixes, inning.StrikeRate, inning.Innings, inning.Opposition, inning.Ground, inning.StartDate));
                if (iterations % 10 == 0)

        private string ReadWebPage(string Url)
            // Reading Web page content in c# program
            //Specify the Web page to read
            WebRequest request = WebRequest.Create(Url);
            //Get the response
            WebResponse response = request.GetResponse();
            //Read the stream from the response
            StreamReader reader = new StreamReader(response.GetResponseStream());
            return reader.ReadToEnd();


    class InningsPlayed
        public string Name { get; set; }
        public string Runs { get; set; }
        public string BallsFaced { get; set; }
        public string Minutes { get; set; }
        public string Fours { get; set; }
        public string Sixes { get; set; }
        public string StrikeRate { get; set; }
        public string Innings { get; set; }
        public string Opposition { get; set; }
        public string Ground { get; set; }
        public string StartDate { get; set; }