Dec 242011

I have been spending some of my free time trying to build a complete cricket statistics database by parsing records from Cricinfo. However scraping HTML pages is an ardous task. There is simply no standard way of achieving it and often becomes a struggle with regular expressions. A good solution to this problem is the Html Agility Pack. Its a library which standardizes parsing of HTML pages and converts them into a XML style DOM object that you can extract data from. There are a good number of options for error checking (for HTML which is not XHTML compliant)

The API is very similar to the XmlDocument class in System.Xml namespace and hence there is hardly any learning curve. You can search for nodes based on the Xpath expression of the element you want to search. Now getting the xpath can be a bit tricky, so an easier way would be to use a chrome extension called XPath Helper. Once this extension is installed and activated, press Ctrl+Shift+X to activate and then shift to give the xpath of any particular element on which the mouse is hovering. The given XPath can be easily tailored to get the whole set of data which we need to extract.

Now, its time to start scraping. Download Html Agility pack from Codeplex and add a reference to the dll. Its a pretty simple code to get the webpage as a string , then load it in the HTML Agility pack and let it create the DOM structure. Then the XPath is used to get the list of rows in the table and each row is translated into an innings object and added to a collection. At the end its written to a csv file that can be converted to an excel spreadsheet. The code is pretty rough and I did it more for a trial. When the complete database will be built it will become much more difficult since it would involve parsing of different kind of pages and ensure integrity of data.

class Program
        static void Main(string[] args)
            new ReadText().StartParsing();


    class ReadText
        public void StartParsing()
            string TestUrl = ";filter=advanced;page={0};orderby=start;size=200;template=results;type=batting;view=innings;wrappertype=print";
            Console.WriteLine("Extracting Tests\n\n");
            ExtractInningsView(TestUrl, "..\\..\\AllTestInnings.csv",404);
            Console.WriteLine("Extracting ODIs\n\n");
            string ODIUrl = ";filter=advanced;page={0};orderby=start;size=200;template=results;type=batting;view=innings;wrappertype=print";


        private void ExtractInningsView(string statUrl,string fileName,int pageCount)
            List<InningsPlayed> AllInnings = new List<InningsPlayed>();
            for (int j = 1; j < pageCount; j++)
                Console.WriteLine("Reading Page: " + j.ToString());
                string pageText = ReadWebPage(String.Format(statUrl, j));
                var htmlDoc = new HtmlDocument();

                for (int i = 1; i < 200; i++)
                    string inningsXpath = "//tbody/tr[@class='data1'][{0}]/td";
                    var nodeList = htmlDoc.DocumentNode.SelectNodes(String.Format(inningsXpath, i));

                    if (nodeList != null)
                        AllInnings.Add(new InningsPlayed()
                            Name = nodeList[0].InnerText,
                            Runs = nodeList[1].InnerText,
                            Minutes = nodeList[2].InnerText,
                            BallsFaced = nodeList[3].InnerText,
                            Fours = nodeList[4].InnerText,
                            Sixes = nodeList[5].InnerText,
                            StrikeRate = nodeList[6].InnerText,
                            Innings = nodeList[7].InnerText,
                            Opposition = nodeList[9].InnerText,
                            Ground = nodeList[10].InnerText,
                            StartDate = nodeList[11].InnerText

        private void DumpToFile(List<InningsPlayed> AllInnings,string fileName)
            StreamWriter writer = new StreamWriter(fileName);
            StringBuilder builder = new StringBuilder();
            int iterations = 0;
            foreach (var inning in AllInnings)
                writer.WriteLine(string.Format("{0},{1},{2},{3},{4},{5},{6},{7},{8},{9},{10}", inning.Name, inning.Runs, inning.Minutes, inning.BallsFaced, inning.Fours, inning.Sixes, inning.StrikeRate, inning.Innings, inning.Opposition, inning.Ground, inning.StartDate));
                if (iterations % 10 == 0)

        private string ReadWebPage(string Url)
            // Reading Web page content in c# program
            //Specify the Web page to read
            WebRequest request = WebRequest.Create(Url);
            //Get the response
            WebResponse response = request.GetResponse();
            //Read the stream from the response
            StreamReader reader = new StreamReader(response.GetResponseStream());
            return reader.ReadToEnd();


    class InningsPlayed
        public string Name { get; set; }
        public string Runs { get; set; }
        public string BallsFaced { get; set; }
        public string Minutes { get; set; }
        public string Fours { get; set; }
        public string Sixes { get; set; }
        public string StrikeRate { get; set; }
        public string Innings { get; set; }
        public string Opposition { get; set; }
        public string Ground { get; set; }
        public string StartDate { get; set; }
Sep 012011

A few days ago, there was a discussion on a cricket site regarding who is the better bowler amongst Waqar/Pollock/Donald. Though very good bowlers in their era, they failed to break into the all time great list of Marshall, Akram, Ambrose, Mc Grath ,Warne amongst a few others.

Donald had sheer pace and meancing swing. Waqar was feared for his toe-crushing yorkers and Pollock though not a express pace – still had amazing discipline with which he picked up wickets. So I pulled up their statistics and this is the nationwide distribution of their average and strike rate.

Waqar’s average is great in Pakistan, Bangladesh, Zimbabwe and Sri Lanka. In Pakistan bowling on flat wickets and maintaining that kind of average is quite laudable. In India the sample size is too small to draw any meaningful conclusions out of it. His main bugbear seems to be bowling in South Africa, England and Oz. In England and Saffers under bowling friendly conditions, Waqar has underperformned by quite a margin. In Australia he averages 40 which is even worse. Donald’s worst bowling performance is in Pakistan where he averages 32 followed by 28 in Oz. Everywhere else, he averages a commendable sub-25 level. Pollock has a similar problem in performing in Australia – averaging 34 there. 28 and 27 in Windies and India are another place where he has not performed at his best.

Lets see their year wise performance in their career.

Waqar’s career: Waqar’s first few years were the cream of his career. From 1990 to 1994, his bowling performances were superlative. It was during his prime that he tormented the English batsman with the lethal toe-crushers and banana swing. However he suffered an injury in 1995 after which his performance became wayward. His place in the team was in and out in the next few years. Though he gave some inspiring performances in the coming years, the waqar of yore was not seen again.

Donald’s career: Donald’s prime was from 96 – 2000. This period was a relatively easier period to bat than the early 90s which still favored bowlers to an extent. Still Donald’s stats dont give that away. His consistency during the prime years was remarkable, going above 20 in just one year. I still remember his delivery to an UAE batsman who had the gall to bat to Donald without a helmet on. The ball was a lethal bouncer which the batsman on the head, who remarkably escaped without any injuries

Pollock’s career: Unlike the other two bowlers being discussed, Shaun Pollock was an allrounder – He didnt have the same express pace that Waqar or Donald had, but he was ever the workhorse who was disciplined and bowled a consistent line and length with great swing. He took 421 wickets at an average of 23 odd, which in itself is quite remarkable. Add to that his batting ability and you have a match winner. Pollock’s prime was from 98 to 2003 barring slightly poor performance in the one year of 2002.

It is very difficult to pick and choose among these three bowlers as who is the best of them. However my personal choice would be to go with Donald, followed by Waqar and Pollock.