Dec 242011

I have been spending some of my free time trying to build a complete cricket statistics database by parsing records from Cricinfo. However scraping HTML pages is an ardous task. There is simply no standard way of achieving it and often becomes a struggle with regular expressions. A good solution to this problem is the Html Agility Pack. Its a library which standardizes parsing of HTML pages and converts them into a XML style DOM object that you can extract data from. There are a good number of options for error checking (for HTML which is not XHTML compliant)

The API is very similar to the XmlDocument class in System.Xml namespace and hence there is hardly any learning curve. You can search for nodes based on the Xpath expression of the element you want to search. Now getting the xpath can be a bit tricky, so an easier way would be to use a chrome extension called XPath Helper. Once this extension is installed and activated, press Ctrl+Shift+X to activate and then shift to give the xpath of any particular element on which the mouse is hovering. The given XPath can be easily tailored to get the whole set of data which we need to extract.

Now, its time to start scraping. Download Html Agility pack from Codeplex and add a reference to the dll. Its a pretty simple code to get the webpage as a string , then load it in the HTML Agility pack and let it create the DOM structure. Then the XPath is used to get the list of rows in the table and each row is translated into an innings object and added to a collection. At the end its written to a csv file that can be converted to an excel spreadsheet. The code is pretty rough and I did it more for a trial. When the complete database will be built it will become much more difficult since it would involve parsing of different kind of pages and ensure integrity of data.

class Program
        static void Main(string[] args)
            new ReadText().StartParsing();


    class ReadText
        public void StartParsing()
            string TestUrl = ";filter=advanced;page={0};orderby=start;size=200;template=results;type=batting;view=innings;wrappertype=print";
            Console.WriteLine("Extracting Tests\n\n");
            ExtractInningsView(TestUrl, "..\\..\\AllTestInnings.csv",404);
            Console.WriteLine("Extracting ODIs\n\n");
            string ODIUrl = ";filter=advanced;page={0};orderby=start;size=200;template=results;type=batting;view=innings;wrappertype=print";


        private void ExtractInningsView(string statUrl,string fileName,int pageCount)
            List<InningsPlayed> AllInnings = new List<InningsPlayed>();
            for (int j = 1; j < pageCount; j++)
                Console.WriteLine("Reading Page: " + j.ToString());
                string pageText = ReadWebPage(String.Format(statUrl, j));
                var htmlDoc = new HtmlDocument();

                for (int i = 1; i < 200; i++)
                    string inningsXpath = "//tbody/tr[@class='data1'][{0}]/td";
                    var nodeList = htmlDoc.DocumentNode.SelectNodes(String.Format(inningsXpath, i));

                    if (nodeList != null)
                        AllInnings.Add(new InningsPlayed()
                            Name = nodeList[0].InnerText,
                            Runs = nodeList[1].InnerText,
                            Minutes = nodeList[2].InnerText,
                            BallsFaced = nodeList[3].InnerText,
                            Fours = nodeList[4].InnerText,
                            Sixes = nodeList[5].InnerText,
                            StrikeRate = nodeList[6].InnerText,
                            Innings = nodeList[7].InnerText,
                            Opposition = nodeList[9].InnerText,
                            Ground = nodeList[10].InnerText,
                            StartDate = nodeList[11].InnerText

        private void DumpToFile(List<InningsPlayed> AllInnings,string fileName)
            StreamWriter writer = new StreamWriter(fileName);
            StringBuilder builder = new StringBuilder();
            int iterations = 0;
            foreach (var inning in AllInnings)
                writer.WriteLine(string.Format("{0},{1},{2},{3},{4},{5},{6},{7},{8},{9},{10}", inning.Name, inning.Runs, inning.Minutes, inning.BallsFaced, inning.Fours, inning.Sixes, inning.StrikeRate, inning.Innings, inning.Opposition, inning.Ground, inning.StartDate));
                if (iterations % 10 == 0)

        private string ReadWebPage(string Url)
            // Reading Web page content in c# program
            //Specify the Web page to read
            WebRequest request = WebRequest.Create(Url);
            //Get the response
            WebResponse response = request.GetResponse();
            //Read the stream from the response
            StreamReader reader = new StreamReader(response.GetResponseStream());
            return reader.ReadToEnd();


    class InningsPlayed
        public string Name { get; set; }
        public string Runs { get; set; }
        public string BallsFaced { get; set; }
        public string Minutes { get; set; }
        public string Fours { get; set; }
        public string Sixes { get; set; }
        public string StrikeRate { get; set; }
        public string Innings { get; set; }
        public string Opposition { get; set; }
        public string Ground { get; set; }
        public string StartDate { get; set; }
Oct 112011

Selectors often are faced with a dilemma after a budding talent plays a few matches only to not justify his talent with his performances. To find out how much rope selectors have given players who have ended up perfoming quite well, I took out the figures of all batsmen who averaged above 40 for India (Qualifier – 20 matches). 14 players made the cut overall. Then the cumulative averages over their career were calculated to see how they fared with each test match.

First the current crop of batsmen who average above. Sachin, Sehwag and Dravid come in this club. As we see from the pic, Sachin consistently lifted his game over his formative years and broke into the 50s in his 29th match and except for a brief period never went into the 40s again till now.

Now the next rung of current batsmen who average above 40. Laxman’s case is the most interesting here. He underperfomed for nearly his first 40 matches but then turned the tide on its head after the 2001 australia series. The repeated chances that Laxman had been given have paid off and he is one of the most dependable players in the side.

Dada’s selection was an open and shut case following his classy performance in the debut series.

Here are the retired batsmen who averaged above 45. No surprises here – Both Azhar and Gavaskar had great performances in their initial matches.

Then the retired batsmen who averaged above 40. Here people who improved their performances over time considerably are Siddhu and Vengsarkar.

So it turns out that with the exception of Laxman and Siddhu, most players who averaged above 40 did do well in their initial years.

Sep 012011

A few days ago, there was a discussion on a cricket site regarding who is the better bowler amongst Waqar/Pollock/Donald. Though very good bowlers in their era, they failed to break into the all time great list of Marshall, Akram, Ambrose, Mc Grath ,Warne amongst a few others.

Donald had sheer pace and meancing swing. Waqar was feared for his toe-crushing yorkers and Pollock though not a express pace – still had amazing discipline with which he picked up wickets. So I pulled up their statistics and this is the nationwide distribution of their average and strike rate.

Waqar’s average is great in Pakistan, Bangladesh, Zimbabwe and Sri Lanka. In Pakistan bowling on flat wickets and maintaining that kind of average is quite laudable. In India the sample size is too small to draw any meaningful conclusions out of it. His main bugbear seems to be bowling in South Africa, England and Oz. In England and Saffers under bowling friendly conditions, Waqar has underperformned by quite a margin. In Australia he averages 40 which is even worse. Donald’s worst bowling performance is in Pakistan where he averages 32 followed by 28 in Oz. Everywhere else, he averages a commendable sub-25 level. Pollock has a similar problem in performing in Australia – averaging 34 there. 28 and 27 in Windies and India are another place where he has not performed at his best.

Lets see their year wise performance in their career.

Waqar’s career: Waqar’s first few years were the cream of his career. From 1990 to 1994, his bowling performances were superlative. It was during his prime that he tormented the English batsman with the lethal toe-crushers and banana swing. However he suffered an injury in 1995 after which his performance became wayward. His place in the team was in and out in the next few years. Though he gave some inspiring performances in the coming years, the waqar of yore was not seen again.

Donald’s career: Donald’s prime was from 96 – 2000. This period was a relatively easier period to bat than the early 90s which still favored bowlers to an extent. Still Donald’s stats dont give that away. His consistency during the prime years was remarkable, going above 20 in just one year. I still remember his delivery to an UAE batsman who had the gall to bat to Donald without a helmet on. The ball was a lethal bouncer which the batsman on the head, who remarkably escaped without any injuries

Pollock’s career: Unlike the other two bowlers being discussed, Shaun Pollock was an allrounder – He didnt have the same express pace that Waqar or Donald had, but he was ever the workhorse who was disciplined and bowled a consistent line and length with great swing. He took 421 wickets at an average of 23 odd, which in itself is quite remarkable. Add to that his batting ability and you have a match winner. Pollock’s prime was from 98 to 2003 barring slightly poor performance in the one year of 2002.

It is very difficult to pick and choose among these three bowlers as who is the best of them. However my personal choice would be to go with Donald, followed by Waqar and Pollock.

Mar 132011

India’s defeat against South Africa today was very painful, not because of the loss per se, but because we squandered two wonderful opportunities to win it. One in the 40th over where despite being 267 for 1, India collapsed to 296 all out. And the other in the death overs where tighter bowling could have still got us home.

More than the match loss, it was plain disappointing to see Sachin haters dishing the usual ignorant tripe about how India loses every time he scores a century. After years of unsuccessfully searching for reasons to put down Sachin’s record because of his 21 year career at the pinnacle of cricket, uncontroversial career and humble image, the haters have settled for frivolous reasons like Sachin’s centuries being unlucky for India.

To firmly discredit such irrational theories being floated, I dug up some stats to determine if there is truly a corelation between Sachin’s centuries and India’s defeats. My stats dont contain the world cup data so its behind by 3 matches.

Lets do the Simplest math first:

India has won 33 times out of 45 (73.33%) when Sachin has scored a century. Two of the matches had ended up in no result

Now for the more in depth analysis. These are the 12 matches when Sachin had lost. This is the query that I used for analyzing this.

select /*,m.scorecard_url,*/ pbs.runs_scored as Sachin_Score,
	   i.runs as Team_Score,
	   CASE when i.match_innings = 1
		    then "Batting_First"
			when i.Match_innings = 2
			then "Chasing"
	   End as "Team position",
	   ceil((pbs.runs_scored/pbs.balls_faced)*100) as Strike_Rate,
	   pbs.departure_score as Departure_Score
	/*   pbs.departure_wickets as Departure_Wickets, */
	   /*pbs.departure_overs as Departure_Overs,
	   round((pbs.departure_score - pbs.arrival_score)/(pbs.departure_overs),2) as RunRate_duringStay,
	   round((i.runs-pbs.departure_score)/(i.overs-pbs.departure_overs),2) as RunRate_afterStay,
	   round(i.overs-pbs.departure_overs,2) as Overs_afterStay,
	   round((i.runs/i.overs),2) as MatchRunRate
	  /* (pbs.runs_scored/i.runs)*100  as Percentage_of_teamScore,*/
	    /*t.Name as opposing_team,
	    Team_Scores.runs as Oppositon_Score
	    from Players p,
              PlayerBattingStats pbs,
			 Matches m,
			 Innings i,
			 Teams t,
			 (select matchid,runs,batting_teamid from Innings) Team_Scores,
		 (Select p1.cname,pbs1.matchid,runs_scored,i1.batting_teamid from Players p1,PlayerBattingStats pbs1,Innings i1
					where = pbs1.playerid and pbs1.matchid = i1.matchid) Oppn_Scores
		where = pbs.playerid and
			  pbs.matchid = and
			  i.matchid = and = Team_Scores.matchid and
		       Team_Scores.batting_teamid <> 6 and = Oppn_Scores.matchid and
			  Oppn_Scores.batting_teamid <> 6 and
			  m.winning_teamid = and
		      m.matchtype = 'ODI' and
			  (m.winning_teamid <> 6 or m.winning_teamid is null) and
			  i.batting_teamid = 6 and
			  pbs.runs_scored >= 100 and
			  p.cname = 'Sachin Tendulkar'
group by pbs.runs_scored ,
			--  order by Percentage_of_teamScore desc;

It is notable that Sachin scored more than 45% of the entire team’s runs on 9 of these 12 occasions that India lost. So In effect he was carrying the whole batting alone. And there were only 3 occasions chasing where he was expected to get India home. But an overwhelming majority of 9 times, India was the team batting first and the bowlers were unable to defend targets ranging from 328 to 224

The next reason touted is that Sachin bats too slow to get to his century. This is also not true. 10 times out of 12, Sachin scored at a strike rate of 80 or above. 8 times out of 12, the Strike rate was 90+. When chasing, Sachin’s centuries have come at more than a run a ball.

Next is the run rate, out of the 12 occasions for 4 Sachin carried his bat throught the innings, and pretty much made sure the team reached a decent total. On the other 8 occasions, there is only .5 increase in run rate after Sachin lost his wicket to the match run rate. Even this increase can be attributed to the aggressive batting in the death overs. During chasing under pressure, there is actually a dip after Sachin loses his wicket showing the inability of following batsmen to handle pressure.

There are a lot of other stats that can be displayed to reinforce this fact, eg the Indian’s bowlers economy rate or the misfields and dropped catches by the fielders. At the end of the day, Cricket is a team game and it takes more than one player to win a match for a country. An individual brilliant performance can only do so much to get a team near victory, but it takes the rest of the team to also chip to cross the finish line. Its about time, few of us stop blaming one individual for the mistakes of 10 others.