Correlation Between Recruit Rankings and Final Standings in Big Ten Football
December 31, 2013 1 Comment
Following up on my last post about screen scraping college football in F#, I took the next step and analyzed the data that I scraped. I am a big believer in Domain Specific Language so ‘Rankings’ means the ranking assigned by Rivals about how well a school recruits players. ‘Standings’ means the final position in the Big Ten after the games have been played. Ranking is for recruiting and standings is for actually playing the games.
Going back to the code, the 1st thing I did was to separate the Standings call from the search for a given school – so that the XmlDocument is loaded once and then searched several times versus loading it for each search. This improved performance dramatically:
- static member getAnnualConferenceStandings(year:int)=
- let url = "http://espn.go.com/college-football/conferences/standings/_/id/5/year/"+year.ToString()+"/big-ten-conference";
- let request = WebRequest.Create(Uri(url))
- use response = request.GetResponse()
- use stream = response.GetResponseStream()
- use reader = new IO.StreamReader(stream)
- let htmlString = reader.ReadToEnd()
- let divMarkerStartPosition = htmlString.IndexOf("my-teams-table");
- let tableStartPosition = htmlString.IndexOf("<table",divMarkerStartPosition);
- let tableEndPosition = htmlString.IndexOf("</table",tableStartPosition);
- let data = htmlString.Substring(tableStartPosition, tableEndPosition- tableStartPosition+8)
- let xmlDocument = new XmlDocument();
- xmlDocument.LoadXml(data);
- xmlDocument
- static member getSchoolStanding(xmlDocument: XmlDocument,school) =
- let keyNode = xmlDocument.GetElementsByTagName("td")
- |> Seq.cast<XmlNode>
- |> Seq.find (fun node -> node.InnerText = school)
- let valueNode = keyNode.NextSibling
- let returnValue = (keyNode.InnerText, valueNode.InnerText)
- returnValue
- static member getConferenceStandings(year:int) =
- let xmlDocument = RankingProvider.getAnnualConferenceStandings(year)
- Seq.map(fun school -> RankingProvider.getSchoolStanding(xmlDocument,school)) RankingProvider.schools
- |> Seq.sortBy snd
- |> Seq.toList
- |> List.rev
- |> Seq.mapi(fun index (school,ranking) -> school, index+1)
- |> Seq.sortBy fst
- |> Seq.toList
Thanks for Valera Kolupaev for showing me how to use mapi to create a tuple from the list of schools and what rank they were in the list in getConferenceStandings().
I then went to the rankings call and added a way to parse down only the schools I am interested in. That way I can compare individual schools, groups of schools, or the entire conference:
- static member getConferenceRankings(year) =
- RankingProvider.schools
- |> Seq.map(fun schoolName -> RankingProvider.getSchoolInSequence(year, schoolName))
- |> Seq.toList
- static member getSchoolInSequence(year, schoolName) =
- RankingProvider.getRecrutRankings(year)
- |> Seq.find(fun (school,rank) -> school = schoolName)
After these two refactorings, my unit tests still ran green so I was ready to do the analysis.
I went out to my project of a couple of weeks ago for correlation and copied in the module. The Correlation function takes in two lists of doubles. The first list would be a school’s ranking and the second would be the standings:
- static member getCorrelationBetweenRankingsAndStandings(year, rankings, standings ) =
- let ranks = Seq.map(fun (school,rank) -> rank) rankings
- let stands = Seq.map(fun (school,standing) -> standing) standings
- Calculations.Correlation(ranks,stands)
- static member getCorrelation(year:int) =
- let rankings = RankingProvider.getConferenceRankings year
- |> Seq.map(fun (school,rank) -> school,Convert.ToDouble(rank))
- let standings = RankingProvider.getConferenceStandings(year+RankingProvider.yearDifferenceBetwenRankingsAndStandings)
- |> Seq.map(fun (school, standing) -> school, Convert.ToDouble(standing))
- let correlation = RankingProvider.getCorrelationBetweenRankingsAndStandings(year,rankings, standings)
- (year, correlation)
A couple of things to note:
1) This function assumes that both the rankings and the standings are the same length and are in order by school name. A production application would check this as part of standard argument validation.
2) I used Convert.ToDouble() to change the Int32 of the ranking to Double of the correlation function. Having these .NET assemblies available at key points in the application really moved things along.
In any event, all that was left was to list the Big Ten schools to analyze, the number of years to analyze, and the year difference between the recruit rankings and the standings from the games they played in.
As a first step, I did all original big ten schools with 7 years of recruiting and a 1,2,3,4 years difference (2002 ranking compared to 2003, 2004,2005,2006 standings ,etc…):
The average is .3303/.2650/.5138/.6065
And so yeah – there is a really strong correlation between a recruit ranking and the outcome on the field. Also, the most impact the class has seems to be senior year – which makes sense. I don’t have a hypothesis on why it drops sophomore year – perhaps the ‘impact freshmen’ leave after 1 year?
Also of interest, the correlation does not seem to follow a normal distribution. If you only look at the schools that have an emphasis on academics, the correlation drops significantly – to a negative correlation!
The average is .1485/-.1446/-.2817/-.0381
So another great reason to create the new big ten – sometimes there is a really good recruit class does not do well on the field and other times a poorly-ranked recruiting class does well on the field. This kind of unpredictability is both exciting and probably much more likely to bring in the casual fans.
Based on this analysis, here is what is going to happen in the Big Ten next year:
- Michigan State and Ohio State will be the leaders
- Michigan and Penn State are in the best position beat Michigan State and Ohio State
But you didn’t need a statistical analysis to tell you that. The one key surprise that this analysis tells you is that
- Nebraska will have a significant improvement in the standings in 2014
- Indiana will have a significant improvement in the standings in 2015 and 2016
As a final note, I got this after doing a bunch of requests to Yahoo:
So I wonder if I hit the page too many times and my IP was flagged a as a bot? I waited a day for the server to reset to finish my analysis. Perhaps this is a case where I should get the data when the getting is good and take their pages and bring them locally?