I created a formula for guessing winners and losers in the NCAA tournament and I'd like your feedback. So far, it's outperformed pure Ken Pomeroy rankings (just the raw overall rankings) and various pundits (Eisenberg, Bilas, and "Chalk" even though it's not technically a pundit). It didn't do as well in the latter rounds (4/8, 1/4, and 1/2 respectively), but I think it's because a) luck plays a much bigger role in predicting the outcome of those games, and b) the figures don't take into account the 2 or 3 games they've played in the tournament up to that point (although the impact of including those games would probably be minimal at best).
In essence, all it does is use the season's box score statistics to calculate how many points a team should score per possession, on average, and how many points a team should allow per possession, on average.
Each possession can end one of three ways--a score, a turnover, or another possession. For the purposes of my calculations, I counted a miss and a defensive rebound by the opponent as a turnover, while a miss and an offensive rebound results in another possession. As such, I can use the box score statistics to create a formula that uses multiple advanced metrics to calculate what "should" happen on each possession. I'm able to calculate a team's True Shooting% (TS%), Offensive Rebounding Rate (OR%), Defensive Rebounding Rate (DR%), and Turnover Percentage (TOV%) using pure box score stats. This gives me all the values I need to calculate what will happen on a given possession.
For example, Indiana has a TOV% of 16.36%, so I am expecting that on each possession, there is a 16.36% chance that Indiana turns it over without getting a shot off. If they maintain possession, they have a TS% of 60.2%. If they miss, they have an ORB% of 50.93%, at which point they essentially get a new possession. I'll include the figures for each team separately, but you can basically derive my whole formula from the above example.
I perform the same calculation for the opponents' box score numbers, which essentially gives me the team's defensive figures--namely, TS% allowed, turnovers forced, and offensive rebounding rate allowed (Indiana allowed a .48 TS%, forced a 18% TOV%, and allowed a 49.17% ORB%). I create a differential between these two values, and that gives me the difference in points per possession (PPP). This gives me a raw PPP differential that doesn't take pace or strength of schedule into account.
The next step I did was to adjust this for their strength of schedule (SOS). I didn't have time to calculate it on my own, because frankly, I'm not sure how I should do it. I used the strength of schedule figure in College Basketball Reference (CBR) for each team, but in the future, I'd like to revisit this and try out a couple strength of schedule calculations to see what makes the most sense. If anyone has any ideas on how to approach this, I'd really like to see that too.
The CBR SOS "is denominated in points above/below average". Indiana has a SOS of 8.4 figure, which implies that against a schedule of average opponents, Indiana would have scored 8.4 more points. I'm not sure if this value is per game, but that's what I assumed. Since my PPP calculation is per Possession (by definition), I calculated the total number of possessions the team has on average, per game. This possession calculation was performed using the formula detailed in the CBR glossary.
In the end, I have a PPP differential, a SOS figure, and an average number of possessions per game. So for example, Indiana has a raw PPP of .22, meaning that they score about .22 more points per offensive possession than they give up per defensive possession. They would have scored 8.4 more points per game against a schedule of average opponents (the SOS figure), and they average 66.67 possessions per game. These figures are combined to give a PPP figure that's adjusted for both a team's strength of schedule, and a team's pace. Indiana ends up with an adjusted PPP figure of about .35, meaning that against an average opponent, they would outscore them by 35 points per 100 possessions, or 35 * .6667 per game. This allows us to rank each team against a common baseline that takes pace and strength of schedule into account.
A few weaknesses I'd like to point out first. The more possessions a team has, the lower the impact of their strength of schedule figure. I'm not sure if this makes sense or not, but this is necessary since we are comparing teams at a possession level. Ideally, this SOS figure would have a possessions component embedded in it already, but that may not be guaranteed. Also, the strength of schedule rating heavily favors the big major teams. Only Colorado St., Gonzaga, and VCU are in the top 15 by this ranking. Colorado St. has a 63% ORB%, which is highest in this pool, but having never played against a Jeff Withey or Mason Plumlee, we can't exactly translate their production to the NCAA tournament. In a similar vein, given the exact same team box score, this method weights a win over Louisville the same as a win against Grambling St., excepting the impact that playing these teams would have on their SOS figure.
With these caveats in mind, it has Louisville, Florida, Wisconsin, and Indiana as the top 4 teams, followed by Pittsburgh, Ohio State, Michigan, and Colorado St. These rankings have Gonzaga at #9, and Kansas at #20. Considering the teams in the Final Four, Syracuse is #12 and Wichita St. is #19.
We can compare these ranks to Ken Pom (which also takes into account SOS, pace, and offensive and defensive efficiency), who has Louisville, Florida, Indiana, and Gonzaga as his top 4 teams, followed by Michigan, Ohio St., Syracuse, and Duke. He has Kansas a full 10 spots above me at #10, Pittsburgh is at #11, and Colorado St. is all the way down at #30. The differences in our rankings only resulted in me making two more correct picks than him--Colorado St. over Missouri, and Arizona over New Mexico (which actually turned out to be Harvard). My rankings also favored Michigan St. over Duke.
From here, I think the biggest improvement I can make is regarding the strength of schedule. Wichita St., for example, has a better raw PPP than Ohio St., but the schedule Ohio St. faced improved its ranking in this system (It's interesting to note that Wisconsin has a better ranking than either team, too). I need a way to weight good performances against strong opponents higher than great performances against average opponents, as well as maybe including something for weighing recent games more heavily.
What do you think? Any suggestions?