AES Power Rankings 5th Anniversary
This year marks the 5-year anniversary of the Power Rankings. The first ranking files on my computer date back to the Fall 2011, and many iterations have taken place since then. However, the spirit of the rankings has changed very little. At its core, the rankings were developed to give the most accurate assessment of team strength in each age group from top to bottom … and not just cater to the top teams. While assessing the strength of the top teams is essential to instilling confidence in what the rankings produce, the rankings were primarily intended to help tournament directors when seeding large tournaments. As a former coach, it was always frustrating to be in a tournament where one or more 4-seeds on day one finished in the top quarter of a tournament. The rankings were created in hopes of mitigating the chances of this happening and improving the overall quality of the tournament experience for everyone.
In order to accomplish this task, I wanted to leverage the power of mathematical ranking models and machine learning algorithms, and for the most part, very little has changed with regard to this philosophy over the past 5 years. I felt this methodology was going to be more robust than other methods which subjectively assess the strength of a tournament and then assign points based off tournament finishes. The true indication of a team’s strength, in my opinion, is not just summary statistics of tournament finishes, but rather the team's entire body of work, i.e. every match played and it's score. Without a strong seeding methodology, the "tournament finish" ranking method has a high propensity to perpetuate itself in a viciously flawed cycle.
At a very high level, the club volleyball ranking problem seems very similar to other ranking challenges such as those seen in chess or collegiate sports. However, there have been some significant challenges faced in these last 5 years as the rankings have attempted to mature. These challenges can deviate rather significantly from the examples mentioned above.
Challenge #1: Growth of teams being ranked and the breaking down of some of the ranking algorithms as a result
- At the end of last year, more than 20,000 teams were ranked utilizing data from nearly 500K matches
- Since most teams play one another very infrequently, the "sparse" interaction between teams leads to challenges when trying to rank effectively.
Challenge #2: Assessing strength of younger age group teams who play up one or more divisions
- The rankings for the younger age groups actually utilize the results from multiple age groups in order to accurately assess the strength of the top teams. However, without going into details, older teams are not necessarily penalized in their ranking if they happen to lose to teams in younger age divisions.
Challenge #3: Accounting for strength between regions (inter-region strength) early in the season
- In technical terms, this is considered a "cold start" problem and for the most part this is a problem that has not yet been solved within the current Power Ranking infrastructure. As a result, the Power Rankings are really only worth considering after several inter-region "connections" (matches) have been played in each age group.
Challenge #4: Accounting for intra-region strength in the face of power leagues or tiered competition within that region
- This is the condition where gold-level teams primarily play gold-level teams and silver-level teams play other silver level teams, etc. In this situation, you may have a gold-level team with a losing record who is actually stronger than a top-level silver team. However, since "gold" teams don’t really play "silver" teams, it is hard to truly gauge how good a team who consistently plays (and loses to) tougher competition really is. This year, a new algorithm for dealing with this situation will be introduced in the Regional rankings. While still not perfect, It has tested out well during the backtests, and I look forward to see how it performs in production this year.
Needless to say, when one searches the current ranking literature on how "experts" suggest to solve some of these problems mentioned above, very little surfaces. College basketball and football are popular sports written about frequently in the ranking literature, but they do not suffer from these problems. For example, in both these sports, teams play other teams outside their primary conferences at the beginning of the season. Doing this actually allows many of the ranking algorithms to get a good sense of conference strength early in the season and then apply this "strength" throughout the season. As you well know, in club volleyball, many teams do not play teams outside their region until the qualifiers which do not take place until half way through the season. This is just one example of ranking differences between club volleyball and other sports.
The challenges mentioned above are the ranking challenges. They are the "fun" challenges. However, the "real" challenges for this whole ranking framework have been making sure the rankings only receive accurate data to perform their rankings. The "real" challenges have been data-related issues. For example:
Challenge #1: Two teams input manual results with different outcomes. What do you do?
If the results are the same but the scores are different, do you keep the results and ignore the scores?
- Answer: Yes … since the Power Rankings utilize different algorithms, this particular match will be used in the "match-result" ranking algorithms and will be ignored in the "match-scores" ranking algorithm
If the results are different (both teams say they win), what do you do?
- Answer: The match result cannot be trusted, so the manual result gets discarded from consideration
Challenge #2: Two teams play each other, the results seem to be the same, but the dates are different … what do you do?
- Honestly, dealing with match dates is one of the most frustrating and difficult aspects of the ranking framework. (As an aside, ask any data scientist, and they will probably say dealing with dates in their analysis is one of the most challenging aspects of their jobs). Bottom line, both matches have to be trusted and both are included in the rankings, especially if only one team is reporting.
Challenge #3: What do you do with matches when the opposing team code is incorrect?
- Honestly, it gets discarded and doesn’t count toward the rankings. If it is not possible to accurately align reported results to valid teams, then the result has to be considered invalid.
Challenge #4: A team enters a manual result for a match that has already been entered through the AES system.
- This is easy … but only if the manual match result is on the same date as the AES match. The AES match will always be considered the pristine entry, so the manual match will be dropped. However, if the manual match is on a separate date than the AES match, then the manual match will be preserved.
I mention these challenges to remind everyone that the rankings will only be as good as the data it is given. We want the most data possible for the rankings framework, so we want to include the manual match results in the rankings. However, if these match results are deemed to be inaccurate and cannot be trusted, then only the matches present within the AES system will be used for rankings. Attention to detail when manually inputting match results will go a long, long way in improving the rankings. Furthermore, higher quality manual match input means more time can be spent on the "fun" ranking challenges. Allow me to put it in coaching terms … the "fun" ranking challenges are like coaching in matches while the "real" challenges are dealing with player/parent/team dynamics off the court.
In conclusion, most of the challenges mentioned above have been addressed within the Power Rankings using novel solutions through much experimentation, backtesting, and a fair amount of trial and error. I will be the first to admit the Power Rankings are far from perfect. They continue to be a work in progress (and improved upon only in my minimally-available spare time). For every new method implemented in the rankings, there are at least 5 that have been coded, tested, and failed to show improvement. Such is the world of data modeling. However, with the patience and feedback from the community, along with an aligned spirit of the intended purpose of the rankings, there is no doubt the rankings will continue to improve on a yearly basis. I would like to end by thanking everyone for their patience, support, and continued interest in the rankings. Best of luck to everyone in the upcoming club season!
All the best,