Home / Rating Background

What is Bayesian Rating?

A short tour of the statistical idea behind every RPI ranking — how we turn race results into probability distributions, not just numbers.

Origins in Probability

Bayesian statistics is named after Reverend Thomas Bayes, an 18th-century English minister whose posthumous 1763 essay laid out a simple but radical idea: probability isn't just a property of dice and coins — it's a way of describing what we believe about the world, and how those beliefs should change when we see new evidence. For two centuries the idea sat at the edge of statistics, until modern computing made it practical for everything from political forecasting to clinical trial design to the rating model running this site.

The Core Idea

Bayesian inference combines two things — a prior (what we believed before seeing the data) and a likelihood (how likely the data is under each possible explanation) — to produce a posterior (an updated belief that incorporates the evidence):

P(skill | races) ∝ P(races | skill) × P(skill)

Posterior ∝ Likelihood × Prior. The posterior is what we want to know; the likelihood is the model of how races are generated given a team's skill; the prior captures everything else.

The trick is that the result isn't a single “best” number — it's a full distribution over every plausible value of the unknown. We can read off the most likely value (the posterior mean) and how spread out the distribution is (the SD or a credible interval).

From Numbers to Distributions

A traditional rating gives every crew one number (1500, 1700, …). A Bayesian rating gives every crew a probability distribution — usually a bell-shaped curve. The peak is where we think the crew's true skill most likely lies; the width tells us how confident we are.

On the leaderboard, this shows up as the credible interval next to each rating. A 90% credible interval is the range we're 90% confident covers the truth: tight bands for crews with lots of recent races, wide bands for new or sparsely-raced crews. This is the difference between “1500, who knows” and “1500, definitely between 1490 and 1510” — and the site is honest about which is which.

Hierarchical Pooling

Bayesian models can stack. Each crew's skill draws from a league-level distribution, which itself draws from a global distribution:

global mean
  ↳ league mean (one per league)
    ↳ club mean (shared across boat classes)
      ↳ crew skill (per team × boat class)

This is called partial pooling. Crews with a long race history are estimated mostly from their own results. Crews with very little data lean more heavily on their league's average — borrowing strength from their peers without being replaced by them. And the league means themselves are parameters the model learns, so the data tells us which leagues are stronger overall, instead of us guessing with hand-tuned bonuses.

How We Actually Compute It

For all but the simplest models, the posterior can't be written down in closed form — there's no neat formula. Instead we use Markov Chain Monte Carlo (MCMC): an algorithm that walks through the space of possible parameter values and stays longer in regions where the data is more consistent. After enough steps, the collection of visited points isthe posterior — every crew's rating is a cloud of thousands of plausible values, weighted by how well they fit every race in the dataset.

The RPI uses the No-U-Turn Sampler (NUTS), a modern MCMC variant that adapts its own step size and trajectory length. A typical fit of one boat class — hundreds of teams, thousands of races — takes a couple of minutes on a single CPU.

Why Bayesian Rating Fits Rowing

Scholastic and youth rowing has a few features that this kind of model handles particularly well:

Sparse cross-league racing
Crews from different conferences rarely race directly. Hierarchical pooling lets the model learn each league's overall strength from the few bridge races that do happen — and use that to compare crews who have never met.
Wide variation in race count
A blue-blood program might have 30 races on the books; a brand-new crew might have one. Bayesian credible intervals reflect that difference honestly — no false precision for the new crew, no artificial humility for the established one.
Time and rank both matter
When finish times are available, we use the pace margin. When only finish positions are recorded, we fall back to a Plackett-Luce ranking factor that just enforces winner > loser. The model handles both without separate code paths.
Joint inference, not race-by-race
Every result feeds back into every related team's posterior at once. Beating a strong crew lifts you and slightly lifts your league mean too — so the next crew their league races starts from a tougher baseline.

Beyond Rowing

The same toolbox shows up almost everywhere uncertainty matters. Election forecasters aggregate state polls into national probabilities using hierarchical Bayesian models. Drug regulators use Bayesian methods to adapt clinical trials as data comes in. Tech companies analyze A/B tests with Bayesian decision rules, and astronomers fit Bayesian models to gravitational-wave signals. What unites all of these is the same idea Bayes wrote down in 1763: be explicit about what you knew, be explicit about what the data tells you, and let the math do the rest.

What Came Before

From launch through April 2026 the RPI used a modified Elo system — the same family of ratings used in chess, FIFA football, and FiveThirtyEight's sports models. Elo is a beautifully simple algorithm and got the project a long way, but its single-number ratings couldn't express uncertainty, and its pairwise updates struggled with the sparse cross-league connections that define US scholastic rowing. The Bayesian rewrite addresses both issues directly. The full technical specification of the current model lives on the methodology page.

Read the technical methodology →View the rankings →