Methodology

A transparent look at the hierarchical Bayesian rating model that powers every RPI ranking, prediction, and credible interval.

Overview

The RPI is a hierarchical Bayesian latent-strength model fitted by Markov Chain Monte Carlo (NUTS) over every recorded race in a multi-season window. Each crew has a posterior distribution over its true skill, not a point estimate — so every published rating comes with a 90% credible interval reflecting how much data has actually constrained that crew.

Crucially, the model fits every team's skill jointly from the whole race graph. League and club strengths are themselves parameters the model learns, so beating a strong-league crew lifts you automatically — no hand-tuned bonuses required.

The Model

Every crew has a latent skill θ in units of seconds-per-500m of pace. For each race we observe pace residuals around a race-level intercept αr:

pace_obs ~ Normal(α_r − θ_eff, σ_effective)

where θeff includes a per-team distance specialization term, αr is anchored to a per-distance-bucket baseline pace μpace[d] (so the model can't silently absorb cluster strength into the intercept), and σeffective scales with race type, conditions, importance tier, and distance bucket.

The display rating shown across the site is RPI = 1500 + 50 × E[θ], with the 90% credible interval drawn from the posterior's 5th and 95th percentiles.

Hierarchical Priors

Skill is partially pooled at multiple levels — global → league → club → boat-class — so crews with sparse data borrow strength from their peers:

τ_global ~ HalfNormal(0.7) # cross-league spread μ_league ~ Normal(0, τ_global) # one per league τ_league ~ HalfNormal(1.5) # within-league team SD club_effect ~ Normal(0, τ_club) # shared across boat classes per club boat_offset ~ Normal(0, τ_boat) # per-(team, boat-class) deviation θ = μ_league + club_effect + boat_offset + τ_league × z

Teams with NULL league fall back to a country-scoped pseudo-league (e.g. __unknown_USA) so the pool isn't corrupted by mixing across continents.

Likelihood Components

FactorWhen appliedWeight
Gaussian paceEvery crew with a finish time in a race that had ≥2 timed crewsFull likelihood
Plackett-Luce ranking (timed)Fully-timed races, applied alongside the pace likelihoodλ = 0.5 (auxiliary)
Plackett-Luce ranking (timed, h2h)Dual races where the rank signal has no field-strength confoundλ = 1.0
Plackett-Luce ranking (untimed)Mixed or fully-untimed racesFull likelihood

The Plackett-Luce term is invariant to αr: it depends only on the order of θ values within a race, so it provides a gradient that always pushes θwinner > θloser even when the pace likelihood's rank signal is corrupted by noisy conditions.

Distance & Importance-Tier Handling

Race distances are bucketed into four physical categories — short sprint (<1600m), standard sprint (1600–2200m), medium (2200–3500m), and head race (≥3500m). Each bucket has its own learned noise multiplier:

σ_dist[bucket] ~ LogNormal(0, 0.3)

Importance tiers (1–4) shrink or inflate the per-race noise so championships pull the posterior harder than scrimmages:

TierExamplesσ multiplier
1Henley Royal, Stotesbury, Youth Nationals, SRAA0.70×
2Head of the Charles, NEIRA, Crew Classic0.85×
3Standard regular-season regattas1.00×
4Scrimmages, time trials1.15×

Tiers come from a structured events.tiercolumn when curated, with a keyword fallback for events that haven't been tagged yet.

Cross-Season AR(1) Dynamics

Rather than a hard carry-forward at season boundaries, the model uses a learned autoregressive transition between seasons:

θ_s = α_AR × θ_{s-1} + (1 − α_AR) × μ_league + τ_season × z_s

α_AR ~ Beta(4, 2) has a prior mean around 0.67, but the posterior is data-driven — programs with high roster turnover learn a lower α, programs with stable crews learn a higher α. Single-season fits skip the AR block entirely.

Inference

The full posterior is sampled with the No-U-Turn Sampler (NUTS) via NumPyro/JAX, running 4 chains of 2 500 warmup + 1 000 samples each with a dense mass matrix on the scalar hyperparameter block. A typical mens-V8 fit takes 1–3 minutes on CPU and converges to max R̂ < 1.05 with zero divergences.

Posterior summaries (mean, SD, 5th and 95th percentiles) are persisted to team_rpi_posterior; the joint sample matrix lands in team_rpi_posterior_samples so the head-to-head matchup endpoint can draw correlated (θA, θB) pairs and report calibrated win probabilities and margin credible intervals.

Reading a Rating

Every leaderboard rating shows the posterior mean (the single best guess) plus a confidence chip — HIGH, MED, or LOW — driven by the posterior SD. Crews with fewer than 4 rated results in the current season are flagged as provisional: their posterior is mostly the league prior, not their own evidence, so we surface that uncertainty rather than dressing it up as a confident pick.

Backtest Accuracy

We score the live posterior's predictions on every pairwise outcome from the last ~3 months of real races. For each pair (winner, loser), the model's independent-Gaussian win probability is compared with what actually happened. The calibration table on the landing page reads directly off those buckets: when the model says ~70%, the favourite wins ~78% of the time — the model is mildly under-confident in close races, well-calibrated to slightly over-confident on blowouts.

This is an in-sample evaluation — recent races contributed to the posterior they're scored against — so the headline number is a slight upper bound on a strict forward-only walk-forward (which we have running offline at ~68% on a 28-day single holdout). On a 3-year training window any single race contributes <1% of the total evidence, so the bias is small relative to the gain in practicality: measuring fresh predictions doesn't require an MCMC refit queue.

The current Bayes engine outperforms the prior Elo engine on every slice, especially direct head-to-heads where the model has accumulated rich evidence on the matchup. Live calibration numbers are shown on the landing page.

Known Limitations

  • Joint cross-boat-class fits with shared club_effect are infrastructure-ready but currently run per-class in production for compute economy. The joint mode needs ≥2 000/2 000 MCMC samples for convergence on the ~5 000-parameter joint posterior.
  • Roster changes within a season are not modeled. Mid-season swaps still appear as the same crew identity.
  • Lane bias and course-specific effects are not modeled — pace-margin within each race already cancels course mean, but lane-by-lane variation is folded into σ.
  • Within-season recency weighting was trialed and reverted: on this 3-year horizon, the AR(1) cross-season decay plus equal-weight within-season produced the best holdout accuracy.
  • Importance tier still has a keyword fallback for events without a curated events.tier value.
← Bayesian rating backgroundData sources →