Methodology
A transparent look at the hierarchical Bayesian rating model that powers every RPI ranking, prediction, and credible interval.
Overview
The RPI is a hierarchical Bayesian latent-strength model fitted by Markov Chain Monte Carlo (NUTS) over every recorded race in a multi-season window. Each crew has a posterior distribution over its true skill, not a point estimate — so every published rating comes with a 90% credible interval reflecting how much data has actually constrained that crew.
Crucially, the model fits every team's skill jointly from the whole race graph. League and club strengths are themselves parameters the model learns, so beating a strong-league crew lifts you automatically — no hand-tuned bonuses required.
The Model
Every crew has a latent skill θ in units of seconds-per-500m of pace. For each race we observe pace residuals around a race-level intercept αr:
where θeff includes a per-team distance specialization term, αr is anchored to a per-distance-bucket baseline pace μpace[d] (so the model can't silently absorb cluster strength into the intercept), and σeffective scales with race type, conditions, importance tier, and distance bucket.
The display rating shown across the site is RPI = 1500 + 50 × E[θ], with the 90% credible interval drawn from the posterior's 5th and 95th percentiles.
Hierarchical Priors
Skill is partially pooled at multiple levels — global → league → club → boat-class — so crews with sparse data borrow strength from their peers:
Teams with NULL league fall back to a country-scoped pseudo-league (e.g. __unknown_USA) so the pool isn't corrupted by mixing across continents.
Likelihood Components
| Factor | When applied | Weight |
|---|---|---|
| Gaussian pace | Every crew with a finish time in a race that had ≥2 timed crews | Full likelihood |
| Plackett-Luce ranking (timed) | Fully-timed races, applied alongside the pace likelihood | λ = 0.5 (auxiliary) |
| Plackett-Luce ranking (timed, h2h) | Dual races where the rank signal has no field-strength confound | λ = 1.0 |
| Plackett-Luce ranking (untimed) | Mixed or fully-untimed races | Full likelihood |
The Plackett-Luce term is invariant to αr: it depends only on the order of θ values within a race, so it provides a gradient that always pushes θwinner > θloser even when the pace likelihood's rank signal is corrupted by noisy conditions.
Distance & Importance-Tier Handling
Race distances are bucketed into four physical categories — short sprint (<1600m), standard sprint (1600–2200m), medium (2200–3500m), and head race (≥3500m). Each bucket has its own learned noise multiplier:
Importance tiers (1–4) shrink or inflate the per-race noise so championships pull the posterior harder than scrimmages:
| Tier | Examples | σ multiplier |
|---|---|---|
| 1 | Henley Royal, Stotesbury, Youth Nationals, SRAA | 0.70× |
| 2 | Head of the Charles, NEIRA, Crew Classic | 0.85× |
| 3 | Standard regular-season regattas | 1.00× |
| 4 | Scrimmages, time trials | 1.15× |
Tiers come from a structured events.tiercolumn when curated, with a keyword fallback for events that haven't been tagged yet.
Cross-Season AR(1) Dynamics
Rather than a hard carry-forward at season boundaries, the model uses a learned autoregressive transition between seasons:
α_AR ~ Beta(4, 2) has a prior mean around 0.67, but the posterior is data-driven — programs with high roster turnover learn a lower α, programs with stable crews learn a higher α. Single-season fits skip the AR block entirely.
Inference
The full posterior is sampled with the No-U-Turn Sampler (NUTS) via NumPyro/JAX, running 4 chains of 2 500 warmup + 1 000 samples each with a dense mass matrix on the scalar hyperparameter block. A typical mens-V8 fit takes 1–3 minutes on CPU and converges to max R̂ < 1.05 with zero divergences.
Posterior summaries (mean, SD, 5th and 95th percentiles) are persisted to team_rpi_posterior; the joint sample matrix lands in team_rpi_posterior_samples so the head-to-head matchup endpoint can draw correlated (θA, θB) pairs and report calibrated win probabilities and margin credible intervals.
Reading a Rating
Every leaderboard rating shows the posterior mean (the single best guess) plus a confidence chip — HIGH, MED, or LOW — driven by the posterior SD. Crews with fewer than 4 rated results in the current season are flagged as provisional: their posterior is mostly the league prior, not their own evidence, so we surface that uncertainty rather than dressing it up as a confident pick.
Backtest Accuracy
We score the live posterior's predictions on every pairwise outcome from the last ~3 months of real races. For each pair (winner, loser), the model's independent-Gaussian win probability is compared with what actually happened. The calibration table on the landing page reads directly off those buckets: when the model says ~70%, the favourite wins ~78% of the time — the model is mildly under-confident in close races, well-calibrated to slightly over-confident on blowouts.
This is an in-sample evaluation — recent races contributed to the posterior they're scored against — so the headline number is a slight upper bound on a strict forward-only walk-forward (which we have running offline at ~68% on a 28-day single holdout). On a 3-year training window any single race contributes <1% of the total evidence, so the bias is small relative to the gain in practicality: measuring fresh predictions doesn't require an MCMC refit queue.
The current Bayes engine outperforms the prior Elo engine on every slice, especially direct head-to-heads where the model has accumulated rich evidence on the matchup. Live calibration numbers are shown on the landing page.
Known Limitations
- Joint cross-boat-class fits with shared
club_effectare infrastructure-ready but currently run per-class in production for compute economy. The joint mode needs ≥2 000/2 000 MCMC samples for convergence on the ~5 000-parameter joint posterior. - Roster changes within a season are not modeled. Mid-season swaps still appear as the same crew identity.
- Lane bias and course-specific effects are not modeled — pace-margin within each race already cancels course mean, but lane-by-lane variation is folded into σ.
- Within-season recency weighting was trialed and reverted: on this 3-year horizon, the AR(1) cross-season decay plus equal-weight within-season produced the best holdout accuracy.
- Importance tier still has a keyword fallback for events without a curated
events.tiervalue.