Home / Methodology

Methodology

A transparent look at the hierarchical Bayesian rating model that powers every RPI ranking, prediction, and credible interval.

Overview

The RPI is a hierarchical Bayesian latent-strength model fitted by Markov Chain Monte Carlo (NUTS) over every recorded race in a multi-season window. Each crew has a posterior distribution over its true skill, not a point estimate — so every published rating comes with a 90% credible interval reflecting how much data has actually constrained that crew.

Crucially, the model fits every team's skill jointly from the whole race graph. League and club strengths are themselves parameters the model learns, so beating a strong-league crew lifts you automatically — no hand-tuned bonuses required.

The Model

Every crew has a latent skill θ in units of seconds-per-500m of pace. For each race we observe pace residuals around a race-level intercept α_r:

pace_obs ~ Normal(α_r − θ_eff, σ_effective)

where θ_eff includes a per-team distance specialization term, α_r is anchored to a per-distance-bucket baseline pace μ_pace[d] (so the model can't silently absorb cluster strength into the intercept), and σ_effective scales with race type, conditions, importance tier, and distance bucket.

The display rating shown across the site is RPI = 1500 + 50 × E[θ], with the 90% credible interval drawn from the posterior's 5th and 95th percentiles.

Hierarchical Priors

Skill is partially pooled at multiple levels — global → league → club → boat-class — so crews with sparse data borrow strength from their peers:

τ_global   ~ HalfNormal(0.7)        # cross-league spread
μ_league   ~ Normal(0, τ_global)     # one per league
τ_league   ~ HalfNormal(1.5)         # within-league team SD
club_effect ~ Normal(0, τ_club)      # shared across boat classes per club
boat_offset ~ Normal(0, τ_boat)      # per-(team, boat-class) deviation
θ          = μ_league + club_effect + boat_offset + τ_league × z

Teams with NULL league fall back to a country-scoped pseudo-league (e.g. __unknown_USA) so the pool isn't corrupted by mixing across continents.

Likelihood Components

Factor	When applied	Weight
Gaussian pace	Every crew with a finish time in a race that had ≥2 timed crews	Full likelihood
Plackett-Luce ranking (timed)	Fully-timed races, applied alongside the pace likelihood	λ = 0.5 (auxiliary)
Plackett-Luce ranking (timed, h2h)	Dual races where the rank signal has no field-strength confound	λ = 1.0
Plackett-Luce ranking (untimed)	Mixed or fully-untimed races	Full likelihood

The Plackett-Luce term is invariant to α_r: it depends only on the order of θ values within a race, so it provides a gradient that always pushes θ_winner > θ_loser even when the pace likelihood's rank signal is corrupted by noisy conditions.

Distance & Importance-Tier Handling

Race distances are bucketed into four physical categories — short sprint (<1600m), standard sprint (1600–2200m), medium (2200–3500m), and head race (≥3500m). Each bucket has its own learned noise multiplier:

σ_dist[bucket] ~ LogNormal(0, 0.3)

Importance tiers (1–4) shrink or inflate the per-race noise so championships pull the posterior harder than scrimmages:

Tier	Examples	σ multiplier
1	Henley Royal, Stotesbury, Youth Nationals, SRAA	0.70×
2	Head of the Charles, NEIRA, Crew Classic	0.85×
3	Standard regular-season regattas	1.00×
4	Scrimmages, time trials	1.15×

Tiers come from a structured events.tiercolumn when curated, with a keyword fallback for events that haven't been tagged yet.

Cross-Season AR(1) Dynamics

Rather than a hard carry-forward at season boundaries, the model uses a learned autoregressive transition between seasons:

θ_s = α_AR × θ_{s-1} + (1 − α_AR) × μ_league + τ_season × z_s

α_AR ~ Beta(4, 2) has a prior mean around 0.67, but the posterior is data-driven — programs with high roster turnover learn a lower α, programs with stable crews learn a higher α. Single-season fits skip the AR block entirely.

Inference

The full posterior is sampled with the No-U-Turn Sampler (NUTS) via NumPyro/JAX, running 4 chains of 2 500 warmup + 1 000 samples each with a dense mass matrix on the scalar hyperparameter block. A typical mens-V8 fit takes 1–3 minutes on CPU and converges to max R̂ < 1.05 with zero divergences.

Posterior summaries (mean, SD, 5th and 95th percentiles) are persisted to team_rpi_posterior; the joint sample matrix lands in team_rpi_posterior_samples so the head-to-head matchup endpoint can draw correlated (θ_A, θ_B) pairs and report calibrated win probabilities and margin credible intervals.

Reading a Rating

Every leaderboard rating shows the posterior mean (the single best guess) plus a confidence chip — HIGH, MED, or LOW — driven by the posterior SD. Crews with fewer than 4 rated results in the current season are flagged as provisional: their posterior is mostly the league prior, not their own evidence, so we surface that uncertainty rather than dressing it up as a confident pick.

Backtest Accuracy

We score the live posterior's predictions on every pairwise outcome from the last ~3 months of real races. For each pair (winner, loser), the model's independent-Gaussian win probability is compared with what actually happened. The calibration table on the landing page reads directly off those buckets: when the model says ~70%, the favourite wins ~78% of the time — the model is mildly under-confident in close races, well-calibrated to slightly over-confident on blowouts.

This is an in-sample evaluation — recent races contributed to the posterior they're scored against — so the headline number is a slight upper bound on a strict forward-only walk-forward (which we have running offline at ~68% on a 28-day single holdout). On a 3-year training window any single race contributes <1% of the total evidence, so the bias is small relative to the gain in practicality: measuring fresh predictions doesn't require an MCMC refit queue.

The current Bayes engine outperforms the prior Elo engine on every slice, especially direct head-to-heads where the model has accumulated rich evidence on the matchup. Live calibration numbers are shown on the landing page.

Known Limitations

Joint cross-boat-class fits with shared club_effect are infrastructure-ready but currently run per-class in production for compute economy. The joint mode needs ≥2 000/2 000 MCMC samples for convergence on the ~5 000-parameter joint posterior.
Roster changes within a season are not modeled. Mid-season swaps still appear as the same crew identity.
Lane bias and course-specific effects are not modeled — pace-margin within each race already cancels course mean, but lane-by-lane variation is folded into σ.
Within-season recency weighting was trialed and reverted: on this 3-year horizon, the AR(1) cross-season decay plus equal-weight within-season produced the best holdout accuracy.
Importance tier still has a keyword fallback for events without a curated events.tier value.

← Bayesian rating background Data sources →