Alpine Skiing Weekly Predictions Model Documentation
Overview
The Alpine skiing weekly predictions system (weekly-picks2.R) is a comprehensive statistical modeling framework that generates both point predictions and position probabilities for weekend races. The system uses ELO ratings, historical performance data, and sophisticated statistical models to predict race outcomes.
Key Differences from Cross-Country
Unlike cross-country skiing which has multiple points systems (World Cup, Stage, Tour de Ski) and complex optimization, Alpine skiing uses:
- Single points system: World Cup points only (100, 80, 60, 50, 45, 40, 36, 32, 29, 26…)
- Individual races only: No relay events
- Discipline-specific ELO ratings: Separate ratings for Downhill, Super G, Giant Slalom, Slalom, Combined, Tech, Speed
Main Components
1. Race Probability Calculation
Purpose: Determines the probability that each athlete will participate in each race based on historical participation patterns.
Method:
get_race_probability <- function(chronos, participant, discipline) {
# Calculate date from 5 years ago
five_years_ago <- Sys.Date() - (5 * 365)
# Use participant's first race or 5 years ago, whichever is later
start_date <- max(five_years_ago, participant_first_race)
# Count all races in this discipline since start_date
all_races <- chronos %>%
filter(Date >= start_date, Distance == discipline) %>%
distinct(Date, City)
# Count participant's races in this discipline
participant_races <- chronos %>%
filter(Date >= start_date, Skier == participant, Distance == discipline) %>%
distinct(Date, City)
# Calculate probability (capped at 1.0)
prob <- min(1, races_participated / total_races)
return(prob)
}
Key Points:
- Uses 5-year lookback window
- Discipline-specific (Downhill, Super G, Giant Slalom, Slalom)
- Based on actual participation history, not startlist presence
- For Race1, uses FIS startlist if available (In_Startlist=TRUE → 1.0 probability)
2. Points Prediction Models
Purpose: Predicts World Cup points each athlete will score in each race.
Model Architecture:
- Feature Selection: Uses
regsubsets()with BIC criterion to select best variables - Model Type: Generalized Additive Models (GAM) with smooth terms
- Fallback Strategy: Linear models if GAM fails
Variables by Discipline:
- Speed Events (Downhill, Super G):
Prev_Points_Weighted,Downhill_Elo_Pct,Super.G_Elo_Pct,Giant.Slalom_Elo_Pct,Speed_Elo_Pct,Elo_Pct - Technical Events (Slalom, Giant Slalom):
Prev_Points_Weighted,Super.G_Elo_Pct,Slalom_Elo_Pct,Giant.Slalom_Elo_Pct,Tech_Elo_Pct,Elo_Pct - Combined Events:
Prev_Points_Weighted,Combined_Elo_Pct,Tech_Elo_Pct,Speed_Elo_Pct,Elo_Pct
Model Formula Example:
# Feature selection
exhaustive_selection <- regsubsets(Points ~ Prev_Points_Weighted + Downhill_Elo_Pct + ...,
data = race_df_75, method = "exhaustive")
best_bic_vars <- names(coef(exhaustive_selection, which.min(summary_exhaustive$bic)))
# GAM model with smooth terms
smooth_terms <- paste("s(", best_bic_vars[-1], ")", collapse=" + ")
gam_formula <- as.formula(paste("Points ~", smooth_terms))
model <- gam(gam_formula, data = race_df_75)
3. Position Probability Models
Purpose: Predicts the probability that each athlete finishes in top-1, top-3, top-5, top-10, and top-30 positions.
Model Architecture:
- Binary Classification: Separate GAM model for each threshold using binomial family
- Same Variables: Uses identical feature selection as points models
- Period Adjustments: Accounts for seasonal performance variations
Position Thresholds: [1, 3, 5, 10, 30]
Model Formula:
# Create binary outcome
race_df$position_achieved <- race_df$Place <= threshold
# Feature selection (same as points model)
pos_formula <- as.formula(paste("position_achieved ~", paste(position_feature_vars, collapse = " + ")))
pos_selection <- regsubsets(pos_formula, data = race_df, method = "exhaustive")
# GAM with binomial family
pos_gam_formula <- as.formula(paste("position_achieved ~", pos_smooth_terms))
position_model <- gam(pos_gam_formula, data = race_df, family = binomial, method = "REML")
4. Adjustment Mechanisms
Period Adjustments:
- Compares athlete’s recent performance in current period vs. other periods
- Uses t-test to determine if period effect is statistically significant (p < 0.05)
- Applies period-specific correction to both points and position predictions
Discipline Adjustments:
- Similar to period adjustments but for technical vs. speed events
- Uses
Tech_Flagto categorize disciplines
Volatility Metrics:
prediction_volatility: Standard deviation of prediction errors over last 10 racesupside_potential: 90th percentile of prediction errorsdownside_risk: 10th percentile of prediction errorsconfidence_factor: Based on number of recent races (max 10)
5. Position Probability Normalization
Problem: Raw position probabilities don’t sum to mathematically correct totals.
Solution: Custom normalization function that ensures:
- Top-1 probabilities sum to 100% across all participants
- Top-3 probabilities sum to 300% across all participants
- Top-5 probabilities sum to 500% across all participants
- etc.
Process:
- Race Probability Adjustment: Multiply by race participation probability
- Scaling: Calculate scaling factor = target_sum / current_sum
- Capping: Ensure no individual probability exceeds 100%
- Redistribution: Redistribute excess probability proportionally
normalize_position_probabilities <- function(predictions, race_prob_col, position_thresholds) {
for(threshold in position_thresholds) {
prob_col <- paste0("prob_top", threshold)
# Apply race probability adjustment
normalized[[prob_col]] <- normalized[[prob_col]] * normalized[[race_prob_col]]
# Calculate scaling factor
current_sum <- sum(normalized[[prob_col]], na.rm = TRUE)
target_sum <- 100 * threshold
scaling_factor <- target_sum / current_sum
# Apply scaling
normalized[[prob_col]] <- normalized[[prob_col]] * scaling_factor
# Cap at 100% and redistribute excess
# [redistribution logic...]
}
}
6. Final Predictions Integration
Points Predictions:
- Base prediction from GAM model
- Plus period and discipline adjustments
- Multiplied by race participation probability
- Generates: Final_Prediction, Safe_Prediction, Upside_Prediction
Position Predictions:
- Separate probability for each threshold (1, 3, 5, 10, 30)
- Normalized to ensure mathematical consistency
- Expressed as percentages
Output Structure:
- Points Excel: Total expected points across all races with probability weighting
- Position Excel: Race-by-race position probabilities for each threshold
- Top Contenders: Summary of top 5 athletes for win/podium/top-5 in each race
Model Training Data
Historical Scope: Last 10+ seasons of race results Training Set: Athletes with ELO > 75th percentile (top performers only) Cross-Validation: None explicitly implemented (uses recent historical data)
Key Strengths
- Discipline Specificity: Separate models and ELO ratings for each Alpine discipline
- Participation Modeling: Realistic race probability based on historical patterns
- Mathematical Consistency: Position probabilities sum correctly across all athletes
- Volatility Awareness: Accounts for athlete consistency/inconsistency patterns
- Seasonal Effects: Period and discipline adjustments for changing conditions
Current Limitations
- No Weather/Conditions: Models don’t account for snow, weather, or course conditions
- No Injury Modeling: Doesn’t predict injury risk or recovery patterns
- Static Course Difficulty: Doesn’t adjust for venue-specific difficulty
- Limited Cross-Validation: Model validation relies primarily on recent performance
File Dependencies
- Input Data:
weekends.csv,{gender}_chrono.csv,startlist_weekend_{gender}.csv - Output: Excel files in
~/blog/daehl-e/content/post/alpine/drafts/weekly-picks/{date}/ - Libraries:
dplyr,mgcv,leaps,openxlsx,slider,purrr