Biathlon Big-Break Section Explanation

Overview

The big-break section is the application layer of the breakthrough prediction system in biathlon analysis. This section takes the trained models from feat-select-break and applies them to predict 2026 breakthrough candidates, generate historical comparisons, and export results to Excel files. It represents the culmination of the breakthrough analysis pipeline.

Main Functions and Components

1. Core Function: `predict_2026_breakthroughs()`

This is the central prediction function that applies trained models to identify 2026 breakthrough candidates.

Input Validation and Setup

if (!is.data.frame(current_data)) stop("current_data is not a data frame")
if (nrow(current_data) == 0) stop("current_data is empty")
if (is.null(breakthrough_model)) stop("breakthrough_model is NULL")
if (is.null(top_predictors) || length(top_predictors) == 0) stop("No top_predictors provided")

Biathlon-Specific Predictor Mapping

predictor_mapping <- c(
  "Prev_Pelo" = "Pelo",
  "Prev_Individual" = "Individual_Pelo", 
  "Prev_Sprint" = "Sprint_Pelo",
  "Prev_Pursuit" = "Pursuit_Pelo",
  "Prev_MassStart" = "MassStart_Pelo",
  "Prev_Pct_of_Max_Points" = "Pct_of_Max_Points",
  "Age" = "Age"
)

2. Career History Analysis

2025 Season Data Extraction

Focuses on 2025 as the most recent complete season
Takes most recent entry for each skier to avoid duplicates
Validates data availability and reports skier counts

Career Maximum Calculation

career_maximums <- current_data %>%
  filter(!is.na(Pct_of_Max_Points)) %>%
  group_by(Skier) %>%
  summarise(
    Career_Max_Pct = max(Pct_of_Max_Points, na.rm = TRUE),
    .groups = "drop"
  )

Breakthrough Candidate Identification

Inclusion Criteria:

Age ≥ 16 (junior age minimum)
Career_Max_Pct < 0.4 (never achieved 40% breakthrough - biathlon specific)
Pct_of_Max_Points > 0.01 (has competitive results in 2025)
Complete data for key variables

Notable Features:

No upper age limit: Recognizes breakthrough can occur at any career stage
40% threshold: Biathlon-specific breakthrough definition
Career-based filtering: Excludes athletes who already achieved breakthrough

3. Debug Functionality

Athlete-Specific Debugging

if ("Campbell Wright" %in% prediction_data$Skier) {
  wright_idx <- which(prediction_data$Skier == "Campbell Wright")
  cat("\n=== DEBUG: Campbell Wright Breakthrough Model Input ===\n")
  # Display detailed input data for verification
}

if ("Jeanne Richard" %in% prediction_data$Skier) {
  # Similar debugging for ladies' representative athlete
}

Comprehensive Data Quality Checks

Missing value analysis for each predictor
Range validation and extreme value detection
Factor level compatibility with training data
Infinite value detection and replacement

4. Model Application and Prediction

Feature Mapping Process

for (prev_feature in names(feature_mapping)) {
  current_feature <- feature_mapping[[prev_feature]]
  if (prev_feature %in% top_predictors && current_feature %in% names(prediction_data)) {
    prediction_data[[prev_feature]] <- prediction_data[[current_feature]]
    cat(sprintf("✓ Mapped %s -> %s\n", current_feature, prev_feature))
  }
}

Prediction Generation

breakthrough_probs <- tryCatch({
  predict(breakthrough_model, newdata = prediction_clean, type = "prob")
}, error = function(e) {
  # Fallback with na.action = na.pass
  predict(breakthrough_model, newdata = prediction_clean, type = "prob", na.action = na.pass)
})

5. Results Processing and Classification

Probability-Based Classification

Likelihood = case_when(
  is.na(Breakthrough_Prob) ~ "Unknown",
  Breakthrough_Prob >= 0.6 ~ "Very High",
  Breakthrough_Prob >= 0.4 ~ "High", 
  Breakthrough_Prob >= 0.2 ~ "Moderate",
  Breakthrough_Prob >= 0.1 ~ "Low",
  TRUE ~ "Very Low"
)

Performance Metrics

Points to Threshold: pmax(0, 0.4 - Pct_of_Max_Points, na.rm = TRUE)
Age-based filtering: Under-25 subset for young prospects
Probability distributions: Summary statistics and likelihood categories

6. Excel Export System

Breakthrough Candidates Files

# Men's breakthrough candidates
men_file <- file.path(output_dir, "mens_breakthrough_candidates_2026.xlsx")
write.xlsx(men_breakthrough_workbook, men_file, rowNames = FALSE)

# Ladies breakthrough candidates  
ladies_file <- file.path(output_dir, "ladies_breakthrough_candidates_2026.xlsx")
write.xlsx(ladies_breakthrough_workbook, ladies_file, rowNames = FALSE)

Historical Comparison Files

# Comparative analysis with historical breakthroughs
comparative_men_file <- file.path(output_dir, "mens_breakthrough_comparison_historical_vs_2026.xlsx")
comparative_ladies_file <- file.path(output_dir, "ladies_breakthrough_comparison_historical_vs_2026.xlsx")

7. Historical Breakthrough Comparison

Function: `predict_historical_breakthrough()`

Applies 2026 models to historical breakthrough cases
Maps current performance to “previous” variables for model compatibility
Handles missing values with median imputation
Generates “what would the model have predicted” probabilities

Comparative Analysis Structure

Excel Output Columns:

Name, Nation, Age
Pre-Breakthrough Pct (performance before breakthrough)
Breakthrough Result (actual breakthrough performance)
Predicted Prob (what model predicted)
Season, Type (“Historical Success” vs “2026 Prediction”)

8. Comprehensive Result Validation

Multi-Level Validation

Input validation: Model existence, predictor availability
Data quality: Missing values, infinite values, extreme outliers
Model compatibility: Factor levels, data structure alignment
Output validation: Probability ranges, result completeness
Export validation: File creation, data integrity

Error Handling Strategy

tryCatch({
  # Main operation
}, error = function(e) {
  cat("Detailed error context:", e$message, "\n")
  # Fallback procedures or graceful failure
})

Key Technical Features

Sport-Specific Adaptations

Biathlon Breakthrough Definition

40% threshold: Represents significant competitive achievement
Career-based exclusion: Athletes who already achieved breakthrough
No age restrictions: Recognizes breakthrough can occur at various career stages

ELO Rating Integration

Maps discipline-specific ELO ratings (Individual, Sprint, Pursuit)
Handles MassStart exclusions due to data quality issues
Maintains consistency with training data preparation

Advanced Data Processing

Missing Value Strategy

Uses quartile imputation to match training preparation
Preserves data distribution characteristics
Reports imputation statistics for transparency

Feature Engineering

Dynamic mapping between training and prediction features
Handles temporal data structure (Prev_ → current mapping)
Validates predictor availability and compatibility

Integration with Analysis Pipeline

Upstream Dependencies

Models: From feat-select-break section (logistic regression, random forest)
Predictors: Top-ranked features from importance analysis
Thresholds: Sport-specific breakthrough definitions
Data: Cleaned training datasets with career histories

Downstream Outputs

Excel files: Formatted for analysis and decision-making
Probability scores: Quantitative breakthrough likelihood
Historical validation: Model reliability assessment
Debug information: Comprehensive data lineage

Expected Outputs and Results

Console Reporting

Candidate identification statistics
Age distribution analysis
Probability distribution summaries
High-potential candidate identification
Export confirmation and file locations

Excel Files Generated

mens_breakthrough_candidates_2026.xlsx
ladies_breakthrough_candidates_2026.xlsx
mens_breakthrough_comparison_historical_vs_2026.xlsx
ladies_breakthrough_comparison_historical_vs_2026.xlsx
Debug files: Comprehensive predictor and prediction data

Data Objects Created

breakthrough_predictions_men: Complete men’s prediction results
breakthrough_predictions_ladies: Complete ladies’ prediction results
Age-stratified subsets (under-25 prospects)
Summary statistics and performance metrics

Analytical Insights

Quantitative Rankings: Probability-based candidate prioritization
Historical Validation: Model performance on past breakthrough cases
Age Patterns: Young prospect identification and analysis
Performance Gaps: Points needed to reach breakthrough threshold

Quality Assurance Features

Comprehensive Debugging

Athlete-specific tracking: Campbell Wright, Jeanne Richard verification
Data lineage: Input → processing → output validation
Statistical summaries: Distribution analysis and outlier detection
Model diagnostics: Prediction quality and reliability assessment

Error Recovery

Graceful degradation: Continues analysis when components fail
Fallback procedures: Alternative approaches for edge cases
Detailed logging: Error context and troubleshooting information
Data validation: Multi-stage quality checks throughout pipeline

This section represents the practical application of the breakthrough prediction system, transforming statistical models into actionable insights for identifying future biathlon stars and validating the approach against historical breakthrough patterns.