Biathlon Feat-Select-Break Section Explanation

Overview

The feat-select-break section is the core machine learning component of the breakthrough prediction system in biathlon analysis. This section builds and validates predictive models to identify athletes most likely to achieve breakthrough performances (>40% of maximum points) in the upcoming season. It combines feature selection, model training, and statistical validation with sport-specific logic.

Main Functions and Components

1. Library Loading and Setup

Required Libraries:

  • caret: Machine learning framework for model training and cross-validation
  • ranger: Random Forest implementation for variable importance
  • pROC: ROC curve analysis for model performance evaluation

2. Core Function: evaluate_breakthrough_predictors()

This is the central function that performs comprehensive breakthrough prediction analysis.

Input Validation and Data Quality

  • Validates input dataframe structure and non-emptiness
  • Checks predictor existence in the dataset
  • Analyzes missing data patterns and removes predictors with >50% missing values
  • Reports missing data statistics for transparency

Breakthrough Definition and Adaptive Thresholds

Will_Breakthrough = ifelse(is.na(Pct_of_Max_Points), NA, Pct_of_Max_Points >= 0.4)
  • Primary threshold: 40% of maximum points (biathlon-specific)
  • Fallback threshold: 20% if insufficient cases at 40%
  • Adaptive logic: Ensures adequate sample size for model training

Age Considerations

  • No age restrictions: Recognizes that breakthrough can occur at any career stage
  • Tracks age distribution of breakthrough performers for analysis
  • Provides age-related statistics but doesn’t filter based on age

3. Data Preparation and Imputation

Missing Data Handling

  • Uses quartile imputation strategy via replace_na_with_quartile() function
  • Maintains consistency with training data preparation methods
  • Reports before/after imputation statistics

Class Distribution Analysis

  • Monitors breakthrough vs. non-breakthrough case ratios
  • Detects severe class imbalance (>20:1 ratio)
  • Issues warnings for potential model instability
  • Provides example cases for validation

4. Cross-Validation Strategy (Adaptive)

Data Size-Based CV Selection:

  • <50 observations: Leave-One-Out Cross-Validation (LOOCV)
  • 50-200 observations: 5-fold Cross-Validation
  • >200 observations: 10-fold Cross-Validation

Configuration:

ctrl <- trainControl(method = cv_method, number = cv_number, 
                    classProbs = TRUE, summaryFunction = twoClassSummary)

5. Model Training and Comparison

Logistic Regression Model

  • Primary model for interpretability
  • Uses binomial family for binary classification
  • Optimizes ROC/AUC metric
  • Extracts coefficient importance scores

Random Forest Model (Secondary)

  • Provides variable importance validation
  • Uses ‘impurity’ importance measure
  • Fallback handling if training fails
  • Complementary to logistic regression insights

Model Performance Evaluation

model_comparison <- resamples(models_list)
performance_summary <- summary(model_comparison)

6. Feature Selection with Logical Validation

Coefficient Direction Validation

Expected Positive Predictors: (should increase breakthrough probability)

  • Prev_Pelo: Overall previous season ELO rating
  • Prev_Individual: Individual race previous ELO
  • Prev_Sprint: Sprint race previous ELO
  • Prev_Pursuit: Pursuit race previous ELO
  • Prev_Pct_of_Max_Points: Previous season performance percentage

Expected Negative Predictors: (should decrease breakthrough probability)

  • Age: Younger athletes more likely to breakthrough

Logical Filtering Process

if (var %in% expected_positive_vars) {
  if (coef_value > 0) {
    cat("✓", var, "has positive coefficient - VALID for breakthrough prediction\n")
    valid_predictors <- c(valid_predictors, var)
  } else {
    cat("✗", var, "has negative coefficient - EXCLUDED from model\n")
  }
}

7. Predictor Setup and Validation

Biathlon-Specific Predictor Identification

  • Extracts all Prev_ prefixed variables (previous season metrics)
  • Excludes MassStart: Due to data quality issues (only top 30 skiers compete)
  • Includes Age as additional predictor
  • Validates predictor availability in both men’s and ladies’ datasets

Data Availability Checks

  • Compares requested vs. available predictors
  • Reports missing predictors with warnings
  • Ensures sufficient predictors remain for modeling

8. Gender-Specific Analysis Execution

Men’s Breakthrough Analysis

breakthrough_analysis_men <- evaluate_breakthrough_predictors(train_men, breakthrough_predictors_men)

Ladies’ Breakthrough Analysis

breakthrough_analysis_ladies <- evaluate_breakthrough_predictors(train_ladies, breakthrough_predictors_ladies)

Result Validation

  • Checks for NULL returns
  • Validates required components (importance, top_predictors, model_comparison)
  • Reports missing components with warnings

9. Output Structure and Storage

Comprehensive Results Object

return(list(
  importance = importance_df,           # Variable importance rankings
  top_predictors = top_predictors,      # Selected top predictors
  full_model = logistic_model,          # Complete logistic regression model  
  reduced_model = reduced_model,        # Model with top predictors only
  rf_model = rf_model,                  # Random Forest model (if successful)
  model_comparison = model_comparison,   # Performance comparison
  performance = performance_summary,     # Detailed performance metrics
  data_summary = list(...)              # Data characteristics summary
))

Key Technical Features

Sport-Specific Adaptations

Biathlon Performance Metrics

  • ELO Ratings: Uses discipline-specific ELO ratings (Individual, Sprint, Pursuit)
  • Composite Performance: Pct_of_Max_Points reflects multi-discipline success
  • 40% Threshold: Represents significant competitive achievement in biathlon

Predictor Exclusions

  • MassStart exclusion: Recognition of limited participation (top 30 only)
  • Data quality focus: Prioritizes predictors with reliable data availability

Statistical Rigor

Coefficient Direction Validation

  • Ensures model logic aligns with sport knowledge
  • Excludes counterintuitive predictors to maintain interpretability
  • Balances statistical significance with logical consistency

Adaptive Model Parameters

  • Adjusts cross-validation strategy based on data availability
  • Handles class imbalance through monitoring and warnings
  • Provides fallback options for edge cases

Integration with Analysis Pipeline

Upstream Dependencies

  • Requires cleaned training data from preprocessing steps
  • Uses breakthrough cases identified in breakout-identifier section
  • Leverages existing missing data imputation functions

Downstream Outputs

  • Model objects: Used in big-break section for 2026 predictions
  • Feature rankings: Guide predictor selection for prediction functions
  • Performance metrics: Validate model reliability for decision-making

Error Handling and Robustness

Comprehensive Error Management

  • Try-catch blocks around all major operations
  • Detailed error reporting with context
  • Graceful degradation when components fail
  • Validation at each processing step

Quality Assurance

  • Statistical significance monitoring
  • Class distribution validation
  • Predictor availability confirmation
  • Model performance verification

Expected Outputs

Console Reporting

  • Missing data analysis for each predictor
  • Breakthrough case distribution statistics
  • Coefficient direction validation results
  • Model performance summaries
  • Top predictor rankings

Data Objects Created

  • breakthrough_analysis_men: Complete men’s analysis results
  • breakthrough_analysis_ladies: Complete ladies’ analysis results
  • Model objects for downstream prediction use
  • Performance metrics for model validation

Model Performance Metrics

  • ROC/AUC scores for classification performance
  • Cross-validation results for generalization assessment
  • Variable importance rankings for interpretability
  • Coefficient significance testing for statistical validity

This section establishes the predictive foundation for breakthrough identification by combining machine learning rigor with sport-specific domain knowledge, ensuring both statistical validity and practical interpretability.