Sgd / Classifier Layer
Stochastic Gradient Descent (SGD) Classifier - A versatile linear classification algorithm that learns incrementally using gradient descent optimization.
Mathematical form: where:
- is the model weights at step t
- is the learning rate at step t
- is the loss function
- is a single training example
Key characteristics:
- Efficient for large-scale learning
- Online/incremental learning capability
- Multiple loss function options
- Flexible regularization schemes
- Adaptive learning rates
Common applications:
- Text classification
- Large-scale document categorization
- Real-time classification tasks
- Stream data processing
- High-dimensional sparse data
Outputs:
- Predicted Table: Input data with prediction columns
- Validation Results: Cross-validation metrics
- Test Metric: Test set performance metrics
- ROC Curve Data: ROC analysis information
- Confusion Matrix: Detailed classification results
- Feature Importances: Feature coefficients and importance
Note: Particularly effective for large datasets where memory efficiency is crucial.
SelectFeatures
[column, ...]Feature columns for SGD classification:
Selection guidelines:
- Numerical features preferred
- Standardized/scaled values recommended
- Handle missing values beforehand
- Consider feature interactions
Preprocessing tips:
- Scale to zero mean and unit variance
- Remove or encode categorical features
- Handle outliers appropriately
- Consider dimensionality reduction
If empty, uses all numeric columns except target.
SelectTarget
columnTarget column for classification:
Requirements:
- Categorical labels
- No missing values
- At least two classes
- Properly encoded
Preprocessing:
- Encode categorical labels
- Check class balance
- Consider label noise
- Verify label consistency
Params
oneofStandard configuration optimized for general classification tasks:
Default settings:
- Hinge loss (SVM-like behavior)
- L2 regularization (prevent overfitting)
- Optimal learning rate schedule
- Balanced class weights
- Shuffled training data
Best suited for:
- Initial model exploration
- Medium-sized datasets
- When quick results needed
- Standard classification tasks
Note: Provides good baseline performance for most applications.
Fine-grained control over SGD Classifier parameters for model optimization:
Parameter categories:
-
Model Architecture:
- Loss function selection
- Penalty type and strength
- Intercept fitting
-
Optimization Control:
- Learning rate schedule
- Convergence criteria
- Early stopping settings
-
Training Behavior:
- Class weight adjustment
- Shuffling options
- Warm start capability
Use cases:
- Performance optimization
- Specific problem requirements
- Complex dataset handling
- Production deployment tuning
Loss
enumLoss function determining how the model penalizes misclassifications:
Selection guide:
- Hinge: For maximum-margin linear classification (SVM-like)
- LogLoss: For probabilistic predictions
- ModifiedHuber: For robust classification with outliers
- Perceptron: For simple binary classification
- Huber/Epsilon variants: For robust regression-like behavior
Impact on learning:
- Affects model's sensitivity to errors
- Determines probability estimation capability
- Influences convergence behavior
- Changes outlier handling
Maximum-margin classification loss:
Characteristics:
- Creates SVM-like classifier
- No probability estimates
- Sharp decision boundary
- Good generalization
Best for:
- Binary classification
- When clear separation desired
- Low-noise datasets
Squared hinge loss:
Characteristics:
- Stronger penalty for violations
- Differentiable everywhere
- More sensitive to outliers
Best for:
- Smoother optimization
- When stronger penalties needed
- Clean datasets
Smooth variant of hinge loss with better outlier tolerance:
Characteristics:
- Probability estimates
- Robust to outliers
- Smooth optimization
Best for:
- Noisy datasets
- When probabilities needed
- Robust classification
Logistic regression loss:
Characteristics:
- Natural probability estimates
- Smooth gradients
- Well-calibrated predictions
Best for:
- Probability estimation
- Risk assessment
- Multiclass problems
- When calibrated probabilities needed
Linear loss used by perceptron:
Characteristics:
- Simple update rule
- No hyperparameters
- Online learning
Best for:
- Simple binary problems
- Online learning
- Quick prototyping
Huber loss combines squared and absolute loss:
Characteristics:
- Robust to outliers
- Smooth transition
- Configurable sensitivity
Best for:
- Noisy data
- Robust classification
- When outliers present
Epsilon-insensitive loss:
Characteristics:
- Ignores small errors
- SVR-like behavior
- Sparse solution
Best for:
- Regression-like classification
- When small errors acceptable
- Feature selection
Squared epsilon-insensitive loss:
Characteristics:
- Smooth loss function
- Quadratic penalty beyond epsilon
- More sensitive to large errors
Best for:
- Smooth optimization
- When gradual error penalties needed
- Regression-like classification
Mean squared error loss:
Characteristics:
- Quadratic error penalty
- Differentiable everywhere
- Sensitive to outliers
Best for:
- Regression-like problems
- Clean datasets
- When smooth gradients needed
Penalty
enumThe penalty (aka regularization term) to be used. Defaults to L2
which is the standard regularizer for linear SVM models. L1
and ElasticNet
might bring sparsity to the model (feature selection) not achievable with L2
. No penalty is added when set to None.
L2 penalty (Ridge):
Characteristics:
- Squared magnitude penalty
- Shrinks all weights toward zero
- Handles correlated features well
- Produces dense solutions
Best for:
- Most classification tasks
- When all features potentially relevant
- Dealing with multicollinearity
- Stable solutions needed
L1 penalty (Lasso):
Characteristics:
- Absolute magnitude penalty
- Produces sparse solutions
- Feature selection capability
- Path algorithms possible
Best for:
- Feature selection
- High-dimensional data
- When sparse solutions desired
- Eliminating irrelevant features
ElasticNet penalty:
Characteristics:
- Combines L1 and L2 penalties
- Controls sparsity via ratio
- Group selection capability
- More stable than pure L1
Best for:
- Correlated features
- When both sparsity and stability needed
- Group feature selection
- Balanced regularization
No regularization penalty applied
Characteristics:
- Uncontrolled model complexity
- Maximum flexibility
- Risk of overfitting
- Full parameter range
Best for:
- Very small datasets
- Theoretical analysis
- When bias undesirable
- Testing/debugging purposes
Warning: Use with caution as it may lead to overfitting
Alpha
f64Regularization strength multiplier:
Effects:
- Larger values: Stronger regularization
- Smaller values: More flexible model
Typical ranges:
- 1e-4: Default, good starting point
- 1e-5 to 1e-3: Common range
- >1e-3: Strong regularization
Note: Critical for preventing overfitting
L1Ratio
f64ElasticNet mixing parameter:
Behavior:
- 0.0: Pure L2 penalty
- 1.0: Pure L1 penalty
- Between: Mixed penalty
Guidelines:
- 0.15: Default, balanced mix
- <0.5: Favor stability
- >0.5: Favor sparsity
Only used with ElasticNet penalty
FitIntercept
boolWhether to calculate the intercept term:
Impact:
- true: Model learns bias term (recommended)
- false: Assumes centered data
Set false only when:
- Data is pre-centered
- Zero-intercept desired
- Testing theoretical properties
MaxIter
u64Maximum number of training epochs:
Guidelines:
- 1000: Default, suitable for most cases
- <1000: Simple problems, quick results
- >1000: Complex problems, better convergence
Consider increasing if:
- Model not converging
- Complex decision boundaries needed
- High precision required
Tolerance
f64Convergence criterion threshold:
Stopping rule:
- Stops when: loss > previous_loss - tolerance
Typical values:
- 1e-3: Default balance
- 1e-4: Higher precision
- 1e-2: Faster convergence
Trade-off: Precision vs. Speed
LearningRate
enumLearning rate schedule controlling parameter update steps:
Schedule types:
- Constant: Fixed step size (simple but may need tuning)
- Optimal: Theoretical optimal rate for some loss functions
- Invscaling: Gradually decreasing rate
- Adaptive: Automatically adjusted based on training performance
Selection criteria:
- Dataset size and noise level
- Convergence stability needs
- Training time constraints
- Model performance requirements
Fixed learning rate:
Characteristics:
- Simplest schedule
- No decay over time
- Requires careful tuning
- Can be unstable
Best for:
- Simple problems
- Short training runs
- When behavior well understood
- Quick prototyping
Note: Initial rate (eta0) selection crucial
Theoretical optimal schedule:
Characteristics:
- Theoretically motivated
- Automatic scaling with regularization
- Proven convergence properties
- Robust performance
Best for:
- Convex optimization
- When theoretical guarantees needed
- Standard classification tasks
- Production systems
Note: t0 chosen heuristically based on data
Inverse scaling schedule:
Characteristics:
- Gradual rate decay
- Configurable decay speed
- Smooth learning transition
- Predictable behavior
Best for:
- General purpose learning
- Long training runs
- When gradual decay needed
- Fine-tuning models
Note: power_t parameter controls decay speed
Adaptive learning rate with dynamic adjustments:
Behavior:
- Starts with eta0
- Monitors training progress
- Reduces rate by factor of 5 when progress stalls
- Adapts to problem difficulty
Best for:
- Difficult optimization problems
- Unknown optimal learning rates
- Avoiding manual tuning
- Production systems
Note: Uses n_iter_no_change and tolerance parameters
Shuffle
boolWhether to shuffle training data after each epoch:
Benefits of true:
- Better convergence
- Prevents cyclical patterns
- Reduces variance
Set false when:
- Reproducibility critical
- Order meaningful
- Debugging needed
Epsilon
f64Epsilon parameter for epsilon-sensitive losses:
Affects:
- Huber loss
- Epsilon-insensitive losses
Impact:
- Controls error sensitivity
- Defines insensitive region
- Affects solution sparsity
Only relevant for specific losses
RandomState
u64Random number generator seed:
Controls randomness in:
- Data shuffling
- Weight initialization
- Sample selection
Fixed value ensures:
- Reproducible results
- Consistent behavior
- Debugging capability
Eta0
f64Initial learning rate:
Used in schedules:
- Constant: Fixed value
- Invscaling: Starting value
- Adaptive: Initial rate
Typical ranges:
- 0.1: Default starting point
- 0.01-1.0: Common range
- Adjust based on convergence
PowerT
f64Power of learning rate decay for invscaling:
Schedule:
Common values:
- 0.5: Default, standard decay
- <0.5: Slower decay
- >0.5: Faster decay
Only used with invscaling schedule
EarlyStopping
boolWhether to use validation-based early stopping:
Benefits:
- Prevents overfitting
- Reduces training time
- Automatic stopping
Requires:
- Validation fraction
- Patience setting
- Tolerance threshold
Fraction of training data for validation:
Used when early_stopping=true
Typical values:
- 0.1: Default (10% validation)
- 0.2: More validation emphasis
- 0.3: Large validation set
Trade-off: Training size vs. validation reliability
Early stopping patience parameter:
Stops training if no improvement in:
- n_iter_no_change consecutive epochs
Values:
- 5: Default patience
- <5: Aggressive stopping
- >5: More optimization chances
Used with early_stopping=true
ClassWeights
enumClass weight adjustment strategy for handling imbalanced datasets:
Mathematical form:
- Balanced:
- Uniform: for all classes
Impact on model:
- Affects class importance during training
- Influences decision boundary placement
- Controls misclassification penalties
- Balances precision vs recall trade-off
Selection criteria:
- Class distribution in data
- Cost of different error types
- Business/domain requirements
- Performance metrics priorities
Equal weights for all classes:
Characteristics:
- No adjustment for class frequencies
- Natural class proportions preserved
- Faster training process
- Original data distribution maintained
Best for:
- Balanced datasets (similar class frequencies)
- When natural proportions matter
- Representative sampling
- When all errors equally costly
Warning: May underperform on imbalanced data
Weights inversely proportional to class frequencies:
Characteristics:
- Automatically adjusts for class imbalance
- Higher weights for minority classes
- Equalizes class importance
- Helps rare class detection
Best for:
- Imbalanced datasets
- Rare event detection
- Fraud detection
- Medical diagnosis
- Anomaly detection
Note: May increase variance in predictions
WarmStart
boolWhen set to True, reuse the solution of the previous call to fit as initialization, otherwise, just erase the previous solution.
Effects when true:
- Reuses previous weights
- Continues training
- Faster convergence
Useful for:
- Incremental learning
- Transfer learning
- Parameter search
Exhaustive hyperparameter optimization through grid search:
Search process:
- Tests all parameter combinations
- Uses cross-validation for evaluation
- Selects best performing configuration
- Optimizes for specified metric
Key parameters to tune:
- Loss function and penalty type
- Learning rate schedule
- Regularization strength (alpha)
- Class weights for imbalanced data
Performance considerations:
- Computational cost grows exponentially with parameters
- Memory usage depends on data size
- Consider validation strategy carefully
- Balance search space vs. computation time
Loss
[enum, ...]Loss function determining how the model penalizes misclassifications:
Selection guide:
- Hinge: For maximum-margin linear classification (SVM-like)
- LogLoss: For probabilistic predictions
- ModifiedHuber: For robust classification with outliers
- Perceptron: For simple binary classification
- Huber/Epsilon variants: For robust regression-like behavior
Impact on learning:
- Affects model's sensitivity to errors
- Determines probability estimation capability
- Influences convergence behavior
- Changes outlier handling
Maximum-margin classification loss:
Characteristics:
- Creates SVM-like classifier
- No probability estimates
- Sharp decision boundary
- Good generalization
Best for:
- Binary classification
- When clear separation desired
- Low-noise datasets
Squared hinge loss:
Characteristics:
- Stronger penalty for violations
- Differentiable everywhere
- More sensitive to outliers
Best for:
- Smoother optimization
- When stronger penalties needed
- Clean datasets
Smooth variant of hinge loss with better outlier tolerance:
Characteristics:
- Probability estimates
- Robust to outliers
- Smooth optimization
Best for:
- Noisy datasets
- When probabilities needed
- Robust classification
Logistic regression loss:
Characteristics:
- Natural probability estimates
- Smooth gradients
- Well-calibrated predictions
Best for:
- Probability estimation
- Risk assessment
- Multiclass problems
- When calibrated probabilities needed
Linear loss used by perceptron:
Characteristics:
- Simple update rule
- No hyperparameters
- Online learning
Best for:
- Simple binary problems
- Online learning
- Quick prototyping
Huber loss combines squared and absolute loss:
Characteristics:
- Robust to outliers
- Smooth transition
- Configurable sensitivity
Best for:
- Noisy data
- Robust classification
- When outliers present
Epsilon-insensitive loss:
Characteristics:
- Ignores small errors
- SVR-like behavior
- Sparse solution
Best for:
- Regression-like classification
- When small errors acceptable
- Feature selection
Squared epsilon-insensitive loss:
Characteristics:
- Smooth loss function
- Quadratic penalty beyond epsilon
- More sensitive to large errors
Best for:
- Smooth optimization
- When gradual error penalties needed
- Regression-like classification
Mean squared error loss:
Characteristics:
- Quadratic error penalty
- Differentiable everywhere
- Sensitive to outliers
Best for:
- Regression-like problems
- Clean datasets
- When smooth gradients needed
Penalty
[enum, ...]The penalty (aka regularization term) to be used. Defaults to L2
which is the standard regularizer for linear SVM models. L1
and ElasticNet
might bring sparsity to the model (feature selection) not achievable with L2
. No penalty is added when set to None.
L2 penalty (Ridge):
Characteristics:
- Squared magnitude penalty
- Shrinks all weights toward zero
- Handles correlated features well
- Produces dense solutions
Best for:
- Most classification tasks
- When all features potentially relevant
- Dealing with multicollinearity
- Stable solutions needed
L1 penalty (Lasso):
Characteristics:
- Absolute magnitude penalty
- Produces sparse solutions
- Feature selection capability
- Path algorithms possible
Best for:
- Feature selection
- High-dimensional data
- When sparse solutions desired
- Eliminating irrelevant features
ElasticNet penalty:
Characteristics:
- Combines L1 and L2 penalties
- Controls sparsity via ratio
- Group selection capability
- More stable than pure L1
Best for:
- Correlated features
- When both sparsity and stability needed
- Group feature selection
- Balanced regularization
No regularization penalty applied
Characteristics:
- Uncontrolled model complexity
- Maximum flexibility
- Risk of overfitting
- Full parameter range
Best for:
- Very small datasets
- Theoretical analysis
- When bias undesirable
- Testing/debugging purposes
Warning: Use with caution as it may lead to overfitting
Alpha
[f64, ...]Regularization strength values to test:
Recommended ranges:
- Coarse: [1e-4, 1e-3, 1e-2, 1e-1]
- Fine: [5e-5, 1e-4, 5e-4, 1e-3]
- Wide: [1e-5, 1e-4, 1e-3, 1e-2, 1e-1]
Note: Log-scale spacing recommended
L1Ratio
[f64, ...]ElasticNet mixing parameter values:
Common ranges:
- Basic: [0.15, 0.5, 0.85]
- Detailed: [0.1, 0.3, 0.5, 0.7, 0.9]
- Complete: [0.0, 0.25, 0.5, 0.75, 1.0]
Note: Only used with ElasticNet penalty
FitIntercept
[bool, ...]Whether to fit intercept combinations:
Options:
- [true]: Standard modeling (default)
- [true, false]: Compare both options
Note: Usually keep [true] unless data centered
MaxIter
[u64, ...]Maximum iterations to test:
Typical ranges:
- Standard: [500, 1000, 2000]
- Extended: [1000, 2000, 5000]
- Quick: [100, 500, 1000]
Include larger values if convergence issues
Tolerance
[f64, ...]Convergence tolerance values:
Common ranges:
- Standard: [1e-3, 1e-4, 1e-5]
- Coarse: [1e-2, 1e-3, 1e-4]
- Fine: [5e-4, 1e-4, 5e-5]
Trade-off: Precision vs computation time
LearningRate
[enum, ...]Learning rate schedule controlling parameter update steps:
Schedule types:
- Constant: Fixed step size (simple but may need tuning)
- Optimal: Theoretical optimal rate for some loss functions
- Invscaling: Gradually decreasing rate
- Adaptive: Automatically adjusted based on training performance
Selection criteria:
- Dataset size and noise level
- Convergence stability needs
- Training time constraints
- Model performance requirements
Fixed learning rate:
Characteristics:
- Simplest schedule
- No decay over time
- Requires careful tuning
- Can be unstable
Best for:
- Simple problems
- Short training runs
- When behavior well understood
- Quick prototyping
Note: Initial rate (eta0) selection crucial
Theoretical optimal schedule:
Characteristics:
- Theoretically motivated
- Automatic scaling with regularization
- Proven convergence properties
- Robust performance
Best for:
- Convex optimization
- When theoretical guarantees needed
- Standard classification tasks
- Production systems
Note: t0 chosen heuristically based on data
Inverse scaling schedule:
Characteristics:
- Gradual rate decay
- Configurable decay speed
- Smooth learning transition
- Predictable behavior
Best for:
- General purpose learning
- Long training runs
- When gradual decay needed
- Fine-tuning models
Note: power_t parameter controls decay speed
Adaptive learning rate with dynamic adjustments:
Behavior:
- Starts with eta0
- Monitors training progress
- Reduces rate by factor of 5 when progress stalls
- Adapts to problem difficulty
Best for:
- Difficult optimization problems
- Unknown optimal learning rates
- Avoiding manual tuning
- Production systems
Note: Uses n_iter_no_change and tolerance parameters
Eta0
[f64, ...]Initial learning rates to test:
Typical ranges:
- Conservative: [0.01, 0.1, 0.5]
- Aggressive: [0.1, 0.5, 1.0]
- Wide: [0.01, 0.1, 0.5, 1.0]
Critical for Constant and Invscaling schedules
PowerT
[f64, ...]Learning rate decay powers:
Common values:
- Standard: [0.5] (default)
- Range: [0.3, 0.5, 0.7]
- Wide: [0.25, 0.5, 0.75]
Only relevant for Invscaling schedule
Epsilon
[f64, ...]Epsilon values for sensitive losses:
Typical ranges:
- Standard: [0.1, 0.2, 0.3]
- Fine: [0.05, 0.1, 0.15]
- Wide: [0.01, 0.1, 0.5]
Only for epsilon-sensitive loss functions
ClassWeights
[enum, ...]Class weight adjustment strategy for handling imbalanced datasets:
Mathematical form:
- Balanced:
- Uniform: for all classes
Impact on model:
- Affects class importance during training
- Influences decision boundary placement
- Controls misclassification penalties
- Balances precision vs recall trade-off
Selection criteria:
- Class distribution in data
- Cost of different error types
- Business/domain requirements
- Performance metrics priorities
Equal weights for all classes:
Characteristics:
- No adjustment for class frequencies
- Natural class proportions preserved
- Faster training process
- Original data distribution maintained
Best for:
- Balanced datasets (similar class frequencies)
- When natural proportions matter
- Representative sampling
- When all errors equally costly
Warning: May underperform on imbalanced data
Weights inversely proportional to class frequencies:
Characteristics:
- Automatically adjusts for class imbalance
- Higher weights for minority classes
- Equalizes class importance
- Helps rare class detection
Best for:
- Imbalanced datasets
- Rare event detection
- Fraud detection
- Medical diagnosis
- Anomaly detection
Note: May increase variance in predictions
Shuffle
boolData shuffling between epochs:
Effects:
- true: Better convergence (recommended)
- false: Deterministic order
Keep true unless order significant
RandomState
u64Random seed for reproducibility:
Controls:
- Cross-validation splits
- Data shuffling
- Model initialization
Fixed value ensures reproducible search
EarlyStopping
boolEarly stopping usage:
When true:
- Uses validation set
- Stops on plateau
- May speed up search
Consider true for large parameter spaces
Validation set size for early stopping:
Typical values:
- 0.1: Standard (10%)
- 0.2: Larger validation
- 0.15: Balanced split
Only used if early_stopping=true
Early stopping patience:
Values:
- 5: Default patience
- <5: Aggressive stopping
- >5: More chances to improve
Affects computation time significantly
RefitScore
enumMetric for evaluating model performance during training and validation:
Selection criteria:
- Default: Uses model's built-in scoring method
- Accuracy: For balanced datasets, overall correctness
- BalancedAccuracy: For imbalanced datasets, class-weighted
Use cases:
- Default: When standard metrics suffice
- Accuracy: When all classes equally important
- BalancedAccuracy: When minority classes critical
Impact on model selection:
- Guides hyperparameter optimization
- Affects final model choice
- Influences cross-validation results
Uses the estimator's built-in scoring method. Typically accuracy for classification tasks. Best when default metrics align with problem goals.
Ratio of correct predictions to total predictions: Best for:
- Balanced class distributions
- When all misclassifications equally costly
- Simple performance assessment
Average of recall obtained on each class: Preferred when:
- Classes are imbalanced
- Minority class detection important
- Equal class importance required
Split
oneofStandard train-test split configuration optimized for general classification tasks.
Configuration:
- Test size: 20% (0.2)
- Random seed: 98
- Shuffling: Enabled
- Stratification: Based on target distribution
Advantages:
- Preserves class distribution
- Provides reliable validation
- Suitable for most datasets
Best for:
- Medium to large datasets
- Independent observations
- Initial model evaluation
Splitting uses the ShuffleSplit strategy or StratifiedShuffleSplit strategy depending on the field stratified
. Note: If shuffle is false then stratified must be false.
Configurable train-test split parameters for specialized requirements. Allows fine-tuning of data division strategy for specific use cases or constraints.
Use cases:
- Time series data
- Grouped observations
- Specific train/test ratios
- Custom validation schemes
RandomState
u64Random seed for reproducible splits. Ensures:
- Consistent train/test sets
- Reproducible experiments
- Comparable model evaluations
Same seed guarantees identical splits across runs.
Shuffle
boolData shuffling before splitting. Effects:
- true: Randomizes order, better for i.i.d. data
- false: Maintains order, important for time series
When to disable:
- Time dependent data
- Sequential patterns
- Grouped observations
TrainSize
f64Proportion of data for training. Considerations:
- Larger (e.g., 0.8-0.9): Better model learning
- Smaller (e.g., 0.5-0.7): Better validation
Common splits:
- 0.8: Standard (80/20 split)
- 0.7: More validation emphasis
- 0.9: More training emphasis
Stratified
boolMaintain class distribution in splits. Important when:
- Classes are imbalanced
- Small classes present
- Representative splits needed
Requirements:
- Classification tasks only
- Cannot use with shuffle=false
- Sufficient samples per class
Cv
oneofStandard cross-validation configuration using stratified 3-fold splitting.
Configuration:
- Folds: 3
- Method: StratifiedKFold
- Stratification: Preserves class proportions
Advantages:
- Balanced evaluation
- Reasonable computation time
- Good for medium-sized datasets
Limitations:
- May be insufficient for small datasets
- Higher variance than larger fold counts
- May miss some data patterns
Configurable stratified k-fold cross-validation for specific validation requirements.
Features:
- Adjustable fold count with
NFolds
determining the number of splits. - Stratified sampling
- Preserved class distributions
Use cases:
- Small datasets (more folds)
- Large datasets (fewer folds)
- Detailed model evaluation
- Robust performance estimation
NFolds
u32Number of cross-validation folds. Guidelines:
- 3-5: Large datasets, faster training
- 5-10: Standard choice, good balance
- 10+: Small datasets, thorough evaluation
Trade-offs:
- More folds: Better evaluation, slower training
- Fewer folds: Faster training, higher variance
Must be at least 2.
K-fold cross-validation without stratification. Divides data into k consecutive folds for iterative validation.
Process:
- Splits data into k equal parts
- Each fold serves as validation once
- Remaining k-1 folds form training set
Use cases:
- Regression problems
- Large, balanced datasets
- When stratification unnecessary
- Continuous target variables
Limitations:
- May not preserve class distributions
- Less suitable for imbalanced data
- Can create biased splits with ordered data
NSplits
u32Number of folds for cross-validation. Selection guide: Recommended values:
- 5: Standard choice (default)
- 3: Large datasets/quick evaluation
- 10: Thorough evaluation/smaller datasets
Trade-offs:
- Higher values: More thorough, computationally expensive
- Lower values: Faster, potentially higher variance
Must be at least 2 for valid cross-validation.
RandomState
u64Random seed for fold generation when shuffling. Important for:
- Reproducible results
- Consistent fold assignments
- Benchmark comparisons
- Debugging and validation
Set specific value for reproducibility across runs.
Shuffle
boolWhether to shuffle data before splitting into folds. Effects:
- true: Randomized fold composition (recommended)
- false: Sequential splitting
Enable when:
- Data may have ordering
- Better fold independence needed
Disable for:
- Time series data
- Ordered observations
Stratified K-fold cross-validation maintaining class proportions across folds.
Key features:
- Preserves class distribution in each fold
- Handles imbalanced datasets
- Ensures representative splits
Best for:
- Classification problems
- Imbalanced class distributions
- When class proportions matter
Requirements:
- Classification tasks only
- Sufficient samples per class
- Categorical target variable
NSplits
u32Number of stratified folds. Guidelines: Typical values:
- 5: Standard for most cases
- 3: Quick evaluation/large datasets
- 10: Detailed evaluation/smaller datasets
Considerations:
- Must allow sufficient samples per class per fold
- Balance between stability and computation time
- Consider smallest class size when choosing
RandomState
u64Seed for reproducible stratified splits. Ensures:
- Consistent fold assignments
- Reproducible results
- Comparable experiments
- Systematic validation
Fixed seed guarantees identical stratified splits.
Shuffle
boolData shuffling before stratified splitting. Impact:
- true: Randomizes while maintaining stratification
- false: Maintains data order within strata
Use cases:
- true: Independent observations
- false: Grouped or sequential data
Class proportions maintained regardless of setting.
Random permutation cross-validator with independent sampling.
Characteristics:
- Random sampling for each split
- Independent train/test sets
- More flexible than K-fold
- Can have overlapping test sets
Advantages:
- Control over test size
- Fresh splits each iteration
- Good for large datasets
Limitations:
- Some samples might never be tested
- Others might be tested multiple times
- No guarantee of complete coverage
NSplits
u32Number of random splits to perform. Consider: Common values:
- 5: Standard evaluation
- 10: More thorough assessment
- 3: Quick estimates
Trade-offs:
- More splits: Better estimation, longer runtime
- Fewer splits: Faster, less stable estimates
Balance between computation and stability.
RandomState
u64Random seed for reproducible shuffling. Controls:
- Split randomization
- Sample selection
- Result reproducibility
Important for:
- Debugging
- Comparative studies
- Result verification
TestSize
f64Proportion of samples for test set. Guidelines: Common ratios:
- 0.2: Standard (80/20 split)
- 0.25: More validation emphasis
- 0.1: More training data
Considerations:
- Dataset size
- Model complexity
- Validation requirements
It must be between 0.0 and 1.0.
Stratified random permutation cross-validator combining shuffle-split with stratification.
Features:
- Maintains class proportions
- Random sampling within strata
- Independent splits
- Flexible test size
Ideal for:
- Imbalanced datasets
- Large-scale problems
- When class distributions matter
- Flexible validation schemes
NSplits
u32Number of stratified random splits. Guidelines: Recommended values:
- 5: Standard evaluation
- 10: Detailed analysis
- 3: Quick assessment
Consider:
- Sample size per class
- Computational resources
- Stability requirements
RandomState
u64Seed for reproducible stratified sampling. Ensures:
- Consistent class proportions
- Reproducible splits
- Comparable experiments
Critical for:
- Benchmarking
- Research studies
- Quality assurance
TestSize
f64Fraction of samples for stratified test set. Best practices: Common splits:
- 0.2: Balanced evaluation
- 0.3: More thorough testing
- 0.15: Preserve training size
Consider:
- Minority class size
- Overall dataset size
- Validation objectives
It must be between 0.0 and 1.0.
Time Series cross-validator. Provides train/test indices to split time series data samples that are observed at fixed time intervals, in train/test sets. It is a variation of k-fold which returns first k
folds as train set and the k + 1
th fold as test set. Note that unlike standard cross-validation methods, successive training sets are supersets of those that come before them. Also, it adds all surplus data to the first training partition, which is always used to train the model.
Key features:
- Maintains temporal dependence
- Expanding window approach
- Forward-chaining splits
- No future data leakage
Use cases:
- Sequential data
- Financial forecasting
- Temporal predictions
- Time-dependent patterns
Note: Training sets are supersets of previous iterations.
NSplits
u32Number of temporal splits. Considerations: Typical values:
- 5: Standard forward chaining
- 3: Limited historical data
- 10: Long time series
Impact:
- Affects training window growth
- Determines validation points
- Influences computational load
MaxTrainSize
u64Maximum size of training set. Should be strictly less than the number of samples. Applications:
- 0: Use all available past data
- >0: Rolling window of fixed size
Use cases:
- Limit historical relevance
- Control computational cost
- Handle concept drift
- Memory constraints
TestSize
u64Number of samples in each test set. When 0:
- Auto-calculated as n_samples/(n_splits+1)
- Ensures equal-sized test sets
Considerations:
- Forecast horizon
- Validation requirements
- Available future data
Gap
u64Number of samples to exclude from the end of each train set before the test set.Gap between train and test sets. Uses:
- Avoid data leakage
- Model forecast lag
- Buffer periods
Common scenarios:
- 0: Continuous prediction
- >0: Forward gap for realistic evaluation
- Match business forecasting needs