Sgd / Classifier Layer

Stochastic Gradient Descent (SGD) Classifier - A versatile linear classification algorithm that learns incrementally using gradient descent optimization.

Mathematical form: where:

  • is the model weights at step t
  • is the learning rate at step t
  • is the loss function
  • is a single training example

Key characteristics:

  • Efficient for large-scale learning
  • Online/incremental learning capability
  • Multiple loss function options
  • Flexible regularization schemes
  • Adaptive learning rates

Common applications:

  • Text classification
  • Large-scale document categorization
  • Real-time classification tasks
  • Stream data processing
  • High-dimensional sparse data

Outputs:

  1. Predicted Table: Input data with prediction columns
  2. Validation Results: Cross-validation metrics
  3. Test Metric: Test set performance metrics
  4. ROC Curve Data: ROC analysis information
  5. Confusion Matrix: Detailed classification results
  6. Feature Importances: Feature coefficients and importance

Note: Particularly effective for large datasets where memory efficiency is crucial.

Table
0
0
Predicted Table
1
Validation Results
2
Test Metric
3
ROC Curve Data
4
Confusion Matrix
5
Feature Importances

SelectFeatures

[column, ...]

Feature columns for SGD classification:

Selection guidelines:

  • Numerical features preferred
  • Standardized/scaled values recommended
  • Handle missing values beforehand
  • Consider feature interactions

Preprocessing tips:

  • Scale to zero mean and unit variance
  • Remove or encode categorical features
  • Handle outliers appropriately
  • Consider dimensionality reduction

If empty, uses all numeric columns except target.

Target column for classification:

Requirements:

  • Categorical labels
  • No missing values
  • At least two classes
  • Properly encoded

Preprocessing:

  • Encode categorical labels
  • Check class balance
  • Consider label noise
  • Verify label consistency

Params

oneof
DefaultParams

Standard configuration optimized for general classification tasks:

Default settings:

  • Hinge loss (SVM-like behavior)
  • L2 regularization (prevent overfitting)
  • Optimal learning rate schedule
  • Balanced class weights
  • Shuffled training data

Best suited for:

  • Initial model exploration
  • Medium-sized datasets
  • When quick results needed
  • Standard classification tasks

Note: Provides good baseline performance for most applications.

Fine-grained control over SGD Classifier parameters for model optimization:

Parameter categories:

  1. Model Architecture:

    • Loss function selection
    • Penalty type and strength
    • Intercept fitting
  2. Optimization Control:

    • Learning rate schedule
    • Convergence criteria
    • Early stopping settings
  3. Training Behavior:

    • Class weight adjustment
    • Shuffling options
    • Warm start capability

Use cases:

  • Performance optimization
  • Specific problem requirements
  • Complex dataset handling
  • Production deployment tuning

Loss

enum
Hinge

Loss function determining how the model penalizes misclassifications:

Selection guide:

  • Hinge: For maximum-margin linear classification (SVM-like)
  • LogLoss: For probabilistic predictions
  • ModifiedHuber: For robust classification with outliers
  • Perceptron: For simple binary classification
  • Huber/Epsilon variants: For robust regression-like behavior

Impact on learning:

  • Affects model's sensitivity to errors
  • Determines probability estimation capability
  • Influences convergence behavior
  • Changes outlier handling
Hinge ~

Maximum-margin classification loss:

Characteristics:

  • Creates SVM-like classifier
  • No probability estimates
  • Sharp decision boundary
  • Good generalization

Best for:

  • Binary classification
  • When clear separation desired
  • Low-noise datasets
SquaredHinge ~

Squared hinge loss:

Characteristics:

  • Stronger penalty for violations
  • Differentiable everywhere
  • More sensitive to outliers

Best for:

  • Smoother optimization
  • When stronger penalties needed
  • Clean datasets
ModifiedHuber ~

Smooth variant of hinge loss with better outlier tolerance:

Characteristics:

  • Probability estimates
  • Robust to outliers
  • Smooth optimization

Best for:

  • Noisy datasets
  • When probabilities needed
  • Robust classification
LogLoss ~

Logistic regression loss:

Characteristics:

  • Natural probability estimates
  • Smooth gradients
  • Well-calibrated predictions

Best for:

  • Probability estimation
  • Risk assessment
  • Multiclass problems
  • When calibrated probabilities needed
Perceptron ~

Linear loss used by perceptron:

Characteristics:

  • Simple update rule
  • No hyperparameters
  • Online learning

Best for:

  • Simple binary problems
  • Online learning
  • Quick prototyping
Huber ~

Huber loss combines squared and absolute loss:

Characteristics:

  • Robust to outliers
  • Smooth transition
  • Configurable sensitivity

Best for:

  • Noisy data
  • Robust classification
  • When outliers present
EpsilonInsensitive ~

Epsilon-insensitive loss:

Characteristics:

  • Ignores small errors
  • SVR-like behavior
  • Sparse solution

Best for:

  • Regression-like classification
  • When small errors acceptable
  • Feature selection
SquaredEpsilonInsensitive ~

Squared epsilon-insensitive loss:

Characteristics:

  • Smooth loss function
  • Quadratic penalty beyond epsilon
  • More sensitive to large errors

Best for:

  • Smooth optimization
  • When gradual error penalties needed
  • Regression-like classification
SquaredError ~

Mean squared error loss:

Characteristics:

  • Quadratic error penalty
  • Differentiable everywhere
  • Sensitive to outliers

Best for:

  • Regression-like problems
  • Clean datasets
  • When smooth gradients needed
L2

The penalty (aka regularization term) to be used. Defaults to L2 which is the standard regularizer for linear SVM models. L1 and ElasticNet might bring sparsity to the model (feature selection) not achievable with L2. No penalty is added when set to None.

L2 ~

L2 penalty (Ridge):

Characteristics:

  • Squared magnitude penalty
  • Shrinks all weights toward zero
  • Handles correlated features well
  • Produces dense solutions

Best for:

  • Most classification tasks
  • When all features potentially relevant
  • Dealing with multicollinearity
  • Stable solutions needed
L1 ~

L1 penalty (Lasso):

Characteristics:

  • Absolute magnitude penalty
  • Produces sparse solutions
  • Feature selection capability
  • Path algorithms possible

Best for:

  • Feature selection
  • High-dimensional data
  • When sparse solutions desired
  • Eliminating irrelevant features
ElasticNet ~

ElasticNet penalty:

Characteristics:

  • Combines L1 and L2 penalties
  • Controls sparsity via ratio
  • Group selection capability
  • More stable than pure L1

Best for:

  • Correlated features
  • When both sparsity and stability needed
  • Group feature selection
  • Balanced regularization
None ~

No regularization penalty applied

Characteristics:

  • Uncontrolled model complexity
  • Maximum flexibility
  • Risk of overfitting
  • Full parameter range

Best for:

  • Very small datasets
  • Theoretical analysis
  • When bias undesirable
  • Testing/debugging purposes

Warning: Use with caution as it may lead to overfitting

0.0001

Regularization strength multiplier:

Effects:

  • Larger values: Stronger regularization
  • Smaller values: More flexible model

Typical ranges:

  • 1e-4: Default, good starting point
  • 1e-5 to 1e-3: Common range
  • >1e-3: Strong regularization

Note: Critical for preventing overfitting

0.15

ElasticNet mixing parameter:

Behavior:

  • 0.0: Pure L2 penalty
  • 1.0: Pure L1 penalty
  • Between: Mixed penalty

Guidelines:

  • 0.15: Default, balanced mix
  • <0.5: Favor stability
  • >0.5: Favor sparsity

Only used with ElasticNet penalty

true

Whether to calculate the intercept term:

Impact:

  • true: Model learns bias term (recommended)
  • false: Assumes centered data

Set false only when:

  • Data is pre-centered
  • Zero-intercept desired
  • Testing theoretical properties
1000

Maximum number of training epochs:

Guidelines:

  • 1000: Default, suitable for most cases
  • <1000: Simple problems, quick results
  • >1000: Complex problems, better convergence

Consider increasing if:

  • Model not converging
  • Complex decision boundaries needed
  • High precision required
0.001

Convergence criterion threshold:

Stopping rule:

  • Stops when: loss > previous_loss - tolerance

Typical values:

  • 1e-3: Default balance
  • 1e-4: Higher precision
  • 1e-2: Faster convergence

Trade-off: Precision vs. Speed

Optimal

Learning rate schedule controlling parameter update steps:

Schedule types:

  • Constant: Fixed step size (simple but may need tuning)
  • Optimal: Theoretical optimal rate for some loss functions
  • Invscaling: Gradually decreasing rate
  • Adaptive: Automatically adjusted based on training performance

Selection criteria:

  • Dataset size and noise level
  • Convergence stability needs
  • Training time constraints
  • Model performance requirements
Constant ~

Fixed learning rate:

Characteristics:

  • Simplest schedule
  • No decay over time
  • Requires careful tuning
  • Can be unstable

Best for:

  • Simple problems
  • Short training runs
  • When behavior well understood
  • Quick prototyping

Note: Initial rate (eta0) selection crucial

Optimal ~

Theoretical optimal schedule:

Characteristics:

  • Theoretically motivated
  • Automatic scaling with regularization
  • Proven convergence properties
  • Robust performance

Best for:

  • Convex optimization
  • When theoretical guarantees needed
  • Standard classification tasks
  • Production systems

Note: t0 chosen heuristically based on data

Invscaling ~

Inverse scaling schedule:

Characteristics:

  • Gradual rate decay
  • Configurable decay speed
  • Smooth learning transition
  • Predictable behavior

Best for:

  • General purpose learning
  • Long training runs
  • When gradual decay needed
  • Fine-tuning models

Note: power_t parameter controls decay speed

Adaptive ~

Adaptive learning rate with dynamic adjustments:

Behavior:

  • Starts with eta0
  • Monitors training progress
  • Reduces rate by factor of 5 when progress stalls
  • Adapts to problem difficulty

Best for:

  • Difficult optimization problems
  • Unknown optimal learning rates
  • Avoiding manual tuning
  • Production systems

Note: Uses n_iter_no_change and tolerance parameters

true

Whether to shuffle training data after each epoch:

Benefits of true:

  • Better convergence
  • Prevents cyclical patterns
  • Reduces variance

Set false when:

  • Reproducibility critical
  • Order meaningful
  • Debugging needed
0.1

Epsilon parameter for epsilon-sensitive losses:

Affects:

  • Huber loss
  • Epsilon-insensitive losses

Impact:

  • Controls error sensitivity
  • Defines insensitive region
  • Affects solution sparsity

Only relevant for specific losses

Random number generator seed:

Controls randomness in:

  • Data shuffling
  • Weight initialization
  • Sample selection

Fixed value ensures:

  • Reproducible results
  • Consistent behavior
  • Debugging capability

Eta0

f64
0.1

Initial learning rate:

Used in schedules:

  • Constant: Fixed value
  • Invscaling: Starting value
  • Adaptive: Initial rate

Typical ranges:

  • 0.1: Default starting point
  • 0.01-1.0: Common range
  • Adjust based on convergence
0.5

Power of learning rate decay for invscaling:

Schedule:

Common values:

  • 0.5: Default, standard decay
  • <0.5: Slower decay
  • >0.5: Faster decay

Only used with invscaling schedule

false

Whether to use validation-based early stopping:

Benefits:

  • Prevents overfitting
  • Reduces training time
  • Automatic stopping

Requires:

  • Validation fraction
  • Patience setting
  • Tolerance threshold

Fraction of training data for validation:

Used when early_stopping=true

Typical values:

  • 0.1: Default (10% validation)
  • 0.2: More validation emphasis
  • 0.3: Large validation set

Trade-off: Training size vs. validation reliability

Early stopping patience parameter:

Stops training if no improvement in:

  • n_iter_no_change consecutive epochs

Values:

  • 5: Default patience
  • <5: Aggressive stopping
  • >5: More optimization chances

Used with early_stopping=true

Balanced

Class weight adjustment strategy for handling imbalanced datasets:

Mathematical form:

  • Balanced:
  • Uniform: for all classes

Impact on model:

  • Affects class importance during training
  • Influences decision boundary placement
  • Controls misclassification penalties
  • Balances precision vs recall trade-off

Selection criteria:

  • Class distribution in data
  • Cost of different error types
  • Business/domain requirements
  • Performance metrics priorities
Uniform ~

Equal weights for all classes:

Characteristics:

  • No adjustment for class frequencies
  • Natural class proportions preserved
  • Faster training process
  • Original data distribution maintained

Best for:

  • Balanced datasets (similar class frequencies)
  • When natural proportions matter
  • Representative sampling
  • When all errors equally costly

Warning: May underperform on imbalanced data

Balanced ~

Weights inversely proportional to class frequencies:

Characteristics:

  • Automatically adjusts for class imbalance
  • Higher weights for minority classes
  • Equalizes class importance
  • Helps rare class detection

Best for:

  • Imbalanced datasets
  • Rare event detection
  • Fraud detection
  • Medical diagnosis
  • Anomaly detection

Note: May increase variance in predictions

false

When set to True, reuse the solution of the previous call to fit as initialization, otherwise, just erase the previous solution.

Effects when true:

  • Reuses previous weights
  • Continues training
  • Faster convergence

Useful for:

  • Incremental learning
  • Transfer learning
  • Parameter search

Exhaustive hyperparameter optimization through grid search:

Search process:

  • Tests all parameter combinations
  • Uses cross-validation for evaluation
  • Selects best performing configuration
  • Optimizes for specified metric

Key parameters to tune:

  • Loss function and penalty type
  • Learning rate schedule
  • Regularization strength (alpha)
  • Class weights for imbalanced data

Performance considerations:

  • Computational cost grows exponentially with parameters
  • Memory usage depends on data size
  • Consider validation strategy carefully
  • Balance search space vs. computation time

Loss

[enum, ...]
Hinge

Loss function determining how the model penalizes misclassifications:

Selection guide:

  • Hinge: For maximum-margin linear classification (SVM-like)
  • LogLoss: For probabilistic predictions
  • ModifiedHuber: For robust classification with outliers
  • Perceptron: For simple binary classification
  • Huber/Epsilon variants: For robust regression-like behavior

Impact on learning:

  • Affects model's sensitivity to errors
  • Determines probability estimation capability
  • Influences convergence behavior
  • Changes outlier handling
Hinge ~

Maximum-margin classification loss:

Characteristics:

  • Creates SVM-like classifier
  • No probability estimates
  • Sharp decision boundary
  • Good generalization

Best for:

  • Binary classification
  • When clear separation desired
  • Low-noise datasets
SquaredHinge ~

Squared hinge loss:

Characteristics:

  • Stronger penalty for violations
  • Differentiable everywhere
  • More sensitive to outliers

Best for:

  • Smoother optimization
  • When stronger penalties needed
  • Clean datasets
ModifiedHuber ~

Smooth variant of hinge loss with better outlier tolerance:

Characteristics:

  • Probability estimates
  • Robust to outliers
  • Smooth optimization

Best for:

  • Noisy datasets
  • When probabilities needed
  • Robust classification
LogLoss ~

Logistic regression loss:

Characteristics:

  • Natural probability estimates
  • Smooth gradients
  • Well-calibrated predictions

Best for:

  • Probability estimation
  • Risk assessment
  • Multiclass problems
  • When calibrated probabilities needed
Perceptron ~

Linear loss used by perceptron:

Characteristics:

  • Simple update rule
  • No hyperparameters
  • Online learning

Best for:

  • Simple binary problems
  • Online learning
  • Quick prototyping
Huber ~

Huber loss combines squared and absolute loss:

Characteristics:

  • Robust to outliers
  • Smooth transition
  • Configurable sensitivity

Best for:

  • Noisy data
  • Robust classification
  • When outliers present
EpsilonInsensitive ~

Epsilon-insensitive loss:

Characteristics:

  • Ignores small errors
  • SVR-like behavior
  • Sparse solution

Best for:

  • Regression-like classification
  • When small errors acceptable
  • Feature selection
SquaredEpsilonInsensitive ~

Squared epsilon-insensitive loss:

Characteristics:

  • Smooth loss function
  • Quadratic penalty beyond epsilon
  • More sensitive to large errors

Best for:

  • Smooth optimization
  • When gradual error penalties needed
  • Regression-like classification
SquaredError ~

Mean squared error loss:

Characteristics:

  • Quadratic error penalty
  • Differentiable everywhere
  • Sensitive to outliers

Best for:

  • Regression-like problems
  • Clean datasets
  • When smooth gradients needed

Penalty

[enum, ...]
L2

The penalty (aka regularization term) to be used. Defaults to L2 which is the standard regularizer for linear SVM models. L1 and ElasticNet might bring sparsity to the model (feature selection) not achievable with L2. No penalty is added when set to None.

L2 ~

L2 penalty (Ridge):

Characteristics:

  • Squared magnitude penalty
  • Shrinks all weights toward zero
  • Handles correlated features well
  • Produces dense solutions

Best for:

  • Most classification tasks
  • When all features potentially relevant
  • Dealing with multicollinearity
  • Stable solutions needed
L1 ~

L1 penalty (Lasso):

Characteristics:

  • Absolute magnitude penalty
  • Produces sparse solutions
  • Feature selection capability
  • Path algorithms possible

Best for:

  • Feature selection
  • High-dimensional data
  • When sparse solutions desired
  • Eliminating irrelevant features
ElasticNet ~

ElasticNet penalty:

Characteristics:

  • Combines L1 and L2 penalties
  • Controls sparsity via ratio
  • Group selection capability
  • More stable than pure L1

Best for:

  • Correlated features
  • When both sparsity and stability needed
  • Group feature selection
  • Balanced regularization
None ~

No regularization penalty applied

Characteristics:

  • Uncontrolled model complexity
  • Maximum flexibility
  • Risk of overfitting
  • Full parameter range

Best for:

  • Very small datasets
  • Theoretical analysis
  • When bias undesirable
  • Testing/debugging purposes

Warning: Use with caution as it may lead to overfitting

Alpha

[f64, ...]
0.0001

Regularization strength values to test:

Recommended ranges:

  • Coarse: [1e-4, 1e-3, 1e-2, 1e-1]
  • Fine: [5e-5, 1e-4, 5e-4, 1e-3]
  • Wide: [1e-5, 1e-4, 1e-3, 1e-2, 1e-1]

Note: Log-scale spacing recommended

L1Ratio

[f64, ...]
0.15

ElasticNet mixing parameter values:

Common ranges:

  • Basic: [0.15, 0.5, 0.85]
  • Detailed: [0.1, 0.3, 0.5, 0.7, 0.9]
  • Complete: [0.0, 0.25, 0.5, 0.75, 1.0]

Note: Only used with ElasticNet penalty

FitIntercept

[bool, ...]
true

Whether to fit intercept combinations:

Options:

  • [true]: Standard modeling (default)
  • [true, false]: Compare both options

Note: Usually keep [true] unless data centered

MaxIter

[u64, ...]
1000

Maximum iterations to test:

Typical ranges:

  • Standard: [500, 1000, 2000]
  • Extended: [1000, 2000, 5000]
  • Quick: [100, 500, 1000]

Include larger values if convergence issues

Tolerance

[f64, ...]
0.001

Convergence tolerance values:

Common ranges:

  • Standard: [1e-3, 1e-4, 1e-5]
  • Coarse: [1e-2, 1e-3, 1e-4]
  • Fine: [5e-4, 1e-4, 5e-5]

Trade-off: Precision vs computation time

LearningRate

[enum, ...]
Optimal

Learning rate schedule controlling parameter update steps:

Schedule types:

  • Constant: Fixed step size (simple but may need tuning)
  • Optimal: Theoretical optimal rate for some loss functions
  • Invscaling: Gradually decreasing rate
  • Adaptive: Automatically adjusted based on training performance

Selection criteria:

  • Dataset size and noise level
  • Convergence stability needs
  • Training time constraints
  • Model performance requirements
Constant ~

Fixed learning rate:

Characteristics:

  • Simplest schedule
  • No decay over time
  • Requires careful tuning
  • Can be unstable

Best for:

  • Simple problems
  • Short training runs
  • When behavior well understood
  • Quick prototyping

Note: Initial rate (eta0) selection crucial

Optimal ~

Theoretical optimal schedule:

Characteristics:

  • Theoretically motivated
  • Automatic scaling with regularization
  • Proven convergence properties
  • Robust performance

Best for:

  • Convex optimization
  • When theoretical guarantees needed
  • Standard classification tasks
  • Production systems

Note: t0 chosen heuristically based on data

Invscaling ~

Inverse scaling schedule:

Characteristics:

  • Gradual rate decay
  • Configurable decay speed
  • Smooth learning transition
  • Predictable behavior

Best for:

  • General purpose learning
  • Long training runs
  • When gradual decay needed
  • Fine-tuning models

Note: power_t parameter controls decay speed

Adaptive ~

Adaptive learning rate with dynamic adjustments:

Behavior:

  • Starts with eta0
  • Monitors training progress
  • Reduces rate by factor of 5 when progress stalls
  • Adapts to problem difficulty

Best for:

  • Difficult optimization problems
  • Unknown optimal learning rates
  • Avoiding manual tuning
  • Production systems

Note: Uses n_iter_no_change and tolerance parameters

Eta0

[f64, ...]
0.1

Initial learning rates to test:

Typical ranges:

  • Conservative: [0.01, 0.1, 0.5]
  • Aggressive: [0.1, 0.5, 1.0]
  • Wide: [0.01, 0.1, 0.5, 1.0]

Critical for Constant and Invscaling schedules

PowerT

[f64, ...]
0.5

Learning rate decay powers:

Common values:

  • Standard: [0.5] (default)
  • Range: [0.3, 0.5, 0.7]
  • Wide: [0.25, 0.5, 0.75]

Only relevant for Invscaling schedule

Epsilon

[f64, ...]
0.1

Epsilon values for sensitive losses:

Typical ranges:

  • Standard: [0.1, 0.2, 0.3]
  • Fine: [0.05, 0.1, 0.15]
  • Wide: [0.01, 0.1, 0.5]

Only for epsilon-sensitive loss functions

ClassWeights

[enum, ...]
Balanced

Class weight adjustment strategy for handling imbalanced datasets:

Mathematical form:

  • Balanced:
  • Uniform: for all classes

Impact on model:

  • Affects class importance during training
  • Influences decision boundary placement
  • Controls misclassification penalties
  • Balances precision vs recall trade-off

Selection criteria:

  • Class distribution in data
  • Cost of different error types
  • Business/domain requirements
  • Performance metrics priorities
Uniform ~

Equal weights for all classes:

Characteristics:

  • No adjustment for class frequencies
  • Natural class proportions preserved
  • Faster training process
  • Original data distribution maintained

Best for:

  • Balanced datasets (similar class frequencies)
  • When natural proportions matter
  • Representative sampling
  • When all errors equally costly

Warning: May underperform on imbalanced data

Balanced ~

Weights inversely proportional to class frequencies:

Characteristics:

  • Automatically adjusts for class imbalance
  • Higher weights for minority classes
  • Equalizes class importance
  • Helps rare class detection

Best for:

  • Imbalanced datasets
  • Rare event detection
  • Fraud detection
  • Medical diagnosis
  • Anomaly detection

Note: May increase variance in predictions

true

Data shuffling between epochs:

Effects:

  • true: Better convergence (recommended)
  • false: Deterministic order

Keep true unless order significant

Random seed for reproducibility:

Controls:

  • Cross-validation splits
  • Data shuffling
  • Model initialization

Fixed value ensures reproducible search

false

Early stopping usage:

When true:

  • Uses validation set
  • Stops on plateau
  • May speed up search

Consider true for large parameter spaces

Validation set size for early stopping:

Typical values:

  • 0.1: Standard (10%)
  • 0.2: Larger validation
  • 0.15: Balanced split

Only used if early_stopping=true

Early stopping patience:

Values:

  • 5: Default patience
  • <5: Aggressive stopping
  • >5: More chances to improve

Affects computation time significantly

Accuracy

Metric for evaluating model performance during training and validation:

Selection criteria:

  • Default: Uses model's built-in scoring method
  • Accuracy: For balanced datasets, overall correctness
  • BalancedAccuracy: For imbalanced datasets, class-weighted

Use cases:

  • Default: When standard metrics suffice
  • Accuracy: When all classes equally important
  • BalancedAccuracy: When minority classes critical

Impact on model selection:

  • Guides hyperparameter optimization
  • Affects final model choice
  • Influences cross-validation results
Default ~

Uses the estimator's built-in scoring method. Typically accuracy for classification tasks. Best when default metrics align with problem goals.

Accuracy ~

Ratio of correct predictions to total predictions: Best for:

  • Balanced class distributions
  • When all misclassifications equally costly
  • Simple performance assessment
BalancedAccuracy ~

Average of recall obtained on each class: Preferred when:

  • Classes are imbalanced
  • Minority class detection important
  • Equal class importance required

Split

oneof
DefaultSplit

Standard train-test split configuration optimized for general classification tasks.

Configuration:

  • Test size: 20% (0.2)
  • Random seed: 98
  • Shuffling: Enabled
  • Stratification: Based on target distribution

Advantages:

  • Preserves class distribution
  • Provides reliable validation
  • Suitable for most datasets

Best for:

  • Medium to large datasets
  • Independent observations
  • Initial model evaluation

Splitting uses the ShuffleSplit strategy or StratifiedShuffleSplit strategy depending on the field stratified. Note: If shuffle is false then stratified must be false.

Configurable train-test split parameters for specialized requirements. Allows fine-tuning of data division strategy for specific use cases or constraints.

Use cases:

  • Time series data
  • Grouped observations
  • Specific train/test ratios
  • Custom validation schemes

Random seed for reproducible splits. Ensures:

  • Consistent train/test sets
  • Reproducible experiments
  • Comparable model evaluations

Same seed guarantees identical splits across runs.

true

Data shuffling before splitting. Effects:

  • true: Randomizes order, better for i.i.d. data
  • false: Maintains order, important for time series

When to disable:

  • Time dependent data
  • Sequential patterns
  • Grouped observations
0.8

Proportion of data for training. Considerations:

  • Larger (e.g., 0.8-0.9): Better model learning
  • Smaller (e.g., 0.5-0.7): Better validation

Common splits:

  • 0.8: Standard (80/20 split)
  • 0.7: More validation emphasis
  • 0.9: More training emphasis
false

Maintain class distribution in splits. Important when:

  • Classes are imbalanced
  • Small classes present
  • Representative splits needed

Requirements:

  • Classification tasks only
  • Cannot use with shuffle=false
  • Sufficient samples per class

Cv

oneof
DefaultCv

Standard cross-validation configuration using stratified 3-fold splitting.

Configuration:

  • Folds: 3
  • Method: StratifiedKFold
  • Stratification: Preserves class proportions

Advantages:

  • Balanced evaluation
  • Reasonable computation time
  • Good for medium-sized datasets

Limitations:

  • May be insufficient for small datasets
  • Higher variance than larger fold counts
  • May miss some data patterns

Configurable stratified k-fold cross-validation for specific validation requirements.

Features:

  • Adjustable fold count with NFolds determining the number of splits.
  • Stratified sampling
  • Preserved class distributions

Use cases:

  • Small datasets (more folds)
  • Large datasets (fewer folds)
  • Detailed model evaluation
  • Robust performance estimation
3

Number of cross-validation folds. Guidelines:

  • 3-5: Large datasets, faster training
  • 5-10: Standard choice, good balance
  • 10+: Small datasets, thorough evaluation

Trade-offs:

  • More folds: Better evaluation, slower training
  • Fewer folds: Faster training, higher variance

Must be at least 2.

K-fold cross-validation without stratification. Divides data into k consecutive folds for iterative validation.

Process:

  • Splits data into k equal parts
  • Each fold serves as validation once
  • Remaining k-1 folds form training set

Use cases:

  • Regression problems
  • Large, balanced datasets
  • When stratification unnecessary
  • Continuous target variables

Limitations:

  • May not preserve class distributions
  • Less suitable for imbalanced data
  • Can create biased splits with ordered data

Number of folds for cross-validation. Selection guide: Recommended values:

  • 5: Standard choice (default)
  • 3: Large datasets/quick evaluation
  • 10: Thorough evaluation/smaller datasets

Trade-offs:

  • Higher values: More thorough, computationally expensive
  • Lower values: Faster, potentially higher variance

Must be at least 2 for valid cross-validation.

Random seed for fold generation when shuffling. Important for:

  • Reproducible results
  • Consistent fold assignments
  • Benchmark comparisons
  • Debugging and validation

Set specific value for reproducibility across runs.

true

Whether to shuffle data before splitting into folds. Effects:

  • true: Randomized fold composition (recommended)
  • false: Sequential splitting

Enable when:

  • Data may have ordering
  • Better fold independence needed

Disable for:

  • Time series data
  • Ordered observations

Stratified K-fold cross-validation maintaining class proportions across folds.

Key features:

  • Preserves class distribution in each fold
  • Handles imbalanced datasets
  • Ensures representative splits

Best for:

  • Classification problems
  • Imbalanced class distributions
  • When class proportions matter

Requirements:

  • Classification tasks only
  • Sufficient samples per class
  • Categorical target variable

Number of stratified folds. Guidelines: Typical values:

  • 5: Standard for most cases
  • 3: Quick evaluation/large datasets
  • 10: Detailed evaluation/smaller datasets

Considerations:

  • Must allow sufficient samples per class per fold
  • Balance between stability and computation time
  • Consider smallest class size when choosing

Seed for reproducible stratified splits. Ensures:

  • Consistent fold assignments
  • Reproducible results
  • Comparable experiments
  • Systematic validation

Fixed seed guarantees identical stratified splits.

false

Data shuffling before stratified splitting. Impact:

  • true: Randomizes while maintaining stratification
  • false: Maintains data order within strata

Use cases:

  • true: Independent observations
  • false: Grouped or sequential data

Class proportions maintained regardless of setting.

Random permutation cross-validator with independent sampling.

Characteristics:

  • Random sampling for each split
  • Independent train/test sets
  • More flexible than K-fold
  • Can have overlapping test sets

Advantages:

  • Control over test size
  • Fresh splits each iteration
  • Good for large datasets

Limitations:

  • Some samples might never be tested
  • Others might be tested multiple times
  • No guarantee of complete coverage

Number of random splits to perform. Consider: Common values:

  • 5: Standard evaluation
  • 10: More thorough assessment
  • 3: Quick estimates

Trade-offs:

  • More splits: Better estimation, longer runtime
  • Fewer splits: Faster, less stable estimates

Balance between computation and stability.

Random seed for reproducible shuffling. Controls:

  • Split randomization
  • Sample selection
  • Result reproducibility

Important for:

  • Debugging
  • Comparative studies
  • Result verification
0.2

Proportion of samples for test set. Guidelines: Common ratios:

  • 0.2: Standard (80/20 split)
  • 0.25: More validation emphasis
  • 0.1: More training data

Considerations:

  • Dataset size
  • Model complexity
  • Validation requirements

It must be between 0.0 and 1.0.

Stratified random permutation cross-validator combining shuffle-split with stratification.

Features:

  • Maintains class proportions
  • Random sampling within strata
  • Independent splits
  • Flexible test size

Ideal for:

  • Imbalanced datasets
  • Large-scale problems
  • When class distributions matter
  • Flexible validation schemes

Number of stratified random splits. Guidelines: Recommended values:

  • 5: Standard evaluation
  • 10: Detailed analysis
  • 3: Quick assessment

Consider:

  • Sample size per class
  • Computational resources
  • Stability requirements

Seed for reproducible stratified sampling. Ensures:

  • Consistent class proportions
  • Reproducible splits
  • Comparable experiments

Critical for:

  • Benchmarking
  • Research studies
  • Quality assurance
0.2

Fraction of samples for stratified test set. Best practices: Common splits:

  • 0.2: Balanced evaluation
  • 0.3: More thorough testing
  • 0.15: Preserve training size

Consider:

  • Minority class size
  • Overall dataset size
  • Validation objectives

It must be between 0.0 and 1.0.

Time Series cross-validator. Provides train/test indices to split time series data samples that are observed at fixed time intervals, in train/test sets. It is a variation of k-fold which returns first k folds as train set and the k + 1th fold as test set. Note that unlike standard cross-validation methods, successive training sets are supersets of those that come before them. Also, it adds all surplus data to the first training partition, which is always used to train the model. Key features:

  • Maintains temporal dependence
  • Expanding window approach
  • Forward-chaining splits
  • No future data leakage

Use cases:

  • Sequential data
  • Financial forecasting
  • Temporal predictions
  • Time-dependent patterns

Note: Training sets are supersets of previous iterations.

Number of temporal splits. Considerations: Typical values:

  • 5: Standard forward chaining
  • 3: Limited historical data
  • 10: Long time series

Impact:

  • Affects training window growth
  • Determines validation points
  • Influences computational load

Maximum size of training set. Should be strictly less than the number of samples. Applications:

  • 0: Use all available past data
  • >0: Rolling window of fixed size

Use cases:

  • Limit historical relevance
  • Control computational cost
  • Handle concept drift
  • Memory constraints

Number of samples in each test set. When 0:

  • Auto-calculated as n_samples/(n_splits+1)
  • Ensures equal-sized test sets

Considerations:

  • Forecast horizon
  • Validation requirements
  • Available future data

Gap

u64
0

Number of samples to exclude from the end of each train set before the test set.Gap between train and test sets. Uses:

  • Avoid data leakage
  • Model forecast lag
  • Buffer periods

Common scenarios:

  • 0: Continuous prediction
  • >0: Forward gap for realistic evaluation
  • Match business forecasting needs