Sgd / Classifier Layer

Stochastic Gradient Descent (SGD) Classifier - A versatile linear classification algorithm that learns incrementally using gradient descent optimization.

Mathematical form: $w_{t + 1} = w_{t} - η_{t} \nabla L (w_{t}; x_{i}, y_{i})$ where:

$w_{t}$ is the model weights at step t
$η_{t}$ is the learning rate at step t
$L$ is the loss function
$(x_{i}, y_{i})$ is a single training example

Key characteristics:

Efficient for large-scale learning
Online/incremental learning capability
Multiple loss function options
Flexible regularization schemes
Adaptive learning rates

Common applications:

Text classification
Large-scale document categorization
Real-time classification tasks
Stream data processing
High-dimensional sparse data

Outputs:

Predicted Table: Input data with prediction columns
Validation Results: Cross-validation metrics
Test Metric: Test set performance metrics
ROC Curve Data: ROC analysis information
Confusion Matrix: Detailed classification results
Feature Importances: Feature coefficients and importance

Note: Particularly effective for large datasets where memory efficiency is crucial.

Table

Predicted Table

Validation Results

Test Metric

ROC Curve Data

Confusion Matrix

Feature Importances

SelectFeatures

[column, ...]

Feature columns for SGD classification:

Selection guidelines:

Numerical features preferred
Standardized/scaled values recommended
Handle missing values beforehand
Consider feature interactions

Preprocessing tips:

Scale to zero mean and unit variance
Remove or encode categorical features
Handle outliers appropriately
Consider dimensionality reduction

If empty, uses all numeric columns except target.

SelectTarget

column

Target column for classification:

Requirements:

Categorical labels
No missing values
At least two classes
Properly encoded

Preprocessing:

Encode categorical labels
Check class balance
Consider label noise
Verify label consistency

Params

oneof

DefaultParams

Standard configuration optimized for general classification tasks:

Default settings:

Hinge loss (SVM-like behavior)
L2 regularization (prevent overfitting)
Optimal learning rate schedule
Balanced class weights
Shuffled training data

Best suited for:

Initial model exploration
Medium-sized datasets
When quick results needed
Standard classification tasks

Note: Provides good baseline performance for most applications.

CustomParams

Fine-grained control over SGD Classifier parameters for model optimization:

Parameter categories:

Model Architecture:
- Loss function selection
- Penalty type and strength
- Intercept fitting
Optimization Control:
- Learning rate schedule
- Convergence criteria
- Early stopping settings
Training Behavior:
- Class weight adjustment
- Shuffling options
- Warm start capability

Use cases:

Performance optimization
Specific problem requirements
Complex dataset handling
Production deployment tuning

Loss

enum

Hinge

Loss function determining how the model penalizes misclassifications:

Selection guide:

Hinge: For maximum-margin linear classification (SVM-like)
LogLoss: For probabilistic predictions
ModifiedHuber: For robust classification with outliers
Perceptron: For simple binary classification
Huber/Epsilon variants: For robust regression-like behavior

Impact on learning:

Affects model's sensitivity to errors
Determines probability estimation capability
Influences convergence behavior
Changes outlier handling

Hinge ~

Maximum-margin classification loss: $ma x (0, 1 - y_{i} w^{T} x_{i})$

Characteristics:

Creates SVM-like classifier
No probability estimates
Sharp decision boundary
Good generalization

Best for:

Binary classification
When clear separation desired
Low-noise datasets

SquaredHinge ~

Squared hinge loss: $ma x (0, 1 - y_{i} w^{T} x_{i})^{2}$

Characteristics:

Stronger penalty for violations
Differentiable everywhere
More sensitive to outliers

Best for:

Smoother optimization
When stronger penalties needed
Clean datasets

ModifiedHuber ~

Smooth variant of hinge loss with better outlier tolerance: ${- 4 y_{i} w^{T} x_{i} ma x (0, 1 - y_{i} w^{T} x_{i})^{2} if y_{i} w^{T} x_{i} \leq - 1 otherwise$

Characteristics:

Probability estimates
Robust to outliers
Smooth optimization

Best for:

Noisy datasets
When probabilities needed
Robust classification

LogLoss ~

Logistic regression loss: $l o g (1 + e x p (- y_{i} w^{T} x_{i}))$

Characteristics:

Natural probability estimates
Smooth gradients
Well-calibrated predictions

Best for:

Probability estimation
Risk assessment
Multiclass problems
When calibrated probabilities needed

Perceptron ~

Linear loss used by perceptron: $ma x (0, - y_{i} w^{T} x_{i})$

Characteristics:

Simple update rule
No hyperparameters
Online learning

Best for:

Simple binary problems
Online learning
Quick prototyping

Huber ~

Huber loss combines squared and absolute loss: ${\frac{1}{2} ∣ y - f (x) ∣^{2} ϵ ∣ y - f (x) ∣ - \frac{1}{2} ϵ^{2} for ∣ y - f (x) ∣ \leq ϵ otherwise$

Characteristics:

Robust to outliers
Smooth transition
Configurable sensitivity

Best for:

Noisy data
Robust classification
When outliers present

EpsilonInsensitive ~

Epsilon-insensitive loss: $ma x (0, ∣ y - f (x) ∣ - ϵ)$

Characteristics:

Ignores small errors
SVR-like behavior
Sparse solution

Best for:

Regression-like classification
When small errors acceptable
Feature selection

SquaredEpsilonInsensitive ~

Squared epsilon-insensitive loss: $ma x (0, ∣ y - f (x) ∣ - ϵ)^{2}$

Characteristics:

Smooth loss function
Quadratic penalty beyond epsilon
More sensitive to large errors

Best for:

Smooth optimization
When gradual error penalties needed
Regression-like classification

SquaredError ~

Mean squared error loss: $(y - f (x))^{2}$

Characteristics:

Quadratic error penalty
Differentiable everywhere
Sensitive to outliers

Best for:

Regression-like problems
Clean datasets
When smooth gradients needed

The penalty (aka regularization term) to be used. Defaults to L2 which is the standard regularizer for linear SVM models. L1 and ElasticNet might bring sparsity to the model (feature selection) not achievable with L2. No penalty is added when set to None.

L2 ~

L2 penalty (Ridge): $α \sum_{i = 1}^{n} w_{i}^{2}$

Characteristics:

Squared magnitude penalty
Shrinks all weights toward zero
Handles correlated features well
Produces dense solutions

Best for:

Most classification tasks
When all features potentially relevant
Dealing with multicollinearity
Stable solutions needed

L1 ~

L1 penalty (Lasso): $α \sum_{i = 1}^{n} ∣ w_{i} ∣$

Characteristics:

Absolute magnitude penalty
Produces sparse solutions
Feature selection capability
Path algorithms possible

Best for:

Feature selection
High-dimensional data
When sparse solutions desired
Eliminating irrelevant features

ElasticNet ~

ElasticNet penalty: $α ρ \sum_{i = 1}^{n} ∣ w_{i} ∣ + \frac{α ( 1 - ρ )}{2} \sum_{i = 1}^{n} w_{i}^{2}$

Characteristics:

Combines L1 and L2 penalties
Controls sparsity via ratio
Group selection capability
More stable than pure L1

Best for:

Correlated features
When both sparsity and stability needed
Group feature selection
Balanced regularization

None ~

No regularization penalty applied

Characteristics:

Uncontrolled model complexity
Maximum flexibility
Risk of overfitting
Full parameter range

Best for:

Very small datasets
Theoretical analysis
When bias undesirable
Testing/debugging purposes

Warning: Use with caution as it may lead to overfitting

Alpha

f64

0.0001

Regularization strength multiplier: $α > 0$

Effects:

Larger values: Stronger regularization
Smaller values: More flexible model

Typical ranges:

1e-4: Default, good starting point
1e-5 to 1e-3: Common range
>1e-3: Strong regularization

Note: Critical for preventing overfitting

L1Ratio

f64

0.15

ElasticNet mixing parameter: $0 \leq ρ \leq 1$

Behavior:

0.0: Pure L2 penalty
1.0: Pure L1 penalty
Between: Mixed penalty

Guidelines:

0.15: Default, balanced mix
<0.5: Favor stability
>0.5: Favor sparsity

Only used with ElasticNet penalty

FitIntercept

bool

true

Whether to calculate the intercept term:

Impact:

true: Model learns bias term (recommended)
false: Assumes centered data

Set false only when:

Data is pre-centered
Zero-intercept desired
Testing theoretical properties

MaxIter

u64

1000

Maximum number of training epochs:

Guidelines:

1000: Default, suitable for most cases
<1000: Simple problems, quick results
>1000: Complex problems, better convergence

Consider increasing if:

Model not converging
Complex decision boundaries needed
High precision required

Tolerance

f64

0.001

Convergence criterion threshold:

Stopping rule:

Stops when: loss > previous_loss - tolerance

Typical values:

1e-3: Default balance
1e-4: Higher precision
1e-2: Faster convergence

Trade-off: Precision vs. Speed

LearningRate

enum

Optimal

Learning rate schedule controlling parameter update steps:

Schedule types:

Constant: Fixed step size (simple but may need tuning)
Optimal: Theoretical optimal rate for some loss functions
Invscaling: Gradually decreasing rate
Adaptive: Automatically adjusted based on training performance

Selection criteria:

Dataset size and noise level
Convergence stability needs
Training time constraints
Model performance requirements

Constant ~

Fixed learning rate: $η_{t} = η_{0}$

Characteristics:

Simplest schedule
No decay over time
Requires careful tuning
Can be unstable

Best for:

Simple problems
Short training runs
When behavior well understood
Quick prototyping

Note: Initial rate (eta0) selection crucial

Optimal ~

Theoretical optimal schedule: $η_{t} = \frac{1}{α ( t + t _{0} )}$

Characteristics:

Theoretically motivated
Automatic scaling with regularization
Proven convergence properties
Robust performance

Best for:

Convex optimization
When theoretical guarantees needed
Standard classification tasks
Production systems

Note: t0 chosen heuristically based on data

Invscaling ~

Inverse scaling schedule: $η_{t} = η_{0} / t^{p o w er_t}$

Characteristics:

Gradual rate decay
Configurable decay speed
Smooth learning transition
Predictable behavior

Best for:

General purpose learning
Long training runs
When gradual decay needed
Fine-tuning models

Note: power_t parameter controls decay speed

Adaptive ~

Adaptive learning rate with dynamic adjustments:

Behavior:

Starts with eta0
Monitors training progress
Reduces rate by factor of 5 when progress stalls
Adapts to problem difficulty

Best for:

Difficult optimization problems
Unknown optimal learning rates
Avoiding manual tuning
Production systems

Note: Uses n_iter_no_change and tolerance parameters

Shuffle

bool

true

Whether to shuffle training data after each epoch:

Benefits of true:

Better convergence
Prevents cyclical patterns
Reduces variance

Set false when:

Reproducibility critical
Order meaningful
Debugging needed

Epsilon

f64

0.1

Epsilon parameter for epsilon-sensitive losses:

Affects:

Huber loss
Epsilon-insensitive losses

Impact:

Controls error sensitivity
Defines insensitive region
Affects solution sparsity

Only relevant for specific losses

RandomState

u64

Random number generator seed:

Controls randomness in:

Data shuffling
Weight initialization
Sample selection

Fixed value ensures:

Reproducible results
Consistent behavior
Debugging capability

Eta0

f64

0.1

Initial learning rate: $η_{0} > 0$

Used in schedules:

Constant: Fixed value
Invscaling: Starting value
Adaptive: Initial rate

Typical ranges:

0.1: Default starting point
0.01-1.0: Common range
Adjust based on convergence

PowerT

f64

0.5

Power of learning rate decay for invscaling:

Schedule: $η_{t} = η_{0} / t^{p o w er_t}$

Common values:

0.5: Default, standard decay
<0.5: Slower decay
>0.5: Faster decay

Only used with invscaling schedule

EarlyStopping

bool

false

Whether to use validation-based early stopping:

Benefits:

Prevents overfitting
Reduces training time
Automatic stopping

Requires:

Validation fraction
Patience setting
Tolerance threshold

ValidationFraction

f64

0.1

Fraction of training data for validation:

Used when early_stopping=true

Typical values:

0.1: Default (10% validation)
0.2: More validation emphasis
0.3: Large validation set

Trade-off: Training size vs. validation reliability

NIterNoChange

u64

Early stopping patience parameter:

Stops training if no improvement in:

n_iter_no_change consecutive epochs

Values:

5: Default patience
<5: Aggressive stopping
>5: More optimization chances

Used with early_stopping=true

ClassWeights

enum

Balanced

Class weight adjustment strategy for handling imbalanced datasets:

Mathematical form:

Balanced: $w_{i} = \frac{n _{s am pl es}}{n _{c l a sses} \times n _{s am pl e s_{i}}}$
Uniform: $w_{i} = 1$ for all classes

Impact on model:

Affects class importance during training
Influences decision boundary placement
Controls misclassification penalties
Balances precision vs recall trade-off

Selection criteria:

Class distribution in data
Cost of different error types
Business/domain requirements
Performance metrics priorities

Uniform ~

Equal weights for all classes: $w_{i} = 1$

Characteristics:

No adjustment for class frequencies
Natural class proportions preserved
Faster training process
Original data distribution maintained

Best for:

Balanced datasets (similar class frequencies)
When natural proportions matter
Representative sampling
When all errors equally costly

Warning: May underperform on imbalanced data

Balanced ~

Weights inversely proportional to class frequencies: $w_{i} = \frac{n _{s am pl es}}{n _{c l a sses} \times n _{s am pl e s_{i}}}$

Characteristics:

Automatically adjusts for class imbalance
Higher weights for minority classes
Equalizes class importance
Helps rare class detection

Best for:

Imbalanced datasets
Rare event detection
Fraud detection
Medical diagnosis
Anomaly detection

Note: May increase variance in predictions

WarmStart

bool

false

When set to True, reuse the solution of the previous call to fit as initialization, otherwise, just erase the previous solution.

Effects when true:

Reuses previous weights
Continues training
Faster convergence

Useful for:

Incremental learning
Transfer learning
Parameter search

GridSearch

Exhaustive hyperparameter optimization through grid search:

Search process:

Tests all parameter combinations
Uses cross-validation for evaluation
Selects best performing configuration
Optimizes for specified metric

Key parameters to tune:

Loss function and penalty type
Learning rate schedule
Regularization strength (alpha)
Class weights for imbalanced data

Performance considerations:

Computational cost grows exponentially with parameters
Memory usage depends on data size
Consider validation strategy carefully
Balance search space vs. computation time

Loss

[enum, ...]

Hinge

Loss function determining how the model penalizes misclassifications:

Selection guide:

Hinge: For maximum-margin linear classification (SVM-like)
LogLoss: For probabilistic predictions
ModifiedHuber: For robust classification with outliers
Perceptron: For simple binary classification
Huber/Epsilon variants: For robust regression-like behavior

Impact on learning:

Affects model's sensitivity to errors
Determines probability estimation capability
Influences convergence behavior
Changes outlier handling

Hinge ~

Maximum-margin classification loss: $ma x (0, 1 - y_{i} w^{T} x_{i})$

Characteristics:

Creates SVM-like classifier
No probability estimates
Sharp decision boundary
Good generalization

Best for:

Binary classification
When clear separation desired
Low-noise datasets

SquaredHinge ~

Squared hinge loss: $ma x (0, 1 - y_{i} w^{T} x_{i})^{2}$

Characteristics:

Stronger penalty for violations
Differentiable everywhere
More sensitive to outliers

Best for:

Smoother optimization
When stronger penalties needed
Clean datasets

ModifiedHuber ~

Smooth variant of hinge loss with better outlier tolerance: ${- 4 y_{i} w^{T} x_{i} ma x (0, 1 - y_{i} w^{T} x_{i})^{2} if y_{i} w^{T} x_{i} \leq - 1 otherwise$

Characteristics:

Probability estimates
Robust to outliers
Smooth optimization

Best for:

Noisy datasets
When probabilities needed
Robust classification

LogLoss ~

Logistic regression loss: $l o g (1 + e x p (- y_{i} w^{T} x_{i}))$

Characteristics:

Natural probability estimates
Smooth gradients
Well-calibrated predictions

Best for:

Probability estimation
Risk assessment
Multiclass problems
When calibrated probabilities needed

Perceptron ~

Linear loss used by perceptron: $ma x (0, - y_{i} w^{T} x_{i})$

Characteristics:

Simple update rule
No hyperparameters
Online learning

Best for:

Simple binary problems
Online learning
Quick prototyping

Huber ~

Huber loss combines squared and absolute loss: ${\frac{1}{2} ∣ y - f (x) ∣^{2} ϵ ∣ y - f (x) ∣ - \frac{1}{2} ϵ^{2} for ∣ y - f (x) ∣ \leq ϵ otherwise$

Characteristics:

Robust to outliers
Smooth transition
Configurable sensitivity

Best for:

Noisy data
Robust classification
When outliers present

EpsilonInsensitive ~

Epsilon-insensitive loss: $ma x (0, ∣ y - f (x) ∣ - ϵ)$

Characteristics:

Ignores small errors
SVR-like behavior
Sparse solution

Best for:

Regression-like classification
When small errors acceptable
Feature selection

SquaredEpsilonInsensitive ~

Squared epsilon-insensitive loss: $ma x (0, ∣ y - f (x) ∣ - ϵ)^{2}$

Characteristics:

Smooth loss function
Quadratic penalty beyond epsilon
More sensitive to large errors

Best for:

Smooth optimization
When gradual error penalties needed
Regression-like classification

SquaredError ~

Mean squared error loss: $(y - f (x))^{2}$

Characteristics:

Quadratic error penalty
Differentiable everywhere
Sensitive to outliers

Best for:

Regression-like problems
Clean datasets
When smooth gradients needed

Penalty

[enum, ...]

L2 ~

L2 penalty (Ridge): $α \sum_{i = 1}^{n} w_{i}^{2}$

Characteristics:

Squared magnitude penalty
Shrinks all weights toward zero
Handles correlated features well
Produces dense solutions

Best for:

Most classification tasks
When all features potentially relevant
Dealing with multicollinearity
Stable solutions needed

L1 ~

L1 penalty (Lasso): $α \sum_{i = 1}^{n} ∣ w_{i} ∣$

Characteristics:

Absolute magnitude penalty
Produces sparse solutions
Feature selection capability
Path algorithms possible

Best for:

Feature selection
High-dimensional data
When sparse solutions desired
Eliminating irrelevant features

ElasticNet ~

ElasticNet penalty: $α ρ \sum_{i = 1}^{n} ∣ w_{i} ∣ + \frac{α ( 1 - ρ )}{2} \sum_{i = 1}^{n} w_{i}^{2}$

Characteristics:

Combines L1 and L2 penalties
Controls sparsity via ratio
Group selection capability
More stable than pure L1

Best for:

Correlated features
When both sparsity and stability needed
Group feature selection
Balanced regularization

None ~

No regularization penalty applied

Characteristics:

Uncontrolled model complexity
Maximum flexibility
Risk of overfitting
Full parameter range

Best for:

Very small datasets
Theoretical analysis
When bias undesirable
Testing/debugging purposes

Warning: Use with caution as it may lead to overfitting

Alpha

[f64, ...]

0.0001

Regularization strength values to test:

Recommended ranges:

Coarse: [1e-4, 1e-3, 1e-2, 1e-1]
Fine: [5e-5, 1e-4, 5e-4, 1e-3]
Wide: [1e-5, 1e-4, 1e-3, 1e-2, 1e-1]

Note: Log-scale spacing recommended

L1Ratio

[f64, ...]

0.15

ElasticNet mixing parameter values:

Common ranges:

Basic: [0.15, 0.5, 0.85]
Detailed: [0.1, 0.3, 0.5, 0.7, 0.9]
Complete: [0.0, 0.25, 0.5, 0.75, 1.0]

Note: Only used with ElasticNet penalty

FitIntercept

[bool, ...]

true

Whether to fit intercept combinations:

Options:

[true]: Standard modeling (default)
[true, false]: Compare both options

Note: Usually keep [true] unless data centered

MaxIter

[u64, ...]

1000

Maximum iterations to test:

Typical ranges:

Standard: [500, 1000, 2000]
Extended: [1000, 2000, 5000]
Quick: [100, 500, 1000]

Include larger values if convergence issues

Tolerance

[f64, ...]

0.001

Convergence tolerance values:

Common ranges:

Standard: [1e-3, 1e-4, 1e-5]
Coarse: [1e-2, 1e-3, 1e-4]
Fine: [5e-4, 1e-4, 5e-5]

Trade-off: Precision vs computation time

LearningRate

[enum, ...]

Optimal

Learning rate schedule controlling parameter update steps:

Schedule types:

Constant: Fixed step size (simple but may need tuning)
Optimal: Theoretical optimal rate for some loss functions
Invscaling: Gradually decreasing rate
Adaptive: Automatically adjusted based on training performance

Selection criteria:

Dataset size and noise level
Convergence stability needs
Training time constraints
Model performance requirements

Constant ~

Fixed learning rate: $η_{t} = η_{0}$

Characteristics:

Simplest schedule
No decay over time
Requires careful tuning
Can be unstable

Best for:

Simple problems
Short training runs
When behavior well understood
Quick prototyping

Note: Initial rate (eta0) selection crucial

Optimal ~

Theoretical optimal schedule: $η_{t} = \frac{1}{α ( t + t _{0} )}$

Characteristics:

Theoretically motivated
Automatic scaling with regularization
Proven convergence properties
Robust performance

Best for:

Convex optimization
When theoretical guarantees needed
Standard classification tasks
Production systems

Note: t0 chosen heuristically based on data

Invscaling ~

Inverse scaling schedule: $η_{t} = η_{0} / t^{p o w er_t}$

Characteristics:

Gradual rate decay
Configurable decay speed
Smooth learning transition
Predictable behavior

Best for:

General purpose learning
Long training runs
When gradual decay needed
Fine-tuning models

Note: power_t parameter controls decay speed

Adaptive ~

Adaptive learning rate with dynamic adjustments:

Behavior:

Starts with eta0
Monitors training progress
Reduces rate by factor of 5 when progress stalls
Adapts to problem difficulty

Best for:

Difficult optimization problems
Unknown optimal learning rates
Avoiding manual tuning
Production systems

Note: Uses n_iter_no_change and tolerance parameters

Eta0

[f64, ...]

0.1

Initial learning rates to test:

Typical ranges:

Conservative: [0.01, 0.1, 0.5]
Aggressive: [0.1, 0.5, 1.0]
Wide: [0.01, 0.1, 0.5, 1.0]

Critical for Constant and Invscaling schedules

PowerT

[f64, ...]

0.5

Learning rate decay powers:

Common values:

Standard: [0.5] (default)
Range: [0.3, 0.5, 0.7]
Wide: [0.25, 0.5, 0.75]

Only relevant for Invscaling schedule

Epsilon

[f64, ...]

0.1

Epsilon values for sensitive losses:

Typical ranges:

Standard: [0.1, 0.2, 0.3]
Fine: [0.05, 0.1, 0.15]
Wide: [0.01, 0.1, 0.5]

Only for epsilon-sensitive loss functions

ClassWeights

[enum, ...]

Balanced

Class weight adjustment strategy for handling imbalanced datasets:

Mathematical form:

Balanced: $w_{i} = \frac{n _{s am pl es}}{n _{c l a sses} \times n _{s am pl e s_{i}}}$
Uniform: $w_{i} = 1$ for all classes

Impact on model:

Affects class importance during training
Influences decision boundary placement
Controls misclassification penalties
Balances precision vs recall trade-off

Selection criteria:

Class distribution in data
Cost of different error types
Business/domain requirements
Performance metrics priorities

Uniform ~

Equal weights for all classes: $w_{i} = 1$

Characteristics:

No adjustment for class frequencies
Natural class proportions preserved
Faster training process
Original data distribution maintained

Best for:

Balanced datasets (similar class frequencies)
When natural proportions matter
Representative sampling
When all errors equally costly

Warning: May underperform on imbalanced data

Balanced ~

Weights inversely proportional to class frequencies: $w_{i} = \frac{n _{s am pl es}}{n _{c l a sses} \times n _{s am pl e s_{i}}}$

Characteristics:

Automatically adjusts for class imbalance
Higher weights for minority classes
Equalizes class importance
Helps rare class detection

Best for:

Imbalanced datasets
Rare event detection
Fraud detection
Medical diagnosis
Anomaly detection

Note: May increase variance in predictions

Shuffle

bool

true

Data shuffling between epochs:

Effects:

true: Better convergence (recommended)
false: Deterministic order

Keep true unless order significant

RandomState

u64

Random seed for reproducibility:

Controls:

Cross-validation splits
Data shuffling
Model initialization

Fixed value ensures reproducible search

EarlyStopping

bool

false

Early stopping usage:

When true:

Uses validation set
Stops on plateau
May speed up search

Consider true for large parameter spaces

ValidationFraction

f64

0.1

Validation set size for early stopping:

Typical values:

0.1: Standard (10%)
0.2: Larger validation
0.15: Balanced split

Only used if early_stopping=true

NIterNoChange

u64

Early stopping patience:

Values:

5: Default patience
<5: Aggressive stopping
>5: More chances to improve

Affects computation time significantly

RefitScore

enum

Accuracy

Metric for evaluating model performance during training and validation:

Selection criteria:

Default: Uses model's built-in scoring method
Accuracy: For balanced datasets, overall correctness
BalancedAccuracy: For imbalanced datasets, class-weighted

Use cases:

Default: When standard metrics suffice
Accuracy: When all classes equally important
BalancedAccuracy: When minority classes critical

Impact on model selection:

Guides hyperparameter optimization
Affects final model choice
Influences cross-validation results

Default ~

Uses the estimator's built-in scoring method. Typically accuracy for classification tasks. Best when default metrics align with problem goals.

Accuracy ~

Ratio of correct predictions to total predictions: $\frac{TP + TN}{TP + TN + FP + FN}$ Best for:

Balanced class distributions
When all misclassifications equally costly
Simple performance assessment

BalancedAccuracy ~

Average of recall obtained on each class: $\frac{1}{n _{c l a sses}} \sum_{i} \frac{T P _{i}}{T P _{i} + F N _{i}}$ Preferred when:

Classes are imbalanced
Minority class detection important
Equal class importance required

Split

oneof

DefaultSplit

Standard train-test split configuration optimized for general classification tasks.

Configuration:

Test size: 20% (0.2)
Random seed: 98
Shuffling: Enabled
Stratification: Based on target distribution

Advantages:

Preserves class distribution
Provides reliable validation
Suitable for most datasets

Best for:

Medium to large datasets
Independent observations
Initial model evaluation

Splitting uses the ShuffleSplit strategy or StratifiedShuffleSplit strategy depending on the field stratified. Note: If shuffle is false then stratified must be false.

CustomSplit

Configurable train-test split parameters for specialized requirements. Allows fine-tuning of data division strategy for specific use cases or constraints.

Use cases:

Time series data
Grouped observations
Specific train/test ratios
Custom validation schemes

RandomState

u64

Random seed for reproducible splits. Ensures:

Consistent train/test sets
Reproducible experiments
Comparable model evaluations

Same seed guarantees identical splits across runs.

Shuffle

bool

true

Data shuffling before splitting. Effects:

true: Randomizes order, better for i.i.d. data
false: Maintains order, important for time series

When to disable:

Time dependent data
Sequential patterns
Grouped observations

TrainSize

f64

0.8

Proportion of data for training. Considerations:

Larger (e.g., 0.8-0.9): Better model learning
Smaller (e.g., 0.5-0.7): Better validation

Common splits:

0.8: Standard (80/20 split)
0.7: More validation emphasis
0.9: More training emphasis

Stratified

bool

false

Maintain class distribution in splits. Important when:

Classes are imbalanced
Small classes present
Representative splits needed

Requirements:

Classification tasks only
Cannot use with shuffle=false
Sufficient samples per class

Cv

oneof

DefaultCv

Standard cross-validation configuration using stratified 3-fold splitting.

Configuration:

Folds: 3
Method: StratifiedKFold
Stratification: Preserves class proportions

Advantages:

Balanced evaluation
Reasonable computation time
Good for medium-sized datasets

Limitations:

May be insufficient for small datasets
Higher variance than larger fold counts
May miss some data patterns

CustomCv

Configurable stratified k-fold cross-validation for specific validation requirements.

Features:

Adjustable fold count with NFolds determining the number of splits.
Stratified sampling
Preserved class distributions

Use cases:

Small datasets (more folds)
Large datasets (fewer folds)
Detailed model evaluation
Robust performance estimation

NFolds

u32

Number of cross-validation folds. Guidelines:

3-5: Large datasets, faster training
5-10: Standard choice, good balance
10+: Small datasets, thorough evaluation

Trade-offs:

More folds: Better evaluation, slower training
Fewer folds: Faster training, higher variance

Must be at least 2.

KfoldCv

K-fold cross-validation without stratification. Divides data into k consecutive folds for iterative validation.

Process:

Splits data into k equal parts
Each fold serves as validation once
Remaining k-1 folds form training set

Use cases:

Regression problems
Large, balanced datasets
When stratification unnecessary
Continuous target variables

Limitations:

May not preserve class distributions
Less suitable for imbalanced data
Can create biased splits with ordered data

NSplits

u32

Number of folds for cross-validation. Selection guide: Recommended values:

5: Standard choice (default)
3: Large datasets/quick evaluation
10: Thorough evaluation/smaller datasets

Trade-offs:

Higher values: More thorough, computationally expensive
Lower values: Faster, potentially higher variance

Must be at least 2 for valid cross-validation.

RandomState

u64

Random seed for fold generation when shuffling. Important for:

Reproducible results
Consistent fold assignments
Benchmark comparisons
Debugging and validation

Set specific value for reproducibility across runs.

Shuffle

bool

true

Whether to shuffle data before splitting into folds. Effects:

true: Randomized fold composition (recommended)
false: Sequential splitting

Enable when:

Data may have ordering
Better fold independence needed

Disable for:

Time series data
Ordered observations

StratifiedKfoldCv

Stratified K-fold cross-validation maintaining class proportions across folds.

Key features:

Preserves class distribution in each fold
Handles imbalanced datasets
Ensures representative splits

Best for:

Classification problems
Imbalanced class distributions
When class proportions matter

Requirements:

Classification tasks only
Sufficient samples per class
Categorical target variable

NSplits

u32

Number of stratified folds. Guidelines: Typical values:

5: Standard for most cases
3: Quick evaluation/large datasets
10: Detailed evaluation/smaller datasets

Considerations:

Must allow sufficient samples per class per fold
Balance between stability and computation time
Consider smallest class size when choosing

RandomState

u64

Seed for reproducible stratified splits. Ensures:

Consistent fold assignments
Reproducible results
Comparable experiments
Systematic validation

Fixed seed guarantees identical stratified splits.

Shuffle

bool

false

Data shuffling before stratified splitting. Impact:

true: Randomizes while maintaining stratification
false: Maintains data order within strata

Use cases:

true: Independent observations
false: Grouped or sequential data

Class proportions maintained regardless of setting.

ShuffleSplitCv

Random permutation cross-validator with independent sampling.

Characteristics:

Random sampling for each split
Independent train/test sets
More flexible than K-fold
Can have overlapping test sets

Advantages:

Control over test size
Fresh splits each iteration
Good for large datasets

Limitations:

Some samples might never be tested
Others might be tested multiple times
No guarantee of complete coverage

NSplits

u32

Number of random splits to perform. Consider: Common values:

5: Standard evaluation
10: More thorough assessment
3: Quick estimates

Trade-offs:

More splits: Better estimation, longer runtime
Fewer splits: Faster, less stable estimates

Balance between computation and stability.

RandomState

u64

Random seed for reproducible shuffling. Controls:

Split randomization
Sample selection
Result reproducibility

Important for:

Debugging
Comparative studies
Result verification

TestSize

f64

0.2

Proportion of samples for test set. Guidelines: Common ratios:

0.2: Standard (80/20 split)
0.25: More validation emphasis
0.1: More training data

Considerations:

Dataset size
Model complexity
Validation requirements

It must be between 0.0 and 1.0.

StratifiedShuffleSplitCv

Stratified random permutation cross-validator combining shuffle-split with stratification.

Features:

Maintains class proportions
Random sampling within strata
Independent splits
Flexible test size

Ideal for:

Imbalanced datasets
Large-scale problems
When class distributions matter
Flexible validation schemes

NSplits

u32

Number of stratified random splits. Guidelines: Recommended values:

5: Standard evaluation
10: Detailed analysis
3: Quick assessment

Consider:

Sample size per class
Computational resources
Stability requirements

RandomState

u64

Seed for reproducible stratified sampling. Ensures:

Consistent class proportions
Reproducible splits
Comparable experiments

Critical for:

Benchmarking
Research studies
Quality assurance

TestSize

f64

0.2

Fraction of samples for stratified test set. Best practices: Common splits:

0.2: Balanced evaluation
0.3: More thorough testing
0.15: Preserve training size

Consider:

Minority class size
Overall dataset size
Validation objectives

It must be between 0.0 and 1.0.

TimeSeriesSplitCv

Time Series cross-validator. Provides train/test indices to split time series data samples that are observed at fixed time intervals, in train/test sets. It is a variation of k-fold which returns first k folds as train set and the k + 1th fold as test set. Note that unlike standard cross-validation methods, successive training sets are supersets of those that come before them. Also, it adds all surplus data to the first training partition, which is always used to train the model. Key features:

Maintains temporal dependence
Expanding window approach
Forward-chaining splits
No future data leakage

Use cases:

Sequential data
Financial forecasting
Temporal predictions
Time-dependent patterns

Note: Training sets are supersets of previous iterations.

NSplits

u32

Number of temporal splits. Considerations: Typical values:

5: Standard forward chaining
3: Limited historical data
10: Long time series

Impact:

Affects training window growth
Determines validation points
Influences computational load

MaxTrainSize

u64

Maximum size of training set. Should be strictly less than the number of samples. Applications:

0: Use all available past data
>0: Rolling window of fixed size

Use cases:

Limit historical relevance
Control computational cost
Handle concept drift
Memory constraints

TestSize

u64

Number of samples in each test set. When 0:

Auto-calculated as n_samples/(n_splits+1)
Ensures equal-sized test sets

Considerations:

Forecast horizon
Validation requirements
Available future data

Gap

u64

Number of samples to exclude from the end of each train set before the test set.Gap between train and test sets. Uses:

Avoid data leakage
Model forecast lag
Buffer periods

Common scenarios:

0: Continuous prediction
>0: Forward gap for realistic evaluation
Match business forecasting needs