MultiLayerPerceptron / Classifier Layer

Multi-Layer Perceptron Classifier - A neural network architecture for classification tasks. This model optimizes the log-loss function using LBFGS or stochastic gradient descent.

Mathematical form: where:

  • is the activation function
  • are weight matrices
  • are bias vectors

Key characteristics:

  • Deep learning architecture
  • Non-linear function approximation
  • Gradient-based optimization
  • Automatic feature learning
  • Universal function approximator

Common applications:

  • Image classification
  • Pattern recognition
  • Financial prediction
  • Medical diagnosis
  • Speech recognition

Outputs:

  1. Predicted Table: Input data with predictions
  2. Validation Results: Cross-validation metrics
  3. Test Metric: Test set performance
  4. ROC Curve Data: ROC analysis information
  5. Confusion Matrix: Classification breakdown
  6. Feature Importances: Neural network weights

Note: Performance depends heavily on architecture and hyperparameter choices

Table
0
0
Predicted Table
1
Validation Results
2
Test Metric
3
ROC Curve Data
4
Confusion Matrix
5
Feature Importances

SelectFeatures

[column, ...]

Feature columns for neural network classification:

Requirements:

  1. Data preparation:

    • Numerical features
    • Standardized/normalized values
    • No missing values
    • Finite numbers only
  2. Scaling recommendations:

    • StandardScaler: Zero mean, unit variance
    • MinMaxScaler: [0,1] or [-1,1] range
    • RobustScaler: For outlier presence
    • Normalize: For uniform feature scales
  3. Feature engineering:

    • Polynomial features for non-linearity
    • Interaction terms
    • Domain-specific transformations
    • Dimensionality reduction if needed
  4. Categorical handling:

    • One-hot encoding
    • Label encoding (ordinal)
    • Embedding representations
    • Feature hashing
  5. Quality checks:

    • Feature correlations
    • Distribution analysis
    • Outlier detection
    • Missing value handling

Best practices:

  • Scale features to similar ranges
  • Remove highly correlated features
  • Handle outliers appropriately
  • Consider feature importance

Note: If empty, uses all numeric columns except target

Target column for neural network classification:

Requirements:

  1. Data characteristics:

    • Categorical labels
    • No missing values
    • At least two classes
    • Properly encoded values
  2. Encoding formats:

    • Integer encoding (0 to n_classes-1)
    • One-hot encoding for multi-class
    • Binary encoding for two classes
    • Label encoding for ordinal classes
  3. Class distribution:

    • Monitor class balance
    • Consider class weights
    • Handle imbalanced cases
    • Stratification needs
  4. Quality considerations:

    • Label consistency
    • Class definitions
    • Annotation quality
    • Error patterns

Model outputs:

  • Probability distributions per class
  • Predicted class labels
  • Confidence scores
  • Classification metrics

Best practices:

  • Verify label quality
  • Check class distributions
  • Consider resampling if imbalanced
  • Use appropriate evaluation metrics

Note: Critical for defining the classification task

Params

oneof
DefaultParams

Optimized default configuration for Multi-Layer Perceptron:

Architecture:

  • Single hidden layer with 100 neurons
  • ReLU activation function
  • Adam optimizer

Training parameters:

  • L2 regularization (alpha): 0.0001
  • Learning rate: 0.001
  • Max iterations: 200
  • Batch size: auto

Best suited for:

  • Medium-sized datasets
  • General classification tasks
  • Initial modeling exploration
  • Balanced class problems

Fine-tuned configuration for Multi-Layer Perceptron classifier:

Parameter categories:

  1. Architecture:

    • Network structure
    • Activation functions
    • Layer sizes
  2. Optimization:

    • Solver selection
    • Learning rates
    • Momentum settings
  3. Regularization:

    • L2 penalty
    • Early stopping
    • Validation control
  4. Training control:

    • Batch processing
    • Iteration limits
    • Convergence criteria

Note: Parameter interactions significantly impact model performance

Number of neurons in hidden layers:

Architecture impact:

  • Determines model capacity
  • Affects feature extraction
  • Controls learning complexity

Selection guide:

  • Small (10-50): Simple problems, fast training
  • Medium (50-200): Balanced complexity
  • Large (200+): Complex patterns, more data

Considerations:

  • Data size vs layer size
  • Overfitting risk
  • Computational cost
  • Memory requirements
Relu

Activation function for hidden layers:

Mathematical role:

  • Introduces non-linearity in the network
  • Transforms weighted sums at each layer
  • Maps inputs to bounded/unbounded ranges
  • Enables complex pattern learning

Selection criteria:

  1. Problem characteristics:

    • Data distribution
    • Output range needs
    • Learning complexity
  2. Training considerations:

    • Gradient behavior
    • Convergence speed
    • Vanishing/exploding gradients
  3. Computational efficiency:

    • Calculation speed
    • Memory requirements
    • Hardware optimization

Note: Choice significantly impacts model performance and training dynamics

Identity ~

Linear activation function:

Properties:

  • No non-linearity
  • Gradient always 1.0
  • Unbounded output
  • Linear transformation

Best for:

  • Linear relationships
  • Final layers
  • Feature passing
  • Linear bottlenecks

Limitations:

  • Cannot learn non-linear patterns
  • Limited representation power
  • May cause model underfitting
Logistic ~

Sigmoid activation function:

Properties:

  • Output range: [0, 1]
  • Smooth, continuous
  • Saturating gradients
  • Probabilistic interpretation

Best for:

  • Binary classification
  • Probability outputs
  • Bounded predictions
  • Final layer binary tasks

Limitations:

  • Vanishing gradients
  • Saturating outputs
  • Not zero-centered
Tanh ~

Hyperbolic tangent function:

Properties:

  • Output range: [-1, 1]
  • Zero-centered
  • Smooth, continuous
  • Stronger gradients than sigmoid

Best for:

  • Hidden layers
  • Normalized inputs
  • When zero-centering matters
  • Complex pattern recognition

Limitations:

  • Still has vanishing gradients
  • Saturating activations
  • Computationally more expensive
Relu ~

Rectified Linear Unit:

Properties:

  • Output range: [0, ∞)
  • Non-saturating
  • Computationally efficient
  • Sparse activation

Best for:

  • Deep networks
  • Hidden layers
  • Fast training
  • Most modern networks

Advantages:

  • Reduces vanishing gradients
  • Biological plausibility
  • Fast convergence
  • Sparse representations

Limitations:

  • Dying ReLU problem
  • Not zero-centered
  • Unbounded output

Solver

enum
Adam

Optimization algorithm for neural network training:

Mathematical role:

  • Minimizes loss function
  • Updates network weights
  • Finds optimal parameters
  • Controls convergence behavior

Selection criteria:

  1. Dataset characteristics:

    • Size (samples, features)
    • Memory constraints
    • Training time budget
  2. Optimization needs:

    • Convergence speed
    • Solution quality
    • Local minima handling
  3. Resource considerations:

    • Memory usage
    • Computational cost
    • Hardware acceleration

Note: Choice significantly impacts training efficiency and model performance

Lbfgs ~

Limited-memory BFGS (quasi-Newton method):

Algorithm properties:

  • Second-order optimization
  • Uses curvature information
  • Full batch updates
  • Memory-efficient approximation

Best for:

  • Small datasets (<1000 samples)
  • High-quality solutions
  • Memory-constrained settings
  • Smooth optimization problems

Advantages:

  • Fast convergence
  • No learning rate tuning
  • Better local minima
  • Handles ill-conditioning

Limitations:

  • Not suitable for large datasets
  • Higher memory per update
  • No online/mini-batch learning
  • May be slower on simple problems
Sgd ~

Stochastic Gradient Descent with momentum:

Algorithm: where:

  • is learning rate
  • is momentum
  • is velocity
  • is gradient

Best for:

  • Large datasets
  • Online learning
  • Simple optimization
  • Limited memory settings

Advantages:

  • Low memory usage
  • Easy to implement
  • Works with any batch size
  • Escapes local minima

Limitations:

  • Requires learning rate tuning
  • Sensitive to scaling
  • May converge slowly
  • Needs momentum tuning
Adam ~

Adaptive Moment Estimation optimizer:

Algorithm: Combines RMSprop and momentum:

Best for:

  • Most deep learning tasks
  • Large datasets
  • Non-stationary objectives
  • Sparse gradients

Advantages:

  • Adaptive learning rates
  • Robust to hyperparameters
  • Handles sparse gradients
  • Fast convergence

Limitations:

  • Memory overhead
  • May not generalize as well as SGD
  • Can be sensitive to beta parameters
  • Higher computational cost

Note: Recommended default choice for most problems

0.0001

L2 regularization strength:

Effect:

Impact:

  • Controls overfitting
  • Reduces weight magnitude
  • Improves generalization

Typical ranges:

  • Weak: 1e-5 to 1e-4
  • Medium: 1e-4 to 1e-3
  • Strong: 1e-3 to 1e-2

Note: Higher values mean stronger regularization

200

Mini-batch size for gradient updates:

Impact on training:

  • Memory usage
  • Training speed
  • Gradient noise
  • Convergence behavior

Selection guide:

  • Small (16-32): More noise, better generalization
  • Medium (32-256): Balanced performance
  • Large (256+): Faster training, stable gradients

Note: Ignored when solver is 'lbfgs'

Constant

Learning rate schedule for gradient-based optimization:

Mathematical role:

  • Controls parameter update step size
  • Influences convergence speed
  • Balances exploration vs exploitation
  • Affects training stability

Selection criteria:

  1. Training dynamics:

    • Convergence behavior
    • Loss landscape
    • Training stability
  2. Problem characteristics:

    • Dataset size
    • Model complexity
    • Optimization difficulty
  3. Computational needs:

    • Training time budget
    • Resource constraints
    • Performance requirements

Note: Critical parameter affecting both training speed and model performance

Constant ~

Fixed learning rate throughout training:

Behavior: where:

  • is learning rate at time t
  • is initial learning rate

Best for:

  • Simple problems
  • Well-behaved loss surfaces
  • Short training runs
  • Initial experimentation

Advantages:

  • Simple to implement
  • Predictable behavior
  • No additional parameters
  • Easy to debug

Limitations:

  • May converge slowly
  • Can miss fine optimization
  • Requires careful rate selection
  • Not adaptive to progress
Invscaling ~

Inverse scaling learning rate schedule:

Formula: where:

  • is learning rate at time t
  • is initial learning rate
  • is power_t parameter
  • is current iteration

Best for:

  • Long training runs
  • Gradual convergence needs
  • Non-stationary problems
  • When final precision matters

Advantages:

  • Theoretical guarantees
  • Automatic rate reduction
  • Better final convergence
  • Handles non-stationarity

Limitations:

  • May decay too quickly
  • Sensitive to power parameter
  • Fixed decay schedule
  • Not adaptive to progress
Adaptive ~

Adaptive learning rate based on training loss:

Strategy:

  • Keeps initial rate while loss decreases
  • Decreases rate when loss plateaus
  • Divides rate by 5 after stagnation
  • Continues until convergence or max_iter

Best for:

  • Complex problems
  • Unknown loss landscapes
  • When optimal rate unknown
  • Production training

Advantages:

  • Automatic rate adjustment
  • Responds to training progress
  • Handles varying difficulty
  • More robust training

Limitations:

  • More complex implementation
  • May need early stopping
  • Additional computation overhead
  • Requires patience parameter

Note: Recommended for most practical applications

Initial learning rate:

Impact:

  • Controls step size
  • Affects convergence speed
  • Influences stability

Typical ranges:

  • Conservative: 1e-4 to 1e-3
  • Standard: 1e-3 to 1e-2
  • Aggressive: 1e-2 to 1e-1

Note: Only used with 'sgd' or 'adam' solvers

0.5

Power for inverse scaling learning rate:

Formula:

Typical values:

  • 0.5: Standard decay (default)
  • < 0.5: Slower decay
  • > 0.5: Faster decay

Note: Only used when learning_rate is 'invscaling'

200

Maximum number of training iterations:

Guidelines:

  • Small (100-300): Simple problems
  • Medium (300-1000): Standard tasks
  • Large (1000+): Complex problems

Considerations:

  • Training time
  • Convergence needs
  • Early stopping use
  • Problem complexity
true

Whether to shuffle samples in each iteration.

Effects:

  • Prevents order bias
  • Improves convergence
  • Reduces overfitting
  • Better generalization

Note: Only affects 'sgd' and 'adam' solvers

Determines random number generation for weights and bias initialization, train-test split if early stopping is used, and batch sampling when solver='sgd' or 'adam'.

Controls randomness in:

  • Weight initialization
  • Data shuffling
  • Batch sampling

Important for:

  • Reproducibility
  • Result comparison
  • Debugging
  • Validation

Tol

f64
0.0001

Optimization tolerance:

Convergence criterion:

  • Stops when loss improvement < tol
  • Affects training duration
  • Controls precision

Typical ranges:

  • Strict: 1e-5 to 1e-4
  • Standard: 1e-4 to 1e-3
  • Loose: 1e-3 to 1e-2
true

When set to True, reuse the solution of the previous call to fit as initialization, otherwise, just erase the previous solution.

Applications:

  • Transfer learning
  • Incremental learning
  • Parameter searching
  • Model refinement

Benefits:

  • Faster convergence
  • Better solutions
  • Parameter reuse
  • Continuous learning
0.9

Momentum for gradient descent:

Formula: where is momentum

Effects:

  • Accelerates convergence
  • Reduces oscillation
  • Escapes local minima

Typical values:

  • Low (0.1-0.5): More stable
  • Medium (0.5-0.9): Standard choice
  • High (0.9-0.999): Faster convergence

Note: Only used with 'sgd' solver

Whether to use Nesterov's momentum:

Mathematical form: Classical: Nesterov:

Advantages:

  • Faster convergence
  • Better theoretical guarantees
  • Improved acceleration
  • More stable updates

Used when:

  • momentum > 0
  • 'sgd' solver

Best when:

  • Long training runs
  • Smooth loss surface

Note: Often improves standard momentum, especially for deep networks

false

Enable early stopping based on validation score:

Stopping criterion:

  • Monitors validation score
  • Stops when no improvement for n_iter_no_change epochs
  • Improvement threshold defined by 'tol'

Benefits:

  • Prevents overfitting
  • Reduces training time
  • Automatic model selection
  • Optimal epoch determination

Requirements:

  • Validation data (split from training)
  • Patience parameter (n_iter_no_change)
  • Improvement threshold (tol)
  • Monitoring metric

Note: Uses validation_fraction of training data for monitoring

Fraction of training data for early stopping:

Usage:

  • Splits training data into train/validation
  • Validation set monitors convergence
  • Affects early stopping decisions

Selection guide:

  • Small (0.1): More training data, less reliable stopping
  • Medium (0.2): Balanced split
  • Large (0.3): More reliable stopping, less training data

Considerations:

  • Dataset size
  • Problem complexity
  • Validation stability
  • Training data needs

Note: Only used when early_stopping is True

0.9

Exponential decay rate for first moment in Adam:

Formula:

Properties:

  • Controls moving average of gradient
  • Affects momentum behavior
  • Influences update step size

Typical values:

  • 0.9: Default, works well generally
  • < 0.9: Less momentum, more responsive
  • > 0.9: More momentum, smoother updates

Note: Only used with 'adam' solver

0.999

Exponential decay rate for second moment in Adam:

Formula:

Properties:

  • Controls moving average of squared gradient
  • Affects learning rate adaptation
  • Influences update scale

Typical values:

  • 0.999: Default, stable convergence
  • < 0.999: Faster adaptation, more variance
  • > 0.999: Slower adaptation, more stability

Note: Only used with 'adam' solver

0.00000001

Numerical stability constant for Adam:

Usage in update:

Purpose:

  • Prevents division by zero
  • Improves numerical stability
  • Controls minimum step size

Typical values:

  • 1e-8: Default, works well generally
  • 1e-7 to 1e-5: More stability, larger minimum steps
  • 1e-10 to 1e-8: More precision, risk of instability

Note: Only used with 'adam' solver

Maximum iterations without improvement:

Early stopping behavior:

  • Monitors score improvement
  • Stops if no improvement > tol
  • Counts consecutive iterations

Selection guide:

  • Small (5-10): Aggressive stopping
  • Medium (10-20): Balanced patience
  • Large (20+): Conservative stopping

Considerations:

  • Training stability
  • Convergence patterns
  • Time constraints
  • Loss landscape
0

Maximum number of loss function evaluations:

Purpose:

  • Limits computational budget
  • Controls optimization duration
  • Prevents excessive iterations

Typical ranges:

  • Small (5000): Quick optimization
  • Medium (15000): Standard problems
  • Large (50000+): Complex optimization

Note: Only applies to 'lbfgs' solver

Warning: Small values may prevent convergence

Hyperparameter optimization through grid search for Multi-Layer Perceptron:

Search process:

  1. Architecture optimization:

    • Network size
    • Activation functions
    • Layer configuration
  2. Training dynamics:

    • Optimization algorithms
    • Learning rates
    • Momentum parameters
  3. Regularization tuning:

    • L2 penalties
    • Early stopping
    • Validation settings

Computational complexity:

  • Total configurations = Product of parameter options
  • Time complexity = O(configurations × epochs × samples)
  • Memory needs = O(parameters × configurations)

Best practices:

  • Start with coarse grid
  • Refine promising regions
  • Monitor computational costs
  • Use domain knowledge

Note: Consider computational budget when defining parameter spaces

100

Grid of hidden layer sizes to evaluate:

Search strategies:

  1. Linear scale:

    • Small: [50, 100, 150]
    • Medium: [100, 200, 300]
    • Large: [200, 400, 600]
  2. Logarithmic scale:

    • Wide range: [10, 100, 1000]
    • Balanced: [32, 64, 128, 256]

Selection criteria:

  • Data complexity
  • Feature dimensionality
  • Sample size
  • Memory constraints

Note: Larger networks require more data and computation

Activation

[enum, ...]
Relu

Activation function for hidden layers:

Mathematical role:

  • Introduces non-linearity in the network
  • Transforms weighted sums at each layer
  • Maps inputs to bounded/unbounded ranges
  • Enables complex pattern learning

Selection criteria:

  1. Problem characteristics:

    • Data distribution
    • Output range needs
    • Learning complexity
  2. Training considerations:

    • Gradient behavior
    • Convergence speed
    • Vanishing/exploding gradients
  3. Computational efficiency:

    • Calculation speed
    • Memory requirements
    • Hardware optimization

Note: Choice significantly impacts model performance and training dynamics

Identity ~

Linear activation function:

Properties:

  • No non-linearity
  • Gradient always 1.0
  • Unbounded output
  • Linear transformation

Best for:

  • Linear relationships
  • Final layers
  • Feature passing
  • Linear bottlenecks

Limitations:

  • Cannot learn non-linear patterns
  • Limited representation power
  • May cause model underfitting
Logistic ~

Sigmoid activation function:

Properties:

  • Output range: [0, 1]
  • Smooth, continuous
  • Saturating gradients
  • Probabilistic interpretation

Best for:

  • Binary classification
  • Probability outputs
  • Bounded predictions
  • Final layer binary tasks

Limitations:

  • Vanishing gradients
  • Saturating outputs
  • Not zero-centered
Tanh ~

Hyperbolic tangent function:

Properties:

  • Output range: [-1, 1]
  • Zero-centered
  • Smooth, continuous
  • Stronger gradients than sigmoid

Best for:

  • Hidden layers
  • Normalized inputs
  • When zero-centering matters
  • Complex pattern recognition

Limitations:

  • Still has vanishing gradients
  • Saturating activations
  • Computationally more expensive
Relu ~

Rectified Linear Unit:

Properties:

  • Output range: [0, ∞)
  • Non-saturating
  • Computationally efficient
  • Sparse activation

Best for:

  • Deep networks
  • Hidden layers
  • Fast training
  • Most modern networks

Advantages:

  • Reduces vanishing gradients
  • Biological plausibility
  • Fast convergence
  • Sparse representations

Limitations:

  • Dying ReLU problem
  • Not zero-centered
  • Unbounded output

Solver

[enum, ...]
Adam

Optimization algorithm for neural network training:

Mathematical role:

  • Minimizes loss function
  • Updates network weights
  • Finds optimal parameters
  • Controls convergence behavior

Selection criteria:

  1. Dataset characteristics:

    • Size (samples, features)
    • Memory constraints
    • Training time budget
  2. Optimization needs:

    • Convergence speed
    • Solution quality
    • Local minima handling
  3. Resource considerations:

    • Memory usage
    • Computational cost
    • Hardware acceleration

Note: Choice significantly impacts training efficiency and model performance

Lbfgs ~

Limited-memory BFGS (quasi-Newton method):

Algorithm properties:

  • Second-order optimization
  • Uses curvature information
  • Full batch updates
  • Memory-efficient approximation

Best for:

  • Small datasets (<1000 samples)
  • High-quality solutions
  • Memory-constrained settings
  • Smooth optimization problems

Advantages:

  • Fast convergence
  • No learning rate tuning
  • Better local minima
  • Handles ill-conditioning

Limitations:

  • Not suitable for large datasets
  • Higher memory per update
  • No online/mini-batch learning
  • May be slower on simple problems
Sgd ~

Stochastic Gradient Descent with momentum:

Algorithm: where:

  • is learning rate
  • is momentum
  • is velocity
  • is gradient

Best for:

  • Large datasets
  • Online learning
  • Simple optimization
  • Limited memory settings

Advantages:

  • Low memory usage
  • Easy to implement
  • Works with any batch size
  • Escapes local minima

Limitations:

  • Requires learning rate tuning
  • Sensitive to scaling
  • May converge slowly
  • Needs momentum tuning
Adam ~

Adaptive Moment Estimation optimizer:

Algorithm: Combines RMSprop and momentum:

Best for:

  • Most deep learning tasks
  • Large datasets
  • Non-stationary objectives
  • Sparse gradients

Advantages:

  • Adaptive learning rates
  • Robust to hyperparameters
  • Handles sparse gradients
  • Fast convergence

Limitations:

  • Memory overhead
  • May not generalize as well as SGD
  • Can be sensitive to beta parameters
  • Higher computational cost

Note: Recommended default choice for most problems

Alpha

[f64, ...]
0.0001

L2 regularization strengths to evaluate:

Search spaces:

  1. Linear scale:

    • Fine: [0.0001, 0.0005, 0.001]
    • Medium: [0.001, 0.01, 0.1]
  2. Log scale (recommended):

    • Wide: [1e-5, 1e-4, 1e-3, 1e-2]
    • Focused: [1e-4, 3e-4, 1e-3]

Selection strategy:

  • Start with log-spaced values
  • Monitor validation curves
  • Check for overfitting
  • Consider model size

Note: Higher values = stronger regularization

BatchSize

[u64, ...]
200

Mini-batch sizes to evaluate:

Search ranges:

  1. Power of 2 (common):

    • Standard: [32, 64, 128, 256]
    • Extended: [16, 32, 64, 128, 256, 512]
  2. Linear scale:

    • Small: [50, 100, 200]
    • Large: [200, 400, 800]

Considerations:

  • Memory constraints
  • Training stability
  • Parallelization
  • Hardware optimization

Note: Only relevant for 'sgd' and 'adam' solvers

LearningRate

[enum, ...]
Constant

Learning rate schedule for gradient-based optimization:

Mathematical role:

  • Controls parameter update step size
  • Influences convergence speed
  • Balances exploration vs exploitation
  • Affects training stability

Selection criteria:

  1. Training dynamics:

    • Convergence behavior
    • Loss landscape
    • Training stability
  2. Problem characteristics:

    • Dataset size
    • Model complexity
    • Optimization difficulty
  3. Computational needs:

    • Training time budget
    • Resource constraints
    • Performance requirements

Note: Critical parameter affecting both training speed and model performance

Constant ~

Fixed learning rate throughout training:

Behavior: where:

  • is learning rate at time t
  • is initial learning rate

Best for:

  • Simple problems
  • Well-behaved loss surfaces
  • Short training runs
  • Initial experimentation

Advantages:

  • Simple to implement
  • Predictable behavior
  • No additional parameters
  • Easy to debug

Limitations:

  • May converge slowly
  • Can miss fine optimization
  • Requires careful rate selection
  • Not adaptive to progress
Invscaling ~

Inverse scaling learning rate schedule:

Formula: where:

  • is learning rate at time t
  • is initial learning rate
  • is power_t parameter
  • is current iteration

Best for:

  • Long training runs
  • Gradual convergence needs
  • Non-stationary problems
  • When final precision matters

Advantages:

  • Theoretical guarantees
  • Automatic rate reduction
  • Better final convergence
  • Handles non-stationarity

Limitations:

  • May decay too quickly
  • Sensitive to power parameter
  • Fixed decay schedule
  • Not adaptive to progress
Adaptive ~

Adaptive learning rate based on training loss:

Strategy:

  • Keeps initial rate while loss decreases
  • Decreases rate when loss plateaus
  • Divides rate by 5 after stagnation
  • Continues until convergence or max_iter

Best for:

  • Complex problems
  • Unknown loss landscapes
  • When optimal rate unknown
  • Production training

Advantages:

  • Automatic rate adjustment
  • Responds to training progress
  • Handles varying difficulty
  • More robust training

Limitations:

  • More complex implementation
  • May need early stopping
  • Additional computation overhead
  • Requires patience parameter

Note: Recommended for most practical applications

0.001

Initial learning rates to evaluate:

Search spaces:

  1. Log scale (recommended):

    • Wide: [1e-4, 1e-3, 1e-2, 1e-1]
    • Focused: [3e-4, 1e-3, 3e-3]
  2. Solver-specific:

    • SGD: Higher rates (1e-3 to 1e-1)
    • Adam: Lower rates (1e-4 to 1e-2)

Considerations:

  • Network architecture
  • Activation functions
  • Batch size
  • Optimizer choice

Note: Critical parameter for training success

PowerT

[f64, ...]
0.5

Inverse scaling exponents to evaluate:

Search ranges:

  1. Standard range:

    • [0.3, 0.5, 0.7]: Around default
    • [0.2, 0.4, 0.6, 0.8]: Wider search
  2. Theoretical values:

    • 0.5: Standard SGD theory
    • 1.0: Faster decay
    • 0.25: Slower decay

Impact:

  • Learning rate decay speed
  • Convergence behavior
  • Training stability

Note: Only used with 'invscaling' learning rate

MaxIter

[u64, ...]
200

Maximum iteration counts to evaluate:

Search ranges:

  1. Standard scale:

    • Basic: [100, 200, 300]
    • Extended: [200, 400, 600, 800]
  2. Problem-based:

    • Simple: [50, 100, 200]
    • Complex: [500, 1000, 2000]

Selection criteria:

  • Convergence patterns
  • Problem complexity
  • Computational budget
  • Early stopping usage

Note: Higher values needed for larger/complex networks

Shuffle

[bool, ...]
true

Data shuffling options to evaluate:

Values:

  • [true]: Standard shuffling

  • [false]: Ordered processing

  • [true, false]: Test both options

Impact analysis:

  • Training stability
  • Convergence speed
  • Generalization
  • Order sensitivity

Note: Only affects 'sgd' and 'adam' solvers

Random seed for reproducibility:

Controls randomness in:

  • Weight initialization
  • Data shuffling
  • Mini-batch selection
  • Cross-validation splits

Importance:

  • Reproducible results
  • Fair comparisons
  • Debugging
  • Parameter studies

Tol

[f64, ...]
0.0001

Optimization tolerances to evaluate:

Search spaces: Values:

  • Strict: [1e-5, 1e-4, 1e-3]

  • Relaxed: [1e-4, 1e-3, 1e-2]

  • High: ≤ 1e-4

  • Medium: 1e-4 to 1e-3

  • Low: ≥ 1e-3

Trade-offs:

  • Convergence precision
  • Training time
  • Solution quality
  • Computational cost

WarmStart

[bool, ...]
true

When set to True, reuse the solution of the previous call to fit and add more estimators to the ensemble, otherwise, just fit a whole new ensemble.

Values:

  • [true]: Reuse previous solutions

  • [false]: Fresh initialization

  • [true, false]: Test both strategies

Use cases:

  • Transfer learning
  • Incremental training
  • Model refinement
  • Parameter searching

Momentum

[f64, ...]
0.9

Momentum values to evaluate:

Search ranges like:

  • [0.5, 0.7, 0.9]

  • [0.9, 0.95, 0.99]

  • [0.0, 0.5, 0.9, 0.95, 0.99]

  • [0.8, 0.9, 0.95, 0.98]

Impact on training:

  • Convergence speed
  • Oscillation damping
  • Local minima escape
  • Training stability

Note: Only used with 'sgd' solver

true

Nesterov momentum options to evaluate:

Search options:

  • [true]: Use Nesterov's acceleration

  • [false]: Classic momentum

  • [true, false]: Compare both methods

Performance impact:

  • Convergence speed
  • Training stability
  • Optimization accuracy
  • Memory usage

Note: Only relevant when momentum > 0

false

Early stopping configuration:

Control parameters:

  • Validation split size
  • Patience threshold
  • Score monitoring
  • Stopping criteria

Benefits:

  • Prevents overfitting
  • Saves computation
  • Automatic epoch selection
  • Optimal model selection

Note: Requires validation_fraction of training data

Validation set size for early stopping:

Typical ranges:

  • Small: 0.1 (10% validation)
  • Medium: 0.2 (20% validation)
  • Large: 0.3 (30% validation)

Selection criteria:

  • Dataset size
  • Model complexity
  • Validation stability
  • Training needs

Beta1

[f64, ...]
0.9

First moment decay rates for Adam:

Search spaces like:

  • [0.9]: Default value

  • [0.8, 0.9, 0.95]: Around default

  • [0.7, 0.8, 0.9, 0.95]

  • [0.85, 0.9, 0.93, 0.97]

Impact:

  • Gradient averaging
  • Update smoothing
  • Training stability

Note: Only used with 'adam' solver

Beta2

[f64, ...]
0.999

Second moment decay rates for Adam:

Search spaces like:

  • [0.999]: Default value

  • [0.995, 0.999, 0.9999]

  • [0.99, 0.999, 0.9999]

  • [0.995, 0.997, 0.999, 0.9995]

Effects:

  • Learning rate adaptation
  • Update scale control
  • Training robustness

Note: Only used with 'adam' solver

Epsilon

[f64, ...]
0.00000001

Numerical stability constants for Adam:

Search ranges:

  • [1e-8]: Default value

  • [1e-9, 1e-8, 1e-7]

  • [1e-10, 1e-8, 1e-6]

  • [1e-8, 1e-7, 1e-6, 1e-5]

Purpose:

  • Prevents division by zero
  • Controls update magnitude
  • Ensures stability

Note: Only used with 'adam' solver

Early stopping patience parameter:

Common values:

  • Small (5-10): Quick stopping
  • Medium (10-20): Standard patience
  • Large (20+): Extended training

Considerations:

  • Convergence patterns
  • Training stability
  • Time constraints
  • Model complexity

MaxFun

[u32, ...]
15000

Maximum function evaluation limits:

Search ranges like:

  • [5000, 10000, 15000]

  • [10000, 15000, 20000]

  • [15000, 25000, 50000]

  • [10000, 20000, 40000, 80000]

Trade-offs:

  • Solution quality
  • Computation time
  • Convergence guarantee
  • Resource usage

Note: Only applies to 'lbfgs' solver

Accuracy

Performance evaluation metrics for neural network classification:

Selection criteria:

  1. Problem characteristics:

    • Class distribution
    • Error cost structure
    • Business objectives
    • Domain requirements
  2. Model evaluation needs:

    • Model comparison
    • Performance monitoring
    • Validation strategy
    • Threshold tuning
  3. Optimization goals:

    • Early stopping decisions
    • Model selection
    • Hyperparameter tuning
    • Cross-validation

Note: Choice of metric significantly impacts model selection and optimization

Default ~

Uses model's built-in scoring method:

For neural networks:

  • Classification: Accuracy score
  • Probability: Log-loss
  • Multi-class: Weighted average

Best for:

  • Initial evaluation
  • Quick prototyping
  • Balanced datasets
  • Standard problems

Advantages:

  • Computationally efficient
  • Well-understood metric
  • Standard benchmarking
  • Built-in optimization

Limitations:

  • May not reflect business goals
  • Sensitive to class imbalance
  • Doesn't consider prediction confidence
  • May oversimplify performance
Accuracy ~

Standard classification accuracy score:

Formula:

Properties:

  • Range: [0.0, 1.0]
  • Perfect score: 1.0
  • Chance level: 1/n_classes
  • Intuitive interpretation

Best for:

  • Balanced classes
  • Equal error costs
  • Clear right/wrong scenarios
  • Simple evaluation needs

Limitations:

  • Misleading with imbalanced data
  • Ignores prediction confidence
  • Sensitive to class distribution
  • May hide poor minority performance
BalancedAccuracy ~

Class-weighted accuracy score:

Formula:

Properties:

  • Range: [0.0, 1.0]
  • Perfect score: 1.0
  • Chance level: 0.5
  • Class-normalized metric

Best for:

  • Imbalanced datasets
  • When all classes matter equally
  • Minority class importance
  • Fair evaluation needs

Advantages:

  • Handles class imbalance
  • Better minority class evaluation
  • Robust to class distribution
  • Fairer comparison basis

Limitations:

  • May not reflect business priorities
  • Ignores prediction confidence
  • Potentially oversensitive to rare classes
  • Can be unstable with very rare classes
LogLoss ~

Logarithmic loss (cross-entropy):

Formula: where:

  • is number of samples
  • is number of classes
  • is binary indicator of class j for instance i
  • is predicted probability of class j for instance i

Properties:

  • Range: [0, ∞)
  • Perfect score: 0.0
  • Probability-sensitive
  • Information theoretic basis

Best for:

  • Probability calibration
  • Confidence assessment
  • Risk-sensitive applications
  • Multi-class problems

Advantages:

  • Evaluates probability quality
  • Sensitive to uncertainty
  • Natural for neural networks
  • Differentiable metric

Limitations:

  • Scale depends on number of classes
  • Less intuitive interpretation
  • Sensitive to outliers
  • Requires probability calibration
RocAuc ~

Area Under the Receiver Operating Characteristic Curve:

Definition:

Properties:

  • Range: [0.0, 1.0]
  • Perfect score: 1.0
  • Random baseline: 0.5
  • Threshold-independent

Best for:

  • Binary classification
  • Ranking evaluation
  • Threshold optimization
  • Imbalanced datasets

Advantages:

  • Independent of class distribution
  • Evaluates ranking quality
  • Robust to imbalance
  • Comprehensive evaluation

Limitations:

  • Only for binary/one-vs-rest
  • Computationally intensive
  • Scale sensitivity
  • May average over irrelevant regions

Note: For multi-class, computes weighted average of one-vs-rest AUCs

Split

oneof
DefaultSplit

Standard train-test split configuration optimized for general classification tasks.

Configuration:

  • Test size: 20% (0.2)
  • Random seed: 98
  • Shuffling: Enabled
  • Stratification: Based on target distribution

Advantages:

  • Preserves class distribution
  • Provides reliable validation
  • Suitable for most datasets

Best for:

  • Medium to large datasets
  • Independent observations
  • Initial model evaluation

Splitting uses the ShuffleSplit strategy or StratifiedShuffleSplit strategy depending on the field stratified. Note: If shuffle is false then stratified must be false.

Configurable train-test split parameters for specialized requirements. Allows fine-tuning of data division strategy for specific use cases or constraints.

Use cases:

  • Time series data
  • Grouped observations
  • Specific train/test ratios
  • Custom validation schemes

Random seed for reproducible splits. Ensures:

  • Consistent train/test sets
  • Reproducible experiments
  • Comparable model evaluations

Same seed guarantees identical splits across runs.

true

Data shuffling before splitting. Effects:

  • true: Randomizes order, better for i.i.d. data
  • false: Maintains order, important for time series

When to disable:

  • Time dependent data
  • Sequential patterns
  • Grouped observations
0.8

Proportion of data for training. Considerations:

  • Larger (e.g., 0.8-0.9): Better model learning
  • Smaller (e.g., 0.5-0.7): Better validation

Common splits:

  • 0.8: Standard (80/20 split)
  • 0.7: More validation emphasis
  • 0.9: More training emphasis
false

Maintain class distribution in splits. Important when:

  • Classes are imbalanced
  • Small classes present
  • Representative splits needed

Requirements:

  • Classification tasks only
  • Cannot use with shuffle=false
  • Sufficient samples per class

Cv

oneof
DefaultCv

Standard cross-validation configuration using stratified 3-fold splitting.

Configuration:

  • Folds: 3
  • Method: StratifiedKFold
  • Stratification: Preserves class proportions

Advantages:

  • Balanced evaluation
  • Reasonable computation time
  • Good for medium-sized datasets

Limitations:

  • May be insufficient for small datasets
  • Higher variance than larger fold counts
  • May miss some data patterns

Configurable stratified k-fold cross-validation for specific validation requirements.

Features:

  • Adjustable fold count with NFolds determining the number of splits.
  • Stratified sampling
  • Preserved class distributions

Use cases:

  • Small datasets (more folds)
  • Large datasets (fewer folds)
  • Detailed model evaluation
  • Robust performance estimation
3

Number of cross-validation folds. Guidelines:

  • 3-5: Large datasets, faster training
  • 5-10: Standard choice, good balance
  • 10+: Small datasets, thorough evaluation

Trade-offs:

  • More folds: Better evaluation, slower training
  • Fewer folds: Faster training, higher variance

Must be at least 2.

K-fold cross-validation without stratification. Divides data into k consecutive folds for iterative validation.

Process:

  • Splits data into k equal parts
  • Each fold serves as validation once
  • Remaining k-1 folds form training set

Use cases:

  • Regression problems
  • Large, balanced datasets
  • When stratification unnecessary
  • Continuous target variables

Limitations:

  • May not preserve class distributions
  • Less suitable for imbalanced data
  • Can create biased splits with ordered data

Number of folds for cross-validation. Selection guide: Recommended values:

  • 5: Standard choice (default)
  • 3: Large datasets/quick evaluation
  • 10: Thorough evaluation/smaller datasets

Trade-offs:

  • Higher values: More thorough, computationally expensive
  • Lower values: Faster, potentially higher variance

Must be at least 2 for valid cross-validation.

Random seed for fold generation when shuffling. Important for:

  • Reproducible results
  • Consistent fold assignments
  • Benchmark comparisons
  • Debugging and validation

Set specific value for reproducibility across runs.

true

Whether to shuffle data before splitting into folds. Effects:

  • true: Randomized fold composition (recommended)
  • false: Sequential splitting

Enable when:

  • Data may have ordering
  • Better fold independence needed

Disable for:

  • Time series data
  • Ordered observations

Stratified K-fold cross-validation maintaining class proportions across folds.

Key features:

  • Preserves class distribution in each fold
  • Handles imbalanced datasets
  • Ensures representative splits

Best for:

  • Classification problems
  • Imbalanced class distributions
  • When class proportions matter

Requirements:

  • Classification tasks only
  • Sufficient samples per class
  • Categorical target variable

Number of stratified folds. Guidelines: Typical values:

  • 5: Standard for most cases
  • 3: Quick evaluation/large datasets
  • 10: Detailed evaluation/smaller datasets

Considerations:

  • Must allow sufficient samples per class per fold
  • Balance between stability and computation time
  • Consider smallest class size when choosing

Seed for reproducible stratified splits. Ensures:

  • Consistent fold assignments
  • Reproducible results
  • Comparable experiments
  • Systematic validation

Fixed seed guarantees identical stratified splits.

false

Data shuffling before stratified splitting. Impact:

  • true: Randomizes while maintaining stratification
  • false: Maintains data order within strata

Use cases:

  • true: Independent observations
  • false: Grouped or sequential data

Class proportions maintained regardless of setting.

Random permutation cross-validator with independent sampling.

Characteristics:

  • Random sampling for each split
  • Independent train/test sets
  • More flexible than K-fold
  • Can have overlapping test sets

Advantages:

  • Control over test size
  • Fresh splits each iteration
  • Good for large datasets

Limitations:

  • Some samples might never be tested
  • Others might be tested multiple times
  • No guarantee of complete coverage

Number of random splits to perform. Consider: Common values:

  • 5: Standard evaluation
  • 10: More thorough assessment
  • 3: Quick estimates

Trade-offs:

  • More splits: Better estimation, longer runtime
  • Fewer splits: Faster, less stable estimates

Balance between computation and stability.

Random seed for reproducible shuffling. Controls:

  • Split randomization
  • Sample selection
  • Result reproducibility

Important for:

  • Debugging
  • Comparative studies
  • Result verification
0.2

Proportion of samples for test set. Guidelines: Common ratios:

  • 0.2: Standard (80/20 split)
  • 0.25: More validation emphasis
  • 0.1: More training data

Considerations:

  • Dataset size
  • Model complexity
  • Validation requirements

It must be between 0.0 and 1.0.

Stratified random permutation cross-validator combining shuffle-split with stratification.

Features:

  • Maintains class proportions
  • Random sampling within strata
  • Independent splits
  • Flexible test size

Ideal for:

  • Imbalanced datasets
  • Large-scale problems
  • When class distributions matter
  • Flexible validation schemes

Number of stratified random splits. Guidelines: Recommended values:

  • 5: Standard evaluation
  • 10: Detailed analysis
  • 3: Quick assessment

Consider:

  • Sample size per class
  • Computational resources
  • Stability requirements

Seed for reproducible stratified sampling. Ensures:

  • Consistent class proportions
  • Reproducible splits
  • Comparable experiments

Critical for:

  • Benchmarking
  • Research studies
  • Quality assurance
0.2

Fraction of samples for stratified test set. Best practices: Common splits:

  • 0.2: Balanced evaluation
  • 0.3: More thorough testing
  • 0.15: Preserve training size

Consider:

  • Minority class size
  • Overall dataset size
  • Validation objectives

It must be between 0.0 and 1.0.

Time Series cross-validator. Provides train/test indices to split time series data samples that are observed at fixed time intervals, in train/test sets. It is a variation of k-fold which returns first k folds as train set and the k + 1th fold as test set. Note that unlike standard cross-validation methods, successive training sets are supersets of those that come before them. Also, it adds all surplus data to the first training partition, which is always used to train the model. Key features:

  • Maintains temporal dependence
  • Expanding window approach
  • Forward-chaining splits
  • No future data leakage

Use cases:

  • Sequential data
  • Financial forecasting
  • Temporal predictions
  • Time-dependent patterns

Note: Training sets are supersets of previous iterations.

Number of temporal splits. Considerations: Typical values:

  • 5: Standard forward chaining
  • 3: Limited historical data
  • 10: Long time series

Impact:

  • Affects training window growth
  • Determines validation points
  • Influences computational load

Maximum size of training set. Should be strictly less than the number of samples. Applications:

  • 0: Use all available past data
  • >0: Rolling window of fixed size

Use cases:

  • Limit historical relevance
  • Control computational cost
  • Handle concept drift
  • Memory constraints

Number of samples in each test set. When 0:

  • Auto-calculated as n_samples/(n_splits+1)
  • Ensures equal-sized test sets

Considerations:

  • Forecast horizon
  • Validation requirements
  • Available future data

Gap

u64
0

Number of samples to exclude from the end of each train set before the test set.Gap between train and test sets. Uses:

  • Avoid data leakage
  • Model forecast lag
  • Buffer periods

Common scenarios:

  • 0: Continuous prediction
  • >0: Forward gap for realistic evaluation
  • Match business forecasting needs