MultiLayerPerceptron / Classifier Layer
Multi-Layer Perceptron Classifier - A neural network architecture for classification tasks. This model optimizes the log-loss function using LBFGS or stochastic gradient descent.
Mathematical form: where:
- is the activation function
- are weight matrices
- are bias vectors
Key characteristics:
- Deep learning architecture
- Non-linear function approximation
- Gradient-based optimization
- Automatic feature learning
- Universal function approximator
Common applications:
- Image classification
- Pattern recognition
- Financial prediction
- Medical diagnosis
- Speech recognition
Outputs:
- Predicted Table: Input data with predictions
- Validation Results: Cross-validation metrics
- Test Metric: Test set performance
- ROC Curve Data: ROC analysis information
- Confusion Matrix: Classification breakdown
- Feature Importances: Neural network weights
Note: Performance depends heavily on architecture and hyperparameter choices
SelectFeatures
[column, ...]Feature columns for neural network classification:
Requirements:
-
Data preparation:
- Numerical features
- Standardized/normalized values
- No missing values
- Finite numbers only
-
Scaling recommendations:
- StandardScaler: Zero mean, unit variance
- MinMaxScaler: [0,1] or [-1,1] range
- RobustScaler: For outlier presence
- Normalize: For uniform feature scales
-
Feature engineering:
- Polynomial features for non-linearity
- Interaction terms
- Domain-specific transformations
- Dimensionality reduction if needed
-
Categorical handling:
- One-hot encoding
- Label encoding (ordinal)
- Embedding representations
- Feature hashing
-
Quality checks:
- Feature correlations
- Distribution analysis
- Outlier detection
- Missing value handling
Best practices:
- Scale features to similar ranges
- Remove highly correlated features
- Handle outliers appropriately
- Consider feature importance
Note: If empty, uses all numeric columns except target
SelectTarget
columnTarget column for neural network classification:
Requirements:
-
Data characteristics:
- Categorical labels
- No missing values
- At least two classes
- Properly encoded values
-
Encoding formats:
- Integer encoding (0 to n_classes-1)
- One-hot encoding for multi-class
- Binary encoding for two classes
- Label encoding for ordinal classes
-
Class distribution:
- Monitor class balance
- Consider class weights
- Handle imbalanced cases
- Stratification needs
-
Quality considerations:
- Label consistency
- Class definitions
- Annotation quality
- Error patterns
Model outputs:
- Probability distributions per class
- Predicted class labels
- Confidence scores
- Classification metrics
Best practices:
- Verify label quality
- Check class distributions
- Consider resampling if imbalanced
- Use appropriate evaluation metrics
Note: Critical for defining the classification task
Params
oneofOptimized default configuration for Multi-Layer Perceptron:
Architecture:
- Single hidden layer with 100 neurons
- ReLU activation function
- Adam optimizer
Training parameters:
- L2 regularization (alpha): 0.0001
- Learning rate: 0.001
- Max iterations: 200
- Batch size: auto
Best suited for:
- Medium-sized datasets
- General classification tasks
- Initial modeling exploration
- Balanced class problems
Fine-tuned configuration for Multi-Layer Perceptron classifier:
Parameter categories:
-
Architecture:
- Network structure
- Activation functions
- Layer sizes
-
Optimization:
- Solver selection
- Learning rates
- Momentum settings
-
Regularization:
- L2 penalty
- Early stopping
- Validation control
-
Training control:
- Batch processing
- Iteration limits
- Convergence criteria
Note: Parameter interactions significantly impact model performance
Number of neurons in hidden layers:
Architecture impact:
- Determines model capacity
- Affects feature extraction
- Controls learning complexity
Selection guide:
- Small (10-50): Simple problems, fast training
- Medium (50-200): Balanced complexity
- Large (200+): Complex patterns, more data
Considerations:
- Data size vs layer size
- Overfitting risk
- Computational cost
- Memory requirements
Activation
enumActivation function for hidden layers:
Mathematical role:
- Introduces non-linearity in the network
- Transforms weighted sums at each layer
- Maps inputs to bounded/unbounded ranges
- Enables complex pattern learning
Selection criteria:
-
Problem characteristics:
- Data distribution
- Output range needs
- Learning complexity
-
Training considerations:
- Gradient behavior
- Convergence speed
- Vanishing/exploding gradients
-
Computational efficiency:
- Calculation speed
- Memory requirements
- Hardware optimization
Note: Choice significantly impacts model performance and training dynamics
Linear activation function:
Properties:
- No non-linearity
- Gradient always 1.0
- Unbounded output
- Linear transformation
Best for:
- Linear relationships
- Final layers
- Feature passing
- Linear bottlenecks
Limitations:
- Cannot learn non-linear patterns
- Limited representation power
- May cause model underfitting
Sigmoid activation function:
Properties:
- Output range: [0, 1]
- Smooth, continuous
- Saturating gradients
- Probabilistic interpretation
Best for:
- Binary classification
- Probability outputs
- Bounded predictions
- Final layer binary tasks
Limitations:
- Vanishing gradients
- Saturating outputs
- Not zero-centered
Hyperbolic tangent function:
Properties:
- Output range: [-1, 1]
- Zero-centered
- Smooth, continuous
- Stronger gradients than sigmoid
Best for:
- Hidden layers
- Normalized inputs
- When zero-centering matters
- Complex pattern recognition
Limitations:
- Still has vanishing gradients
- Saturating activations
- Computationally more expensive
Rectified Linear Unit:
Properties:
- Output range: [0, ∞)
- Non-saturating
- Computationally efficient
- Sparse activation
Best for:
- Deep networks
- Hidden layers
- Fast training
- Most modern networks
Advantages:
- Reduces vanishing gradients
- Biological plausibility
- Fast convergence
- Sparse representations
Limitations:
- Dying ReLU problem
- Not zero-centered
- Unbounded output
Solver
enumOptimization algorithm for neural network training:
Mathematical role:
- Minimizes loss function
- Updates network weights
- Finds optimal parameters
- Controls convergence behavior
Selection criteria:
-
Dataset characteristics:
- Size (samples, features)
- Memory constraints
- Training time budget
-
Optimization needs:
- Convergence speed
- Solution quality
- Local minima handling
-
Resource considerations:
- Memory usage
- Computational cost
- Hardware acceleration
Note: Choice significantly impacts training efficiency and model performance
Limited-memory BFGS (quasi-Newton method):
Algorithm properties:
- Second-order optimization
- Uses curvature information
- Full batch updates
- Memory-efficient approximation
Best for:
- Small datasets (<1000 samples)
- High-quality solutions
- Memory-constrained settings
- Smooth optimization problems
Advantages:
- Fast convergence
- No learning rate tuning
- Better local minima
- Handles ill-conditioning
Limitations:
- Not suitable for large datasets
- Higher memory per update
- No online/mini-batch learning
- May be slower on simple problems
Stochastic Gradient Descent with momentum:
Algorithm: where:
- is learning rate
- is momentum
- is velocity
- is gradient
Best for:
- Large datasets
- Online learning
- Simple optimization
- Limited memory settings
Advantages:
- Low memory usage
- Easy to implement
- Works with any batch size
- Escapes local minima
Limitations:
- Requires learning rate tuning
- Sensitive to scaling
- May converge slowly
- Needs momentum tuning
Adaptive Moment Estimation optimizer:
Algorithm: Combines RMSprop and momentum:
Best for:
- Most deep learning tasks
- Large datasets
- Non-stationary objectives
- Sparse gradients
Advantages:
- Adaptive learning rates
- Robust to hyperparameters
- Handles sparse gradients
- Fast convergence
Limitations:
- Memory overhead
- May not generalize as well as SGD
- Can be sensitive to beta parameters
- Higher computational cost
Note: Recommended default choice for most problems
Alpha
f64L2 regularization strength:
Effect:
Impact:
- Controls overfitting
- Reduces weight magnitude
- Improves generalization
Typical ranges:
- Weak: 1e-5 to 1e-4
- Medium: 1e-4 to 1e-3
- Strong: 1e-3 to 1e-2
Note: Higher values mean stronger regularization
BatchSize
u64Mini-batch size for gradient updates:
Impact on training:
- Memory usage
- Training speed
- Gradient noise
- Convergence behavior
Selection guide:
- Small (16-32): More noise, better generalization
- Medium (32-256): Balanced performance
- Large (256+): Faster training, stable gradients
Note: Ignored when solver is 'lbfgs'
LearningRate
enumLearning rate schedule for gradient-based optimization:
Mathematical role:
- Controls parameter update step size
- Influences convergence speed
- Balances exploration vs exploitation
- Affects training stability
Selection criteria:
-
Training dynamics:
- Convergence behavior
- Loss landscape
- Training stability
-
Problem characteristics:
- Dataset size
- Model complexity
- Optimization difficulty
-
Computational needs:
- Training time budget
- Resource constraints
- Performance requirements
Note: Critical parameter affecting both training speed and model performance
Fixed learning rate throughout training:
Behavior: where:
- is learning rate at time t
- is initial learning rate
Best for:
- Simple problems
- Well-behaved loss surfaces
- Short training runs
- Initial experimentation
Advantages:
- Simple to implement
- Predictable behavior
- No additional parameters
- Easy to debug
Limitations:
- May converge slowly
- Can miss fine optimization
- Requires careful rate selection
- Not adaptive to progress
Inverse scaling learning rate schedule:
Formula: where:
- is learning rate at time t
- is initial learning rate
- is power_t parameter
- is current iteration
Best for:
- Long training runs
- Gradual convergence needs
- Non-stationary problems
- When final precision matters
Advantages:
- Theoretical guarantees
- Automatic rate reduction
- Better final convergence
- Handles non-stationarity
Limitations:
- May decay too quickly
- Sensitive to power parameter
- Fixed decay schedule
- Not adaptive to progress
Adaptive learning rate based on training loss:
Strategy:
- Keeps initial rate while loss decreases
- Decreases rate when loss plateaus
- Divides rate by 5 after stagnation
- Continues until convergence or max_iter
Best for:
- Complex problems
- Unknown loss landscapes
- When optimal rate unknown
- Production training
Advantages:
- Automatic rate adjustment
- Responds to training progress
- Handles varying difficulty
- More robust training
Limitations:
- More complex implementation
- May need early stopping
- Additional computation overhead
- Requires patience parameter
Note: Recommended for most practical applications
Initial learning rate:
Impact:
- Controls step size
- Affects convergence speed
- Influences stability
Typical ranges:
- Conservative: 1e-4 to 1e-3
- Standard: 1e-3 to 1e-2
- Aggressive: 1e-2 to 1e-1
Note: Only used with 'sgd' or 'adam' solvers
PowerT
f64Power for inverse scaling learning rate:
Formula:
Typical values:
- 0.5: Standard decay (default)
- < 0.5: Slower decay
- > 0.5: Faster decay
Note: Only used when learning_rate is 'invscaling'
MaxIter
u64Maximum number of training iterations:
Guidelines:
- Small (100-300): Simple problems
- Medium (300-1000): Standard tasks
- Large (1000+): Complex problems
Considerations:
- Training time
- Convergence needs
- Early stopping use
- Problem complexity
Shuffle
boolWhether to shuffle samples in each iteration.
Effects:
- Prevents order bias
- Improves convergence
- Reduces overfitting
- Better generalization
Note: Only affects 'sgd' and 'adam' solvers
RandomState
u64Determines random number generation for weights and bias initialization, train-test split if early stopping is used, and batch sampling when solver='sgd' or 'adam'.
Controls randomness in:
- Weight initialization
- Data shuffling
- Batch sampling
Important for:
- Reproducibility
- Result comparison
- Debugging
- Validation
Tol
f64Optimization tolerance:
Convergence criterion:
- Stops when loss improvement < tol
- Affects training duration
- Controls precision
Typical ranges:
- Strict: 1e-5 to 1e-4
- Standard: 1e-4 to 1e-3
- Loose: 1e-3 to 1e-2
WarmStart
boolWhen set to True, reuse the solution of the previous call to fit as initialization, otherwise, just erase the previous solution.
Applications:
- Transfer learning
- Incremental learning
- Parameter searching
- Model refinement
Benefits:
- Faster convergence
- Better solutions
- Parameter reuse
- Continuous learning
Momentum
f64Momentum for gradient descent:
Formula: where is momentum
Effects:
- Accelerates convergence
- Reduces oscillation
- Escapes local minima
Typical values:
- Low (0.1-0.5): More stable
- Medium (0.5-0.9): Standard choice
- High (0.9-0.999): Faster convergence
Note: Only used with 'sgd' solver
Whether to use Nesterov's momentum:
Mathematical form: Classical: Nesterov:
Advantages:
- Faster convergence
- Better theoretical guarantees
- Improved acceleration
- More stable updates
Used when:
- momentum > 0
- 'sgd' solver
Best when:
- Long training runs
- Smooth loss surface
Note: Often improves standard momentum, especially for deep networks
EarlyStopping
boolEnable early stopping based on validation score:
Stopping criterion:
- Monitors validation score
- Stops when no improvement for n_iter_no_change epochs
- Improvement threshold defined by 'tol'
Benefits:
- Prevents overfitting
- Reduces training time
- Automatic model selection
- Optimal epoch determination
Requirements:
- Validation data (split from training)
- Patience parameter (n_iter_no_change)
- Improvement threshold (tol)
- Monitoring metric
Note: Uses validation_fraction of training data for monitoring
Fraction of training data for early stopping:
Usage:
- Splits training data into train/validation
- Validation set monitors convergence
- Affects early stopping decisions
Selection guide:
- Small (0.1): More training data, less reliable stopping
- Medium (0.2): Balanced split
- Large (0.3): More reliable stopping, less training data
Considerations:
- Dataset size
- Problem complexity
- Validation stability
- Training data needs
Note: Only used when early_stopping is True
Beta1
f64Exponential decay rate for first moment in Adam:
Formula:
Properties:
- Controls moving average of gradient
- Affects momentum behavior
- Influences update step size
Typical values:
- 0.9: Default, works well generally
- < 0.9: Less momentum, more responsive
- > 0.9: More momentum, smoother updates
Note: Only used with 'adam' solver
Beta2
f64Exponential decay rate for second moment in Adam:
Formula:
Properties:
- Controls moving average of squared gradient
- Affects learning rate adaptation
- Influences update scale
Typical values:
- 0.999: Default, stable convergence
- < 0.999: Faster adaptation, more variance
- > 0.999: Slower adaptation, more stability
Note: Only used with 'adam' solver
Epsilon
f64Numerical stability constant for Adam:
Usage in update:
Purpose:
- Prevents division by zero
- Improves numerical stability
- Controls minimum step size
Typical values:
- 1e-8: Default, works well generally
- 1e-7 to 1e-5: More stability, larger minimum steps
- 1e-10 to 1e-8: More precision, risk of instability
Note: Only used with 'adam' solver
Maximum iterations without improvement:
Early stopping behavior:
- Monitors score improvement
- Stops if no improvement > tol
- Counts consecutive iterations
Selection guide:
- Small (5-10): Aggressive stopping
- Medium (10-20): Balanced patience
- Large (20+): Conservative stopping
Considerations:
- Training stability
- Convergence patterns
- Time constraints
- Loss landscape
MaxFun
u32Maximum number of loss function evaluations:
Purpose:
- Limits computational budget
- Controls optimization duration
- Prevents excessive iterations
Typical ranges:
- Small (5000): Quick optimization
- Medium (15000): Standard problems
- Large (50000+): Complex optimization
Note: Only applies to 'lbfgs' solver
Warning: Small values may prevent convergence
Hyperparameter optimization through grid search for Multi-Layer Perceptron:
Search process:
-
Architecture optimization:
- Network size
- Activation functions
- Layer configuration
-
Training dynamics:
- Optimization algorithms
- Learning rates
- Momentum parameters
-
Regularization tuning:
- L2 penalties
- Early stopping
- Validation settings
Computational complexity:
- Total configurations = Product of parameter options
- Time complexity = O(configurations × epochs × samples)
- Memory needs = O(parameters × configurations)
Best practices:
- Start with coarse grid
- Refine promising regions
- Monitor computational costs
- Use domain knowledge
Note: Consider computational budget when defining parameter spaces
HiddenLayerSizes
[u64, ...]Grid of hidden layer sizes to evaluate:
Search strategies:
-
Linear scale:
- Small: [50, 100, 150]
- Medium: [100, 200, 300]
- Large: [200, 400, 600]
-
Logarithmic scale:
- Wide range: [10, 100, 1000]
- Balanced: [32, 64, 128, 256]
Selection criteria:
- Data complexity
- Feature dimensionality
- Sample size
- Memory constraints
Note: Larger networks require more data and computation
Activation
[enum, ...]Activation function for hidden layers:
Mathematical role:
- Introduces non-linearity in the network
- Transforms weighted sums at each layer
- Maps inputs to bounded/unbounded ranges
- Enables complex pattern learning
Selection criteria:
-
Problem characteristics:
- Data distribution
- Output range needs
- Learning complexity
-
Training considerations:
- Gradient behavior
- Convergence speed
- Vanishing/exploding gradients
-
Computational efficiency:
- Calculation speed
- Memory requirements
- Hardware optimization
Note: Choice significantly impacts model performance and training dynamics
Linear activation function:
Properties:
- No non-linearity
- Gradient always 1.0
- Unbounded output
- Linear transformation
Best for:
- Linear relationships
- Final layers
- Feature passing
- Linear bottlenecks
Limitations:
- Cannot learn non-linear patterns
- Limited representation power
- May cause model underfitting
Sigmoid activation function:
Properties:
- Output range: [0, 1]
- Smooth, continuous
- Saturating gradients
- Probabilistic interpretation
Best for:
- Binary classification
- Probability outputs
- Bounded predictions
- Final layer binary tasks
Limitations:
- Vanishing gradients
- Saturating outputs
- Not zero-centered
Hyperbolic tangent function:
Properties:
- Output range: [-1, 1]
- Zero-centered
- Smooth, continuous
- Stronger gradients than sigmoid
Best for:
- Hidden layers
- Normalized inputs
- When zero-centering matters
- Complex pattern recognition
Limitations:
- Still has vanishing gradients
- Saturating activations
- Computationally more expensive
Rectified Linear Unit:
Properties:
- Output range: [0, ∞)
- Non-saturating
- Computationally efficient
- Sparse activation
Best for:
- Deep networks
- Hidden layers
- Fast training
- Most modern networks
Advantages:
- Reduces vanishing gradients
- Biological plausibility
- Fast convergence
- Sparse representations
Limitations:
- Dying ReLU problem
- Not zero-centered
- Unbounded output
Solver
[enum, ...]Optimization algorithm for neural network training:
Mathematical role:
- Minimizes loss function
- Updates network weights
- Finds optimal parameters
- Controls convergence behavior
Selection criteria:
-
Dataset characteristics:
- Size (samples, features)
- Memory constraints
- Training time budget
-
Optimization needs:
- Convergence speed
- Solution quality
- Local minima handling
-
Resource considerations:
- Memory usage
- Computational cost
- Hardware acceleration
Note: Choice significantly impacts training efficiency and model performance
Limited-memory BFGS (quasi-Newton method):
Algorithm properties:
- Second-order optimization
- Uses curvature information
- Full batch updates
- Memory-efficient approximation
Best for:
- Small datasets (<1000 samples)
- High-quality solutions
- Memory-constrained settings
- Smooth optimization problems
Advantages:
- Fast convergence
- No learning rate tuning
- Better local minima
- Handles ill-conditioning
Limitations:
- Not suitable for large datasets
- Higher memory per update
- No online/mini-batch learning
- May be slower on simple problems
Stochastic Gradient Descent with momentum:
Algorithm: where:
- is learning rate
- is momentum
- is velocity
- is gradient
Best for:
- Large datasets
- Online learning
- Simple optimization
- Limited memory settings
Advantages:
- Low memory usage
- Easy to implement
- Works with any batch size
- Escapes local minima
Limitations:
- Requires learning rate tuning
- Sensitive to scaling
- May converge slowly
- Needs momentum tuning
Adaptive Moment Estimation optimizer:
Algorithm: Combines RMSprop and momentum:
Best for:
- Most deep learning tasks
- Large datasets
- Non-stationary objectives
- Sparse gradients
Advantages:
- Adaptive learning rates
- Robust to hyperparameters
- Handles sparse gradients
- Fast convergence
Limitations:
- Memory overhead
- May not generalize as well as SGD
- Can be sensitive to beta parameters
- Higher computational cost
Note: Recommended default choice for most problems
Alpha
[f64, ...]L2 regularization strengths to evaluate:
Search spaces:
-
Linear scale:
- Fine: [0.0001, 0.0005, 0.001]
- Medium: [0.001, 0.01, 0.1]
-
Log scale (recommended):
- Wide: [1e-5, 1e-4, 1e-3, 1e-2]
- Focused: [1e-4, 3e-4, 1e-3]
Selection strategy:
- Start with log-spaced values
- Monitor validation curves
- Check for overfitting
- Consider model size
Note: Higher values = stronger regularization
BatchSize
[u64, ...]Mini-batch sizes to evaluate:
Search ranges:
-
Power of 2 (common):
- Standard: [32, 64, 128, 256]
- Extended: [16, 32, 64, 128, 256, 512]
-
Linear scale:
- Small: [50, 100, 200]
- Large: [200, 400, 800]
Considerations:
- Memory constraints
- Training stability
- Parallelization
- Hardware optimization
Note: Only relevant for 'sgd' and 'adam' solvers
LearningRate
[enum, ...]Learning rate schedule for gradient-based optimization:
Mathematical role:
- Controls parameter update step size
- Influences convergence speed
- Balances exploration vs exploitation
- Affects training stability
Selection criteria:
-
Training dynamics:
- Convergence behavior
- Loss landscape
- Training stability
-
Problem characteristics:
- Dataset size
- Model complexity
- Optimization difficulty
-
Computational needs:
- Training time budget
- Resource constraints
- Performance requirements
Note: Critical parameter affecting both training speed and model performance
Fixed learning rate throughout training:
Behavior: where:
- is learning rate at time t
- is initial learning rate
Best for:
- Simple problems
- Well-behaved loss surfaces
- Short training runs
- Initial experimentation
Advantages:
- Simple to implement
- Predictable behavior
- No additional parameters
- Easy to debug
Limitations:
- May converge slowly
- Can miss fine optimization
- Requires careful rate selection
- Not adaptive to progress
Inverse scaling learning rate schedule:
Formula: where:
- is learning rate at time t
- is initial learning rate
- is power_t parameter
- is current iteration
Best for:
- Long training runs
- Gradual convergence needs
- Non-stationary problems
- When final precision matters
Advantages:
- Theoretical guarantees
- Automatic rate reduction
- Better final convergence
- Handles non-stationarity
Limitations:
- May decay too quickly
- Sensitive to power parameter
- Fixed decay schedule
- Not adaptive to progress
Adaptive learning rate based on training loss:
Strategy:
- Keeps initial rate while loss decreases
- Decreases rate when loss plateaus
- Divides rate by 5 after stagnation
- Continues until convergence or max_iter
Best for:
- Complex problems
- Unknown loss landscapes
- When optimal rate unknown
- Production training
Advantages:
- Automatic rate adjustment
- Responds to training progress
- Handles varying difficulty
- More robust training
Limitations:
- More complex implementation
- May need early stopping
- Additional computation overhead
- Requires patience parameter
Note: Recommended for most practical applications
LearningRateInit
[f64, ...]Initial learning rates to evaluate:
Search spaces:
-
Log scale (recommended):
- Wide: [1e-4, 1e-3, 1e-2, 1e-1]
- Focused: [3e-4, 1e-3, 3e-3]
-
Solver-specific:
- SGD: Higher rates (1e-3 to 1e-1)
- Adam: Lower rates (1e-4 to 1e-2)
Considerations:
- Network architecture
- Activation functions
- Batch size
- Optimizer choice
Note: Critical parameter for training success
PowerT
[f64, ...]Inverse scaling exponents to evaluate:
Search ranges:
-
Standard range:
- [0.3, 0.5, 0.7]: Around default
- [0.2, 0.4, 0.6, 0.8]: Wider search
-
Theoretical values:
- 0.5: Standard SGD theory
- 1.0: Faster decay
- 0.25: Slower decay
Impact:
- Learning rate decay speed
- Convergence behavior
- Training stability
Note: Only used with 'invscaling' learning rate
MaxIter
[u64, ...]Maximum iteration counts to evaluate:
Search ranges:
-
Standard scale:
- Basic: [100, 200, 300]
- Extended: [200, 400, 600, 800]
-
Problem-based:
- Simple: [50, 100, 200]
- Complex: [500, 1000, 2000]
Selection criteria:
- Convergence patterns
- Problem complexity
- Computational budget
- Early stopping usage
Note: Higher values needed for larger/complex networks
Shuffle
[bool, ...]Data shuffling options to evaluate:
Values:
-
[true]: Standard shuffling
-
[false]: Ordered processing
-
[true, false]: Test both options
Impact analysis:
- Training stability
- Convergence speed
- Generalization
- Order sensitivity
Note: Only affects 'sgd' and 'adam' solvers
RandomState
u64Random seed for reproducibility:
Controls randomness in:
- Weight initialization
- Data shuffling
- Mini-batch selection
- Cross-validation splits
Importance:
- Reproducible results
- Fair comparisons
- Debugging
- Parameter studies
Tol
[f64, ...]Optimization tolerances to evaluate:
Search spaces: Values:
-
Strict: [1e-5, 1e-4, 1e-3]
-
Relaxed: [1e-4, 1e-3, 1e-2]
-
High: ≤ 1e-4
-
Medium: 1e-4 to 1e-3
-
Low: ≥ 1e-3
Trade-offs:
- Convergence precision
- Training time
- Solution quality
- Computational cost
WarmStart
[bool, ...]When set to True, reuse the solution of the previous call to fit and add more estimators to the ensemble, otherwise, just fit a whole new ensemble.
Values:
-
[true]: Reuse previous solutions
-
[false]: Fresh initialization
-
[true, false]: Test both strategies
Use cases:
- Transfer learning
- Incremental training
- Model refinement
- Parameter searching
Momentum
[f64, ...]Momentum values to evaluate:
Search ranges like:
-
[0.5, 0.7, 0.9]
-
[0.9, 0.95, 0.99]
-
[0.0, 0.5, 0.9, 0.95, 0.99]
-
[0.8, 0.9, 0.95, 0.98]
Impact on training:
- Convergence speed
- Oscillation damping
- Local minima escape
- Training stability
Note: Only used with 'sgd' solver
NesterovsMomentum
[bool, ...]Nesterov momentum options to evaluate:
Search options:
-
[true]: Use Nesterov's acceleration
-
[false]: Classic momentum
-
[true, false]: Compare both methods
Performance impact:
- Convergence speed
- Training stability
- Optimization accuracy
- Memory usage
Note: Only relevant when momentum > 0
EarlyStopping
boolEarly stopping configuration:
Control parameters:
- Validation split size
- Patience threshold
- Score monitoring
- Stopping criteria
Benefits:
- Prevents overfitting
- Saves computation
- Automatic epoch selection
- Optimal model selection
Note: Requires validation_fraction of training data
Validation set size for early stopping:
Typical ranges:
- Small: 0.1 (10% validation)
- Medium: 0.2 (20% validation)
- Large: 0.3 (30% validation)
Selection criteria:
- Dataset size
- Model complexity
- Validation stability
- Training needs
Beta1
[f64, ...]First moment decay rates for Adam:
Search spaces like:
-
[0.9]: Default value
-
[0.8, 0.9, 0.95]: Around default
-
[0.7, 0.8, 0.9, 0.95]
-
[0.85, 0.9, 0.93, 0.97]
Impact:
- Gradient averaging
- Update smoothing
- Training stability
Note: Only used with 'adam' solver
Beta2
[f64, ...]Second moment decay rates for Adam:
Search spaces like:
-
[0.999]: Default value
-
[0.995, 0.999, 0.9999]
-
[0.99, 0.999, 0.9999]
-
[0.995, 0.997, 0.999, 0.9995]
Effects:
- Learning rate adaptation
- Update scale control
- Training robustness
Note: Only used with 'adam' solver
Epsilon
[f64, ...]Numerical stability constants for Adam:
Search ranges:
-
[1e-8]: Default value
-
[1e-9, 1e-8, 1e-7]
-
[1e-10, 1e-8, 1e-6]
-
[1e-8, 1e-7, 1e-6, 1e-5]
Purpose:
- Prevents division by zero
- Controls update magnitude
- Ensures stability
Note: Only used with 'adam' solver
Early stopping patience parameter:
Common values:
- Small (5-10): Quick stopping
- Medium (10-20): Standard patience
- Large (20+): Extended training
Considerations:
- Convergence patterns
- Training stability
- Time constraints
- Model complexity
MaxFun
[u32, ...]Maximum function evaluation limits:
Search ranges like:
-
[5000, 10000, 15000]
-
[10000, 15000, 20000]
-
[15000, 25000, 50000]
-
[10000, 20000, 40000, 80000]
Trade-offs:
- Solution quality
- Computation time
- Convergence guarantee
- Resource usage
Note: Only applies to 'lbfgs' solver
RefitScore
enumPerformance evaluation metrics for neural network classification:
Selection criteria:
-
Problem characteristics:
- Class distribution
- Error cost structure
- Business objectives
- Domain requirements
-
Model evaluation needs:
- Model comparison
- Performance monitoring
- Validation strategy
- Threshold tuning
-
Optimization goals:
- Early stopping decisions
- Model selection
- Hyperparameter tuning
- Cross-validation
Note: Choice of metric significantly impacts model selection and optimization
Uses model's built-in scoring method:
For neural networks:
- Classification: Accuracy score
- Probability: Log-loss
- Multi-class: Weighted average
Best for:
- Initial evaluation
- Quick prototyping
- Balanced datasets
- Standard problems
Advantages:
- Computationally efficient
- Well-understood metric
- Standard benchmarking
- Built-in optimization
Limitations:
- May not reflect business goals
- Sensitive to class imbalance
- Doesn't consider prediction confidence
- May oversimplify performance
Standard classification accuracy score:
Formula:
Properties:
- Range: [0.0, 1.0]
- Perfect score: 1.0
- Chance level: 1/n_classes
- Intuitive interpretation
Best for:
- Balanced classes
- Equal error costs
- Clear right/wrong scenarios
- Simple evaluation needs
Limitations:
- Misleading with imbalanced data
- Ignores prediction confidence
- Sensitive to class distribution
- May hide poor minority performance
Class-weighted accuracy score:
Formula:
Properties:
- Range: [0.0, 1.0]
- Perfect score: 1.0
- Chance level: 0.5
- Class-normalized metric
Best for:
- Imbalanced datasets
- When all classes matter equally
- Minority class importance
- Fair evaluation needs
Advantages:
- Handles class imbalance
- Better minority class evaluation
- Robust to class distribution
- Fairer comparison basis
Limitations:
- May not reflect business priorities
- Ignores prediction confidence
- Potentially oversensitive to rare classes
- Can be unstable with very rare classes
Logarithmic loss (cross-entropy):
Formula: where:
- is number of samples
- is number of classes
- is binary indicator of class j for instance i
- is predicted probability of class j for instance i
Properties:
- Range: [0, ∞)
- Perfect score: 0.0
- Probability-sensitive
- Information theoretic basis
Best for:
- Probability calibration
- Confidence assessment
- Risk-sensitive applications
- Multi-class problems
Advantages:
- Evaluates probability quality
- Sensitive to uncertainty
- Natural for neural networks
- Differentiable metric
Limitations:
- Scale depends on number of classes
- Less intuitive interpretation
- Sensitive to outliers
- Requires probability calibration
Area Under the Receiver Operating Characteristic Curve:
Definition:
Properties:
- Range: [0.0, 1.0]
- Perfect score: 1.0
- Random baseline: 0.5
- Threshold-independent
Best for:
- Binary classification
- Ranking evaluation
- Threshold optimization
- Imbalanced datasets
Advantages:
- Independent of class distribution
- Evaluates ranking quality
- Robust to imbalance
- Comprehensive evaluation
Limitations:
- Only for binary/one-vs-rest
- Computationally intensive
- Scale sensitivity
- May average over irrelevant regions
Note: For multi-class, computes weighted average of one-vs-rest AUCs
Split
oneofStandard train-test split configuration optimized for general classification tasks.
Configuration:
- Test size: 20% (0.2)
- Random seed: 98
- Shuffling: Enabled
- Stratification: Based on target distribution
Advantages:
- Preserves class distribution
- Provides reliable validation
- Suitable for most datasets
Best for:
- Medium to large datasets
- Independent observations
- Initial model evaluation
Splitting uses the ShuffleSplit strategy or StratifiedShuffleSplit strategy depending on the field stratified
. Note: If shuffle is false then stratified must be false.
Configurable train-test split parameters for specialized requirements. Allows fine-tuning of data division strategy for specific use cases or constraints.
Use cases:
- Time series data
- Grouped observations
- Specific train/test ratios
- Custom validation schemes
RandomState
u64Random seed for reproducible splits. Ensures:
- Consistent train/test sets
- Reproducible experiments
- Comparable model evaluations
Same seed guarantees identical splits across runs.
Shuffle
boolData shuffling before splitting. Effects:
- true: Randomizes order, better for i.i.d. data
- false: Maintains order, important for time series
When to disable:
- Time dependent data
- Sequential patterns
- Grouped observations
TrainSize
f64Proportion of data for training. Considerations:
- Larger (e.g., 0.8-0.9): Better model learning
- Smaller (e.g., 0.5-0.7): Better validation
Common splits:
- 0.8: Standard (80/20 split)
- 0.7: More validation emphasis
- 0.9: More training emphasis
Stratified
boolMaintain class distribution in splits. Important when:
- Classes are imbalanced
- Small classes present
- Representative splits needed
Requirements:
- Classification tasks only
- Cannot use with shuffle=false
- Sufficient samples per class
Cv
oneofStandard cross-validation configuration using stratified 3-fold splitting.
Configuration:
- Folds: 3
- Method: StratifiedKFold
- Stratification: Preserves class proportions
Advantages:
- Balanced evaluation
- Reasonable computation time
- Good for medium-sized datasets
Limitations:
- May be insufficient for small datasets
- Higher variance than larger fold counts
- May miss some data patterns
Configurable stratified k-fold cross-validation for specific validation requirements.
Features:
- Adjustable fold count with
NFolds
determining the number of splits. - Stratified sampling
- Preserved class distributions
Use cases:
- Small datasets (more folds)
- Large datasets (fewer folds)
- Detailed model evaluation
- Robust performance estimation
NFolds
u32Number of cross-validation folds. Guidelines:
- 3-5: Large datasets, faster training
- 5-10: Standard choice, good balance
- 10+: Small datasets, thorough evaluation
Trade-offs:
- More folds: Better evaluation, slower training
- Fewer folds: Faster training, higher variance
Must be at least 2.
K-fold cross-validation without stratification. Divides data into k consecutive folds for iterative validation.
Process:
- Splits data into k equal parts
- Each fold serves as validation once
- Remaining k-1 folds form training set
Use cases:
- Regression problems
- Large, balanced datasets
- When stratification unnecessary
- Continuous target variables
Limitations:
- May not preserve class distributions
- Less suitable for imbalanced data
- Can create biased splits with ordered data
NSplits
u32Number of folds for cross-validation. Selection guide: Recommended values:
- 5: Standard choice (default)
- 3: Large datasets/quick evaluation
- 10: Thorough evaluation/smaller datasets
Trade-offs:
- Higher values: More thorough, computationally expensive
- Lower values: Faster, potentially higher variance
Must be at least 2 for valid cross-validation.
RandomState
u64Random seed for fold generation when shuffling. Important for:
- Reproducible results
- Consistent fold assignments
- Benchmark comparisons
- Debugging and validation
Set specific value for reproducibility across runs.
Shuffle
boolWhether to shuffle data before splitting into folds. Effects:
- true: Randomized fold composition (recommended)
- false: Sequential splitting
Enable when:
- Data may have ordering
- Better fold independence needed
Disable for:
- Time series data
- Ordered observations
Stratified K-fold cross-validation maintaining class proportions across folds.
Key features:
- Preserves class distribution in each fold
- Handles imbalanced datasets
- Ensures representative splits
Best for:
- Classification problems
- Imbalanced class distributions
- When class proportions matter
Requirements:
- Classification tasks only
- Sufficient samples per class
- Categorical target variable
NSplits
u32Number of stratified folds. Guidelines: Typical values:
- 5: Standard for most cases
- 3: Quick evaluation/large datasets
- 10: Detailed evaluation/smaller datasets
Considerations:
- Must allow sufficient samples per class per fold
- Balance between stability and computation time
- Consider smallest class size when choosing
RandomState
u64Seed for reproducible stratified splits. Ensures:
- Consistent fold assignments
- Reproducible results
- Comparable experiments
- Systematic validation
Fixed seed guarantees identical stratified splits.
Shuffle
boolData shuffling before stratified splitting. Impact:
- true: Randomizes while maintaining stratification
- false: Maintains data order within strata
Use cases:
- true: Independent observations
- false: Grouped or sequential data
Class proportions maintained regardless of setting.
Random permutation cross-validator with independent sampling.
Characteristics:
- Random sampling for each split
- Independent train/test sets
- More flexible than K-fold
- Can have overlapping test sets
Advantages:
- Control over test size
- Fresh splits each iteration
- Good for large datasets
Limitations:
- Some samples might never be tested
- Others might be tested multiple times
- No guarantee of complete coverage
NSplits
u32Number of random splits to perform. Consider: Common values:
- 5: Standard evaluation
- 10: More thorough assessment
- 3: Quick estimates
Trade-offs:
- More splits: Better estimation, longer runtime
- Fewer splits: Faster, less stable estimates
Balance between computation and stability.
RandomState
u64Random seed for reproducible shuffling. Controls:
- Split randomization
- Sample selection
- Result reproducibility
Important for:
- Debugging
- Comparative studies
- Result verification
TestSize
f64Proportion of samples for test set. Guidelines: Common ratios:
- 0.2: Standard (80/20 split)
- 0.25: More validation emphasis
- 0.1: More training data
Considerations:
- Dataset size
- Model complexity
- Validation requirements
It must be between 0.0 and 1.0.
Stratified random permutation cross-validator combining shuffle-split with stratification.
Features:
- Maintains class proportions
- Random sampling within strata
- Independent splits
- Flexible test size
Ideal for:
- Imbalanced datasets
- Large-scale problems
- When class distributions matter
- Flexible validation schemes
NSplits
u32Number of stratified random splits. Guidelines: Recommended values:
- 5: Standard evaluation
- 10: Detailed analysis
- 3: Quick assessment
Consider:
- Sample size per class
- Computational resources
- Stability requirements
RandomState
u64Seed for reproducible stratified sampling. Ensures:
- Consistent class proportions
- Reproducible splits
- Comparable experiments
Critical for:
- Benchmarking
- Research studies
- Quality assurance
TestSize
f64Fraction of samples for stratified test set. Best practices: Common splits:
- 0.2: Balanced evaluation
- 0.3: More thorough testing
- 0.15: Preserve training size
Consider:
- Minority class size
- Overall dataset size
- Validation objectives
It must be between 0.0 and 1.0.
Time Series cross-validator. Provides train/test indices to split time series data samples that are observed at fixed time intervals, in train/test sets. It is a variation of k-fold which returns first k
folds as train set and the k + 1
th fold as test set. Note that unlike standard cross-validation methods, successive training sets are supersets of those that come before them. Also, it adds all surplus data to the first training partition, which is always used to train the model.
Key features:
- Maintains temporal dependence
- Expanding window approach
- Forward-chaining splits
- No future data leakage
Use cases:
- Sequential data
- Financial forecasting
- Temporal predictions
- Time-dependent patterns
Note: Training sets are supersets of previous iterations.
NSplits
u32Number of temporal splits. Considerations: Typical values:
- 5: Standard forward chaining
- 3: Limited historical data
- 10: Long time series
Impact:
- Affects training window growth
- Determines validation points
- Influences computational load
MaxTrainSize
u64Maximum size of training set. Should be strictly less than the number of samples. Applications:
- 0: Use all available past data
- >0: Rolling window of fixed size
Use cases:
- Limit historical relevance
- Control computational cost
- Handle concept drift
- Memory constraints
TestSize
u64Number of samples in each test set. When 0:
- Auto-calculated as n_samples/(n_splits+1)
- Ensures equal-sized test sets
Considerations:
- Forecast horizon
- Validation requirements
- Available future data
Gap
u64Number of samples to exclude from the end of each train set before the test set.Gap between train and test sets. Uses:
- Avoid data leakage
- Model forecast lag
- Buffer periods
Common scenarios:
- 0: Continuous prediction
- >0: Forward gap for realistic evaluation
- Match business forecasting needs