Using pre-preprocessing to make it easier for ML models to intake inputs
Example: Imagine we have a dataset with two features: "Age" and "Income" for predicting credit risk.
Before standardization:
Age (years) | Income ($)
25 | 30,000
40 | 60,000
55 | 90,000
After standardization:
Age (standardized) | Income (standardized)
-1.22 | -1.22
0.00 | 0.00
1.22 | 1.22
This standardization helps prevent the larger scale of "Income" from dominating the smaller scale of "Age" in the model's calculations.
Suggetion: If there's lots of outliers, standardisation makes the variation between these outliers and normal values smaller.
Another Visual Example:
Common Standarization Classes:
- StandardScaler
- Standardizes features by removing the mean and scaling to unit variance
- Useful when features have different scales and normal distribution is assumed
- MinMaxScaler
- Scales features to a fixed range, usually between 0 and 1
- Useful when you need bounded values or dealing with sparse data
-
RobustScaler
- Uses statistics that are robust to outliers
- Helpful when your data contains many outliers
-
Normalizer
- Scales individual samples to have unit norm
- Useful for text classification or clustering