Using pre-preprocessing to make it easier for ML models to intake inputs

Example: Imagine we have a dataset with two features: "Age" and "Income" for predicting credit risk.

Before standardization:

Age (years) | Income ($)

25          | 30,000

40          | 60,000

55          | 90,000

After standardization:

Age (standardized) | Income (standardized)

-1.22              | -1.22

0.00               | 0.00

1.22               | 1.22

This standardization helps prevent the larger scale of "Income" from dominating the smaller scale of "Age" in the model's calculations.

Suggetion: If there's lots of outliers, standardisation makes the variation between these outliers and normal values smaller.

Another Visual Example:

Pasted image 20240818165629.png

StandardScaler
- Standardizes features by removing the mean and scaling to unit variance
- Useful when features have different scales and normal distribution is assumed

MinMaxScaler
- Scales features to a fixed range, usually between 0 and 1
- Useful when you need bounded values or dealing with sparse data

RobustScaler
- Uses statistics that are robust to outliers
- Helpful when your data contains many outliers
Normalizer
- Scales individual samples to have unit norm
- Useful for text classification or clustering