lecture_05 – DataSci-223

Classification: Putting a Label on Things

hw05 #FIXME

References

Documentation

Deep Learning Frameworks

Books

Python Data Science Handbook - Jake VanderPlas (free online)
Hands-On Machine Learning - Aurélien Géron
Introduction to Statistical Learning - James, Witten, Hastie, Tibshirani (free PDF)
Interpretable Machine Learning - Christoph Molnar (free online)

Tutorials & Articles

A Visual Introduction to Machine Learning
ROC Curves Explained - Google ML Crash Course
How to Evaluate Classification Models - edlitera
ML Models for Classification - tour of methods
Logistic Regression using Gradient Descent - Kaggle
XGBoost vs Random Forest - Geek Culture
Interpretable ML with XGBoost - Towards Data Science
Unsupervised Learning: Algorithms and Examples - AltexSoft
Environment and Distribution Shift - Dive into Deep Learning

Health Data Examples

Cancer Classification (EDA, PCA, Random Forest) - Kaggle
XGBoost, Random Forest, and Nomograph for Disease Severity Prediction - Frontiers
Prediction Method for Hypertension - Diagnostics Journal

Crash Course in Classification

The building blocks of ML are algorithms for regression and classification:

Regression: predicting continuous quantities
Classification: predicting discrete class labels (categories)

Classification Methods

Classification algorithms learn to assign labels to data points based on their features. In health data, this might mean predicting whether a patient has a disease based on lab results.

Reference Card: Common Classification Methods

Method	Description	Strengths
Logistic Regression	Linear model for binary outcomes	Easy to interpret, fast
Decision Trees	Tree structure, splits data by feature values	Intuitive but can overfit
Random Forest	Many decision trees combined	More robust, less overfitting
Support Vector Machines (SVM)	Finds the best boundary between classes	Good for complex data
Naive Bayes	Probabilistic, assumes features are independent	Fast and simple
Neural Networks	Layers of nodes, can model complex patterns	Powerful but less interpretable

Model Evaluation

There are many more classification approaches than data scientists, so choosing the best one for your application can be daunting. Thankfully, all of them output predicted classes for each data point. We can use this similarity to define objective performance criteria based on how often the predicted class matches the underlying truth.

I get in trouble with the data science police if I don’t include something about confusion matrices:

Precision (Positive Predictive Value) = \(\frac{TP}{TP + FP}\)

How well it performs when it predicts positive
Recall (Sensitivity, True Positive Rate) = \(\frac{TP}{TP+FN}\)

How well it performs among actual positives
Accuracy = \(\frac{(TP+TN)}{(TP+FP+FN+TN)}\)

How well it performs among all known classes
F1 score = \(2 \times \frac{Recall \times Precision}{Recall + Precision}\)

Balanced score for overall model performance
Specificity (Selectivity, True Negative Rate) = \(\frac{TN}{TN + FP}\)

Similar to Recall, how well it performs among actual negatives
Miss Rate (False Negative Rate) = \(\frac{FN}{TP + FN}\)

Proportion of positives that were incorrectly classified, good measure when missing a positive has a high cost

Reference Card: `confusion_matrix`

Component	Details
Function	`sklearn.metrics.confusion_matrix()`
Purpose	Compute confusion matrix to evaluate classification accuracy
Key Parameters	• `y_true`: Ground truth target values • `y_pred`: Estimated targets from classifier • `labels`: List of labels to index the matrix • `normalize`: Normalize over ‘true’, ‘pred’, ‘all’, or None
Returns	Array where rows are actual, columns are predicted

Code Snippet: Confusion Matrixfrom sklearn.metrics import confusion_matrix
y_true = [1, 0, 1, 1, 0]  # Actual labels
y_pred = [1, 0, 0, 1, 1]  # Predicted labels
cm = confusion_matrix(y_true, y_pred)
print(cm)
## Output interpretation (for binary case):
## [[TN, FP],
##  [FN, TP]]

Classification Report

The classification_report function provides a comprehensive summary of precision, recall, and F1-score for each class in a single call—essential for evaluating multi-class models.

Reference Card: `classification_report`

Component	Details
Function	`sklearn.metrics.classification_report()`
Purpose	Build a text report showing main classification metrics per class
Key Parameters	• `y_true`: Ground truth target values • `y_pred`: Estimated targets from classifier • `target_names`: Display names for classes • `output_dict`: Return dict instead of string
Returns	String (or dict) with precision, recall, F1-score, support per class

Code Snippet: Classification Reportfrom sklearn.metrics import classification_report

y_true = [0, 0, 1, 1, 1, 2, 2]
y_pred = [0, 1, 1, 1, 0, 2, 2]

print(classification_report(y_true, y_pred, target_names=['No Disease', 'Mild', 'Severe']))
##               precision    recall  f1-score   support
## 
##   No Disease       0.50      0.50      0.50         2
##         Mild       0.67      0.67      0.67         3
##       Severe       1.00      1.00      1.00         2
## 
##     accuracy                           0.71         7
##    macro avg       0.72      0.72      0.72         7
## weighted avg       0.71      0.71      0.71         7

ROC Curve and AUC

An ROC curve (Receiver Operating Characteristic curve) shows how a classifier’s performance changes as you vary the threshold for predicting “positive.” It plots:

True Positive Rate (TPR) (a.k.a. recall): How many actual positives did we catch?
- TPR (Recall): \(TPR = \frac{TP}{TP + FN}\)
False Positive Rate (FPR): How many actual negatives did we incorrectly call positive?
- FPR: \(FPR = \frac{FP}{FP + TN}\)

Reference Card: `roc_curve` and `roc_auc_score`

Component	Details
`roc_curve()`	Returns FPR, TPR, and thresholds for plotting
`roc_auc_score()`	Returns the Area Under the ROC Curve
Key Parameters	• `y_true`: True binary labels • `y_score`: Probability estimates (use `predict_proba()[:, 1]`)
Interpretation	AUC = 0.5 is random guessing; AUC = 1.0 is perfect

Code Snippet: ROC and AUC

from sklearn.metrics import roc_curve, roc_auc_score

y_true = [0, 0, 1, 1]
y_scores = [0.1, 0.4, 0.35, 0.8]  # Model probabilities

fpr, tpr, thresholds = roc_curve(y_true, y_scores)
auc_score = roc_auc_score(y_true, y_scores)
print(f"AUC Score: {auc_score}")  # Output: AUC Score: 0.75

AUC is desirable for two reasons:

AUC is scale-invariant. It measures how well predictions are ranked, rather than their absolute values.
AUC is classification-threshold-invariant. It measures the quality of the model’s predictions irrespective of what classification threshold is chosen.

However, both these reasons come with caveats:

Scale invariance is not always desirable. Sometimes we need well calibrated probability outputs, and AUC won’t tell us about that.
Classification-threshold invariance is not always desirable. In cases where there are wide disparities in the cost of false negatives vs. false positives, AUC isn’t the right metric.

Supervised vs. Unsupervised

There are two(-ish) overarching categories of classification algorithms: supervised and unsupervised. There are many possible approaches in each category, and some that work well in both (deep learning, for example).

Supervised - uses labeled datasets with known classes for the data points
Unsupervised - uses unlabeled data to uncover organizational patterns
Semi-supervised - some data with labels is used to extract relevant features, while others without can amplify that signal; e.g., medical images (x-ray, CT)

Supervised Models

To fairly evaluate each model, we must test its performance on different data than it was trained on. So we split our dataset into two partitions: test and train:

Train - data used to fit the model(s)
Validation - data used to tune hyperparameters, compare models, and select the best one
Test - final holdout data, touched only once to evaluate the selected model

A note on terminology: The standard convention is train/validation/test, where validation is used for model development and test is the final holdout. However, sklearn’s function is called train_test_split() and cross-validation literature often refers to “test folds”—so you’ll see some inconsistency in practice. We’ll follow the standard: validation for tuning/selection, test for final evaluation.

Reference Card: `train_test_split`

Component	Details
Function	`sklearn.model_selection.train_test_split()`
Purpose	Split arrays into random train and test subsets
Key Parameters	• `test_size`: Proportion for test (default 0.25) • `random_state`: Seed for reproducibility • `stratify`: Preserve class ratios in splits (pass `y` for classification)
Returns	`X_train, X_test, y_train, y_test`

Tip: Always use stratify=y for classification tasks—especially with imbalanced classes—to ensure both train and test sets have similar class distributions.

Code Snippet: Train/Test Splitfrom sklearn.model_selection import train_test_split
import numpy as np

X = np.array([[1, 2], [2, 3], [3, 4], [4, 5], [5, 6], [6, 7]])
y = np.array([0, 1, 0, 1, 0, 1])

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)
print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)

Cross-Validation for Model Comparison

A single train/test split can be misleading if the split happens to be particularly easy or hard. Cross-validation provides more robust performance estimates by training and testing on multiple different splits of the data.

K-Fold Cross-Validation:

Split data into k equal parts (folds)
Train on k-1 folds, test on the remaining fold
Repeat k times, each fold serving as test once
Average the results for final performance estimate

Reference Card: `cross_val_score`

Component	Details
Function	`sklearn.model_selection.cross_val_score()`
Purpose	Evaluate model with cross-validation, returning scores for each fold
Key Parameters	• `estimator`: Model to evaluate • `X, y`: Features and labels • `cv`: Number of folds or CV splitter • `scoring`: Metric to use (‘accuracy’, ‘f1’, ‘roc_auc’)
Returns	Array of scores, one per fold

Code Snippet: Cross-Validationfrom sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.ensemble import RandomForestClassifier
import numpy as np

X = np.random.randn(100, 5)
y = np.random.randint(0, 2, 100)

model = RandomForestClassifier(n_estimators=50, random_state=42)

## Simple 5-fold cross-validation
scores = cross_val_score(model, X, y, cv=5, scoring='f1')
print(f"F1 scores per fold: {scores}")
print(f"Mean F1: {scores.mean():.3f} (+/- {scores.std() * 2:.3f})")

Reference Card: `StratifiedKFold`

Component	Details
Function	`sklearn.model_selection.StratifiedKFold()`
Purpose	K-fold iterator that preserves class distribution in each fold
Key Parameters	• `n_splits`: Number of folds (default 5) • `shuffle`: Shuffle data before splitting • `random_state`: Seed for reproducibility
Use With	Pass to `cross_val_score(cv=...)` or iterate manually

Code Snippet: StratifiedKFold

from sklearn.model_selection import StratifiedKFold

## Create a stratified k-fold splitter
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

## Iterate over folds manually (useful for custom training loops)
for fold, (train_idx, val_idx) in enumerate(skf.split(X, y)):
    X_train_fold, X_val_fold = X[train_idx], X[val_idx]
    y_train_fold, y_val_fold = y[train_idx], y[val_idx]
    print(f"Fold {fold}: train={len(train_idx)}, val={len(val_idx)}")

Model Comparison Workflow

Once you have evaluation metrics and cross-validation set up, the typical workflow for comparing models is:

Define candidate models - pick \(N\) (2-4?) algorithms appropriate for your data
Compare on the same metric - F1, AUC, or whatever matters for your problem; Choose a clear criteria for establishing a winner ahead of time, ideally a single “champion” metric but could also be a more complex ruleset (e.g., best on metric A with metric B > minimum)
Use the same data splits - ensures fair comparison across the same data splits
For each fold (split n of k):
1. (optional) Hyperparameter tuning - vary model settings (e.g., tree depth)
2. Train each model on identical training sets
3. Evaluate each model on identical validation sets
4. Record performance metrics for this fold
Compare models and select best - usually best mean score for champion metric, but watch out for models with high variance across folds
Retrain selected model on full train+validation data
Final evaluation - evaluate the chosen model on the held-out test set (touched only once)

flowchart TB
    A[Setup: Define models, metric, hold out test, split into k folds]

    A --> F1 & F2 & Fdots & Fk

    subgraph FoldRow[" "]
        direction LR
        F1[Fold 1]
        F2[Fold 2]
        Fdots[...]
        Fk[Fold k]
    end

    F1 --> LR & RF & XGB

    subgraph Fold1Box[" "]
        LR[Logistic Regression] --> LR1[Train] --> LR2[Eval] --> LR3[Record]
        RF[Random Forest] --> RF1[Train] --> RF2[Eval] --> RF3[Record]
        XGB[XGBoost] --> XGB1[Train] --> XGB2[Eval] --> XGB3[Record]
    end

    F2 -.-> R2[...]
    Fdots -.-> Rdots[...]
    Fk -.-> Rk[...]

    LR3 & RF3 & XGB3 --> Avg[Average scores per model]
    R2 & Rdots & Rk -.-> Avg
    Avg --> Select[Select best model]
    Select --> Retrain[Retrain best on ALL train+val]
    Retrain --> Final[Evaluate on test set]

    style FoldRow fill:none,stroke:none

Code Snippet: Comparing Multiple Modelsfrom sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier

## Define models to compare
models = {
    'Logistic Regression': LogisticRegression(max_iter=1000),
    'Random Forest': RandomForestClassifier(n_estimators=100),
    'XGBoost': XGBClassifier(n_estimators=100, verbosity=0)
}

## Use same CV splitter for fair comparison
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

## Compare models
for name, model in models.items():
    scores = cross_val_score(model, X, y, cv=cv, scoring='f1')
    print(f"{name}: F1 = {scores.mean():.3f} (+/- {scores.std() * 2:.3f})")

LIVE DEMO

Quick Supervised Model Overview

Let’s look at a few tools that you should get a lot of use out of:

Logistic Regression shouldn’t be overlooked! It’s not as new as some other models, but it’s simple and works.
Random Forest is an ensemble model that makes many decision trees using bagging, then takes a simple vote across them to assign a class
XGBoost is another ensemble and arguably the most widely used (and useful) algorithm in tabular ML (it can do regression, classification, and julienne fries!)
Deep Learning uses artificial neural networks with multiple layers to learn complex patterns from data. These models have performed well in a variety of tasks: image recognition, speech recognition, and natural language processing.

Deep Learning models may also be used in unsupervised settings

Logistic Regression

Logistic regression works similarly to linear regression but uses a sigmoid curve that squeezes our straight line into an S-curve. Note that logistic regression actually “fits” by training linear-like terms inside a more complex function.

Additionally, it uses log loss (also called cross-entropy loss: a cost that measures how far predicted probabilities are from the true labels and penalizes confident wrong predictions more than uncertain ones) in place of our usual mean-squared error cost function. This provides a convex curve for approximating variable weights using gradient descent.

Reference Card: `LogisticRegression`

Component	Details
Function	`sklearn.linear_model.LogisticRegression()`
Purpose	Linear model for classification (binary or multinomial)
Key Parameters	• `penalty`: Regularization (‘l2’, ‘l1’, ‘elasticnet’) • `C`: Inverse regularization strength (smaller = stronger) • `solver`: Optimization algorithm • `max_iter`: Max iterations for convergence
Key Methods	`.fit()`, `.predict()`, `.predict_proba()`

Code Snippet: Logistic Regressionfrom sklearn.linear_model import LogisticRegression
X = [[1, 2], [2, 3], [3, 4]]
y = [0, 1, 0]
model = LogisticRegression().fit(X, y)
print(model.predict([[2, 2]]))  # Predicts class label

Random Forest

Each of the steps can be tweaked, but the general flow goes:

Bagging (Bootstrap AGGregating) - create k random samples (with replacement) from the dataset
Grow trees - individual decision trees are constructed by choosing the best features and cutpoints to separate the classes
Classify - instances are run through all trees and assigned a class by majority vote

Reference Card: `RandomForestClassifier`

Component	Details
Function	`sklearn.ensemble.RandomForestClassifier()`
Purpose	Ensemble of decision trees for classification
Key Parameters	• `n_estimators`: Number of trees (default 100) • `max_depth`: Maximum tree depth • `min_samples_split`: Min samples to split a node • `max_features`: Features to consider per split
Attributes	`.feature_importances_` - importance of each feature

Code Snippet: Random Forest

from sklearn.ensemble import RandomForestClassifier

X = [[1, 2], [2, 3], [3, 4], [4, 5]]
y = [0, 1, 0, 1]
model = RandomForestClassifier(n_estimators=10).fit(X, y)
print(model.predict([[2, 2]]))

XGBoost

XGBoost stands for Extreme Gradient Boosting. Like other tree algorithms, XGBoost considers each instance with a series of if statements, resulting in a leaf with associated class assignment scores. Where XGBoost differs is that it uses gradient boosting to focus on weak-performing areas of the previous tree.

Boosting - sequentially choosing models by minimizing errors from previous models while increasing the influence of high-performing models; i.e., each model tries to improve where the last was wrong
Gradient boosting - a stagewise additive algorithm sequentially adding trees to improve performance measured by a loss function until some threshold is met. It’s a greedy algorithm prone to overfitting but often proves useful when focused on poor-performing areas

Reference Card: `XGBClassifier`

Component	Details
Function	`xgboost.XGBClassifier()`
Purpose	Gradient boosting implementation for classification
Key Parameters	• `n_estimators`: Number of boosting rounds • `learning_rate`: Step size shrinkage (eta) • `max_depth`: Maximum tree depth • `subsample`: Training instance subsample ratio
Install	`pip install xgboost`

Code Snippet: XGBoostimport xgboost as xgb

X = [[1, 2], [2, 3], [3, 4], [4, 5]]
y = [0, 1, 0, 1]
model = xgb.XGBClassifier(n_estimators=10).fit(X, y)
print(model.predict([[2, 2]]))

LightGBM: A Faster Alternative

LightGBM (Light Gradient Boosting Machine) is Microsoft’s gradient boosting framework, optimized for speed and memory efficiency. It’s often faster than XGBoost, especially on large datasets.

Key differences from XGBoost:

Leaf-wise tree growth - grows trees by choosing the leaf with max delta loss, rather than level-wise. Faster but can overfit on small datasets
Histogram-based splitting - bins continuous features into discrete bins for faster training
Native categorical support - handles categorical features directly without one-hot encoding

Reference Card: `LGBMClassifier`

Component	Details
Function	`lightgbm.LGBMClassifier()`
Purpose	Fast gradient boosting for classification
Key Parameters	• `n_estimators`: Number of boosting rounds • `learning_rate`: Step size shrinkage • `num_leaves`: Max number of leaves per tree • `max_depth`: Limit tree depth (-1 for no limit)
Install	`pip install lightgbm`

Code Snippet: LightGBM

from lightgbm import LGBMClassifier

X = [[1, 2], [2, 3], [3, 4], [4, 5]]
y = [0, 1, 0, 1]
model = LGBMClassifier(n_estimators=10, verbosity=-1).fit(X, y)
print(model.predict([[2, 2]]))

When to use which?

XGBoost - well-established, extensive documentation, Kaggle competitions
LightGBM - faster training, lower memory, good for large datasets
Both are excellent choices for tabular data classification

Deep Learning

Deep learning is a subfield of machine learning that uses artificial neural networks with multiple layers to learn complex patterns from data. These models use back-propagation to adjust the weights in each layer during training, allowing them to model very large and complex datasets.

Deep learning models are especially useful for handling large datasets with high dimensionality, and they can be used for both supervised and unsupervised learning tasks. However, they often require a large amount of data and computation power to train effectively.

Artificial neural networks - a computational model inspired by biological neural networks
Deep neural networks - an ANN with more than one hidden layer
Convolutional neural networks - designed for image and video recognition
Recurrent neural networks - designed for sequence data

More on neural networks later in the course…

Reference Card: Keras Sequential Model

Component	Details
`Sequential()`	Linear stack of neural network layers
`Dense(units, activation)`	Fully connected layer
`model.compile()`	Configure for training (optimizer, loss, metrics)
Common Activations	‘relu’, ‘sigmoid’, ‘softmax’, ‘tanh’

Code Snippet: Simple Neural Networkfrom tensorflow import keras
from tensorflow.keras import layers

model = keras.Sequential([
    layers.Dense(8, activation='relu', input_shape=(2,)),
    layers.Dense(1, activation='sigmoid')
])
model.compile(optimizer='adam', loss='binary_crossentropy')
model.fit(X_train, y_train, epochs=10)

Unsupervised Models (Further Reading)

Unsupervised models are used when you don’t have labeled data. While this course focuses on supervised classification, it’s worth knowing what’s available:

Clustering: grouping points based on similarities/differences; e.g., K-means, hierarchical clustering
Association: reveals relationships between variables; e.g., Apriori, F-P Growth
Dimensionality reduction: reduces the inputs to a smaller size; e.g., PCA, t-SNE, autoencoders

LIVE DEMO

How Models Fail

Even the best models can fail if the data is messy, the problem is hard, or the world changes. Common failure modes include:

Bad or inconsistent labels (garbage in, garbage out!)
Underfitting (model too simple) or overfitting (model too complex)
Dataset shift (data changes between training and real-world use)
Hidden confounders (Simpson’s paradox)
Imbalanced or “troublesome” classes

Labeling

Oh, labeling…

Labeling issues can arise when the data is not labeled correctly or consistently, which can lead to biased or inaccurate models. Examples of labeling issues include:

Mislabeling: Labels that are assigned to data points are incorrect.
Ambiguous labeling: Labels that are assigned to data points are not clear or specific.
Inconsistent labeling: Labels that are assigned to similar data points are not the same

Fit

A model may fail to fit the data in one of two ways: under-fitting or over-fitting:

Under-fitting: The model fails to capture the differences between the classes. The model may be too simple, lack the necessary features, or the classes may not easily divide based on existing data.
Over-fitting: The model fits the training data too closely, leading to poor generalization. This can be the case when the model is overly complex or the data may have “too many features”.

Note: With enough variables you can build a perfect predictor for anything (at least in the training set). That doesn’t mean the model will perform well in the wild

Dataset Shift

Dataset shift occurs when the distribution of the data changes between the training and test sets. Dataset shift can be divided into three types:

Covariate Shift: A change in the distribution of the independent variables between the training and test sets.
Prior Probability Shift: A change in the distribution of the target variable between the training and test sets.
Concept Shift: A change in the relationship between the independent and target variables between the training and test sets.

Training data (Q2 2017): “Fidget spinners are the future!”

Production data (2018+): “…nevermind”

Simpson’s Paradox

Simpson’s paradox occurs when a trend appears in several different groups of data, but disappears or reverses when these groups are combined. It is a common problem in statistics and machine learning that can occur when there are confounding variables that affect the relationship between the independent and dependent variables.

Troublesome Classes

Certain classes or categories in a dataset may be more difficult to classify accurately than others. This can be due to imbalanced class distribution, noisy data, or other factors. Identifying and addressing troublesome classes is an important step in building effective classification models.

Model Interpretation

Understanding why a model makes its predictions is crucial in health data science—especially when decisions impact patient care.

SHAP Values for Feature Importance

SHAP (SHapley Additive exPlanations) assigns each feature an importance value for a particular prediction, based on cooperative game theory. SHAP provides more rigorous explanations than simpler methods but takes longer to compute.

Reference Card: SHAP

Component	Details
`TreeExplainer(model)`	Explainer optimized for tree-based models
`explainer.shap_values(X)`	Compute SHAP values for data
`summary_plot()`	Visualize global feature importance
`dependence_plot()`	Show feature interactions
Install	`pip install shap`

Code Snippet: SHAP

import shap
import xgboost as xgb
import pandas as pd

X = pd.DataFrame({'age': [50, 60], 'bp': [120, 140]})
y = [0, 1]
model = xgb.XGBClassifier().fit(X, y)

explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X)
shap.summary_plot(shap_values, X, plot_type="bar")

Example SHAP summary plot: Each dot shows a feature’s impact on a prediction. Color indicates feature value (red=high, blue=low).

Example SHAP dependence plot: Shows how the effect of one feature depends on the value of another feature.

eli5 for Model Inspection

eli5 is a Python library that helps demystify machine learning models by showing feature weights and decision paths.

Reference Card: eli5

Component	Details
`show_weights()`	Display feature importances (for notebooks/HTML)
`explain_weights()`	Return explanation object for programmatic use (e.g. `format_as_html()`)
`show_prediction()`	Explain a single prediction
Key Parameters	• `estimator`: Trained model • `feature_names`: List of feature names • `top`: Number of top features to show
Install	`pip install eli5`

Code Snippet: eli5

import eli5
from sklearn.ensemble import RandomForestClassifier

X = [[1, 2], [3, 4], [5, 6]]
y = [0, 1, 0]
model = RandomForestClassifier().fit(X, y)

eli5.show_weights(model, feature_names=['feature1', 'feature2'])

Practical Data Preparation

Preparing your data is just as important as choosing the right model. Good data prep can make or break your results—especially with real-world health data, which is often messy, imbalanced, and full of categorical variables.

Feature Scaling with StandardScaler

Many algorithms (especially logistic regression, SVMs, and neural networks) perform better when features are on similar scales. StandardScaler transforms features to have mean=0 and standard deviation=1.

Reference Card: `StandardScaler`

Component	Details
Function	`sklearn.preprocessing.StandardScaler()`
Purpose	Standardize features by removing the mean and scaling to unit variance
Key Methods	• `fit(X)`: Compute mean and std from training data • `transform(X)`: Apply standardization • `fit_transform(X)`: Fit and transform in one step • `inverse_transform(X)`: Reverse the transformation
When to Use	Logistic regression, SVM, neural networks, K-means, PCA

Code Snippet: StandardScalerfrom sklearn.preprocessing import StandardScaler
import numpy as np

## Sample data with different scales
X = np.array([[50, 180000], [65, 220000], [35, 150000]])  # Age, Income

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
print("Original:\n", X)
print("Scaled:\n", X_scaled)
## Each column now has mean≈0, std≈1

LabelEncoder for Target Variables

Some classifiers (like XGBoost) require target labels to be consecutive integers starting from 0. LabelEncoder transforms arbitrary labels like [5, 7, 9] into [0, 1, 2].

Reference Card: `LabelEncoder`

Component	Details
Function	`sklearn.preprocessing.LabelEncoder()`
Purpose	Encode target labels as integers 0, 1, 2, …
Key Methods	• `fit(y)`: Learn label mapping • `transform(y)`: Apply encoding • `fit_transform(y)`: Fit and transform in one step • `inverse_transform(y)`: Convert back to original labels
When to Use	XGBoost with multi-class labels, or when algorithms require 0-indexed labels

Code Snippet: LabelEncoderfrom sklearn.preprocessing import LabelEncoder

## Original labels aren't 0-indexed
y = [5, 7, 9, 5, 7, 9]

encoder = LabelEncoder()
y_encoded = encoder.fit_transform(y)
print(y_encoded)  # [0, 1, 2, 0, 1, 2]

## To decode back:
y_original = encoder.inverse_transform(y_encoded)
print(y_original)  # [5, 7, 9, 5, 7, 9]

OneHotEncoder for Categorical Variables

Many machine learning models require all input features to be numeric. One-hot encoding transforms categorical variables (like “smoker” or “blood type”) into a set of binary columns.

Reference Card: `OneHotEncoder`

Component	Details
Function	`sklearn.preprocessing.OneHotEncoder()`
Purpose	Encode categorical features as binary columns
Key Parameters	• `categories`: Categories per feature (‘auto’) • `drop`: Drop category to avoid multicollinearity • `sparse_output`: Return sparse matrix or array • `handle_unknown`: How to handle unknown categories

Code Snippet: OneHotEncoderfrom sklearn.preprocessing import OneHotEncoder
import pandas as pd

df = pd.DataFrame({'smoker': ['yes', 'no', 'no', 'yes']})
encoder = OneHotEncoder(sparse_output=False, handle_unknown='ignore')
encoded = encoder.fit_transform(df[['smoker']])
print(encoded)

Handling Imbalanced Classes with Evaluation Metrics

Consider metrics that penalize misclassifications unequally, like:

F1-score: Harmonic mean of precision and recall (good out of the box, but can be weighted for class imbalance)
Precision: Proportion of positive identifications that were actually correct
Recall: Proportion of actual positives that were identified correctly

Handling Imbalanced Data with SMOTE

In health data, one class (like “disease present”) is often much rarer than the other. SMOTE (Synthetic Minority Over-sampling Technique) creates synthetic examples of the minority class to balance the dataset.

Reference Card: `SMOTE`

Component	Details
Function	`imblearn.over_sampling.SMOTE()`
Purpose	Oversample minority class with synthetic samples
Key Parameters	• `sampling_strategy`: Target class distribution • `k_neighbors`: Neighbors used for interpolation • `random_state`: For reproducibility
Install	`pip install imbalanced-learn`

Code Snippet: SMOTE

from imblearn.over_sampling import SMOTE
from sklearn.datasets import make_classification
import collections

X, y = make_classification(n_samples=100, weights=[0.9, 0.1], random_state=42)
print("Original:", collections.Counter(y))

smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X, y)
print("Resampled:", collections.Counter(y_resampled))

Feature Engineering

Good features often matter more than the choice of algorithm. Feature engineering combines domain knowledge with data transformation to create inputs that help models learn.

Domain-Specific Feature Derivations

The best features often come from domain knowledge—knowing what matters in health data:

Calculating BMI from height and weight
Deriving heart rate variability from RR intervals
Creating a “polypharmacy” flag for patients on multiple medications
Age buckets for risk stratification

Automated Feature Engineering (Further Reading)

For complex relational datasets, libraries like featuretools can automatically generate features using Deep Feature Synthesis. This is especially useful for time series and multi-table data.

When and How to Combine Techniques

Often, you’ll need to use several data prep techniques together. The order is crucial to prevent data leakage—when information from the test set accidentally influences training, leading to overly optimistic performance estimates.

Recommended Order (opinionated):

Split data FIRST: Hold out the test set before any preprocessing. Use stratify=y to preserve class ratios.
Encode categorical variables: Fit encoder only on training data, then transform both train and validation/test
Engineer additional features:
- Always use subject matter expertise to transform or combine conceptually consistent columns that may be more predictive or better fit (e.g., linear) for your model
- Use dimension reduction techniques like Principal Component Analysis to combine columns for better predictive value and to reduce multicollinearity
- Consider automated feature engineering as an exploratory step but beware that it is a kind of modeling and, just like any model, it is susceptible to overfitting. Also, you’re going to have to explain whatever you did and the relationships in your analysis.
Scale/normalize features (if needed): Fit scaler only on training data, then transform both
Balance classes (optional, e.g., SMOTE): Apply only to training set—never validation or test!

Why split first? If you fit your scaler or encoder on the full dataset before splitting, your model “sees” test data statistics during training. This leaks information and makes your model appear better than it really is.

If Not Balancing: Use evaluation metrics that account for imbalance:

Confusion Matrix: Performance per class
Precision, Recall, F1-score: Especially for minority class
ROC AUC: Or Precision-Recall AUC for imbalanced data

References

Documentation

Deep Learning Frameworks

Books

Tutorials & Articles

Health Data Examples

Crash Course in Classification

Classification Methods

Reference Card: Common Classification Methods

Model Evaluation

Reference Card: confusion_matrix

Code Snippet: Confusion Matrix

Classification Report

Reference Card: classification_report

Code Snippet: Classification Report

ROC Curve and AUC

Reference Card: roc_curve and roc_auc_score

Code Snippet: ROC and AUC

Supervised vs. Unsupervised

Supervised Models

Reference Card: train_test_split

Code Snippet: Train/Test Split

Cross-Validation for Model Comparison

Reference Card: cross_val_score

Code Snippet: Cross-Validation

Reference Card: StratifiedKFold

Code Snippet: StratifiedKFold

Model Comparison Workflow

Code Snippet: Comparing Multiple Models

LIVE DEMO

Quick Supervised Model Overview

Logistic Regression

Reference Card: LogisticRegression

Code Snippet: Logistic Regression

Random Forest

Reference Card: RandomForestClassifier

Code Snippet: Random Forest

XGBoost

Reference Card: XGBClassifier

Code Snippet: XGBoost

LightGBM: A Faster Alternative

Reference Card: LGBMClassifier

Code Snippet: LightGBM

Deep Learning

Reference Card: Keras Sequential Model

Code Snippet: Simple Neural Network

Unsupervised Models (Further Reading)

LIVE DEMO

How Models Fail

Labeling

Fit

Dataset Shift

Simpson’s Paradox

Troublesome Classes

Model Interpretation

SHAP Values for Feature Importance

Reference Card: SHAP

Code Snippet: SHAP

eli5 for Model Inspection

Reference Card: eli5

Code Snippet: eli5

Practical Data Preparation

Feature Scaling with StandardScaler

Reference Card: StandardScaler

Code Snippet: StandardScaler

LabelEncoder for Target Variables

Reference Card: LabelEncoder

Code Snippet: LabelEncoder

OneHotEncoder for Categorical Variables

Reference Card: OneHotEncoder

Code Snippet: OneHotEncoder

Handling Imbalanced Classes with Evaluation Metrics

Handling Imbalanced Data with SMOTE

Reference Card: SMOTE

Code Snippet: SMOTE

Feature Engineering

Domain-Specific Feature Derivations

Automated Feature Engineering (Further Reading)

When and How to Combine Techniques

LIVE DEMO

Reference Card: `confusion_matrix`

Reference Card: `classification_report`

Reference Card: `roc_curve` and `roc_auc_score`

Reference Card: `train_test_split`

Reference Card: `cross_val_score`

Reference Card: `StratifiedKFold`

Reference Card: `LogisticRegression`

Reference Card: `RandomForestClassifier`

Reference Card: `XGBClassifier`

Reference Card: `LGBMClassifier`

Reference Card: `StandardScaler`

Reference Card: `LabelEncoder`

Reference Card: `OneHotEncoder`

Reference Card: `SMOTE`