Skip to content

Time Series & Regression: Predicting the Future πŸ“ˆβŒ›ΒΆ

pywt PyWavelets

The pywt is very difficult to use. Update the lecture and assignment to work around this issue by substituting with a different package.

"The best thing about the future is that it comes one data point at a time." β€” Abraham Lincoln (if he was a data scientist)

XKCD Extrapolating Source: XKCD 605 - A cautionary tale about extrapolation

Lecture OverviewΒΆ

This lecture is divided into four main parts:

  1. Conceptual Overview (10 minutes)
  2. Introduction to time series in healthcare
  3. Key applications and importance
  4. Visual introduction to patterns in time series

  5. Time Series Basics (30 minutes)

  6. Types of healthcare time series
  7. Components and visualization
  8. Common challenges
  9. Basic analysis techniques
  10. Demo: Exploring heart rate patterns during meditation

  11. ARIMA Models (25 minutes)

  12. Introduction to forecasting
  13. Understanding ARIMA components
  14. Model selection and evaluation
  15. Healthcare applications
  16. Demo: Sleep quality prediction

  17. Sensor Data Analysis (25 minutes)

  18. Characteristics of dense physiological data
  19. Signal processing techniques
  20. Feature extraction
  21. Pattern recognition in health monitoring
  22. Demo: Advanced sensor data analysis

Let's begin with a conceptual overview of time series analysis!

1. Conceptual Overview: Why Time Series Matter in Healthcare πŸ₯ΒΆ

The Three Laws of Time Series πŸ€–ΒΆ

  1. A time series may not harm a patient, or through inaction, allow a patient to come to harm

    • Monitoring systems must be reliable
    • Alerts should minimize false alarms
    • Models must be interpretable
  2. A time series must be clean, except where such cleanliness conflicts with the First Law

    • Data quality is crucial
    • But don't discard "messy" data that might be clinically relevant
    • Document all preprocessing steps
  3. A time series must protect its integrity as long as such protection does not conflict with the First or Second Law

    • No data leakage from the future
    • Respect temporal ordering
    • Handle missing values appropriately

Check: What's your favorite example of "garbage in, garbage out" with time series data?

What is a Time Series?ΒΆ

A time series is a sequence of observations recorded at regular time intervals. In healthcare, time series data is ubiquitous:

  • Patient monitoring: Vital signs, continuous glucose readings, ECG
  • Treatment tracking: Medication effects, therapy outcomes
  • Disease progression: Biomarker changes, symptom severity
  • Healthcare operations: Hospital admissions, resource utilization

Patterns in Time SeriesΒΆ

Time series can exhibit various patterns that provide valuable insights:

graph TD
    A[Time Series Patterns] --> B[Trend]
    A --> C[Seasonality]
    A --> D[Cyclical]
    A --> E[Irregular]
    B --> F[Upward/Downward]
    B --> G[Changing Direction]
    C --> H[Daily/Weekly]
    C --> I[Monthly/Yearly]
    D --> J[Business Cycles]
    D --> K[Epidemic Waves]
    E --> L[Random Variation]

Let's look at some examples of different patterns in healthcare time series:

  • Trend only: Disease progression (e.g., gradual decline in lung function)
  • Seasonality only: Seasonal allergies or influenza cases
  • Trend and seasonality: Hospital admissions over multiple years
  • Cyclical patterns: Epidemic waves that don't follow fixed calendar patterns

2. Time Series BasicsΒΆ

Types of Healthcare Time SeriesΒΆ

1. Regular vs. Irregular Time SeriesΒΆ

graph LR
    A[Time Series Types] --> B[Regular]
    A --> C[Irregular]
    B --> D[Fixed intervals<br>e.g., hourly vitals]
    C --> E[Variable intervals<br>e.g., patient visits]
Regular Time SeriesΒΆ
  • Fixed time intervals between observations
  • Examples: Hourly vital signs, daily lab values, continuous monitoring
  • Data from: Heart Rate During Meditation
Irregular Time SeriesΒΆ
  • Variable time intervals between observations
  • Examples: Patient visits, symptom reports, medication changes
  • Data from: MMASH Dataset

2. Univariate vs. Multivariate Time SeriesΒΆ

Univariate Time SeriesΒΆ
  • Single variable measured over time
  • Example: Blood glucose readings

Conceptual: Univariate time series represent a single variable tracked over time. They're the simplest form of time series and allow us to focus on patterns in a single measurement.

Reference: - pandas.Series: Core data structure for univariate time series - index: DatetimeIndex that holds the timestamps - Common methods: plot(), rolling(), resample()

# Example of univariate time series
import pandas as pd
dates = pd.date_range(start='2024-01-01', periods=10, freq='D')
glucose = pd.Series([95, 100, 92, 98, 105, 110, 102, 95, 99, 97], index=dates)
Multivariate Time SeriesΒΆ
  • Multiple variables measured over time
  • Example: Vital signs (heart rate, blood pressure, temperature)

Conceptual: Multivariate time series track multiple variables simultaneously. They allow us to analyze relationships between different measurements and how they evolve together over time.

Reference: - pandas.DataFrame: Core data structure for multivariate time series - Each column represents a different variable - Shared DatetimeIndex across all variables

# Example of multivariate time series
vitals = pd.DataFrame({
    'heart_rate': [72, 75, 71, 74, 77, 80, 76, 73, 75, 74],
    'systolic_bp': [120, 122, 119, 121, 125, 128, 124, 120, 122, 121],
    'temperature': [98.6, 98.7, 98.5, 98.6, 98.8, 99.0, 98.7, 98.6, 98.7, 98.6]
}, index=dates)

3. Dense vs. Sparse Time SeriesΒΆ

Dense Time SeriesΒΆ
  • High-frequency data collection
  • Examples: ECG (250+ Hz), accelerometer data, continuous glucose monitoring
  • Challenges: Storage, processing, feature extraction
Sparse Time SeriesΒΆ
  • Infrequent or irregular observations
  • Examples: Annual check-ups, episodic symptoms
  • Challenges: Interpolation, handling missing data

Time Series Components πŸ“ŠΒΆ

1. TrendΒΆ

  • Long-term progression in the data
  • Can be upward, downward, or changing direction
  • Examples in healthcare: Disease progression, recovery trajectory

2. SeasonalityΒΆ

  • Regular, predictable patterns that repeat
  • Can be daily, weekly, monthly, yearly
  • Examples in healthcare: Circadian rhythms, seasonal illnesses

3. Cyclical PatternsΒΆ

  • Irregular fluctuations without fixed frequency
  • Usually longer than seasonal patterns
  • Examples in healthcare: Epidemic cycles, treatment response cycles

4. Noise/ResidualsΒΆ

  • Random variation that can't be explained by other components
  • Can be due to measurement error or natural variability
  • Examples in healthcare: Biological variability, measurement artifacts

Visualizing ComponentsΒΆ

Conceptual: Time series decomposition breaks a time series into its constituent parts: trend, seasonality, and residuals. This helps us understand the underlying patterns and can improve forecasting.

Reference: - statsmodels.tsa.seasonal.seasonal_decompose: Decomposes time series into components - model: 'additive' (components add) or 'multiplicative' (components multiply) - period: Length of the seasonal cycle (e.g., 7 for weekly, 12 for monthly)

# Example of time series decomposition
from statsmodels.tsa.seasonal import seasonal_decompose
result = seasonal_decompose(ts, model='additive', period=30)

Common Challenges in Healthcare Time Series 🚧¢

1. Missing ValuesΒΆ

  • Causes: Equipment failures, patient non-compliance, data entry errors
  • Approaches:
  • Deletion (if minimal)
  • Interpolation (linear, spline)
  • Forward/backward fill
  • Model-based imputation

Conceptual: Missing values are common in healthcare time series and can significantly impact analysis. Different imputation methods have different assumptions and effects on the resulting data.

Reference: - pandas.Series.fillna: Fill missing values in a Series - pandas.Series.interpolate: Interpolate missing values

# Example of handling missing values
ts_ffill = ts_with_gaps.fillna(method='ffill')  # Forward fill
ts_interp = ts_with_gaps.interpolate(method='linear')  # Linear interpolation

2. Irregular SamplingΒΆ

  • Challenge: Observations not taken at regular intervals
  • Approaches:
  • Resampling to regular intervals
  • Special methods for irregular time series
  • Continuous-time models

Conceptual: Irregular sampling occurs when observations aren't taken at fixed intervals. Resampling converts irregular time series to regular ones, making them easier to analyze with standard methods.

Reference: - pandas.Series.resample: Resample time series to regular frequency

# Example of resampling irregular data
regular_ts = irregular_ts.resample('D').interpolate(method='linear')

3. OutliersΒΆ

  • Challenge: Extreme values that may be errors or important signals
  • Approaches:
  • Statistical detection (z-score, IQR)
  • Contextual outlier detection
  • Robust methods that resist outlier influence

Conceptual: Outliers are extreme values that may represent errors or important clinical events. Z-scores measure how many standard deviations a point is from the mean, helping identify outliers.

Reference: - scipy.stats.zscore: Calculate z-scores for a dataset

# Example of outlier detection
from scipy import stats
z_scores = stats.zscore(ts_with_outliers)
outliers = np.abs(z_scores) > 3  # Threshold at 3 standard deviations

4. Changing VarianceΒΆ

  • Challenge: Variance of the time series changes over time
  • Approaches:
  • Transformation (log, Box-Cox)
  • GARCH models
  • Robust scaling methods

Conceptual: Changing variance (heteroscedasticity) can affect model performance. Log transformation can stabilize variance by compressing larger values more than smaller ones.

Reference: - numpy.log1p: Natural logarithm of (1 + x)

# Example of variance stabilization
ts_log = np.log1p(ts_with_outliers)  # log(1+x) to handle zeros

Basic Time Series Analysis Techniques πŸ”ΒΆ

1. Descriptive StatisticsΒΆ

  • Basic measures: mean, median, min, max, standard deviation
  • Rolling statistics: moving averages, rolling standard deviation
  • Autocorrelation: correlation of a series with its own lagged values

Conceptual: Rolling statistics calculate metrics over a sliding window, helping identify changing patterns over time. They smooth out short-term fluctuations while preserving longer-term trends.

Reference: - pandas.Series.rolling: Create a rolling window object

# Example of rolling statistics
rolling_mean = ts.rolling(window=30).mean()
rolling_std = ts.rolling(window=30).std()

2. Correlation AnalysisΒΆ

  • Autocorrelation Function (ACF): correlation with lagged values
  • Partial Autocorrelation Function (PACF): correlation with lagged values, controlling for intermediate lags
  • Cross-correlation: correlation between two different time series

Conceptual: Autocorrelation measures how a time series correlates with lagged versions of itself. It helps identify seasonality and determine appropriate parameters for time series models.

Reference: - statsmodels.graphics.tsaplots.plot_acf: Plot autocorrelation function - statsmodels.graphics.tsaplots.plot_pacf: Plot partial autocorrelation function

# Example of autocorrelation analysis
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
plot_acf(ts, lags=40)
plot_pacf(ts, lags=40)

3. Resampling and AggregationΒΆ

  • Upsampling: Increasing frequency (e.g., daily to hourly)
  • Downsampling: Decreasing frequency (e.g., hourly to daily)
  • Aggregation methods: mean, median, sum, min, max

Conceptual: Resampling changes the frequency of a time series. Downsampling aggregates data to a lower frequency, while upsampling increases frequency (requiring interpolation).

Reference: - pandas.Series.resample: Resample time series to different frequency

# Example of resampling
weekly_mean = ts.resample('W').mean()  # Downsample from daily to weekly

4. Visualization TechniquesΒΆ

  • Line plots with enhancements: confidence intervals, annotations
  • Multiple time series visualization: subplots, overlay, faceting
  • Seasonal plots: values by season
  • Heatmaps: correlation between multiple time series

Conceptual: Effective visualization is crucial for understanding time series data. Different visualization techniques highlight different aspects of the data.

Reference: - matplotlib.pyplot: Basic plotting library - seaborn: Advanced statistical visualizations

# Example of enhanced line plot
plt.figure(figsize=(12, 6))
plt.plot(ts, label='Time Series')
plt.axhline(y=ts.mean(), color='r', linestyle='--', label='Mean')

5. Survival Analysis MethodsΒΆ

Cox Proportional Hazards ModelΒΆ

Cox Proportional Hazards Model

Conceptual: The Cox Proportional Hazards Model is a semi-parametric model used in survival analysis to assess the relationship between covariates and survival time. It estimates the hazard (or risk) of an event occurring at a certain time, considering the impact of various factors.

Key features: - Handles censored data (when the event hasn't occurred by the end of observation) - Doesn't require specifying a particular probability distribution for survival times - Assumes proportional hazards (the effect of covariates is constant over time) - Widely used in clinical trials and epidemiological studies

Reference: - lifelines.CoxPHFitter: Implementation of Cox Proportional Hazards model - duration_col: Column containing the duration until the event or censoring - event_col: Column indicating if the event of interest occurred - Methods: fit(), predict_survival_function(), plot()

# Example of Cox Proportional Hazards model
from lifelines import CoxPHFitter

# Assuming df has 'duration', 'event', and covariate columns
cph = CoxPHFitter()
cph.fit(df, duration_col='duration', event_col='event')

# Print summary of the model
print(cph.summary)
Kaplan-Meier Survival AnalysisΒΆ

Kaplan-Meier Survival Analysis

Conceptual: Kaplan-Meier analysis is a non-parametric method for estimating the survival function from time-to-event data. It creates a step function that shows the probability of an event occurring at different time points, accounting for censored observations.

Key features: - Provides a visual representation of survival over time - Handles right-censored data effectively - Makes no assumptions about the underlying distribution - Can compare survival curves between different groups - Used in medicine, biology, engineering, and economics

Reference: - lifelines.KaplanMeierFitter: Implementation of Kaplan-Meier estimator - durations: Array of durations until event or censoring - event_observed: Array indicating whether the event was observed (1) or censored (0) - Methods: fit(), plot_survival_function(), median_survival_time_

# Example of Kaplan-Meier analysis
from lifelines import KaplanMeierFitter

# Initialize the Kaplan-Meier model
kmf = KaplanMeierFitter()

# Fit the model
kmf.fit(durations=df['duration'], event_observed=df['event'])

# Plot the survival function
kmf.plot_survival_function()

DEMO BREAK: Exploring Heart Rate Patterns During MeditationΒΆ

See: demo1-synthetic-timeseries

3. ARIMA ModelsΒΆ

Introduction to ForecastingΒΆ

Conceptual: Forecasting time series involves predicting future values based on patterns observed in historical data. The goal is to capture the underlying structure while accounting for randomness.

Additive vs. Multiplicative ModelsΒΆ

Time series can be modeled as either:

  • Additive: Value = Base Level + Trend + Seasonality + Error
  • Multiplicative: Value = Base Level Γ— Trend Γ— Seasonality Γ— Error

Additive models are appropriate when the seasonal variation is constant over time, while multiplicative models are better when the seasonal variation increases with the level of the series.

Stationarity and TransformationsΒΆ

Conceptual: A stationary time series has statistical properties that don't change over time (constant mean, variance, and autocorrelation). Most forecasting methods require stationary data.

Testing for StationarityΒΆ

  • Visual inspection: Plot the series and check if properties change over time
  • Statistical tests: Augmented Dickey-Fuller (ADF), KPSS

Reference: - statsmodels.tsa.stattools.adfuller: Augmented Dickey-Fuller test

# Example of ADF test
from statsmodels.tsa.stattools import adfuller
result = adfuller(series)
print(f'ADF Statistic: {result[0]}')
print(f'p-value: {result[1]}')

Making a Series StationaryΒΆ

  • Differencing: Subtract consecutive observations (first difference, second difference)
  • Transformation: Apply mathematical functions (log, square root, Box-Cox)
  • Detrending: Remove trend component
  • Deseasonalizing: Remove seasonal component
# Example of differencing
diff_series = series.diff().dropna()  # First difference

# Example of log transformation
log_series = np.log1p(series)  # log(1+x) to handle zeros

Understanding ARIMA ComponentsΒΆ

AR (AutoRegressive) ComponentΒΆ

Conceptual: The AR component (p) captures the relationship between an observation and its previous values. It's like saying "today's value depends on the last few days' values."

Key points: - Order p: Number of lag observations to include - Higher p: More complex relationships with past values - Too high p: Risk of overfitting

Reference:

# Simple AR(1) process simulation
def simulate_ar1(n_points=1000, phi=0.7):
    """Simulate AR(1) process: y[t] = phi*y[t-1] + e[t]"""
    e = np.random.normal(0, 1, n_points)
    y = np.zeros(n_points)
    for t in range(1, n_points):
        y[t] = phi * y[t-1] + e[t]
    return y

I (Integrated) ComponentΒΆ

Conceptual: The I component (d) handles non-stationarity through differencing. It's like "leveling the playing field" by removing trends and seasonality.

Types of differencing: 1. First difference (d=1): Remove trend - y'[t] = y[t] - y[t-1] 2. Second difference (d=2): Remove quadratic trend - y''[t] = (y[t] - y[t-1]) - (y[t-1] - y[t-2]) 3. Seasonal difference: Remove seasonal patterns - y'[t] = y[t] - y[t-s], where s is seasonal period

Reference:

def difference_series(series, d=1):
    """Apply d-th order differencing"""
    return pd.Series(series).diff(d).dropna()

def inverse_difference(diff_series, original_first_value, d=1):
    """Reverse differencing transformation"""
    series = diff_series.copy()
    for _ in range(d):
        series = series.cumsum() + original_first_value
    return series

MA (Moving Average) ComponentΒΆ

Conceptual: The MA component (q) uses past forecast errors to improve current predictions. It's like "learning from your mistakes."

Key points: - Order q: Number of past errors to consider - Captures short-term adjustments - Useful for handling irregular fluctuations

Reference:

# Simple MA(1) process simulation
def simulate_ma1(n_points=1000, theta=0.5):
    """Simulate MA(1) process: y[t] = e[t] + theta*e[t-1]"""
    e = np.random.normal(0, 1, n_points)
    y = np.zeros(n_points)
    for t in range(1, n_points):
        y[t] = e[t] + theta * e[t-1]
    return y

Practical ARIMA ImplementationΒΆ

Step 1: Check StationarityΒΆ

Conceptual: Before applying ARIMA, check if your data is stationary. Non-stationary data needs differencing.

Reference:

def check_stationarity(series):
    """Check stationarity using Augmented Dickey-Fuller test"""
    result = adfuller(series)
    print(f'ADF Statistic: {result[0]:.3f}')
    print(f'p-value: {result[1]:.3f}')
    print('Critical values:')
    for key, value in result[4].items():
        print(f'\t{key}: {value:.3f}')
    return result[1] < 0.05

Step 2: Parameter SelectionΒΆ

Conceptual: Choosing p, d, q parameters is crucial for model performance. Use these guidelines:

  1. Choose d:
  2. d=0 if already stationary
  3. d=1 for most non-stationary series
  4. d=2 rarely needed

  5. Choose p:

  6. Examine PACF plot
  7. Count significant lags
  8. Usually p ≀ 3

  9. Choose q:

  10. Examine ACF plot
  11. Count significant lags
  12. Usually q ≀ 3

Reference:

def suggest_pdq(series):
    """Suggest ARIMA parameters based on data characteristics"""
    # Check stationarity
    d = 0
    while not check_stationarity(series):
        series = difference_series(series)
        d += 1
        if d >= 2:
            break

    # Get ACF and PACF values
    acf_values = acf(series, nlags=10)
    pacf_values = pacf(series, nlags=10)

    # Suggest p and q based on significant lags
    p = sum(np.abs(pacf_values[1:]) > 1.96/np.sqrt(len(series)))
    q = sum(np.abs(acf_values[1:]) > 1.96/np.sqrt(len(series)))

    return min(p, 3), d, min(q, 3)

Step 3: Model Fitting and DiagnosticsΒΆ

Conceptual: After selecting parameters, fit the model and check its performance.

Reference:

from statsmodels.tsa.arima.model import ARIMA
from sklearn.metrics import mean_absolute_error, mean_squared_error

def fit_and_evaluate_arima(train, test, order=(1,1,1)):
    """Fit ARIMA model and evaluate performance"""
    # Fit model
    model = ARIMA(train, order=order)
    model_fit = model.fit()

    # Make predictions
    predictions = model_fit.forecast(steps=len(test))

    # Calculate metrics
    mae = mean_absolute_error(test, predictions)
    rmse = np.sqrt(mean_squared_error(test, predictions))

    # Model diagnostics
    residuals = model_fit.resid

    return {
        'model': model_fit,
        'predictions': predictions,
        'mae': mae,
        'rmse': rmse,
        'aic': model_fit.aic,
        'bic': model_fit.bic,
        'residuals': residuals
    }

Step 4: Model ValidationΒΆ

Conceptual: Validate your model using these checks:

  1. Residual Analysis:
  2. Should be normally distributed
  3. Should show no autocorrelation
  4. Should have constant variance

  5. Cross-Validation:

  6. Use rolling forecasts
  7. Compare multiple parameter sets
  8. Check prediction intervals

Reference:

def validate_arima_model(model_results):
    """Perform model validation checks"""
    residuals = model_results['residuals']

    # Normality test
    _, norm_p_value = stats.normaltest(residuals)

    # Autocorrelation in residuals
    residual_acf = acf(residuals, nlags=10)

    # Heteroscedasticity check
    _, hetero_p_value = stats.levene(
        residuals[:len(residuals)//2], 
        residuals[len(residuals)//2:]
    )

    return {
        'normality_p_value': norm_p_value,
        'residual_acf': residual_acf,
        'heteroscedasticity_p_value': hetero_p_value
    }

DEMO BREAK: Sleep Quality PredictionΒΆ

See: demo2-hrv-forecasting

4. Sensor Data AnalysisΒΆ

Characteristics of Dense Physiological DataΒΆ

Conceptual: Dense physiological data represents a fundamental shift in healthcare monitoring, moving from sparse, episodic measurements to continuous, high-resolution observations of physiological processes. This data is characterized by high sampling rates, continuous monitoring, and multiple variables, creating both opportunities and challenges for analysis.

The richness of sensor data allows us to capture subtle patterns and transient events that would be missed by traditional monitoring approaches. However, this same richness creates computational challenges and requires specialized techniques to separate meaningful signals from noise. The temporal nature of sensor data also preserves the sequence and timing of physiological events, which is often as important as the measurements themselves.

Key CharacteristicsΒΆ

  • High sampling rates: ECG (250+ Hz), accelerometer (20-100 Hz)
  • Continuous monitoring: Wearables, implantable devices
  • Multivariate signals: Multiple sensors, multiple dimensions
  • Noise and artifacts: Movement artifacts, sensor errors, baseline drift

Signal Processing MethodsΒΆ

Conceptual: Signal processing prepares raw sensor data for analysis by removing noise and artifacts.

Common Methods: - Filtering: Low-pass, high-pass, band-pass, and notch filters - Smoothing: Moving average, exponential smoothing, Savitzky-Golay - Resampling: Adjusting sampling rates for consistent analysis - Artifact removal: Detecting and handling motion artifacts

Frequency AnalysisΒΆ

Conceptual: Frequency analysis reveals periodic patterns in sensor data that are not visible in the time domain.

Fast Fourier Transform (FFT)ΒΆ

  • Decomposes a signal into its frequency components
  • Reveals dominant frequencies and periodic patterns
  • Essential for analyzing oscillatory physiological signals
# FFT implementation
from scipy.fft import fft, fftfreq
import numpy as np

# Generate sample data
sampling_rate = 100  # Hz
duration = 5  # seconds
t = np.linspace(0, duration, int(sampling_rate * duration), endpoint=False)
# Signal with 2 Hz and 10 Hz components
signal = np.sin(2 * np.pi * 2 * t) + 0.5 * np.sin(2 * np.pi * 10 * t)

# Compute FFT
yf = fft(signal)
xf = fftfreq(len(t), 1/sampling_rate)

# Plot only positive frequencies (negative frequencies are redundant for real signals)
positive_freq_idx = np.where(xf > 0)
plt.figure(figsize=(10, 4))
plt.plot(xf[positive_freq_idx], 2.0/len(t) * np.abs(yf[positive_freq_idx]))
plt.xlabel('Frequency (Hz)')
plt.ylabel('Amplitude')
plt.title('Frequency Spectrum')

# Practical tips for FFT analysis:
# 1. Ensure your sampling rate is at least 2x the highest frequency you want to detect (Nyquist theorem)
# 2. Use windowing functions (e.g., Hanning, Hamming) to reduce spectral leakage
# 3. For better frequency resolution, use longer signal durations
# 4. The magnitude should be normalized by dividing by N (signal length) or N/2 for one-sided spectrum

Power Spectral Density (PSD)ΒΆ

  • Measures power distribution across frequencies
  • Useful for identifying dominant rhythms in physiological signals
  • Common in heart rate variability and EEG analysis
# Power spectral density estimation
from scipy import signal

# Compute PSD using Welch's method
frequencies, psd = signal.welch(signal, sampling_rate, nperseg=256)

# Plot PSD
plt.figure(figsize=(10, 4))
plt.semilogy(frequencies, psd)
plt.xlabel('Frequency (Hz)')
plt.ylabel('Power/Frequency (dB/Hz)')
plt.title('Power Spectral Density')

Wavelet TransformsΒΆ

  • Provides time-frequency representation
  • Better for non-stationary signals than FFT
  • Preserves both time and frequency information
  • Useful for detecting transient events in physiological signals

Two main types of wavelet transforms used in physiological signal analysis:

  1. Continuous Wavelet Transform (CWT):
  2. Analyzes signals with wavelets at all scales and positions
  3. Provides high resolution in both time and frequency domains
  4. Commonly used for ECG and EEG analysis

  5. Discrete Wavelet Transform (DWT):

  6. Decomposes signals into approximation and detail coefficients
  7. More computationally efficient than CWT
  8. Used for denoising and feature extraction
# Example of wavelet transform
from pywt import cwt
import numpy as np

# Create sample signal
sampling_rate = 100  # Hz
t = np.linspace(0, 2, 2 * sampling_rate)
# Signal with changing frequency
signal = np.sin(2 * np.pi * 2 * t) + np.sin(2 * np.pi * 10 * t[t > 1])

# Perform continuous wavelet transform
scales = np.arange(1, 128)
coefficients, frequencies = cwt(signal, scales, 'morl')

# Plot the scalogram
plt.figure(figsize=(10, 6))
plt.imshow(abs(coefficients), aspect='auto', cmap='jet')
plt.title('Wavelet Transform Scalogram')
plt.ylabel('Scale')
plt.xlabel('Time')
plt.colorbar(label='Magnitude')

Feature Extraction for ClassificationΒΆ

Conceptual: Feature extraction transforms raw sensor data into meaningful metrics that can be used for classification and pattern recognition.

Time Domain FeaturesΒΆ

  • Statistical measures: Mean, median, standard deviation, skewness, kurtosis
  • Peak characteristics: Count, amplitude, width, prominence
  • Signal complexity: Approximate entropy, sample entropy, Lyapunov exponent

Frequency Domain FeaturesΒΆ

  • Spectral power in bands: Power in physiologically relevant frequency bands
  • Spectral moments: Mean frequency, spectral centroid, bandwidth
  • Spectral entropy: Measure of regularity/predictability in frequency domain

Time-Frequency FeaturesΒΆ

  • Wavelet coefficients: Capturing time-localized frequency content
  • Spectrogram statistics: Features from time-frequency representations
# Example of extracting multiple features
def extract_features(signal, sampling_rate):
    """Extract common features from a physiological signal.

    Parameters:
    -----------
    signal : array-like
        The input signal time series
    sampling_rate : float
        The sampling frequency in Hz

    Returns:
    --------
    features : dict
        Dictionary containing extracted features

    Notes:
    ------
    This function extracts three types of features:
    1. Time domain features (statistical properties)
    2. Peak-based features (using peak detection)
    3. Frequency domain features (using FFT)

    For physiological signals, typical frequency bands are:
    - Delta (0.5-4 Hz): Deep sleep, relaxation
    - Theta (4-8 Hz): Drowsiness, meditation
    - Alpha (8-13 Hz): Relaxed alertness
    - Beta (13-30 Hz): Active thinking, focus
    """
    features = {}

    # Time domain features
    features['mean'] = np.mean(signal)
    features['std'] = np.std(signal)
    features['skewness'] = stats.skew(signal)
    features['kurtosis'] = stats.kurtosis(signal)

    # Add more robust statistical features
    features['median'] = np.median(signal)
    features['iqr'] = np.percentile(signal, 75) - np.percentile(signal, 25)
    features['min'] = np.min(signal)
    features['max'] = np.max(signal)

    # Peak detection
    peaks, peak_properties = signal.find_peaks(
        signal,
        height=0,                     # Minimum height of peaks
        distance=sampling_rate//10,   # Minimum samples between peaks
        prominence=0.1,               # Minimum prominence of peaks
        width=None                    # Minimum width of peaks
    )

    # Peak-based features
    features['peak_count'] = len(peaks)
    if len(peaks) > 0:
        features['mean_peak_height'] = np.mean(peak_properties['peak_heights'])
        features['peak_prominence'] = np.mean(peak_properties['prominences'])
    else:
        features['mean_peak_height'] = 0
        features['peak_prominence'] = 0

    # Frequency domain features
    yf = fft(signal)
    xf = fftfreq(len(signal), 1/sampling_rate)
    positive_freq_idx = np.where(xf > 0)

    # Power in different frequency bands (common in EEG/ECG analysis)
    delta_band = (0.5, 4)    # Hz - Delta waves
    theta_band = (4, 8)      # Hz - Theta waves
    alpha_band = (8, 13)     # Hz - Alpha waves
    beta_band = (13, 30)     # Hz - Beta waves

    # Calculate power in each band
    power_spectrum = 2.0/len(signal) * np.abs(yf)

    # Extract band powers
    for band_name, (low, high) in [
        ('delta', delta_band),
        ('theta', theta_band),
        ('alpha', alpha_band),
        ('beta', beta_band)
    ]:
        band_idx = np.where((xf >= low) & (xf <= high))
        features[f'{band_name}_power'] = np.sum(power_spectrum[band_idx])

    # Calculate relative band powers (normalized)
    total_power = np.sum(power_spectrum[positive_freq_idx])
    if total_power > 0:
        for band in ['delta', 'theta', 'alpha', 'beta']:
            features[f'relative_{band}_power'] = features[f'{band}_power'] / total_power

    return features

Preparing for ClassificationΒΆ

Conceptual: The extracted features form the input to classification algorithms in the next stage of analysis.

Key Considerations: - Feature selection: Identifying the most informative features - Feature normalization: Scaling features to comparable ranges - Dimensionality reduction: PCA, t-SNE for visualizing high-dimensional feature spaces - Time-aware validation: Special considerations for time series data

DEMO BREAK: Advanced Sensor Data AnalysisΒΆ

See: demo3-hrv-feature-extraction

Signal Preprocessing Techniques πŸ”§ΒΆ

Outlier and Artifact RemovalΒΆ

Conceptual: Physiological signals often contain outliers and artifacts that can distort analysis results. Common sources include: - Movement artifacts - Equipment malfunctions - Ectopic beats (in ECG/RR data) - Environmental interference

Reference:

def remove_outliers(signal, n_std=3):
    """Remove values more than n standard deviations from mean"""
    mean = np.mean(signal)
    std = np.std(signal)
    return signal[(signal > mean - n_std*std) & 
                 (signal < mean + n_std*std)]

Interpolation MethodsΒΆ

Conceptual: Many analysis techniques require evenly sampled data, but physiological signals (like RR intervals) are often irregularly sampled.

Common interpolation methods: - Linear interpolation - Cubic spline interpolation - Polynomial interpolation

Reference:

from scipy.interpolate import interp1d

# Create evenly sampled time points
time = np.cumsum(rr_intervals)
uniform_time = np.linspace(time[0], time[-1], desired_length)

# Interpolate signal
interp_func = interp1d(time, signal, kind='linear')
uniform_signal = interp_func(uniform_time)

Advanced Signal Processing πŸ“ŠΒΆ

Filtering TechniquesΒΆ

Butterworth FiltersΒΆ

Conceptual: Butterworth filters provide smooth frequency response with minimal ripples. Common types: - Lowpass: Remove high frequencies - Highpass: Remove low frequencies - Bandpass: Keep specific frequency range - Bandstop: Remove specific frequency range

Reference:

from scipy.signal import butter, filtfilt

def butter_bandpass(lowcut, highcut, fs, order=4):
    nyquist = 0.5 * fs
    low = lowcut / nyquist
    high = highcut / nyquist
    b, a = butter(order, [low, high], btype='band')
    return b, a

def apply_bandpass(data, lowcut, highcut, fs, order=4):
    b, a = butter_bandpass(lowcut, highcut, fs, order)
    return filtfilt(b, a, data)

Frequency Analysis MethodsΒΆ

Fast Fourier Transform (FFT)ΒΆ

Conceptual: FFT decomposes a signal into its frequency components, revealing periodic patterns.

Key concepts: - Frequency spectrum - Power spectral density - Nyquist frequency - Spectral leakage

Reference:

from scipy.fft import fft, fftfreq

# Compute FFT
yf = fft(signal)
xf = fftfreq(len(signal), 1/sampling_rate)

# Power spectrum
power_spectrum = np.abs(yf)**2
Frequency Bands in HRV AnalysisΒΆ

Conceptual: Heart Rate Variability (HRV) analysis often focuses on specific frequency bands:

  • VLF (Very Low Frequency): 0.003-0.04 Hz
  • Long-term regulation mechanisms
  • Thermoregulation

  • LF (Low Frequency): 0.04-0.15 Hz

  • Sympathetic and parasympathetic activity
  • Baroreceptor activity

  • HF (High Frequency): 0.15-0.4 Hz

  • Parasympathetic activity
  • Respiratory sinus arrhythmia

Reference:

def get_frequency_band_power(power_spectrum, freqs, band):
    """Calculate power in specific frequency band"""
    mask = (freqs >= band[0]) & (freqs <= band[1])
    return np.sum(power_spectrum[mask])

Feature Engineering for Time Series πŸ› οΈΒΆ

Statistical FeaturesΒΆ

Conceptual: Basic statistical measures that capture different aspects of the signal:

  • Central tendency: mean, median, mode
  • Dispersion: standard deviation, variance, IQR
  • Shape: skewness, kurtosis
  • Extremes: min, max, range

Reference:

def extract_statistical_features(signal):
    return {
        'mean': np.mean(signal),
        'std': np.std(signal),
        'skew': stats.skew(signal),
        'kurtosis': stats.kurtosis(signal),
        'iqr': np.percentile(signal, 75) - 
               np.percentile(signal, 25)
    }

Rolling Window FeaturesΒΆ

Conceptual: Features calculated over sliding windows capture local patterns:

  • Rolling mean
  • Rolling standard deviation
  • Rolling entropy
  • Rolling correlation

Reference:

def rolling_features(signal, window_size):
    return {
        'rolling_mean': pd.Series(signal).rolling(window_size).mean(),
        'rolling_std': pd.Series(signal).rolling(window_size).std()
    }

Frequency Domain FeaturesΒΆ

Conceptual: Features derived from frequency analysis:

  • Band powers (VLF, LF, HF)
  • Peak frequencies
  • Spectral entropy
  • Power ratios (LF/HF)

Reference:

def frequency_domain_features(power_spectrum, freqs):
    vlf = get_frequency_band_power(power_spectrum, freqs, (0.003, 0.04))
    lf = get_frequency_band_power(power_spectrum, freqs, (0.04, 0.15))
    hf = get_frequency_band_power(power_spectrum, freqs, (0.15, 0.4))
    return {
        'vlf_power': vlf,
        'lf_power': lf,
        'hf_power': hf,
        'lf_hf_ratio': lf/hf if hf > 0 else 0
    }

Summary and Key TakeawaysΒΆ

  • Time series data is fundamental in healthcare for monitoring, forecasting, and understanding patterns over time
  • Basic analysis techniques like decomposition, correlation analysis, and visualization provide insights into underlying patterns
  • ARIMA models offer a powerful framework for forecasting time series data in healthcare
  • Sensor data analysis requires specialized techniques for processing dense physiological signals
  • Practical applications include patient monitoring, resource planning, and early warning systems

Further ResourcesΒΆ

Next StepsΒΆ

  • Explore the demo notebooks to gain hands-on experience
  • Apply these techniques to your own healthcare data
  • Consider how time series analysis could improve healthcare delivery and outcomes