Time Series & Regression: Predicting the Future πβΒΆ
pywt PyWavelets
The pywt is very difficult to use. Update the lecture and assignment to work around this issue by substituting with a different package.
"The best thing about the future is that it comes one data point at a time." β Abraham Lincoln (if he was a data scientist)
Source: XKCD 605 - A cautionary tale about extrapolation
Lecture OverviewΒΆ
This lecture is divided into four main parts:
- Conceptual Overview (10 minutes)
- Introduction to time series in healthcare
- Key applications and importance
-
Visual introduction to patterns in time series
-
Time Series Basics (30 minutes)
- Types of healthcare time series
- Components and visualization
- Common challenges
- Basic analysis techniques
-
Demo: Exploring heart rate patterns during meditation
-
ARIMA Models (25 minutes)
- Introduction to forecasting
- Understanding ARIMA components
- Model selection and evaluation
- Healthcare applications
-
Demo: Sleep quality prediction
-
Sensor Data Analysis (25 minutes)
- Characteristics of dense physiological data
- Signal processing techniques
- Feature extraction
- Pattern recognition in health monitoring
- Demo: Advanced sensor data analysis
Let's begin with a conceptual overview of time series analysis!
1. Conceptual Overview: Why Time Series Matter in Healthcare π₯ΒΆ
The Three Laws of Time Series π€ΒΆ
-
A time series may not harm a patient, or through inaction, allow a patient to come to harm
- Monitoring systems must be reliable
- Alerts should minimize false alarms
- Models must be interpretable
-
A time series must be clean, except where such cleanliness conflicts with the First Law
- Data quality is crucial
- But don't discard "messy" data that might be clinically relevant
- Document all preprocessing steps
-
A time series must protect its integrity as long as such protection does not conflict with the First or Second Law
- No data leakage from the future
- Respect temporal ordering
- Handle missing values appropriately
Check: What's your favorite example of "garbage in, garbage out" with time series data?
What is a Time Series?ΒΆ
A time series is a sequence of observations recorded at regular time intervals. In healthcare, time series data is ubiquitous:
- Patient monitoring: Vital signs, continuous glucose readings, ECG
- Treatment tracking: Medication effects, therapy outcomes
- Disease progression: Biomarker changes, symptom severity
- Healthcare operations: Hospital admissions, resource utilization
Patterns in Time SeriesΒΆ
Time series can exhibit various patterns that provide valuable insights:
graph TD
A[Time Series Patterns] --> B[Trend]
A --> C[Seasonality]
A --> D[Cyclical]
A --> E[Irregular]
B --> F[Upward/Downward]
B --> G[Changing Direction]
C --> H[Daily/Weekly]
C --> I[Monthly/Yearly]
D --> J[Business Cycles]
D --> K[Epidemic Waves]
E --> L[Random Variation] Let's look at some examples of different patterns in healthcare time series:
- Trend only: Disease progression (e.g., gradual decline in lung function)
- Seasonality only: Seasonal allergies or influenza cases
- Trend and seasonality: Hospital admissions over multiple years
- Cyclical patterns: Epidemic waves that don't follow fixed calendar patterns
2. Time Series BasicsΒΆ
Types of Healthcare Time SeriesΒΆ
1. Regular vs. Irregular Time SeriesΒΆ
graph LR
A[Time Series Types] --> B[Regular]
A --> C[Irregular]
B --> D[Fixed intervals<br>e.g., hourly vitals]
C --> E[Variable intervals<br>e.g., patient visits] Regular Time SeriesΒΆ
- Fixed time intervals between observations
- Examples: Hourly vital signs, daily lab values, continuous monitoring
- Data from: Heart Rate During Meditation
Irregular Time SeriesΒΆ
- Variable time intervals between observations
- Examples: Patient visits, symptom reports, medication changes
- Data from: MMASH Dataset
2. Univariate vs. Multivariate Time SeriesΒΆ
Univariate Time SeriesΒΆ
- Single variable measured over time
- Example: Blood glucose readings
Conceptual: Univariate time series represent a single variable tracked over time. They're the simplest form of time series and allow us to focus on patterns in a single measurement.
Reference: - pandas.Series: Core data structure for univariate time series - index: DatetimeIndex that holds the timestamps - Common methods: plot(), rolling(), resample()
# Example of univariate time series
import pandas as pd
dates = pd.date_range(start='2024-01-01', periods=10, freq='D')
glucose = pd.Series([95, 100, 92, 98, 105, 110, 102, 95, 99, 97], index=dates)
Multivariate Time SeriesΒΆ
- Multiple variables measured over time
- Example: Vital signs (heart rate, blood pressure, temperature)
Conceptual: Multivariate time series track multiple variables simultaneously. They allow us to analyze relationships between different measurements and how they evolve together over time.
Reference: - pandas.DataFrame: Core data structure for multivariate time series - Each column represents a different variable - Shared DatetimeIndex across all variables
# Example of multivariate time series
vitals = pd.DataFrame({
'heart_rate': [72, 75, 71, 74, 77, 80, 76, 73, 75, 74],
'systolic_bp': [120, 122, 119, 121, 125, 128, 124, 120, 122, 121],
'temperature': [98.6, 98.7, 98.5, 98.6, 98.8, 99.0, 98.7, 98.6, 98.7, 98.6]
}, index=dates)
3. Dense vs. Sparse Time SeriesΒΆ
Dense Time SeriesΒΆ
- High-frequency data collection
- Examples: ECG (250+ Hz), accelerometer data, continuous glucose monitoring
- Challenges: Storage, processing, feature extraction
Sparse Time SeriesΒΆ
- Infrequent or irregular observations
- Examples: Annual check-ups, episodic symptoms
- Challenges: Interpolation, handling missing data
Time Series Components πΒΆ
1. TrendΒΆ
- Long-term progression in the data
- Can be upward, downward, or changing direction
- Examples in healthcare: Disease progression, recovery trajectory
2. SeasonalityΒΆ
- Regular, predictable patterns that repeat
- Can be daily, weekly, monthly, yearly
- Examples in healthcare: Circadian rhythms, seasonal illnesses
3. Cyclical PatternsΒΆ
- Irregular fluctuations without fixed frequency
- Usually longer than seasonal patterns
- Examples in healthcare: Epidemic cycles, treatment response cycles
4. Noise/ResidualsΒΆ
- Random variation that can't be explained by other components
- Can be due to measurement error or natural variability
- Examples in healthcare: Biological variability, measurement artifacts
Visualizing ComponentsΒΆ
Conceptual: Time series decomposition breaks a time series into its constituent parts: trend, seasonality, and residuals. This helps us understand the underlying patterns and can improve forecasting.
Reference: - statsmodels.tsa.seasonal.seasonal_decompose: Decomposes time series into components - model: 'additive' (components add) or 'multiplicative' (components multiply) - period: Length of the seasonal cycle (e.g., 7 for weekly, 12 for monthly)
# Example of time series decomposition
from statsmodels.tsa.seasonal import seasonal_decompose
result = seasonal_decompose(ts, model='additive', period=30)
Common Challenges in Healthcare Time Series π§ΒΆ
1. Missing ValuesΒΆ
- Causes: Equipment failures, patient non-compliance, data entry errors
- Approaches:
- Deletion (if minimal)
- Interpolation (linear, spline)
- Forward/backward fill
- Model-based imputation
Conceptual: Missing values are common in healthcare time series and can significantly impact analysis. Different imputation methods have different assumptions and effects on the resulting data.
Reference: - pandas.Series.fillna: Fill missing values in a Series - pandas.Series.interpolate: Interpolate missing values
# Example of handling missing values
ts_ffill = ts_with_gaps.fillna(method='ffill') # Forward fill
ts_interp = ts_with_gaps.interpolate(method='linear') # Linear interpolation
2. Irregular SamplingΒΆ
- Challenge: Observations not taken at regular intervals
- Approaches:
- Resampling to regular intervals
- Special methods for irregular time series
- Continuous-time models
Conceptual: Irregular sampling occurs when observations aren't taken at fixed intervals. Resampling converts irregular time series to regular ones, making them easier to analyze with standard methods.
Reference: - pandas.Series.resample: Resample time series to regular frequency
# Example of resampling irregular data
regular_ts = irregular_ts.resample('D').interpolate(method='linear')
3. OutliersΒΆ
- Challenge: Extreme values that may be errors or important signals
- Approaches:
- Statistical detection (z-score, IQR)
- Contextual outlier detection
- Robust methods that resist outlier influence
Conceptual: Outliers are extreme values that may represent errors or important clinical events. Z-scores measure how many standard deviations a point is from the mean, helping identify outliers.
Reference: - scipy.stats.zscore: Calculate z-scores for a dataset
# Example of outlier detection
from scipy import stats
z_scores = stats.zscore(ts_with_outliers)
outliers = np.abs(z_scores) > 3 # Threshold at 3 standard deviations
4. Changing VarianceΒΆ
- Challenge: Variance of the time series changes over time
- Approaches:
- Transformation (log, Box-Cox)
- GARCH models
- Robust scaling methods
Conceptual: Changing variance (heteroscedasticity) can affect model performance. Log transformation can stabilize variance by compressing larger values more than smaller ones.
Reference: - numpy.log1p: Natural logarithm of (1 + x)
Basic Time Series Analysis Techniques πΒΆ
1. Descriptive StatisticsΒΆ
- Basic measures: mean, median, min, max, standard deviation
- Rolling statistics: moving averages, rolling standard deviation
- Autocorrelation: correlation of a series with its own lagged values
Conceptual: Rolling statistics calculate metrics over a sliding window, helping identify changing patterns over time. They smooth out short-term fluctuations while preserving longer-term trends.
Reference: - pandas.Series.rolling: Create a rolling window object
# Example of rolling statistics
rolling_mean = ts.rolling(window=30).mean()
rolling_std = ts.rolling(window=30).std()
2. Correlation AnalysisΒΆ
- Autocorrelation Function (ACF): correlation with lagged values
- Partial Autocorrelation Function (PACF): correlation with lagged values, controlling for intermediate lags
- Cross-correlation: correlation between two different time series
Conceptual: Autocorrelation measures how a time series correlates with lagged versions of itself. It helps identify seasonality and determine appropriate parameters for time series models.
Reference: - statsmodels.graphics.tsaplots.plot_acf: Plot autocorrelation function - statsmodels.graphics.tsaplots.plot_pacf: Plot partial autocorrelation function
# Example of autocorrelation analysis
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
plot_acf(ts, lags=40)
plot_pacf(ts, lags=40)
3. Resampling and AggregationΒΆ
- Upsampling: Increasing frequency (e.g., daily to hourly)
- Downsampling: Decreasing frequency (e.g., hourly to daily)
- Aggregation methods: mean, median, sum, min, max
Conceptual: Resampling changes the frequency of a time series. Downsampling aggregates data to a lower frequency, while upsampling increases frequency (requiring interpolation).
Reference: - pandas.Series.resample: Resample time series to different frequency
4. Visualization TechniquesΒΆ
- Line plots with enhancements: confidence intervals, annotations
- Multiple time series visualization: subplots, overlay, faceting
- Seasonal plots: values by season
- Heatmaps: correlation between multiple time series
Conceptual: Effective visualization is crucial for understanding time series data. Different visualization techniques highlight different aspects of the data.
Reference: - matplotlib.pyplot: Basic plotting library - seaborn: Advanced statistical visualizations
# Example of enhanced line plot
plt.figure(figsize=(12, 6))
plt.plot(ts, label='Time Series')
plt.axhline(y=ts.mean(), color='r', linestyle='--', label='Mean')
5. Survival Analysis MethodsΒΆ
Cox Proportional Hazards ModelΒΆ

Conceptual: The Cox Proportional Hazards Model is a semi-parametric model used in survival analysis to assess the relationship between covariates and survival time. It estimates the hazard (or risk) of an event occurring at a certain time, considering the impact of various factors.
Key features: - Handles censored data (when the event hasn't occurred by the end of observation) - Doesn't require specifying a particular probability distribution for survival times - Assumes proportional hazards (the effect of covariates is constant over time) - Widely used in clinical trials and epidemiological studies
Reference: - lifelines.CoxPHFitter: Implementation of Cox Proportional Hazards model - duration_col: Column containing the duration until the event or censoring - event_col: Column indicating if the event of interest occurred - Methods: fit(), predict_survival_function(), plot()
# Example of Cox Proportional Hazards model
from lifelines import CoxPHFitter
# Assuming df has 'duration', 'event', and covariate columns
cph = CoxPHFitter()
cph.fit(df, duration_col='duration', event_col='event')
# Print summary of the model
print(cph.summary)
Kaplan-Meier Survival AnalysisΒΆ

Conceptual: Kaplan-Meier analysis is a non-parametric method for estimating the survival function from time-to-event data. It creates a step function that shows the probability of an event occurring at different time points, accounting for censored observations.
Key features: - Provides a visual representation of survival over time - Handles right-censored data effectively - Makes no assumptions about the underlying distribution - Can compare survival curves between different groups - Used in medicine, biology, engineering, and economics
Reference: - lifelines.KaplanMeierFitter: Implementation of Kaplan-Meier estimator - durations: Array of durations until event or censoring - event_observed: Array indicating whether the event was observed (1) or censored (0) - Methods: fit(), plot_survival_function(), median_survival_time_
# Example of Kaplan-Meier analysis
from lifelines import KaplanMeierFitter
# Initialize the Kaplan-Meier model
kmf = KaplanMeierFitter()
# Fit the model
kmf.fit(durations=df['duration'], event_observed=df['event'])
# Plot the survival function
kmf.plot_survival_function()
DEMO BREAK: Exploring Heart Rate Patterns During MeditationΒΆ
See: demo1-synthetic-timeseries
3. ARIMA ModelsΒΆ
Introduction to ForecastingΒΆ
Conceptual: Forecasting time series involves predicting future values based on patterns observed in historical data. The goal is to capture the underlying structure while accounting for randomness.
Additive vs. Multiplicative ModelsΒΆ
Time series can be modeled as either:
- Additive: Value = Base Level + Trend + Seasonality + Error
- Multiplicative: Value = Base Level Γ Trend Γ Seasonality Γ Error
Additive models are appropriate when the seasonal variation is constant over time, while multiplicative models are better when the seasonal variation increases with the level of the series.
Stationarity and TransformationsΒΆ
Conceptual: A stationary time series has statistical properties that don't change over time (constant mean, variance, and autocorrelation). Most forecasting methods require stationary data.
Testing for StationarityΒΆ
- Visual inspection: Plot the series and check if properties change over time
- Statistical tests: Augmented Dickey-Fuller (ADF), KPSS
Reference: - statsmodels.tsa.stattools.adfuller: Augmented Dickey-Fuller test
# Example of ADF test
from statsmodels.tsa.stattools import adfuller
result = adfuller(series)
print(f'ADF Statistic: {result[0]}')
print(f'p-value: {result[1]}')
Making a Series StationaryΒΆ
- Differencing: Subtract consecutive observations (first difference, second difference)
- Transformation: Apply mathematical functions (log, square root, Box-Cox)
- Detrending: Remove trend component
- Deseasonalizing: Remove seasonal component
# Example of differencing
diff_series = series.diff().dropna() # First difference
# Example of log transformation
log_series = np.log1p(series) # log(1+x) to handle zeros
Understanding ARIMA ComponentsΒΆ
AR (AutoRegressive) ComponentΒΆ
Conceptual: The AR component (p) captures the relationship between an observation and its previous values. It's like saying "today's value depends on the last few days' values."
Key points: - Order p: Number of lag observations to include - Higher p: More complex relationships with past values - Too high p: Risk of overfitting
Reference:
# Simple AR(1) process simulation
def simulate_ar1(n_points=1000, phi=0.7):
"""Simulate AR(1) process: y[t] = phi*y[t-1] + e[t]"""
e = np.random.normal(0, 1, n_points)
y = np.zeros(n_points)
for t in range(1, n_points):
y[t] = phi * y[t-1] + e[t]
return y
I (Integrated) ComponentΒΆ
Conceptual: The I component (d) handles non-stationarity through differencing. It's like "leveling the playing field" by removing trends and seasonality.
Types of differencing: 1. First difference (d=1): Remove trend - y'[t] = y[t] - y[t-1] 2. Second difference (d=2): Remove quadratic trend - y''[t] = (y[t] - y[t-1]) - (y[t-1] - y[t-2]) 3. Seasonal difference: Remove seasonal patterns - y'[t] = y[t] - y[t-s], where s is seasonal period
Reference:
def difference_series(series, d=1):
"""Apply d-th order differencing"""
return pd.Series(series).diff(d).dropna()
def inverse_difference(diff_series, original_first_value, d=1):
"""Reverse differencing transformation"""
series = diff_series.copy()
for _ in range(d):
series = series.cumsum() + original_first_value
return series
MA (Moving Average) ComponentΒΆ
Conceptual: The MA component (q) uses past forecast errors to improve current predictions. It's like "learning from your mistakes."
Key points: - Order q: Number of past errors to consider - Captures short-term adjustments - Useful for handling irregular fluctuations
Reference:
# Simple MA(1) process simulation
def simulate_ma1(n_points=1000, theta=0.5):
"""Simulate MA(1) process: y[t] = e[t] + theta*e[t-1]"""
e = np.random.normal(0, 1, n_points)
y = np.zeros(n_points)
for t in range(1, n_points):
y[t] = e[t] + theta * e[t-1]
return y
Practical ARIMA ImplementationΒΆ
Step 1: Check StationarityΒΆ
Conceptual: Before applying ARIMA, check if your data is stationary. Non-stationary data needs differencing.
Reference:
def check_stationarity(series):
"""Check stationarity using Augmented Dickey-Fuller test"""
result = adfuller(series)
print(f'ADF Statistic: {result[0]:.3f}')
print(f'p-value: {result[1]:.3f}')
print('Critical values:')
for key, value in result[4].items():
print(f'\t{key}: {value:.3f}')
return result[1] < 0.05
Step 2: Parameter SelectionΒΆ
Conceptual: Choosing p, d, q parameters is crucial for model performance. Use these guidelines:
- Choose d:
- d=0 if already stationary
- d=1 for most non-stationary series
-
d=2 rarely needed
-
Choose p:
- Examine PACF plot
- Count significant lags
-
Usually p β€ 3
-
Choose q:
- Examine ACF plot
- Count significant lags
- Usually q β€ 3
Reference:
def suggest_pdq(series):
"""Suggest ARIMA parameters based on data characteristics"""
# Check stationarity
d = 0
while not check_stationarity(series):
series = difference_series(series)
d += 1
if d >= 2:
break
# Get ACF and PACF values
acf_values = acf(series, nlags=10)
pacf_values = pacf(series, nlags=10)
# Suggest p and q based on significant lags
p = sum(np.abs(pacf_values[1:]) > 1.96/np.sqrt(len(series)))
q = sum(np.abs(acf_values[1:]) > 1.96/np.sqrt(len(series)))
return min(p, 3), d, min(q, 3)
Step 3: Model Fitting and DiagnosticsΒΆ
Conceptual: After selecting parameters, fit the model and check its performance.
Reference:
from statsmodels.tsa.arima.model import ARIMA
from sklearn.metrics import mean_absolute_error, mean_squared_error
def fit_and_evaluate_arima(train, test, order=(1,1,1)):
"""Fit ARIMA model and evaluate performance"""
# Fit model
model = ARIMA(train, order=order)
model_fit = model.fit()
# Make predictions
predictions = model_fit.forecast(steps=len(test))
# Calculate metrics
mae = mean_absolute_error(test, predictions)
rmse = np.sqrt(mean_squared_error(test, predictions))
# Model diagnostics
residuals = model_fit.resid
return {
'model': model_fit,
'predictions': predictions,
'mae': mae,
'rmse': rmse,
'aic': model_fit.aic,
'bic': model_fit.bic,
'residuals': residuals
}
Step 4: Model ValidationΒΆ
Conceptual: Validate your model using these checks:
- Residual Analysis:
- Should be normally distributed
- Should show no autocorrelation
-
Should have constant variance
-
Cross-Validation:
- Use rolling forecasts
- Compare multiple parameter sets
- Check prediction intervals
Reference:
def validate_arima_model(model_results):
"""Perform model validation checks"""
residuals = model_results['residuals']
# Normality test
_, norm_p_value = stats.normaltest(residuals)
# Autocorrelation in residuals
residual_acf = acf(residuals, nlags=10)
# Heteroscedasticity check
_, hetero_p_value = stats.levene(
residuals[:len(residuals)//2],
residuals[len(residuals)//2:]
)
return {
'normality_p_value': norm_p_value,
'residual_acf': residual_acf,
'heteroscedasticity_p_value': hetero_p_value
}
DEMO BREAK: Sleep Quality PredictionΒΆ
4. Sensor Data AnalysisΒΆ
Characteristics of Dense Physiological DataΒΆ
Conceptual: Dense physiological data represents a fundamental shift in healthcare monitoring, moving from sparse, episodic measurements to continuous, high-resolution observations of physiological processes. This data is characterized by high sampling rates, continuous monitoring, and multiple variables, creating both opportunities and challenges for analysis.
The richness of sensor data allows us to capture subtle patterns and transient events that would be missed by traditional monitoring approaches. However, this same richness creates computational challenges and requires specialized techniques to separate meaningful signals from noise. The temporal nature of sensor data also preserves the sequence and timing of physiological events, which is often as important as the measurements themselves.
Key CharacteristicsΒΆ
- High sampling rates: ECG (250+ Hz), accelerometer (20-100 Hz)
- Continuous monitoring: Wearables, implantable devices
- Multivariate signals: Multiple sensors, multiple dimensions
- Noise and artifacts: Movement artifacts, sensor errors, baseline drift
Signal Processing MethodsΒΆ
Conceptual: Signal processing prepares raw sensor data for analysis by removing noise and artifacts.
Common Methods: - Filtering: Low-pass, high-pass, band-pass, and notch filters - Smoothing: Moving average, exponential smoothing, Savitzky-Golay - Resampling: Adjusting sampling rates for consistent analysis - Artifact removal: Detecting and handling motion artifacts
Frequency AnalysisΒΆ
Conceptual: Frequency analysis reveals periodic patterns in sensor data that are not visible in the time domain.
Fast Fourier Transform (FFT)ΒΆ
- Decomposes a signal into its frequency components
- Reveals dominant frequencies and periodic patterns
- Essential for analyzing oscillatory physiological signals
# FFT implementation
from scipy.fft import fft, fftfreq
import numpy as np
# Generate sample data
sampling_rate = 100 # Hz
duration = 5 # seconds
t = np.linspace(0, duration, int(sampling_rate * duration), endpoint=False)
# Signal with 2 Hz and 10 Hz components
signal = np.sin(2 * np.pi * 2 * t) + 0.5 * np.sin(2 * np.pi * 10 * t)
# Compute FFT
yf = fft(signal)
xf = fftfreq(len(t), 1/sampling_rate)
# Plot only positive frequencies (negative frequencies are redundant for real signals)
positive_freq_idx = np.where(xf > 0)
plt.figure(figsize=(10, 4))
plt.plot(xf[positive_freq_idx], 2.0/len(t) * np.abs(yf[positive_freq_idx]))
plt.xlabel('Frequency (Hz)')
plt.ylabel('Amplitude')
plt.title('Frequency Spectrum')
# Practical tips for FFT analysis:
# 1. Ensure your sampling rate is at least 2x the highest frequency you want to detect (Nyquist theorem)
# 2. Use windowing functions (e.g., Hanning, Hamming) to reduce spectral leakage
# 3. For better frequency resolution, use longer signal durations
# 4. The magnitude should be normalized by dividing by N (signal length) or N/2 for one-sided spectrum
Power Spectral Density (PSD)ΒΆ
- Measures power distribution across frequencies
- Useful for identifying dominant rhythms in physiological signals
- Common in heart rate variability and EEG analysis
# Power spectral density estimation
from scipy import signal
# Compute PSD using Welch's method
frequencies, psd = signal.welch(signal, sampling_rate, nperseg=256)
# Plot PSD
plt.figure(figsize=(10, 4))
plt.semilogy(frequencies, psd)
plt.xlabel('Frequency (Hz)')
plt.ylabel('Power/Frequency (dB/Hz)')
plt.title('Power Spectral Density')
Wavelet TransformsΒΆ
- Provides time-frequency representation
- Better for non-stationary signals than FFT
- Preserves both time and frequency information
- Useful for detecting transient events in physiological signals
Two main types of wavelet transforms used in physiological signal analysis:
- Continuous Wavelet Transform (CWT):
- Analyzes signals with wavelets at all scales and positions
- Provides high resolution in both time and frequency domains
-
Commonly used for ECG and EEG analysis
-
Discrete Wavelet Transform (DWT):
- Decomposes signals into approximation and detail coefficients
- More computationally efficient than CWT
- Used for denoising and feature extraction
# Example of wavelet transform
from pywt import cwt
import numpy as np
# Create sample signal
sampling_rate = 100 # Hz
t = np.linspace(0, 2, 2 * sampling_rate)
# Signal with changing frequency
signal = np.sin(2 * np.pi * 2 * t) + np.sin(2 * np.pi * 10 * t[t > 1])
# Perform continuous wavelet transform
scales = np.arange(1, 128)
coefficients, frequencies = cwt(signal, scales, 'morl')
# Plot the scalogram
plt.figure(figsize=(10, 6))
plt.imshow(abs(coefficients), aspect='auto', cmap='jet')
plt.title('Wavelet Transform Scalogram')
plt.ylabel('Scale')
plt.xlabel('Time')
plt.colorbar(label='Magnitude')
Feature Extraction for ClassificationΒΆ
Conceptual: Feature extraction transforms raw sensor data into meaningful metrics that can be used for classification and pattern recognition.
Time Domain FeaturesΒΆ
- Statistical measures: Mean, median, standard deviation, skewness, kurtosis
- Peak characteristics: Count, amplitude, width, prominence
- Signal complexity: Approximate entropy, sample entropy, Lyapunov exponent
Frequency Domain FeaturesΒΆ
- Spectral power in bands: Power in physiologically relevant frequency bands
- Spectral moments: Mean frequency, spectral centroid, bandwidth
- Spectral entropy: Measure of regularity/predictability in frequency domain
Time-Frequency FeaturesΒΆ
- Wavelet coefficients: Capturing time-localized frequency content
- Spectrogram statistics: Features from time-frequency representations
# Example of extracting multiple features
def extract_features(signal, sampling_rate):
"""Extract common features from a physiological signal.
Parameters:
-----------
signal : array-like
The input signal time series
sampling_rate : float
The sampling frequency in Hz
Returns:
--------
features : dict
Dictionary containing extracted features
Notes:
------
This function extracts three types of features:
1. Time domain features (statistical properties)
2. Peak-based features (using peak detection)
3. Frequency domain features (using FFT)
For physiological signals, typical frequency bands are:
- Delta (0.5-4 Hz): Deep sleep, relaxation
- Theta (4-8 Hz): Drowsiness, meditation
- Alpha (8-13 Hz): Relaxed alertness
- Beta (13-30 Hz): Active thinking, focus
"""
features = {}
# Time domain features
features['mean'] = np.mean(signal)
features['std'] = np.std(signal)
features['skewness'] = stats.skew(signal)
features['kurtosis'] = stats.kurtosis(signal)
# Add more robust statistical features
features['median'] = np.median(signal)
features['iqr'] = np.percentile(signal, 75) - np.percentile(signal, 25)
features['min'] = np.min(signal)
features['max'] = np.max(signal)
# Peak detection
peaks, peak_properties = signal.find_peaks(
signal,
height=0, # Minimum height of peaks
distance=sampling_rate//10, # Minimum samples between peaks
prominence=0.1, # Minimum prominence of peaks
width=None # Minimum width of peaks
)
# Peak-based features
features['peak_count'] = len(peaks)
if len(peaks) > 0:
features['mean_peak_height'] = np.mean(peak_properties['peak_heights'])
features['peak_prominence'] = np.mean(peak_properties['prominences'])
else:
features['mean_peak_height'] = 0
features['peak_prominence'] = 0
# Frequency domain features
yf = fft(signal)
xf = fftfreq(len(signal), 1/sampling_rate)
positive_freq_idx = np.where(xf > 0)
# Power in different frequency bands (common in EEG/ECG analysis)
delta_band = (0.5, 4) # Hz - Delta waves
theta_band = (4, 8) # Hz - Theta waves
alpha_band = (8, 13) # Hz - Alpha waves
beta_band = (13, 30) # Hz - Beta waves
# Calculate power in each band
power_spectrum = 2.0/len(signal) * np.abs(yf)
# Extract band powers
for band_name, (low, high) in [
('delta', delta_band),
('theta', theta_band),
('alpha', alpha_band),
('beta', beta_band)
]:
band_idx = np.where((xf >= low) & (xf <= high))
features[f'{band_name}_power'] = np.sum(power_spectrum[band_idx])
# Calculate relative band powers (normalized)
total_power = np.sum(power_spectrum[positive_freq_idx])
if total_power > 0:
for band in ['delta', 'theta', 'alpha', 'beta']:
features[f'relative_{band}_power'] = features[f'{band}_power'] / total_power
return features
Preparing for ClassificationΒΆ
Conceptual: The extracted features form the input to classification algorithms in the next stage of analysis.
Key Considerations: - Feature selection: Identifying the most informative features - Feature normalization: Scaling features to comparable ranges - Dimensionality reduction: PCA, t-SNE for visualizing high-dimensional feature spaces - Time-aware validation: Special considerations for time series data
DEMO BREAK: Advanced Sensor Data AnalysisΒΆ
See: demo3-hrv-feature-extraction
Signal Preprocessing Techniques π§ΒΆ
Outlier and Artifact RemovalΒΆ
Conceptual: Physiological signals often contain outliers and artifacts that can distort analysis results. Common sources include: - Movement artifacts - Equipment malfunctions - Ectopic beats (in ECG/RR data) - Environmental interference
Reference:
def remove_outliers(signal, n_std=3):
"""Remove values more than n standard deviations from mean"""
mean = np.mean(signal)
std = np.std(signal)
return signal[(signal > mean - n_std*std) &
(signal < mean + n_std*std)]
Interpolation MethodsΒΆ
Conceptual: Many analysis techniques require evenly sampled data, but physiological signals (like RR intervals) are often irregularly sampled.
Common interpolation methods: - Linear interpolation - Cubic spline interpolation - Polynomial interpolation
Reference:
from scipy.interpolate import interp1d
# Create evenly sampled time points
time = np.cumsum(rr_intervals)
uniform_time = np.linspace(time[0], time[-1], desired_length)
# Interpolate signal
interp_func = interp1d(time, signal, kind='linear')
uniform_signal = interp_func(uniform_time)
Advanced Signal Processing πΒΆ
Filtering TechniquesΒΆ
Butterworth FiltersΒΆ
Conceptual: Butterworth filters provide smooth frequency response with minimal ripples. Common types: - Lowpass: Remove high frequencies - Highpass: Remove low frequencies - Bandpass: Keep specific frequency range - Bandstop: Remove specific frequency range
Reference:
from scipy.signal import butter, filtfilt
def butter_bandpass(lowcut, highcut, fs, order=4):
nyquist = 0.5 * fs
low = lowcut / nyquist
high = highcut / nyquist
b, a = butter(order, [low, high], btype='band')
return b, a
def apply_bandpass(data, lowcut, highcut, fs, order=4):
b, a = butter_bandpass(lowcut, highcut, fs, order)
return filtfilt(b, a, data)
Frequency Analysis MethodsΒΆ
Fast Fourier Transform (FFT)ΒΆ
Conceptual: FFT decomposes a signal into its frequency components, revealing periodic patterns.
Key concepts: - Frequency spectrum - Power spectral density - Nyquist frequency - Spectral leakage
Reference:
from scipy.fft import fft, fftfreq
# Compute FFT
yf = fft(signal)
xf = fftfreq(len(signal), 1/sampling_rate)
# Power spectrum
power_spectrum = np.abs(yf)**2
Frequency Bands in HRV AnalysisΒΆ
Conceptual: Heart Rate Variability (HRV) analysis often focuses on specific frequency bands:
- VLF (Very Low Frequency): 0.003-0.04 Hz
- Long-term regulation mechanisms
-
Thermoregulation
-
LF (Low Frequency): 0.04-0.15 Hz
- Sympathetic and parasympathetic activity
-
Baroreceptor activity
-
HF (High Frequency): 0.15-0.4 Hz
- Parasympathetic activity
- Respiratory sinus arrhythmia
Reference:
def get_frequency_band_power(power_spectrum, freqs, band):
"""Calculate power in specific frequency band"""
mask = (freqs >= band[0]) & (freqs <= band[1])
return np.sum(power_spectrum[mask])
Feature Engineering for Time Series π οΈΒΆ
Statistical FeaturesΒΆ
Conceptual: Basic statistical measures that capture different aspects of the signal:
- Central tendency: mean, median, mode
- Dispersion: standard deviation, variance, IQR
- Shape: skewness, kurtosis
- Extremes: min, max, range
Reference:
def extract_statistical_features(signal):
return {
'mean': np.mean(signal),
'std': np.std(signal),
'skew': stats.skew(signal),
'kurtosis': stats.kurtosis(signal),
'iqr': np.percentile(signal, 75) -
np.percentile(signal, 25)
}
Rolling Window FeaturesΒΆ
Conceptual: Features calculated over sliding windows capture local patterns:
- Rolling mean
- Rolling standard deviation
- Rolling entropy
- Rolling correlation
Reference:
def rolling_features(signal, window_size):
return {
'rolling_mean': pd.Series(signal).rolling(window_size).mean(),
'rolling_std': pd.Series(signal).rolling(window_size).std()
}
Frequency Domain FeaturesΒΆ
Conceptual: Features derived from frequency analysis:
- Band powers (VLF, LF, HF)
- Peak frequencies
- Spectral entropy
- Power ratios (LF/HF)
Reference:
def frequency_domain_features(power_spectrum, freqs):
vlf = get_frequency_band_power(power_spectrum, freqs, (0.003, 0.04))
lf = get_frequency_band_power(power_spectrum, freqs, (0.04, 0.15))
hf = get_frequency_band_power(power_spectrum, freqs, (0.15, 0.4))
return {
'vlf_power': vlf,
'lf_power': lf,
'hf_power': hf,
'lf_hf_ratio': lf/hf if hf > 0 else 0
}
Summary and Key TakeawaysΒΆ
- Time series data is fundamental in healthcare for monitoring, forecasting, and understanding patterns over time
- Basic analysis techniques like decomposition, correlation analysis, and visualization provide insights into underlying patterns
- ARIMA models offer a powerful framework for forecasting time series data in healthcare
- Sensor data analysis requires specialized techniques for processing dense physiological signals
- Practical applications include patient monitoring, resource planning, and early warning systems
Further ResourcesΒΆ
- Time Series Analysis in Python - Comprehensive guide to forecasting
- PhysioNet - Repository of physiological data and software
- statsmodels Documentation - Time series analysis in Python
- Practical Time Series Analysis - O'Reilly book
Next StepsΒΆ
- Explore the demo notebooks to gain hands-on experience
- Apply these techniques to your own healthcare data
- Consider how time series analysis could improve healthcare delivery and outcomes