Demo: Analyzing Large Health Data with polars and pandas π₯πΒΆ
GoalΒΆ
Compare polars and pandas when analyzing a large health dataset, and demonstrate the benefits of using Parquet format for efficient data storage and processing.
SetupΒΆ
- Big picture
First, check total system memory:
Then in a separate terminal, start monitoring Python processes:
Now run the demo steps:
# Change format between csv/parquet, backend between polars/pandas,
time python3 demo4-analyze_large_health_data.py --backend polars --format parquet
All together:
python demo4-generate_large_health_data.py;\ # --size 100000000
free -h; \
watch -n 1 'ps -o pid,ppid,rss,vsize,pmem,pcpu,comm -C python'; \
time python3 demo4-analyze_large_health_data.py --backend polars --format parquet
-
Download the source data:
-
Generate the large dataset (optional size parameter):
-
Analyze with different backends and formats:
TasksΒΆ
- Download the source data:
This will download the diabetes dataset and save it as demo4-diabetes.csv.
- Generate data:
This will create both CSV and Parquet files and show a comparison of their sizes and processing times. The script adds a memory-intensive hash column to ensure pandas will run out of memory when using CSV format.
- Analyze with polars using Parquet (recommended):
# Use the memory monitor to track memory usage
python demo4-analyze_large_health_data.py --backend polars --format parquet
- Analyze with polars using CSV:
- Analyze with pandas using Parquet:
- Analyze with pandas using CSV (will likely fail):
- Observe:
- Parquet files are significantly smaller than CSV files
- polars should succeed quickly with either format
- pandas will likely crash with CSV due to the memory-intensive hash column
- Processing times are faster with Parquet
-
Memory usage is tracked and displayed for each operation
-
Inspect output
summary.csv
Expected OutcomesΒΆ
- Students see how polars handles big data efficiently
- Students understand pandas' memory limitations
- Students learn about the benefits of Parquet format for large datasets
- Students learn to choose the right tool and format for big data
- Students see how memory usage differs between libraries and formats
Why Parquet?ΒΆ
Parquet is a columnar storage format that offers several advantages for health data:
- Smaller file size: 2-4x smaller than CSV
- Faster queries: Only reads needed columns
- Schema enforcement: Ensures data consistency
- Predicate pushdown: Filters data before loading
- Better compression: Efficient for healthcare data patterns
- Column pruning: Can read only needed columns, reducing memory usage
Memory Usage ComparisonΒΆ
This demo includes a memory-intensive hash column to demonstrate the differences in memory usage:
| Library | Format | Memory Usage | Performance |
|---|---|---|---|
| polars | Parquet | Low | Very Fast |
| polars | CSV | Medium | Fast |
| pandas | Parquet | Medium | Medium |
| pandas | CSV | Very High | Likely Fails |
Memory MonitoringΒΆ
Using watch and free CommandsΒΆ
You can monitor memory usage in real-time using the watch and free commands:
This will show you the memory usage of your system and the Python process running the analysis script.
Using System Monitoring ToolsΒΆ
On Linux:ΒΆ
# Open a terminal and run the analysis script
python demo4-analyze_large_health_data.py --backend pandas --format csv
# In a separate terminal, monitor the process
ps -eo pid,ppid,%cpu,%mem,rss,command | grep python
Or use top in a separate terminal:
On macOS:ΒΆ
On Windows:ΒΆ
Open Task Manager (Ctrl+Shift+Esc) and go to the "Processes" tab to monitor the Python process.
These system tools provide direct monitoring without requiring any additional scripts or dependencies.
NotesΒΆ
This demo highlights the practical benefits of: 1. polars' lazy and streaming execution for large datasets - Uses the new streaming engine for efficient memory usage - Processes data in chunks without loading everything into memory 2. Parquet's efficient storage and processing capabilities 3. Choosing the right tool and format for healthcare data analysis 4. Memory usage tracking to understand resource requirements