Lecture 1: git init - Getting Started with Git, Python, and Markdown¶
Welcome to the first lecture of Applied Data Science with Python! Today we'll be covering the essential tools and concepts that will form the foundation of your data science journey.
Table of Contents¶
- Tools:
pythonandgit- Getting set up locally
- Cloud options (GitHub Codespaces, Colab, Binder, Paperspace)
- Command Line Basics
- Terminal access on different platforms
- Basic navigation and file operations
- Chaining commands with pipes and redirection
- Markdown
- Syntax summaries
- Readme.md - make one for every repo
- git and GitHub
- Starting or cloning a repository
- Git push/pull/sync
- Branches & Conflicts
- Python
- Syntax basics
- Running python and jupyter
- Variables and control flow
- Runtime Environments
- Virtual environments
- Jupyter Notebooks
- Cloud options: Google Colab & GitHub Codespaces (no PHI in this course)
- Data Security and Ethics
- Working securely with health data
- When NOT to use cloud tools
Why Python?¶
Check: What tools have you used before? What would you like to cover in this course?

Installing tools¶
For most roles data science happens in python and R, in this course we will be talking about python.
It doesn't matter which tools you use; python and R (and other specialized tools) are quite capable. Since python and R are the most commonly used tools, knowing one or both of them will make it easier to play well with others. Don't try to be an expert in everything! Figure out which you prefer and learn to be "fluent" (able to code a solution from start to finish) in one, then you can get by being "conversational" (able to read and edit others' code) in the other.
Additionally, collaboration usually happens in git and documentation will use markdown. Luckily, those are "easy" to pick up.
Quickstart¶
Note: this is also included in the week's assignment
These are the standard options that I'll be using to demonstrate going forward. They will also give us a common base to work from, so we can focus on the work rather than tweaking/fixing our development environment.
- Sign up for an account on GitHub
- Apply for GitHub Education to get extra free hours on Codespaces and other benefits
- Install Python 3 (instructions)
-
-
Most commands are accessed using the "Command Palette"
- Shift + Command + P (Mac)
- Ctrl + Shift + P (Windows/Linux)
- F1 (All)
-
Extensions
- Python + Jupyter (use notebooks within VS Code)
- GitHub Repositories + Remote Repositories (manage git in VS Code instead of the terminal)
-
Note: If you don't want to install software locally, you can use GitHub Codespaces (recommended) or Google Colab but never use PHI data with public-facing tools.
GitHub Codespaces¶
Cloud-based development environment with VS Code in your browser:
- Benefits: No setup, consistent environment, works on any device
- Student perks: Extra free hours with GitHub Education
- Getting started: Repository → Code button → Codespaces tab → Create
- Persistence: Codespaces last weeks but not forever; commit/push often
- Fun fact: I write these lectures on my iPad using Codespaces and VS Code tunnels
GitHub Classroom¶
How we'll manage assignments in this course:
- Benefits: Automated distribution, testing, and grading; private repos
- Process: Get link → Accept assignment → Clone repo → Make changes → Push to submit
- Grading: Automated tests run on submission; feedback via issues/comments
Command Line Basics¶
Recommended Resources:¶
- LinuxCommand.org - Learning the shell
- The Missing Semester - MIT course on developer tools
- regex101.com - Regular expression testing tool
Essential commands for navigating and working with files:¶
- Navigation:
pwd(where am I?),ls(what's here?),cd(change directory) - Special directories:
~(home),.(current),..(parent) - File operations:
mkdir,touch,cp,mv,rm(careful - no undo!) - Viewing content:
cat,head,tail - Text tools:
grep(search),nano(edit) - Chaining:
|(pipe output),>(redirect to file),>>(append to file)
Access via Terminal (Mac), WSL (Windows, recommended), or Terminal (Linux)
Health Data Science Applications¶
- Organizing patient data files:
mkdir patient_cohorts/{control,treatment} - Searching clinical notes:
grep "diabetes" patient_notes.txt - Extracting first 10 rows of data:
head -n 10 lab_results.csv - Counting records by type:
grep "diagnosis" records.csv | wc -l - Combining data processing steps:
cat vitals.csv | grep "elevated" | sort > high_risk_patients.csv
LIVE DEMO!¶
Windows Subsystem for Linux (WSL)¶
For Windows users, WSL provides a Linux environment directly in Windows:
- Why use it: Consistent Unix environment, better compatibility with data science tools
- Quick install: In PowerShell (as Admin):
wsl --install, then restart - VS Code integration: Install the WSL extension in VS Code to work from Unix
- File access: Windows files at
/mnt/c/..., WSL files at\\wsl$\Ubuntu\... - Best terminal: Windows Terminal or VS Code's integrated terminal
Local setup¶
MacOS:
Windows:
iOS:
if you're a weirdo and want to turn your iPad into a fully-fledged development environment
- git: Working Copy
- Terminal: blink.sh
- VS Code: vscode.dev
- Jupyter: Juno (and Juno Connect to use cloud processing and GPUs)
Tools you'll need:¶
- git
brew install git- WSL has git installed by default
- GitHub Desktop has a GUI (excellent for beginners, but plenty of devs use it, too!)
- VS Code 👇 can also manage git repositories!
- Python 3 - Data Science with Python Tutorial
- We'll install and explore throughout the course 👇
Cloud options¶
You can run Python in lots of places, many for free:
- GitHub Codespaces (free extra hours for students with GitHub Education, can work with private repos)
- Google Colab (free for public notebooks, paid for private or higher-powered machines)
- Paperspace (free for public notebooks, paid for private or higher-powered machines)
- Binder (free, always public)
Markdown¶
Lightweight markup language for documentation, used in GitHub, Notion, and more:
Recommended Resources: - Markdown Guide - Comprehensive reference - Interactive Tutorial - Hands-on learning - CommonMark tutorial - Standard Markdown tutorial
Markdown Tip: In Markdown, only use one H1 (
#) heading per document. This helps maintain a clear document structure and improves readability. The first H1 heading typically serves as the document's title or main heading.
Key Syntax¶
- Paragraphs: Separate with blank lines
- Headers:
# H1,## H2,### H3 - Formatting:
**bold**,_italic_,`code` - Lists:
- Unordered:
* itemor- item - Ordered:
1. item(numbers don't matter) - Checklists:
- [ ]and- [x] - Code blocks: Triple backticks
``` - Links:
[text](url) - Blockquotes:
> quoted text
Every repo should have a README.md to explain what it is and how to use it.
git and GitHub¶
Version control system for tracking changes and collaborating on code:

Recommended Resources: - GitHub Foundations - THE tutorial for GitHub - Atlassian Git Tutorial (focus on Getting Started and Collaborating) - Markdown Guide - Markdown syntax reference
Essential Git Commands¶
- Setup:
git config --global user.name "Your Name"andgit config --global user.email "email@example.com" - Starting:
git init(new repo) orgit clone URL(copy existing repo) - Basic workflow:
git status(check what's changed)git add filename(stage changes)git commit -m "Message"(save snapshot)git push(upload to remote) /git pull(download from remote)
git config --global user.email "NOT YOUR ACTUAL EMAIL"¶
GitHub (thankfully) will do its best to keep you from posting your email on the internet. They provide an anonymous remailing service with an email alias. Add that to your git config

Collaboration Features¶
- Branches: Create separate workspaces with
git branchandgit checkout - Pull Requests: Request code review before merging changes
- Forks: Make your own copy of someone else's repository
Important Notes¶
- Never commit sensitive info: No passwords, PHI, or PII
- Handling conflicts: Use
git restore,git rebase, orgit stashwhen things get messy - GitHub alternatives: GitLab, Bitbucket, or UCSF's internal GitHub (for PHI)

LIVE DEMO!¶
Python¶
The most popular language for data science and machine learning:
Recommended Resources: - A Whirlwind Tour of Python (free online) - Think Python - Free book by Allen Downey - Python for Data Analysis - For data science applications - Python Data Science Handbook - Comprehensive guide
Quick Setup¶
- Mac:
brew install python - Windows: In WSL:
sudo apt install python3 python3-pip python3-venv
Key Packages¶
- Data analysis: Pandas, NumPy
- Visualization: Matplotlib, Seaborn
- Machine learning: scikit-learn, PyTorch, TensorFlow/Keras
- Health-specific: BioPython, Nilearn, MedPy, PyDicom
Python in Health Data Science¶
# Example: Working with patient data
patient_name = "Jane Doe" # String (text) - always anonymized for teaching
patient_age = 65 # Integer (whole number)
blood_glucose = 140.5 # Float (decimal number)
has_diabetes = True # Boolean (True/False)
# Simple analysis
if blood_glucose > 126.0:
print(f"Patient {patient_name} has elevated blood glucose")
# List of blood pressure readings
bp_readings = [120, 122, 118, 125]
average_bp = sum(bp_readings) / len(bp_readings)
print(f"Average systolic BP: {average_bp}")
Virtual Environments¶
Recommended Resources: - Python Virtual Environments Primer - Detailed guide - Python venv documentation - Official documentation
Isolated Python environments for different projects:
- Why: Avoid dependency conflicts between projects, essential for reproducible health research
- How:
- Create:
python3 -m venv env_folder - Activate:
source env_folder/bin/activate(Mac/Linux) orenv_folder\Scripts\activate(Windows) - Install:
pip install -r requirements.txt - Deactivate:
deactivate
Jupyter Notebooks¶
Interactive Python environment combining code, output, and documentation:
- Best practice: Clear outputs before committing to git
- Why: Prevents large file sizes and merge conflicts
- Health applications: Ideal for exploratory analysis of health data, creating shareable research, and documenting clinical data pipelines

LIVE DEMO!¶
Data Security and Ethics in Health Data Science¶
Recommended Resources: - UCSF Information Commons Tools - For working with EHR data
Key Principles¶
- PHI (Protected Health Information): Any identifiable health information
- De-identification: Removing identifiers from health data
- HIPAA compliance: Legal requirements for handling health data
- Informed consent: Ensuring proper permissions for data use
Tool Considerations¶
- Local vs. Cloud: When to keep data on local, secured systems
- Public tools: Never use Google Colab, GitHub, etc. with PHI
- Secure alternatives: UCSF's secure computing environments, private instances
- Data minimization: Only use the data you need for your specific purpose
Best Practices¶
- Always encrypt sensitive data and minimize "data surface"
- Use secure authentication (MFA where possible)
- Document your data handling procedures
- Consult with privacy experts when in doubt
- Consider ethical implications beyond legal requirements
Assignment¶
GitHub Classroom Overview¶
- What: Platform for distributing, submitting, and grading assignments
- How: Accept assignment link → Get private repo → Make changes → Push to submit
- Benefits: Automated testing, private repos, direct feedback
Assignment Tasks¶
- Create README.md with:
- Brief introduction (first name only)
- What you hope to get from the course
-
Music recommendation with link
-
Write Python script that:
- Takes email address as command line argument
- Hashes it using specified algorithm
-
Outputs to 'hash.email' file
-
Submit via git push (auto-graded)
Check: How's my driving? What's still confusing?
It came from the Internet¶
Thanks this week to Data Science Weekly Newsletter
Data teams¶
Should You Measure the Value of a Data Team?
Data teams are sometimes asked to prove their ROI to senior leadership to justify a budget for new hires, tools, projects, or process changes.
https://medium.com/the-prefect-blog/should-you-measure-the-value-of-a-data-team-95c447f28d4a
Data scientists work alone and that's bad | Ethan Rosenthal
In Need of a Good Editor Growing up, I had always considered myself a decent writer based on my decent grades in English class.
https://www.ethanrosenthal.com/2023/01/10/data-scientists-alone/
Tooling updates¶
Beyond Pandas - working with big(ger) data more efficiently using Polars and Parquet
As data scientists/engineers, we often deal with large datasets that can be challenging to work with.
https://medium.com/data-analytics-at-nesta/beyond-pandas-working-with-big-ger-data-more-efficiently-using-polars-and-parquet-fd980353cc2
SQL should be your default choice for data engineering pipelines
Originally posted: 2023-01-30.
https://www.robinlinacre.com/recommend_sql/
Data science in practice¶
I Used Computer Vision To Destroy My Childhood High Score in a DS Game
I train an object detection model to control my computer to play a minigame running in a DS emulator endlessly.
https://betterprogramming.pub/using-computer-vision-to-destroy-my-childhood-high-score-in-a-ds-game-38ebd53a1d64
Data Cleaning Plan #FIXME:MOVE TO NEXT WEEK
Data cleaning or data wrangling is the process of organizing and transforming raw data into a dataset that can be easily accessed and analyzed.
https://cghlewis.github.io/mpsi-data-training/training_4.html