Course Information

Instructor: Christopher Seaman
EA's: Marlene Lin & Nya Campbell
Dates: April 3rd - June 4th, 2025 (10 class meetings)
Lecture: Wednesday, 9:00 - 11:00 AM, Mission Hall 1400
Lab: Wednesday, 11:00 - 12:30 AM, Mission Hall 1400

Overview¶

At the conclusion of this course, students will be able to:

Develop projects proficiently in Python Programming and Data Science Tools: Equip students with the skills to proficiently use essential data science technologies and processes common in industry; including Python, Git, SQL, Pandas, and data visualization libraries, providing a strong foundation for conducting data analysis in real-world scenarios.
Apply Machine Learning to Solve Real-World Problems: Enable students to apply machine learning techniques including data cleaning, classification and time series analysis, to real-world data science challenges, emphasizing their practical application in fields like health data science.
Implement Advanced Data Science Concepts: Empower students to implement advanced data science solutions, including generative AI and Large Language Model (LLM) development, with a focus on their relevance and applications in healthcare and health data analysis.
Develop Effective Data Communication Skills: Enhance students' ability to communicate data insights effectively through data visualization and storytelling, a crucial skill for conveying findings in health data science contexts.

Prerequisites¶

Familiarity with programming concepts, including loops, variables, and functions. Ideally, hands-on experience writing and running scripts such as in: Python, R, Bash, or other programming languages.

Students would benefit by being familiar with: - Markdown - Python - Git + GitHub - Jupyter notebooks - Visual Studio Code

Instruction and materials will be provided to aid students in getting to a common baseline during the first lecture.

Format¶

Each week will focus on a different area of applied data science. See the list of lecture by week for more details.

Lectures will provide an overview of new concepts and tools introduced that week. Each lecture will close with a hands-on exercise with solutions due the following lecture.

Labs will not introduce new material; instead, they provide a forum for collaboration between students and staff to help each other with the current material.

Students are encouraged to collaborate in small groups, but may also work independently. The class culminates with a multi-week project applying advanced techniques or combining applications from multiple focus areas.

Grading¶

Final grades will be based on the submitted exercises (60%) and final project (40%).

Materials¶

This course will utilize freely available and open-source materials.

UCSF Course Materials¶

CLE Dashboard - Syllabus, discussion, and lecture recordings
GitHub - Exercises and code, where assignments will be submitted

Lectures¶

Included topics may vary depending on student instersts.

Getting Started with Git, Markdown, and Python¶

Version Control using Git: Fundamentals of version control using Git, emphasizing its role in collaborative data science projects; how version control facilitates teamwork, tracks changes, and ensures project integrity
Markdown Basics: Explore the basics of Markdown, a lightweight markup language, for creating clear and concise documentation
Python Environments: Setting up a Python environment, installation of libraries, and creating virtual environments

Pandas for Data Manipulation and Analysis¶

Loading and Handling Data: Load data using different formats
Handling Large Datasets: Common challenges that arise when working with big data (larger than will fit in memory)
Exploratory Data Analysis: Exploratory analysis to understand data distributions, patterns, and outliers; extract meaningful insights
Data Munging and Cleaning: Handling missing values, outliers, and ensuring data quality

SQL Fundamentals: Querying and Manipulating Data¶

Loading and Inspecting Datasets: Introduction to SQL for loading and inspecting datasets
Structure of SQL Statements: Break down statements including WHERE, ORDER BY, and GROUP BY, to efficiently query and organize data.
Joins: Techniques for integrating information from multiple tables
Data Exploration: Generating summary statistics and visualizations
Advanced Concepts: Common table expressions, stored procedures, and window functions

Classification Models¶

Learning Paradigms: Overview of supervised, unsupervised, and reinforcement learning
Supervised Learning: Widely-used supervised classification algorithms, including logistic regression, decision trees, and random forests
Unsupervised Learning: Examination of unsupervised learning, including clustering and dimensionality reduction methods
Evaluating Model Performance: Methods for assessing model performance through cross-validation and appropriate metrics for classification tasks

Generative AI with Images¶

GANs (Generative Adversarial Networks): GANs are powerful for generating realistic data by training a generator and a discriminator in a competitive setting with adversarial training.
Diffusion Models: Diffusion models, like those based on the concept of denoising score matching, offer an alternative approach to generative modeling. They focus on the evolution of a probability distribution over time. This adds diversity to the understanding of generative models.
VAE’s (Variational Autoencoders): VAE’s bring in the idea of learning a probabilistic latent representation of data. This contrasts with GAN’s and diffusion models, which do not explicitly model a latent space. VAE’s are useful when you want a structured latent space and want to generate data with specific characteristics.

Large Language Models¶

Advanced Textual Comprehension: Large Language Models (e.g., GPT and BERT) excel in comprehending and generating human-like text, showcasing semantic grasp and nuanced language interpretation
Architecture: The transformative role of transformers and attention mechanisms in constructing contextual language understanding, with a brief detour through information geometry and manifolds
Training: Process for building an LLM from scratch
Fine-tuning: Techniques for refining massive general models to suit subject-specific tasks, drawing parallels with approaches used in diffusion models and VAE’s
Prompt Engineering: Effective strategies for crafting prompts and context to effectively extend the LLM’s beyond their training scope (e.g., one-shot and zero-shot learning)
Model Chains: Creating complex agents by chaining LLM’s in sequence, including analysts and personal assistants

Experimentation & A/B Testing¶

Metrics and Product Analytics: Key metrics and product analytics relevant to experimentation and A/B testing, including their role in assessing the performance and impact of changes.
Research Design: Developing a robust research design for A/B testing, including the identification of variables, definition of control and experimental groups, and consideration of randomization.
Analysis and Results: Techniques for analyzing A/B test results, interpreting statistical significance, and drawing meaningful conclusions. Presentation of results to stakeholders and decision-makers.
Practical Considerations and Pitfalls: Practical aspects and common challenges in experimentation and A/B testing: potential pitfalls, ethical considerations, and factors that may impact the reliability of results.

Data Visualization and Communication¶

Principles of Effective Data Visualization: Foundational principles of impactful data visualization; emphasizing clarity, interpretability, and compelling storytelling.
Introduction to Data Visualization Libraries: Popular data visualization libraries like Matplotlib, Seaborn, and Plotly
Interactive Visualizations: Techniques for creating interactive visualizations on the web, with BI tools, and shared Jupyter notebooks

Time Series Analysis and Forecasting¶

Understanding Time Series Data: Unique characteristics of time series data, focusing on temporal patterns and dependencies.
Exploratory Analysis: Extracting meaningful patterns and trends from time series data
Forecasting: Forecasting techniques, including ARIMA and exponential smoothing

Current Data Science Landscape¶

Recent Trends: What’s currently happening in the field and how to stay up-to-date.
People, Process, and Tools: Job roles and responsibilities in data science, how systematic and iterative workflows work, and popular tools/technologies used.
Deploying to Production: The crucial steps involved in deploying a model to the cloud, including considerations for multi-model pipelines.
The Dreaded Technical Interview: Get familiar with what’s usually covered, common challenges, and ways to practice.