Syllabus

General Information

Prerequisites

We assume students have taken or are taking a probability and statistics course and have basic programming skills.

Students not matriculated in an HSPH Biostatistics graduate program (HDS SM60, BIO SM80 / SM60 / SM1, and CBQG SM80) will be required to score at least 90% on a basic math and programming diagnostic test to enroll in the course. If you are in a HSPH Biostatistics graduate program and you score less than 90% we will contact you to offer supplementary resource to help you be prepared for the course.

Textbooks

Course Description

This course introduces the following:

  • UNIX/Linux shell
  • Reproducible document preparation with RStudio, knitr, and markdown
  • Version control with git and GitHub
  • R programming
  • Data wrangling with dplyr and data.table
  • Data visualization with ggplot2

We also demonstrate how the following concepts are applied in data analysis:

  • Probability theory
  • Statistical inference and modeling
  • High-dimensional data techniques
  • Machine learning

We do not cover the theory and details of these methods as they are covered in other courses.

Throughout the course, we use motivating case studies and data analysis problem sets based on challenges similar to those you encounter in scientific research.

Weekly Course Structure

  • Monday lectures: We describe the concepts, methods, and skills needed for problem sets.
  • Wednesday labs: We work together on problem sets.
  • Friday: Problem sets due (see Key Dates and Problem Sets).

Please ensure that you read the chapters listed in the syllabus before each Monday. The lectures are designed with the assumption that you have completed the readings, enabling us to dive deeper into the nuances of data analysis and coding.

Lectures will not be recorded.

We will have a Slack workspace for you to ask questions during and after class.

Grade Distribution

Component Weight
10 Problem Sets 30%
4 Oral Evaluations 20%
Midterm 1 10%
Midterm 2 20%
Final Project 20%

Problem Sets

Problem sets will be due every week or every other week, depending on difficulty. They will be due at 11:59 PM on the day denoted on the Problem Sets page.

Some problem sets include open ended questions that will be difficult to answer on your own. We will be working on these together during Wednesday labs. We also offer office hours where you can get help with unanswered questions.

Problem sets must be submitted via GitHub. Students are required to have a GitHub account and create a repository for the course. We will be providing further instructions during the first lab.

10% of the total points for the problem sets will be deducted for every late day. Students can have a total of 4 late days without penalty during the entire semester. No need to provide a written excuse. Providing an excuse does not give you more days unless an accommodation is requested and approved by the Office of Student Affairs (this includes COVID).

Problem set submissions need to be completely reproducible Quarto documents. If your Quarto file does not compile it will be considered a late day, and you will be notified and will need to resubmit a Quarto file that does compile. You will be deducted further late days for every day it takes for you to turn in a Quarto file that does knit. You are required to check emails that come through the Canvas system, as this the only way we will communicate problems with your problem sets.

Oral Evaluations

Oral evaluations account for 20% of your final grade, divided into four separate evaluations (5% each) throughout the semester. Each student must schedule a 15-minute Zoom meeting with their assigned Teaching Fellow (TF) to discuss their problem set solutions and demonstrate understanding of their code and approach.

Key Details:

  • Four evaluations covering 2 or 3 problem sets each
  • Students are assigned to TFs based on the first letter of their last name
  • You can only schedule a meeting after completing the problem sets included in that evaluation period
  • Meetings are conducted via Zoom using dedicated Slack channels for scheduling

During each 15-minute session, your TF may ask you to:

  • Explain specific sections of your code
  • Walk through your problem-solving approach
  • Justify your choice of methods or functions
  • Discuss challenges you encountered and how you solved them

Grading Scale: 0-5 points based on your demonstrated understanding, from no understanding (1/5) to excellent understanding with minimal gaps (5/5). Not scheduling a meeting results in 0/5.

See the Oral Evaluations page for complete details including scheduling instructions, TF assignments, and full grading criteria.

Midterm Policy

Both midterms are closed book, no internet, and in-class. You are expected to complete them in 1 hour.

They will be very similar to the problem sets and the examples given in class.

Please make sure you can come to class on the midterm dates provided in the Key Dates table below. If you miss the exam, you will need approval from the Office of Student Affairs to receive a makeup. All make-up exams will be completely different from the in-class ones.

Final Project

You will be assigned to a pre-assigned team of 4 or 5 students and a Teaching Fellow (TF) who will supervise your project throughout the semester.

Grading: The final project accounts for 20% of your course grade, divided as follows:

  • Final Project Report: 10% of course grade
  • Oral Presentation: 10% of course grade

Project Overview

Your team will select from a list of available data science projects or propose a custom research topic (subject to TF approval). You will prepare a comprehensive 2,500-3,000 word academic-style report with sections including Abstract, Introduction, Methods, Results, and Discussion.

Key Requirements

  • Team Collaboration: Work in assigned teams of 4-5 students
  • Reproducible Analysis: Submit via GitHub repository with all code and data
  • Oral Presentation: 20-minute team presentation scheduled between December 16-19
  • Individual Accountability: All team members must participate and demonstrate understanding of their contributions

AI Policy

Students may use AI tools as complementary aids but must clearly document where AI was used in their analysis or writing.

Important: You should start working on your project after the first midterm. Do not wait until the last week. Teaching staff will be available during office hours to provide guidance.

For complete details including team assignments, project options, submission guidelines, grading rubrics, and formatting requirements, see the Final Project page.

ChatGPT Policy

You can use ChatGPT however you want. Do remember you won’t be able to use it during the midterms.

Key Dates

Date Event
Oct 13 No class: Indigenous Peoples’ Day
Oct 15 Midterm 1: covers material from Sep 03-Oct 08
Oct 22 Start final project
Nov 10 No class: Veterans’ Day
Nov 24 Midterm 2: cover material from Sep 03-Nov 19
Nov 26 No class: Thanksgiving Recess Begins
Dec 19 Final Project due