Course description

This course introduces UNIX/Linux shell, version control with git and GitHub, R programming, data wrangling with dplyr and data.table, data visualization with ggplot2 and shiny, and reproducible document preparation with RStudio, knitr and markdown. We briefly introduce Monte Carlo simulations, statistical modeling, high-dimensional data techniques, and machine learning and how these are applied to real data. Throughout the course, we use motivating case studies and data analysis problem sets based on challenges similar to those you encounter in scientific research.

Lectures will be mostly live coding. We will go over exercises and challenges together but will pause 1-4 times per lectures so students can complete exercises on their own. The midterm questions will be selected from the exercises presented in class. Some time will be dedicated to answering problem set questions. Lectures will not be recorded.

Students are required to have a GitHub account and create a repository for the course.

Problem sets are mostly composed of open ended questions. Submission should be in the form of a scientific report. Problem set submission need to be completely reproducible. Specifically, students are expected to upload a Quarto document to their GitHub class repository that graders can compile into a readable report.