Introduction to Wrangling

2024-09-30

Data Wrangling

  • The two datasets used in the problem sets are already tidy data frames.

  • However, very rarely in a data science project is data already available in this form.

  • Much more typical is for the data to be in a file, a database, or extracted from a document, including web pages, tweets, or PDFs.

  • As a result, data might be unstructured in complex ways.

Data Wrangling

  • Data wrangling is what we call the process of structuring data from it’s original state into a form that permits us to focus on analysis.

  • Tidy data is an example of a form that permits us to focus on analysis.

  • We focus on tidy data as a target, but we can have other forms as targets, such as matrices.

Data Wrangling

Data wrangling can involve several complicated steps such as:

  • extracting data from a file,
  • converting nested key-value pairs into a data frame,
  • integrating data from different source, and
  • constructing requests for data bases.

Data Wrangling

Today we briefly discuss five concepts/tools considered essential for data wrangling:

  • Importing data from files

  • RESTful APIs

  • Joining tables

  • html parsing

  • working with dates and times

  • locales

Data Wrangling

  • We barely scratch the surface on these topics.

  • Rarely are all these relevant for a single analysis.

  • You will likely face them all at some point.

  • Lecture goal is to make you aware of challenges, tools to tackle them, and help you learn how to learn more.

Further learning

  • SQL widely used in data-intensive industries to manage and manipulate large databases.

  • In R, dplyr functions like filter, select, and the joins we will learn this week, mirror SQL operations.

  • By learning dplyr, you’ve already covered many key SQL concepts and makes the transition to SQL easier.

  • Recommended SQL Resources: W3Schools, Codecademy, Khan Academy, and SQLZoo