2024-09-23
The tidyverse is not a package but a group of packages developed to work with each other.
The tidyverse makes data analysis simpler and code easier to read by sacrificing some flexibility.
One way code is simplified by ensuring all functions take and return tidy data.
Stored in a data frame.
Each observation is exactly one row.
Variables are stored in columns.
Not all data can be represented this way, but a very large subset of data analysis challenges are based on tidy data.
Assuming data is tidy simplifies coding and frees up our minds for statistical thinking.
country year fertility
1 Germany 1960 2.41
2 South Korea 1960 6.16
3 Germany 1961 2.44
4 South Korea 1961 5.99
5 Germany 1962 2.47
6 South Korea 1962 5.79
country 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970
1 Germany 2.41 2.44 2.47 2.49 2.49 2.48 2.44 2.37 2.28 2.17 2.04
2 South Korea 6.16 5.99 5.79 5.57 5.36 5.16 4.99 4.85 4.73 4.62 4.53
tibble - improves data frame class.
readr - import data.
dplyr - used to modify data frames.
ggplot2 - simplifies plotting.
tidyr - helps convert data into tidy format.
stringr - string processing.
forcats - utilities for categorical data.
purrr - tidy version of apply functions.
In this lecture we focus mainly on dplyr.
In particular the following functions:
mutate
select
across
filter
group_by
summarize
mutate
Notice that here we used total
and population
inside the function, which are objects that are not defined in our workspace.
This is known as non-standard evaluation where the context is used to know what variable names means.
Tidyverse extensively uses non-standard evaluation.
This can create confusion but it certainly simplifies code.
filter
select
The function mutate
can also be used to transform variables.
For example, the following code takes the log transformation of the population variable:
Often, we need to apply the same transformation to several variables.
The function across
facilitates the operation.
For example if want to log transform both population and total murders we can use:
The helper functions come in handy when using across.
An example is if we want to apply the same transformation to all numeric variables:
|>
or %>%
We use the pipe to chain a series of operations.
For example if we want to select columns and then filter rows we chain like this:
\[ \mbox{original data } \rightarrow \mbox{ select } \rightarrow \mbox{ filter } \]
|>
or %>%
state region rate
1 Hawaii West 0.5145920
2 Iowa North Central 0.6893484
3 New Hampshire Northeast 0.3798036
4 North Dakota North Central 0.5947151
5 Vermont Northeast 0.3196211
The object on the left of the pipe is used as the first argument for the function on the right.
The second argument becomes the first, the third the second, and so on…
|>
or %>%
We use the dplyr summarize
function, not to be confused with summary
from R base.
Here is an example of how it works:
No, the rate is NOT the average of rates.
It is the total murders divided by total population:
median min max
1 4339367 563626 37253956
quantiles
?Warning
Using a function that returns more than one number within summarize will soon be deprecated.
reframe
:min
, median
, and max
separately, we have to define a function that returns a data frame like this:summarize
:Let’s compute murder rate by region.
Take a close look at this output?
# A tibble: 4 × 6
# Groups: region [2]
state abb region population total rate
<chr> <chr> <fct> <dbl> <dbl> <dbl>
1 Alabama AL South 4779736 135 2.82
2 Alaska AK West 710231 19 2.68
3 Arizona AZ West 6392017 232 3.63
4 Arkansas AR South 2915918 93 3.19
Note the Groups: region [4]
at the top.
This is a special data frame called a grouped data frame.
summarize
, will behave differently when acting on this object.# A tibble: 4 × 2
region rate
<fct> <dbl>
1 Northeast 2.66
2 South 3.63
3 North Central 2.73
4 West 2.66
summarize
function applies the summarization to each group separately.median_min_max
previously defined:You can also summarize a variable but not collapse the dataset.
We use mutate
instead of summarize
.
Here is an example where we add a column with the population in each region and the number of states in the region, shown for each state.
pull
pull
state abb region population total rate
1 Vermont VT Northeast 625741 2 0.3196211
2 New Hampshire NH Northeast 1316470 5 0.3798036
3 Hawaii HI West 1360301 7 0.5145920
4 North Dakota ND North Central 672591 4 0.5947151
5 Iowa IA North Central 3046355 21 0.6893484
6 Idaho ID West 1567582 12 0.7655102
desc
: state abb region population total rate
1 District of Columbia DC South 601723 99 16.452753
2 Louisiana LA South 4533372 351 7.742581
3 Missouri MO North Central 5988927 321 5.359892
4 Maryland MD South 5773552 293 5.074866
5 South Carolina SC South 4625364 207 4.475323
6 Delaware DE South 897934 38 4.231937
state abb region population total rate
1 Pennsylvania PA Northeast 12702379 457 3.5977513
2 New Jersey NJ Northeast 8791894 246 2.7980319
3 Connecticut CT Northeast 3574097 97 2.7139722
4 New York NY Northeast 19378102 517 2.6679599
5 Massachusetts MA Northeast 6547629 118 1.8021791
6 Rhode Island RI Northeast 1052567 16 1.5200933
7 Maine ME Northeast 1328361 11 0.8280881
8 New Hampshire NH Northeast 1316470 5 0.3798036
9 Vermont VT Northeast 625741 2 0.3196211
10 District of Columbia DC South 601723 99 16.4527532
11 Louisiana LA South 4533372 351 7.7425810