2024-09-16
Use install.packages
to install the dslabs package.
Tryout the following functions: sessionInfo
, installed.packages
Much of what we do in R is based on prebuilt functions.
Many are included in automatically loaded packages: stats
, graphics
, grDevices
, utils
, datasets
, methods
.
This subset of the R universe is refereed to as R base.
Very popular packages not included in R base: ggplot2
, dplyr
, tidyr
, and data.table
.
Important
For problem set 2 you can only use R base.
Example of prebuilt functions that we will use today: ls
, rm
, library
, search
, factor
, list
, exists
, str
, typeof
, and class
.
You can see the raw code for a function by typing it without the parenthesis: type ls
on your console to see an example.
In R you can use ?
or help
to learn more about functions.
You can learn about function using
or
ls
to see if it’s there. Also take a look at the Environment tab in RStudio.rm
to remove the variable you defined.A nice convention to follow is to use meaningful words that describe what is stored, use only lower case, and use underscores as a substitute for spaces.
For more we recommend this guide.
The main data types in R are:
One dimensional vectors: numeric, integer, logical, complex, characters.
Factors
Lists: this includes data frames.
Arrays: Matrices are the most widely used.
Date and time
tibble
S4 objects
Many errors in R come from confusing data types.
str
stands for structure, gives us information about an object.
typeof
gives you the basic data type of the object. It reveals the lower-level, more fundamental type of an object in R’s memory.
class
This function returns the class attribute of an object. The class of an object is essentially type_of
at a higher, often user-facing level.
Let’s see some example:
[1] "list"
[1] "data.frame"
'data.frame': 51 obs. of 5 variables:
$ state : chr "Alabama" "Alaska" "Arizona" "Arkansas" ...
$ abb : chr "AL" "AK" "AZ" "AR" ...
$ region : Factor w/ 4 levels "Northeast","South",..: 2 4 4 2 4 4 1 2 2 2 ...
$ population: num 4779736 710231 6392017 2915918 37253956 ...
$ total : num 135 19 232 93 1257 ...
Date frames are the most common class used in data analysis. It is like a spreadsheet.
Usually, rows represents observations and columns variables.
Each variable can be a different data type.
You can see part of the content like this
state abb region population total pop_rank
1 Alabama AL South 4779736 135 29
2 Alaska AK West 710231 19 5
3 Arizona AZ West 6392017 232 36
4 Arkansas AR South 2915918 93 20
5 California CA West 37253956 1257 51
6 Colorado CO West 5029196 65 30
Note that we used $
.
This is called the accessor
because it lets us access columns.
[1] 4779736 710231 6392017 2915918 37253956 5029196 3574097 897934
[9] 601723 19687653 9920000 1360301 1567582 12830632 6483802 3046355
[17] 2853118 4339367 4533372 1328361 5773552 6547629 9883640 5303925
[25] 2967297 5988927 989415 1826341 2700551 1316470 8791894 2059179
[33] 19378102 9535483 672591 11536504 3751351 3831074 12702379 1052567
[41] 4625364 814180 6346105 25145561 2763885 625741 8001024 6724540
[49] 1852994 5686986 563626
One way R confuses beginners is by having multiple ways of doing the same thing.
For example you can access the 4th column in the following five different ways:
with
let’s us use the column names as objects.
This is convenient to avoid typing the data frame name over and over:
Often we have to create vectors.
The concatenate function c
is the most basic way used to create vectors:
:
1:length(x)
is seq_along
:One key distinction between data types you need to understad is the difference between factors and characters.
The murder
dataset has examples of both.
Factors store levels and the label of each level.
This is useful for categorical data.
In data analysis we often have to stratify continuous variables into categories.
The function cut
helps us do this:
age <- c(5, 93, 18, 102, 14, 22, 45, 65, 67, 25, 30, 16, 21)
cut(age, c(0, 11, 27, 43, 59, 78, 96, Inf),
labels = c("Alpha", "Zoomer", "Millennial", "X", "Boomer", "Silent", "Greatest"))
[1] Alpha Silent Zoomer Greatest Zoomer Zoomer
[7] X Boomer Boomer Zoomer Millennial Zoomer
[13] Zoomer
Levels: Alpha Zoomer Millennial X Boomer Silent Greatest
levels
argument:A common reason we need to change levels is to assure R is aware which is the reference strata.
This is important for linear models because the first level is assumed to be the reference.
We often want to order strata based on a summary statistic.
This is common in data visualization.
We can use reorder
for this:
80000232 bytes
40000648 bytes
Exercise: How can we make this go much faster?
droplevels
:NA stands for not available.
Data analysts have to deal with NAs often.
is.na
function is key for dealing with NAsA related constant is NaN
.
Unlike NA
, which is a logical, NaN
is a number.
It is a numeric
that is Not a Number.
Here are some examples:
When you do something inconsistent with data types, R tries to figure out what you mean and change it accordingly.
We call this coercing.
R does not return an error and in some cases does not return a warning either.
This can cause confusion and unnoticed errors.
So it’s important to understand how and when it happens.
NA
s in arithmetical operations usually returns an NA
.You want to avoid automatic coercion and instead explicitly do it.
Most coercion functions start with as.
Here is an example.
Data frames are a type of list.
Lists permit components of different types and, unlike data frames, different lengths:
Matrices are another widely used data type.
They are similar to data frames except all entries need to be of the same type.
We will learn more about matrices in the High Dimensional data Analysis part of the class.
Note what R searches the Global Environment first.
Use search
to see other environments R searches.
Note many prebuilt functions are in stats
.
filter
you want using namespaces:R uses object oriented programming (OOP).
It uses two approaches referred to as S3 and S4, respectively.
S3, the original approach, is more common.
The S4 approach is more similar to the conventions used by modern OOP languages.
co2
is not numeric:plot
behaves different with different classes.plot
actually calls the functionNotice all the plot functions that start with plot
by typing plot.
and then tab.
The function plot will call different functions depending on the class of the arguments.
Soon we will learn how to use the ggplot2 package to make plots.
R base does have functions for plotting though
Some you should know about are:
plot
- mainly for making scatterplots.lines
- add lines/curves to an existing plot.hist
- to make a histogram.boxplot
- makes boxplots.image
- uses color to represent entries in a matrix.Although, in general, we recommend using ggplot2, R base plots are often better for quick exploratory plots.
For example, to make a histogram of values in x
simply type:
y
versus x
and then interpolate we type: