2025-09-15
R has a base installation and then tens of thousands of add-on software packages that can be obtained from CRAN
Use install.packages
to install the dslabs package from CRAN
Try out the following functions: sessionInfo
, installed.packages
Much of what we do in R uses functions from either base R or installed packages.
Many are included in automatically loaded packages: stats
, graphics
, grDevices
, utils
, datasets
, methods
.
This subset of the R universe is referred to as R base.
Very popular packages not included in R base: ggplot2
, dplyr
, tidyr
, and data.table
.
It is easy to write your own functions and packages
Important
For problem set 2 you can only use R base.
Example functions that we will use today: ls
, rm
, library
, search
, factor
, list
, exists
, str
, typeof
, and class
.
You can see the raw code for a function by typing its name without the parentheses.
ls
on your console to see an example.In R you can use ?
or help
to learn more about functions.
You can learn about function using
or
ls
to see if it’s there. Also take a look at the Environment tab in RStudio.rm
to remove the variable you defined.each time you start R you will get a new workspace that does not have any variables or libraries loaded
when you quit R you will be asked if you want to save the workspace
if you do save a workspace it will be saved as a hidden file (Unix lecture) and whenever R is started in a directory with a saved workspace then that workspace will be re-instantiated and used
mostly it is much easier to just use a markdown document to detail your steps and rerun them on a clean workspace
A nice convention to follow is to use meaningful words that describe what is stored, only lower case, and underscores as a substitute for spaces.
R and RStudio both provide autocomplete capabilities so you don’t need to type the whole name
It is highly recommended that you not use the period .
in variable names, there are situations where R treats it differently and those can cause unintended actions
For more we recommend this guide.
The main data types in R are:
One dimensional vectors: double, integer, logical, complex, characters.
integer and double are both numeric
Factors
Lists: this includes data frames.
Arrays: Matrices are the most widely used.
Date and time
tibble
S4 objects
Many errors in R come from confusing data types.
str
stands for structure, gives us information about an object.
typeof
gives you the basic data type of the object. It reveals the lower-level, more fundamental type of an object in R’s memory.
class
This function returns the class attribute of an object. The class of an object is essentially type_of
at a higher, often user-facing level.
Let’s see some example:
[1] "list"
[1] "data.frame"
'data.frame': 51 obs. of 5 variables:
$ state : chr "Alabama" "Alaska" "Arizona" "Arkansas" ...
$ abb : chr "AL" "AK" "AZ" "AR" ...
$ region : Factor w/ 4 levels "Northeast","South",..: 2 4 4 2 4 4 1 2 2 2 ...
$ population: num 4779736 710231 6392017 2915918 37253956 ...
$ total : num 135 19 232 93 1257 ...
[1] 51 5
Date frames are the most common class used in data analysis.
Data frames are like a matrix, but where the columns can have different types.
Usually, rows represents observations and columns variables.
you can index them like you would a matrix, x[i, j]
refers to the element in row i
column j
You can see part of the content like this
View
to open a spreadsheet-like interface to see the entire data.frame. state abb region population total pop_rank
1 Alabama AL South 4779736 135 29
2 Alaska AK West 710231 19 5
3 Arizona AZ West 6392017 232 36
4 Arkansas AR South 2915918 93 20
5 California CA West 37253956 1257 51
6 Colorado CO West 5029196 65 30
Note that we used the $
.
This is called an accessor
because it lets us access columns.
[1] 4779736 710231 6392017 2915918 37253956 5029196 3574097 897934
[9] 601723 19687653 9920000 1360301 1567582 12830632 6483802 3046355
[17] 2853118 4339367 4533372 1328361 5773552 6547629 9883640 5303925
[25] 2967297 5988927 989415 1826341 2700551 1316470 8791894 2059179
[33] 19378102 9535483 672591 11536504 3751351 3831074 12702379 1052567
[41] 4625364 814180 6346105 25145561 2763885 625741 8001024 6724540
[49] 1852994 5686986 563626
$
can be used to access named components of a list.One way R confuses beginners is by having multiple ways of doing the same thing.
For example you can access the 4th column in the following five different ways:
with
let’s us use the column names as symbols to access the data.
This is convenient to avoid typing the data frame name over and over:
The columns of data frames are an example of one dimensional (atomic) vectors.
An atomic vector is a vector where every element must be the same type.
Often we have to create vectors.
The concatenate function c
is the most basic way used to create vectors:
[]
:
x
is seq_along(x)
and NOT 1:length(x)
x
is zero, then using 1:length(x)
does not work for this loopOne dimensional vectors: double, integer, logical, complex, characters and numeric
Each basic type has its own version of NA (a missing value)
testing for types: is.TYPE
,
coercing : as.TYPE
, will result in NA if it is not possible
When you do something inconsistent with data types, R tries to figure out what you mean and change it accordingly.
We call this coercing.
R does not return an error and in some cases does not return a warning either.
This can cause confusion and unnoticed errors.
So it’s important to understand how and when it happens.
coercion is automatically performed when it is possible
TRUE
coerces to 1 and FALSE
to 0
but any non-zero integer coerces to TRUE, only 0 coerces to FALSE
as.logical
converts 0 and only 0 to FALSE
, everything else to TRUE
the character string “NA” is not a missing value
NA
s in arithmetical operations usually returns an NA
.You can coerce explicitly
Most coercion functions start with as.
Here is an example.
readr
package provides some tools for trying to parOne key distinction between data types you need to understad is the difference between factors and characters.
The murder
dataset has examples of both.
A factor is a good representation for a variable, that has a fixed set of non-numeric values
Male
and Female
It is usually not a good representation for variables that have lots of levels (like state names in the murders dataset)
Internally a factor is stored as the unique set of labels (called levels) and an integer vector with values in 1 to length(levels)
i
th entry is k
then that corresponds to the k
th element of the levelsIn statistics we refer to this as categorical data, where all of the individuals are mapped into a relatively small number of categories.
Usually order does not matter, but if it does you can also have ordered factors.
you can set up the levels as you would like, when creating a factor
if you do not set them up, then they will be created in lexicographic order (in the locale you are using)
x = sample(c("Male", "Female"), 50, replace =TRUE)
y1 = factor(x, levels=c("Male", "Female"))
y2 = factor(x, levels = c("Female", "Male"))
y1[1:10]
[1] Female Male Female Male Female Female Female Female Male Female
Levels: Male Female
[1] Female Male Female Male Female Female Female Female Male Female
Levels: Female Male
In data analysis we often want to stratify continuous variables into categories.
The function cut
helps us do this:
In this case there may be a reason to think of using ordered factors.
age <- c(5, 93, 18, 102, 14, 22, 45, 65, 67, 25, 30, 16, 21)
cut(age, c(0, 11, 27, 43, 59, 78, 96, Inf),
labels = c("Alpha", "Zoomer", "Millennial", "X", "Boomer", "Silent", "Greatest"))
[1] Alpha Silent Zoomer Greatest Zoomer Zoomer
[7] X Boomer Boomer Zoomer Millennial Zoomer
[13] Zoomer
Levels: Alpha Zoomer Millennial X Boomer Silent Greatest
This is often needed for ordinal data because R defaults to alphabetical order
Or as noted you may want to make use of ordered
factors
levels
argument:A common reason we want to change levels is to assure R is aware which is the reference strata.
This is important for linear models because the first level is assumed to be the reference.
We often want to order strata based on a summary statistic.
This is common in data visualization.
We can use reorder
for this:
80000232 bytes
40000648 bytes
droplevels
:NA stands for not available and represents data that are missing.
Data analysts have to deal with NAs often.
In R there is a different kind of NA for each of the basic vector data types.
There is also the concept of NULL, which represents a zero length list and is often returned by functions or expressions that do not have a specified return value
is.na
function is key for dealing with NAs&
) and or (|
) coerce their arguments when needed and possibleA related constant is NaN
which stands for Not a Number
NaN
is a double, coercing it to integer yields an NA
Inf and -Inf represent values of infinity and minus infinity (RStudio makes using these really annoying)
Data frames are a type of list.
Lists permit components of different types and, unlike data frames, different lengths:
Matrices are another widely used data type.
They are similar to data frames except all entries need to be of the same type.
We will learn more about matrices in the High Dimensional Data Analysis part of the class.
the values you pass to the function are called the arguments and they can have default values (e.g above z is 0 unless provided)
arguments are matched by either name (which takes precedence) or position
return
or the value of the last statement evaluatedHere is an example of a function that sums \(1,2,\dots,n\)
within the body of a function the arguments are referred to by their symbols and they take the value supplied at the time of invocation
any symbol found in the body of the function that does not match an argument has to be matched to a value by a process called scoping
help
or using ?
(but for the latter you must quote the argument)&
and |
perform element-wise comparisons (vectorized)&&
and ||
evaluate the first element only, move left to right and return when the result is determined^
or +
?Syntax
will get you the manual page for operator precedencein R functions are first class objects - this means they can be passed as arguments, assigned to symbols, stored in other data structures
in particular they can be passed as arguments to a function and returned as values
in some languages (e.g. C or Java) functions are not first class objects and they cannot be passed as arguments
Python uses a fairly similar strategy for functions to the one used in R (as do many other languages)
most of what computer languages do is map symbols (syntax) to values (semantics) and then create an executable program
when a computer comes upon an expression it parses it and that identifies the symbols that will need to be looked up
*
, b
and c
a
, b
and c
and functions for +
and *
##Scope
in fun1(4)
we probably all agree that the value that should be used for x
is 4, we probably expect that +
is a system function
but what about y
- where should it’s value come from?
lexical scope says that for any function with unbound values you should use the environment at the time the function was created to first look for bindings, in R (and Python) after that you look in the global environment (your workspace) and then in attached packages and system functions.
why is this useful?
we can write some interesting functions - like a function that evaluates the log likelihood for any given set of data
here we use rexp
to generate values from an Exponential distribution
notice that a function is returned and that R notes that it is a closure
what happens when authors of two different packages choose the same name for their functions?
Look at how this function associated with the symbol filter
changes in the following code segment:
by calling library(dplyr)
a new package has been put on the search list. You can call search()
before and after the call to library
one implication of this observation is that users could inadvertently alter computations - and we will want to protect against that
when evaluating a function R first establishes an evaluation environment that contains the formal arguments matched to the supplied arguments
if the function is in a package, then the Namespace directives of the package are used to augment this evaluation environment
any symbol not found in that evaluation environment will be searched for in the Global Environment (your workspace).
And after that the search path (search()
) in order.
The evaluator will take the first match it finds and try to use that (sort of - it does know when it is looking for a function)
when authoring a package you will want to use Namespaces - the details will not be discussed here
If a package uses a Namespace then you can explicitly say which filter
you want using
R uses object oriented programming (OOP).
Base R uses two approaches referred to as S3 and S4, respectively.
S3, the original approach, is more common, but has some severe limitations
The S4 approach is more similar to the conventions used by the Lisp family of languages.
In S4 there are classes that are used to describe data structures and generic functions, that have methods associated with them
co2
is not numeric:plot
behaves differently with different classes.plot
actually calls the functionNotice all the plot functions that start with plot
by typing plot.
and then tab.
The function plot will call different functions depending on the class of the arguments.
Soon we will learn how to use the ggplot2 package to make plots.
R base does have functions for plotting though
Some you should know about are:
plot
- mainly for making scatterplots.lines
- add lines/curves to an existing plot.hist
- to make a histogram.boxplot
- makes boxplots.image
- uses color to represent entries in a matrix.Although, in general, we recommend using ggplot2, R base plots are often better for quick exploratory plots.
For example, to make a histogram of values in x
simply type:
y
versus x
and then interpolate we type: