04 February 2020
R is an open source programming language, extremely versatile and increasingly used in academia and industry for data analysis. Together with RMarkdown, RStudio can be employed to build personal websites, blogposts, and even slides (like this one!).
In these weeks, you will learn:
Additional info can be found on the RStudio IDE Cheat Sheet
Once R and RStudio are installed, the main program you will directly work with will be RStudio, which executes R in background. The setting is composed by four panels:
In order to perform certain types of operations and analyses, you will need to install and download packages. A package is a bundle of documents (code, data, documentation, and tests) written by someone to perform a specific type of operation and uploaded on CRAN (Comprehensive R Archive Network).
In order to install a package, write in the syntax install.packages("name.of.the.package)
. Packages are installed on your computer, which means that you only have to install them once (unless there are updates to download). In order to use the functions contained in the package, you have to load it in your R session with the command library()
. When opening a new session, the package needs to be reloaded with the same command.
A brief note about good coding practices. It’s important to learn how to code properly, so that when your syntax is read by others, it’s easy to read and fluent. A good companion throughout the first phases in which you’ll find your way through code writing could be Hadley Wickham’s Style Guide and R-bloggers’ Best Practices.
The official “encyclopedia” of R terminology can be accessed on CRAN
R is a language with different types of objects - such as vector, matrix, data frame, and list - and different types of data - such as numeric, character, logical, and factors.
An object can be considered as a specific configuration of data structure, a variable that can be assigned to an identifier. The <-
symbol is used to assign objects to a certain identifier.
A vector is a sequence of contiguous cells containing data, such as c(1, 2, 3, 4)
. The function c stands for concatenate.
c(1, 2, 3, 4) #integers
[1] 1 2 3 4
c("PrincessLeia", "Yoda", "HanSolo") #characters
[1] “PrincessLeia” “Yoda” “HanSolo”
IMPORTANT: every element of the vector has to be of the same type!
Examples of vectors:
age <- c(23, 24, 21, 57, 35, 18, 29) animals <- c("platypus", "octopus", "axlotl")
matrix(1:9, nrow = 3, ncol = 3)
matrix(1:9, nrow = 3, ncol = 3)
## [,1] [,2] [,3] ## [1,] 1 4 7 ## [2,] 2 5 8 ## [3,] 3 6 9
matrix(1:9, nrow = 3, ncol = 3)
## [,1] [,2] [,3] ## [1,] 1 4 7 ## [2,] 2 5 8 ## [3,] 3 6 9
Matrices are collections of numerical vectors arranged into a fixed number of rows and columns.
data.frame(name = c("bojack", "todd", "mr", "diane"), age = c(52, 30, 52, 41), human = c(FALSE, TRUE, FALSE, TRUE))
data.frame(name = c("bojack", "todd", "mr", "diane"), age = c(52, 30, 52, 41), human = c(FALSE, TRUE, FALSE, TRUE))
## name age human ## 1 bojack 52 FALSE ## 2 todd 30 TRUE ## 3 mr 52 FALSE ## 4 diane 41 TRUE
data.frame(name = c("bojack", "todd", "mr", "diane"), age = c(52, 30, 52, 41), human = c(FALSE, TRUE, FALSE, TRUE))
## name age human ## 1 bojack 52 FALSE ## 2 todd 30 TRUE ## 3 mr 52 FALSE ## 4 diane 41 TRUE
Data frames are collections of any type of vectors (every element in the vector has to be of the same type) stacked by column.
A list is an object in which other objects are stored. It could be thought as a book indexing what are the books present in a library.
Any type of object can be stored into a list. As always, it is importnat that every element within each object is of the same type.
list(series = "BoJack Horseman", seasons = c(1, 2, 3, 4, 5), leads = data.frame(name = c("bojack", "todd", "mr", "diane"), age = c(52, 30, 52, 41), human = c(FALSE, TRUE, FALSE, TRUE)))
## $series ## [1] "BoJack Horseman" ## ## $seasons ## [1] 1 2 3 4 5 ## ## $leads ## name age human ## 1 bojack 52 FALSE ## 2 todd 30 TRUE ## 3 mr 52 FALSE ## 4 diane 41 TRUE
So far, we have seen:
1, 2, 3, 4
"bojack", "todd", "mr", "diane"
TRUE, FALSE
More types:
1, 2, 3, 4
"bojack", "todd", "mr", "diane"
logical - TRUE, FALSE
factor - species
species <- factor(c("Human", "Non-Human", "Non-Human", "Non-Human"), levels = c("Human", "Non-Human"))
species <- factor(c("Human", "Non-Human", "Non-Human", "Non-Human"), levels = c("Human", "Non-Human"))
Under the hood, a factor is a vector of integers which identify the levels.
as.numeric(species)
## [1] 1 2 2 2
species <- factor(c("Human", "Non-Human", "Non-Human", "Non-Human"), levels = c("Human", "Non-Human"))
table(species)
## species ## Human Non-Human ## 1 3
Even more types:
1, 2, 3, 4
"bojack", "todd", "mr", "diane"
TRUE, FALSE
factor - species
missing values - NA
data.frame(name = c("bojack", "todd", "mr", "diane"), age = c(52, NA, 52, 41), human = c(FALSE, TRUE, FALSE, TRUE))
## name age human ## 1 bojack 52 FALSE ## 2 todd NA TRUE ## 3 mr 52 FALSE ## 4 diane 41 TRUE