<- 1:10 |> mean()
result result
[1] 5.5
One of the simplest thing that you can do with R is to use it as a calculator. Here are some common arithmetic operations:
+
-
*
/
^
%%
The ^
operator raises the number to its left to the power of the number to its right. For example, if you enter 2^3
, you will get the answer of 2 to the power 3, which is 8.
The %%
operator returns the remainder when the number on the left is divided by the number on the right. For example, 7 %% 3
results in 1 because 7 divided by 3 leaves a remainder of 1.
Note that you can add comment to your code using the #
symbol. This is helpful when you want to take notes as you go, so that you can follow your R code when you come back to revise in a few weeks time!
A basic concept in programming is called a variable. A variable allows you to store a value (e.g. 8) or an object (e.g. a piece of string) in R. You can then later use this variable’s name to easily access the value or the object that is stored within this variable.
We can assign the result of our arithmetic operations (which is a value), for instance, 2^3
to a variable named my_result
using the following command: my_result <- 2^3
Now, when you evaluate the value of the variable my_result
by running it, you will get the value of 8.
Now it’s your turn to try variable assignment and some simple arithmetic operations in R!
HINT 💡: Let’s say you would like to assign the result of summing from 1 to 5 to a variable named sum_to_five
. You can do: sum_to_five <- 1+2+3+4+5
Now if you would like to divide sum_to_five
by 5 and assign the result to average
, you can do: average <- sum_to_five/5
Pipes allow you to sequence a series of operations in a clear and readable way. There are two types of pipe operators in R. %>%
, which is part of the magrittr
package and native pipe operator |>
introduced in R 4.1.0. The pipe operator takes the output of the operation on its left and feeds it as the first argument to the function on its right. For example:
<- 1:10 |> mean()
result result
[1] 5.5
Here the pipe takes the sequence from 1 to 10 and passes it to the mean()
function, which calculates the mean.
For further information on the pipe operators and their differences, refer here.
Try using the pipe operator to calculate the sum of the sequence from 1 to 10 and assign the result to the variable result
.
HINT 💡: You can use the |>
operator to pass the sequence from 1 to 10 to the sum()
function and assign it to result
with <-
.
There are numerous data types in R. Here are some of the basic ones:
TRUE
or FALSE
)Note that we use quotation marks to indicate if a value is a string.
You can check the data type of a variable with the class()
function. This is a useful function because matching data types are often necessary when performing operations in R. For example, you will get an error message if you try to evaluate 5 + "6"
since the quotation marks would make 6 a character.
The two most common numeric classes are “integer” and “double” (for double precision floating point numbers). “Integers” are whole numbers like 7
. “Double” are decimal values like 5.217
. R uses double precision numeric values by default.
<- 7
db_var class(db_var)
[1] "numeric"
To create integer values, you can add L
after the number.
<- 7L
int_var class(int_var)
[1] "integer"
Play around with different variable types!
HINT 💡: Replace the values in the R code with values that are provided in the instructions (the line with the comment #
). For example, weather <- "sunny"
assigns the string “sunny” to the variable weather
.
There are several special values in R, including NA
, NaN
, and Inf
.
NA
NA
(short for ‘Not Available’), is a logical constant of length 1 which contains a missing value indicator. We use NA
to replace an entry of a vector when such value is unknown or missing. There are different types of NA
for different class, including NA_integer_
, NA_real_
, NA_complex_
and NA_character_
. NA_integer
is of the integer class and so on.
Most operations on an NA
becomes an NA
. For instance,
<- c(1, 2, NA, 4)
x sum(x)
[1] NA
However, there is an exception - when we use the paste()
function to concatenate NA
with other strings. The operation will be performed with the NA
.
<- c("Apple","Banana", NA, "Orange")
some_string paste(some_string, " is good for you.")
[1] "Apple is good for you." "Banana is good for you."
[3] "NA is good for you." "Orange is good for you."
We can use is.na()
to test if a values is NA
. The function is.na(x) returns a boolean vector of the same size as x
with value TRUE if the corresponding element in x
is NA
.
is.na(x)
[1] FALSE FALSE TRUE FALSE
NaN
NaN
(short for ‘Not A Number’) is for arithmetic purposes. NaN
usually comes from arithmetic operations that create undefined values such as 0/0, hence NaN
is numeric. You can also use is.na()
to check if a value is NaN
.
<- c(4, 0/0, 3,5)
y is.na(y)
[1] FALSE TRUE FALSE FALSE
Inf
Inf
(short for Infinite
), like NaN
, also stems from numerical operations and thus is of numeric class. It usually comes operations like 1/0 where the result is a very very large number (larger than other numeric). Note that Inf
is not a type of NA
. You can use is.infinite()
to check if a value is Inf
.
is.infinite(23/0)
[1] TRUE
A vector is an ordered finite list of numbers. Vectors (and matrices) are used a lot in equations, models and mathematical optimisation problems. Learning how to read and use them allows you to:
A vector x
can be written in R as follows:
<- c(1, 2, 3, 4, 5)
x print(x)
[1] 1 2 3 4 5
Where the combine function c()
is used to create the vector and each element in the vector is separated by a comma.
If a
and b
are vectors of the same size, a+b
and a-b
give their sum and difference, respectively.
<- c(1,2,3)
a <- c(100,200,300)
b +b a
[1] 101 202 303
-b a
[1] -99 -198 -297
Try adding and subtracting vectors!
HINT 💡: Make sure you’re assigning a variable with the elements you want in your vector. For example, if you want to create a vector y
with elements 1, 2, 3, and 4, you can do y <- c(1, 2, 3, 4)
and likewise with the vector you’re adding or subtracting by.
If c
is a number and a
is a vector, you can express the scalar-vector product either as c*a
or a*c
. For example, if you do a 3*a
and a*3
:
<- c(1,2,3)
a 3*a
[1] 3 6 9
*3 a
[1] 3 6 9
In R, you can add a scalar to a vector. This means that the scalar is added to each element of the vector. For example, if you do 3+a
:
<- c(1,2,3)
a 3+a
[1] 4 5 6
Note this is NOT a standard mathematical notation.
A specific element \(x_i\) where \(i\) is the index.
Say we would like to obtain the 5th entry from the following vector \(x\):
= c(5,6,3,4,5,8,23,4,6,4,3,23,7,5,4,23,7,90)
x # Extracting the 5th entry from the vector
5] x[
[1] 5
\(x_{r:s}\) denotes the slice of the vector from index \(i\) to \(s\). For instance, x[1:4]
selects the element from index 1 to 4.
1:4] x[
[1] 5 6 3 4
Lets try slicing and extracting elements from a vector!
HINT 💡: If you would like to extract the 3rd element from the vector y
, you can do y[3]
. If you would like to extract the 2nd to 4th element from the vector y
, you can do y[2:4]
.
Nothing here (yet)!
A matrix is a rectangular array of numbers written between rectagular brackets (or large parentheses).
\[T = \begin{bmatrix}2&2\\4&3\end{bmatrix}\] In R, you can construct a matrix using the matrix()
function.
<- matrix(c(4,3,2,7,4,2,6,3,4,3,0,0,2,-2,-3), byrow=TRUE, nrow=5)
A A
[,1] [,2] [,3]
[1,] 4 3 2
[2,] 7 4 2
[3,] 6 3 4
[4,] 3 0 0
[5,] 2 -2 -3
In this R code:
c(4,3,2,7,4,2,6,3,4,3,0,0,2,-2,-3)
is the data that we want to put in the matrix.
byrow=TRUE
means that the matrix is filled by rows. If byrow=FALSE
, the matrix is filled by columns.
nrow=5
means that the matrix has 5 rows.
Similar to what you have learned with vectors, the standard operators like +
, -
, *
, /
can be used to perform element-wise operations on matrices. For example:
2 * A
[,1] [,2] [,3]
[1,] 8 6 4
[2,] 14 8 4
[3,] 12 6 8
[4,] 6 0 0
[5,] 4 -4 -6
which multiplies each element of the matrix A
by 2.
Consider another matrix B
defined as
<- matrix(c(1,1,1,2,2,2,0,0,0,-1,-1,-1,0,0,0), byrow=TRUE, nrow=5) B
A*B
creates a matrix where each element is the product of the corresponding elements in matrix A
and matrix b
.
*B A
[,1] [,2] [,3]
[1,] 4 3 2
[2,] 14 8 4
[3,] 0 0 0
[4,] -3 0 0
[5,] 0 0 0
Remember A*B
is element-wise multiplication, not matrix multiplication. If you want to perform matrix multiplication, you can use the %*%
operator. For example, A %*% B
performs matrix multiplication.
Sometimes we may want to add new data or information to a matrix with more rows or columns.
You can use cbind()
function, which merges matrices and/or vectors by columns, to add a column to a matrix. For example:
<- matrix(c(4,3,2,7,4,2,6,3,4,3,0,0,2,-2,-3), byrow=TRUE, nrow=5)
A <- matrix(c(1,1,1,2,2,2,0,0,0,-1,-1,-1,0,0,0), byrow=TRUE, nrow=5)
B <- cbind(A, B)
big_matrix # Matrix A
A
[,1] [,2] [,3]
[1,] 4 3 2
[2,] 7 4 2
[3,] 6 3 4
[4,] 3 0 0
[5,] 2 -2 -3
# Matrix B
B
[,1] [,2] [,3]
[1,] 1 1 1
[2,] 2 2 2
[3,] 0 0 0
[4,] -1 -1 -1
[5,] 0 0 0
# Binding the two matrices
big_matrix
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 4 3 2 1 1 1
[2,] 7 4 2 2 2 2
[3,] 6 3 4 0 0 0
[4,] 3 0 0 -1 -1 -1
[5,] 2 -2 -3 0 0 0
You can see the resulting matrix of merging A
and B
has 5 rows and 6 columns.
Similarly, you can use rbind()
function to add a row to a matrix. For example:
<- rbind(A,B)
big_matrix2 big_matrix2
[,1] [,2] [,3]
[1,] 4 3 2
[2,] 7 4 2
[3,] 6 3 4
[4,] 3 0 0
[5,] 2 -2 -3
[6,] 1 1 1
[7,] 2 2 2
[8,] 0 0 0
[9,] -1 -1 -1
[10,] 0 0 0
Where big_matrix2 <- rbind(A,B)
returns a bigger matrix by putting matrix A
on top of matrix B
. The resulting matrix has 10 rows and 3 columns.
You can use square brackets ([]) and a comma (,) to select one or multiple elements from a matrix. While vectors have one dimension, matrices have two. Therefore, you need to specify both the row and column indices, separated by a comma, to select an element.
For example, A[1:4, 3:5]
returns a matrix with the data on the rows 1,2,3,4 and columns 3,4,5. If you want to select all elements of a row or a column, you can do the following:
A[,1]
selects all elements of the first column of matrix A.Consider the following matrices S and T:
\[S = \begin{bmatrix}1&0\\0&1\end{bmatrix}, T = \begin{bmatrix}2&2\\4&3\end{bmatrix}\]
Construct a matrix M by stacking S on top of T. Assign your result to the variable M_mat
.
Select a sub-matrix of M with the data on the rows 1, 2, 3 and columns 2. Assign the result to the variable part_of_M
.
HINT 💡: To stack one matrix on top of another, we can use the rbind()
command. HINT 2 💡: For example, B[1,2] selects the element at the first row and second column of matrix B. B[1:3,2:4] results in a matrix with the data on the rows 1, 2, 3 and columns 2, 3, 4.
A data frame has the variables of a dataset as columns and the observations as rows.
Let’s look at an example. Just run the code and you will see what a data frame looks like in R.
HINT 💡: Simply run the code!
The iris data set gives the measurements in centimeters of the variables sepal length, sepal width, petal length and petal width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica.
Often it is useful to show only a small part of the entire dataset, especially when we are working with very big data set where showing the entire dataset is not easy. Here are a few useful commands that helps you understand the data frame very quickly.
head()
shows the first few observations of a data frame.tail()
shows the last few observations of a data frame.str()
shows the structure of the dataset.Investigate the structure of the example dataframe. Have a look at the first and last few observations!
HINT 💡: Try head(df)
.
You can construct your own data frame using data that you have with the data.frame()
function. You can include different vectors as argument in the function and they will become different columns of your data frame. Note that you should make sure the vectors that you pass have same length (i.e. same number of observations for each column).
Create a data frame using vectors symbol
, element
, chemical_group
and atomic_no
, following that order. Assign the result data frame to first_ten_elements
.
HINT 💡: Use the data.frame()
function on the vectors.
Similar to vectors and matrices, you select elements from a data frame using square brackets [ ]
.
my_df
, you can do my_df[2:4, 3:5]
.my_df[,1]
Note that when you select a single column (e.g. my_df[,1]
), it turns into a vector. If you would like it to remain as a data frame, you can use my_df[, 1, drop = FALSE]
.
Alternatively, if you would like to select a certain variable from the data frame, you can use the column name in the square bracket directly or use $
instead: my_df[ , "column_name"]
or my_df$column_name
.
Sometimes we may want to sort the data according to a certain variable in the dataset. In R, we can use the order()
function which gives the ranked position of each element when it is applied on a variable. For example,
<- c(100, -200,300)
x order(x)
[1] 2 1 3
The output above is the ranked positions and we can use that to sort the vector x
.
order(x)] x[
[1] -200 100 300
The above R code gives an ordered version of x
.
Now that we know how to use the order()
, let’s sort the first_ten_element data frame in ascending order of atomic_no
, assign the sorted dataframe to sorted_df
.
Complete the code below to complete the tasks indicated in the comments.
HINT 💡: Use order(first_ten_elements$atomic_no)
to create positions. Then, use ‘positions’ with square brackets: first_ten_elements[...]
; can you fill in the …? Do not forget that ‘positions’ indicates the order of the rows in the data frame.
Consider the following syntax:
if (x < 50) {
if (x < 20) {
result <- "Low to None"
} else {
result <- "Low"
}
} else if (x == 100) {
result <- "Full"
} else {
result <- "High"
}
print(result)
Take a look at the following statements:
If x
is set to 75, “High” gets printed to the console.
If x
is set to 18, “Low” gets printed to the console.
If x
is set to 100, R prints out “Full”
If x
is set to 21, “Low” gets printed to the console.
Select the option that lists all the true statements.
In data analytics, often we need to perform the same operation repeatedly on the data and loops can come in handy on these occasions. There are 2 types of loops: - The ‘for loop’ is designed to iterate over all elements in a sequence - The ‘while loop’ is designed to repeat the operations until certain condition is satisfied
A simple for loop in R looks like this:
<- c(1,3,5)
sequence for (i in sequence) {
print(i)
}
[1] 1
[1] 3
[1] 5
Here, we first defined a vector called sequence
, then for every \(i\) in sequence, we print the value of \(i\). From the output, you can see from the output that each element i
in the sequence was printed in each iteration.
The same loop can be written in another way:
<- c(1,3,5)
sequence for (i in 1:length(sequence)) {
print(sequence[i])
}
[1] 1
[1] 3
[1] 5
In the R code above, we use the length()
function to measure the length/size of sequence
, i.e. 3. Then we construct a for loop where \(i\) iterates from 1 to 3, inside the for loop i
represents the index of the sequence and putting the index inside the square brackets allows us to select the \(i\)th element from sequence
.
Write a for loop that takes the nominal GDP (in trillion) for each country in 2017 and divided it by the population (in million) in the same year. Assign the result to the variable my_result
.
HINT 💡: To extract the i-th element of nominalGDP_trilion_2017
, you can use nominalGDP_trilion_2017[i]
. You can extract the i-th element of population_million_2017
using square bracket as well.
We can do a lot more than when we use for loop together with the control flow statements (if, else if, else) we learnt before. For example,
<- c(2,3,1,5,5,5)
customer_rating
for (rating in customer_rating) {
if (rating >= 4) {
print('Happy Customer!')
else if (rating <= 2) {
} print('Angry Customer!')
else {
} print('Neutral Customer!')
} }
[1] "Angry Customer!"
[1] "Neutral Customer!"
[1] "Angry Customer!"
[1] "Happy Customer!"
[1] "Happy Customer!"
[1] "Happy Customer!"
The R code above prints different messages depending on the values in customer_rating.
Check your understanding
Consider the following syntax:
for (result in student_result) {
if (result >= 79 & result < 99) {
print('Superb!')
} else if (result < 50) {
print('Try again!')
} else if (result >= 99) {
print('Perfect!')
} else {
print('Well done!')
}
}
Here is the output from the above syntax after we have defined a vector variable student_result
.
[1] "Superb!"
[1] "Well done!"
[1] "Try again!"
[1] "Perfect!"
[1] "Well done!"
[1] "Superb!"
Based on the output from the for-loop, which of the following is student_result
?
cat(mc_opts("<code>student_result <- c(99,77,35,97,67,85)</code>" ="Try again! Copy the code and try it with some numbers in R!",
"<code>student_result <- c(77,99,35,97,67,85)</code>" ="Try again! Copy the code and try it with some numbers in R!",
"<code>student_result <- c(97,77,35,99,67,85)</code>" = "That's correct!",
"<code>student_result <- c(100,77,35,99,67,85)</code>"="Try again! Copy the code and try it with some numbers in R!",
correct = 3))
A while loop in R has the following structure:
while (condition) {
do_something
}
In the while loop, R will keep running the code between the brackets { }
repeated until the condition become FALSE
at some point during the execution. If the condition is never changed, the while loop will go on indefinitely.
For example, we create a while loop that will go on subtracting 28.5 (as weekly_spending) to the variable bank_balance
until bank_balance
is less than weekly_spending
. If you execute the R code, you will notice the loop stopped at the 7th iteration.
<- 200
bank_balance <- 28.5
weekly_spending while (bank_balance >= weekly_spending) {
= bank_balance - weekly_spending
bank_balance print(bank_balance)
}
[1] 171.5
[1] 143
[1] 114.5
[1] 86
[1] 57.5
[1] 29
[1] 0.5
There are occasions where breaking the loop during execution is a good idea. The break statement can be used in for loops and while loops.
For example, if we are to set up a early warning system when bank_balance
is less than or equal to 4 times weekly spending
, we can do the following:
<- 200
bank_balance <- 28.5
weekly_spending while (bank_balance>= weekly_spending) {
<- bank_balance - weekly_spending
bank_balance print(bank_balance)
if (bank_balance <= 4*weekly_spending){
print("Find a job and cut weekly spending!")
break
} }
[1] 171.5
[1] 143
[1] 114.5
[1] 86
[1] "Find a job and cut weekly spending!"
Functions are an extremely important concept in almost every programming language, including R! You can think of functions as a black box: you give some values as an input, the function processes this input and generates an output. In R, the function arguments are matched by position or by name. You can set a default value for some function arguments, and you can overwrite the default value when you need to.
Here is a simple skeleton of a function in R:
my_fun_name <- function(arg1, arg2) {
body
}
The keyword function
tells R you are defining a function. The arguments in the parentheses ( )
are the input of your function, the inputs can be data and parameters. The operations that you want to perform (the body
) should be written inside the curly brackets { }
.
Finally, you assign the function defining statement and body to the variable my_fun_name
, which is the name of your function. To call (use) the function, you will need to enter my_fun_name(arg1, arg2)
in your code after you defined the function.
Create a function power_three()
: it takes one argument and returns that number cubed (that number times itself and times itself again). Call this newly defined function with 12 as input.
It is also possible to run a function without an input argument.
In the following example, the function will print the statement “Hello World!” when the function is called without any inputs.
<- function(){
hello_world print('Hello World!')
}
This can also be useful sometimes when we would like to have random outcomes from the function. The sample()
function comes in handy when we would like to draw a number from a particular range of values.
For example:
<- function(){
flip_a_coin <- sample(1:2, size = 1)
outcome if (outcome == 1) {
print("Head")} else {
print("Tail") }
}
#Flip the coin the first time
flip_a_coin()
[1] "Head"
#Flip the coin second time
flip_a_coin()
[1] "Head"
#Flip the coin once again
flip_a_coin()
[1] "Tail"
You will get a different outcome (“Head” or “Tail”) every time when you call the function flip_a_coin
.
Write a function that takes no input argument and return a random outcome choosing from 1 to 6, just like rolling a die in R. Name that function roll_a_die
.
HINT 💡: You can obtain a random integer from 1 to 6 using sample(1:6, size = 1)
. Have you try putting that in the function body?
Function Scoping
Note that variables that are defined inside a function are not accessible outside that function. Variables we use inside a function are considered as local variables, which are different from variables that are designed outside of the function (global variables).
Congratulations! This is the end of R Programming Basics! Here is a useful cheat sheet from the Institute for Quantitative Social Science at Harvard University that summarize introductory R commands nicely: https://iqss.github.io/dss-workshops/R/Rintro/base-r-cheat-sheet.pdf