print("Hello ABC")
<- 1:10
x <- x^2
y plot(x, y, type="b", col="blue")
[1] "Hello ABC"
Dimitrios Maniatis
June 13, 2024
Introductory slides about the ABC, and who is behind the Health Data Science Sandbox and the Core Bioinformatics Facility
A short guide on how to install a specific version of R
on your computer.
We show screenshots for windows, but the procedure is analogous for other systems.
Go to the Cran Website
Click on the Download R link
Click on “base” to download the base R. This is R with its basic functions.
Run the downloaded installer file and follow the default instructions.
Now R is installed in your system. In windows it looks like this:
You can see that because I had previously installed another version of R, there are two installations in separate folders. It is a good idea to stick with one version for the duration of a project. Always check the R version when you launch R - this can be done easily with the command R.Version()
.
A good way of making sure that you don’t mix versions of the software you are using and their packages is to use a package manager (R
includes a package manager called renv
, but other like conda are often used), but that is a conversation for another day. For now we continue with the installation of Rstudio.
Try to launch R
. It should be installed as any other software in your computer. You will see that it opens a simple command line, and informs you on the version you are using.
When you install R
alone, you can use it only through a rather primitive looking command line. This is why we use often softwares like RStudio
, which provide a nicer interface to use R
, including things such as a viewer for the data, an installer for the packages, a project manager, and a text editor. You don’t have to use all the many functions of RStudio, but it is very handy to know the basics.
Rstudio is technically called an IDE (Integrated Development Environment) for the R language. An integrated development environment (IDE) is a software application that helps programmers develop software code efficiently. It increases developer productivity by combining capabilities such as software editing, building, testing, and packaging in an easy-to-use application.
Go to the Rstudio website. (Don’t get confused by the name, they simply changed it)
Download RStudio and start the installer
File
> New File
> R Script
to open a new script editor.Run
button). Two things should happen:
x
and y
are saved in your environment and can be seen in the variable explorer (top right). These variables can be used again since they exist in your computer’s memory.Every time you start working in R
, this will be considering a working directory. Such directory is the reference point you are working in. For example, if you want to open a file, you need to know where it is in relation to your working directory, so that you can correctly write where it is. Write the command getwd()
in the console and press Enter to see your current working directory.
In R you can perform basic math operations by using the appropriate symbol. For example
You can assign variables (“objects”) using the symbol <-
. For example
Create a new script file or use the console to test some exercises below.
We will create a simple data frame. A data frame is nothing more than a table, where both rows and columns have labels, and can be easily accessed and manipulated. To create a small data frame, we can define its columns. We define each column through a vector with the function c()
, where we can write values inside separated by a comma. Then we provide all vectors to the function data.frame
, where we assign column names (Gene
, Control
, Treatment1
, Treatment2
).
Make sure your vectors are all of the same length! Also, each vector usually contains values of the same type (for example only numbers or only text)
geneExpr <- data.frame(
Gene = c("GeneA", "GeneB", "GeneC"),
Control = c(10, 20, 30),
Treatment1 = c(15, 25, 35),
Treatment2 = c(100, 0, 250)
)
geneExpr
Gene | Control | Treatment1 | Treatment2 |
---|---|---|---|
<chr> | <dbl> | <dbl> | <dbl> |
GeneA | 10 | 15 | 100 |
GeneB | 20 | 25 | 0 |
GeneC | 30 | 35 | 250 |
You should be able to see the small data frame printed in output and also shown in the variable explorer.
Now lets try to plot it Treatment 1 versus Treatment 2. The basic function for plotting in R
is called plot
. It takes as arguments the x axis and the y axis. It has other options which are not mandatory, such as style and color of the plot. Notice how we access the values in the columns using the $
sign.
plot( x = geneExpr$Treatment1,
y = geneExpr$Treatment2 ,
main="Expression per Treatment group",
xlab = "Treatment 1",
ylab = "Treatment 2" ,
col ="blue" ,
type ="b")
You should be able to see a plot in the plotting window. In the command above we used many options beyond x
and y
. Can you see what they match in the plot?
There are lots of summary statistics already implemented in R
. Below we calculate mean, median and standard deviation for the column Treatment1
of the data frame and then we print them.
x <- geneExpr$Treatment1
meanTr1 <- mean(x)
medianTr1 <- median(x)
sdTr1 <- sd(x)
print("mean, median and sd:")
print(c(meanTr1, medianTr1, sdTr1))
[1] "mean, median and sd:"
[1] 25 25 10
This was neat! Can you try to calculate the cumulative sum of the difference between Treatment 1 and Treatment 2?
Although R and the packages you can find have almost everything you will need, sometimes you might need to define your own function. The syntax to do it is very easy: you assign a function to a name, which then you will be able to use. Below, there is a function taking an argument (arg1
) and multiplying it by 5. The function commands need to be between the curly brackets, and what we want as output need to be explicit through the return()
function.
Such a function works if the argument is a number, but also if it is a vector!
[1] "with a number only"
[1] "with a vector"
Try to make a function that takes three vectors, plots the first against the sum of the second and third, and returns the sum of all three vectors. Use the plot command we applied previously for help.
Now you can try this on vectors of the same length. We can use the ones in our data frame!
Many times we want to read files from excel or other formats. R has many ways to do this and if not there are always packages out there to help you read the format you have. For reading an excel file a great package is the readxl
. To read a csv
file there is already the R
function read.csv()
.
But wait. what are packages? Each package consist of a set of R
and other scripts that meet specific needs for the users of R
. For example openxlsx
reads from Excel files, which R
cannot do on its own. There are thousands of packages out there, ranging all fields of science, and some have become very popular.
Try to install the package openxlsx
. You can use the command install.packages("openxlsx")
in your RStudio console. Otherwise, go on the bottom right panel and click Packages
and Install
like this
Now you are ready to import an Excel file. To use the package, we can load it with library(openxlsx)
. Otherwise we need to write the package name before the command to use from it (as done below). You can get a file locally on your computer or from an URL as done in this example.
df <- openxlsx::read.xlsx("https://github.com/AU-ABC/AU-ABC.github.io/raw/main/documentation/2024-06-13-instR/data/data.xlsx", sheet=1)
df
x | y | z | |
---|---|---|---|
<dbl> | <dbl> | <dbl> | |
1 | 10 | 30 | -1 |
2 | 20 | 20 | 0 |
3 | 30 | 10 | 1 |
Once you have a data frame, you can always save it. Remember, the path of the saved file is related to your current working directory! To save your data frame as a csv
file, use
where we remove the labels for the rows and use the tab separator instead of the comma. To read the file again, simply use
Gene | Control | Treatment1 | Treatment2 |
---|---|---|---|
<chr> | <int> | <int> | <int> |
GeneA | 10 | 15 | 100 |
GeneB | 20 | 25 | 0 |
GeneC | 30 | 35 | 250 |
You can use the tab key of your keyboard to see a list of the available paths