R is an open-source software package for mathematical and statistical computing.
Download the binaries for the base package
Direct link for USA R for Windows: http://ftp.ussg.iu.edu/CRAN/
Console Window
Script Window
Some people dislike the natural R interface. RStudio is a popular alternative that is free.
1 + 2 * 3
1 - (4/2)
2^4
R has innate functions that can be used
The general structure of a function is
functionname(arg1, arg2, ...)
where the arguments or parameters can be further specified
You can learn more about these using ?functionname
You can also create your own functions. We'll discuss that later.
Some simple examples
exp(5)
sqrt(4)
round(pi, digits=4)
Vectors are a one-dimensional set of values that are all of the same type (e.g., number, string). Part of the power of R is the way it is able to perform vector operations efficiently.
c(1,2,3,4,5) # concatenate
1:5
Simple operations on vectors
sum(1:5)
length(1:5)
mean(1:5)
sd(1:5)
var(1:5)
min(1:5)
max(1:5)
diff(1:5)
There are different classes of vectors (numeric, integers, logical, character, datetime, factors). We will deal with these later.
R stores variables, scalars, matrices, etc. as objects. They have properties and can be manipulated. They exist in the R environment--a workspace in which you can create objects and manipulate them.
Note: Assignment is via =
or <-
x = 3
x = 1:5
mean(x)
m = mean(x)
Can see 'contents' of object by typing the name or print(x)
R is 'case sensitive'.
x is not the same object as X.
print()
is not the same command as Print()
Common data modes are numeric and string/text/character
numeric
x = 4
character (text, string)
h = "Hello"
You can find the data mode of an object using mode()
or can use str()
to find type/structure/content of an object
The R Console is for doing one thing at a time. It is old school.
As an alternative, you can write out all commands/code in a Script window.
Easier to save and modify and share. File > New Script
You can run an entire script or parts of it by selecting the code and pressing CNTL-R
Can use # to comment your code
All objects we have created exist in our workspace
See contents of workspace via ls()
Remove an object with rm()
Can Save and Load your workspace. Doing so makes it easy to package objects and results together.
Note: Can also save (or load) your History: The commands you've used in a session.
Vectors are the primary way in which we store data for variables--whether those are empirical observations or simulated ones. Thus, it is necessary to be familiar with the way vectors work.
The power of vectorization
x = 1:5
x + 5
Other vector creation tricks
seq(from = -3, to = 3, by = .05)
seq(from = -3, to = 3, length = 10)
rep(0, times = 5)
rep(1:3, times = 5)
rep(1:3, each = 5)
Indexing in R
x[10]
x[2:4]
x[c(1,4,5)]
Change values
x[2] = NA
Remove
x[-2] (won't save without assignment)
Returns TRUE if the condition holds; FALSE otherwise.
x == y # is equal to
x != y # is not equal to
x > y # greater than
x < y # less than
x >= y # equal to or greater than
x & y # and
x | y # or
Which values of p are less than .05? Returns a vector of logicals
x = seq(0,.10, length=10)
x < .05
Which values of p are less than .05? Returns a vector of index values Super Useful
which(x < .05)
Useful summary operations
sum(x) # how many elements tested true?
any(x) # did any of them test true?
all(x) # did all of them test true?
Many functions in R will crash if there are missing values. You have to know in advance how to deal with this problem. And, for better or worse, different functions check for missing data differently.
NA (not available)
x = NA
is.na(x)
A matrix is a data structure of common data type (e.g., numbers) with rows and columns. Matrices are commonly used in simulation work.
x = matrix(1:6, nrow=3, ncol=2, byrow=TRUE)
Show properties or dimensions
str(x) # structure
dim(x) # dimensions rows by cols
length(x) # total entries
Reference entries
x[2,2] # What is the value of the 2nd row, 2nd col?
x[2,2] = 55 # change that value
x[ ,2] # show all rows of second col
x[c(1,3), ] # show 1st and third row of all cols
x[1, ] = c(55,55) # replace 1st row with new vector
Show or create diagonal of square matrix
diag(x)
diag(x) = 1
Add rows or cols via row bind and column bind functions
x = rbind(x, c(44,44))
x = cbind(x, 1) # the 1 repeats here
Give rows and cols names
colnames(x) = c("Anxiety","Avoidance","Depression")
This allows one, if desired, to reference the variable by name rather than number. Helpful for large data sets where you want to know something about a variable but dont know the col number without looking it up.
x[,"Anxiety"]
mean(x[,"Anxiety"])
x = matrix(1:4, 2, 2)
x + 5
t(x) # transpose matrix
x%*%t(x) # matrix multiplication
solve(x) # find inverse of a square matrix
diag(x) # find the diagonal of a square matrix
svd(x) # singular value decomposition of matrix
eigen(x) # computer eigenvalues/vectors for matrix
Like a matrix, but can hold multiple types of data (e.g., numeric, characters) You will mostly work with data frames when using empirical data. They are the most natural analog to a spreadsheet in Excel or SPSS.
Can convert a matrix into a data frame
x.df = as.data.frame(x)
x.df
Note: Variable labels are added by default
dimnames(x.df)
These names be changed if you wish
colnames(x.df) = c("X1","X2","X3")
You can reference a single variable from a data frame easily with names
x.df$X1
Add variables to a dataframe
x.df$X4 = c(4,4,4)
Some people find referencing a variable by first denoting the data structure in which it is contained cumbersome. (I like it, but typically use short names for my dataframes, such as "x" or "data".) An alternative is to "attach" a data frame so R treats it as the environment in which operations are performed.
attach(x.df)
mean(X1)
Now you can reference X1 directly by name rather than x.df$X1
If there are variables with names in the data frame that overlap with those in the global environment, the global variables have precedence. R will warn you of this.
You must "detach" the data frame when you're done or you'll create chaos.
detach(x.df)
R can import data from a variety of sources. It is simplest, in my opinion, to create a comma-delimited file (csv) from any source (e.g., Excel) and import that into R.
But, you can also use libraries/packages to import data directly from SPSS or Excel files too. Read more here: http://www.r-tutor.com/r-introduction/data-frame/data-import
In this example, the data file has variable names in the first row
(header=TRUE
) and the entries are separated by commas (sep=","
)
mydata <- read.table("http://yourpersonality.net/R Workshop/example.csv",
header=TRUE, sep=",")
Note the funny backslashes
mydata = read.csv("C:\\Users\\rcfraley.UOFI\\Dropbox\\mydata.csv")
To read a file in from SPSS or Excel, you have to first install the "foreign" library and load it. We will discuss libraries in more depth later; this is a place- holder.
library("foreign")
data<-read.spss("C:\\Users\\rcfraley.UOFI\\Dropbox\\someSPSSfile.sav")
data<-data.frame(data)
See data in a spreadsheet-like way (the V must be capitalized)
View(mydata)
See and Edit data in a spreadsheet like way
fix(mydata)
Summary Statistics (mean, med, max, min)
summary(mydata)
Correlation
cor(mydata$x, mydata$y)
Correlation matrix for selected variables
cor(mydata[,c(4,5,6,7)])
t-test
t.test(mydata$y ~ mydata$condition)
or
t.test(y ~ condition, data=mydata)
more here: http://www.statmethods.net/stats/ttest.html
ANOVA (simple one-way)
more here: http://www.statmethods.net/stats/anova.html
summary(aov(mydata$y ~ mydata$condition))
or
summary(aov(y ~ condition, data=mydata))
Regression
lm(y ~ x, data=mydata)
summary(lm(y ~ x, data=mydata))
Multiple Regression
lm(y ~ x1 + x2, data=mydata)
lm(y ~ x1 + x2 + x1*x2, data=mydata)
summary(lm(y ~ x1 + x2 + x1*x2, data=mydata))
Standardize Variables
scale(mydata$x)
If you want to save the standardized results, save the results as a new object.
mydata$zx = scale(mydata$x)
Standardize multiple variables quickly using the apply function (applies a function to rows (1) or cols (2) of a matrix/frame)
apply(mydata[4:7],2,scale)
Histogram
hist(mydata$y)
Scatterplot
plot(mydata$x, mydata$y)
Adjust various plotting parameters http://www.statmethods.net/advgraphs/parameters.html
Add some labels
plot(mydata$x, mydata$y, ylab="Y axis label", xlab="X axis label",
main="Main graph label", pch=15)
Save a high-resolution graph for publication purposes (journals often want a tiff image file submitted separately and not embedded in your manuscript)
tiff("figure_1_example.tiff", width = 10000,
height = 10000, res = 1000)
plot(mydata$x, mydata$y, ylab="Subjective Well-being",
xlab="Coffee Consumption", main=" ", pch=15, cex.lab=1.3)
dev.off()
Subset of data for which condition is 0
mydata[which(mydata$condition==0), ]
Find mean of y (third variable) for this subset
mean(mydata[which(mydata$condition==0),3])
Create a new object to make it easier
z = mydata[which(mydata$condition==0),]
z$y
mean(z$y)
Can have multiple conditions using logicals
mydata[which(mydata$condition==0 & mydata$ID > 2), ]
Less ugly
# select cases where condition = 0
newData = subset(mydata, condition==0)
mean(newData$y)