STAT 385 Statistics Programming Methods

Week 4 Notes

Created by Christopher Kinson

Things Covered in this Week’s Notes

An Introduction to R Chapter 5 Arrays and matrices
An Introduction to R Chapter 8 Probability distributions

ICYMI: The Git Golden Rule is to always pull your repo before you push.

ICYMI: An Introduction to R by Venables, Smith and the R Core Team is one of the reference textbooks in the syllabus.

An Introduction to R Chapter 5 Arrays and matrices

Arrays are general atomic objects of any dimension. Values fill in the array in column major order. The dim() function is used to see the length of each dimension of an object.

a <- array(1:8, dim=c(2,3,2))
dim(a)

## [1] 2 3 2

Subsetting through the use of the index such that locations can be selected with subscripts (discussed in Week 2 Notes)

a[2,1,1]

## [1] 2

a[2,1,]

## [1] 2 8

a[1,,]

##      [,1] [,2]
## [1,]    1    7
## [2,]    3    1
## [3,]    5    3

a[,1,]

##      [,1] [,2]
## [1,]    1    7
## [2,]    2    8

a[,,1]

##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6

or as index matrix.

im <- array(c(1,1,2,2,2,3,2,3),dim=c(4,2))
im

##      [,1] [,2]
## [1,]    1    2
## [2,]    1    3
## [3,]    2    2
## [4,]    2    3

a[im]

## [1] 1 1 2 2 2 3 2 3

A matrix is a two-dimensional array. It’s atomic and has two subscripts (row,column). We could use the array() function to create a matrix. However, there are other ways to create a matrix.

We could use the matrix() function.

mat <- matrix(1:4,nrow=2,ncol=2)
mat

##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4

Or the diag() function, which makes a square matrix with different values along the diagonal.

mat <- diag(1:2,2)
mat

##      [,1] [,2]
## [1,]    1    0
## [2,]    0    2

Or the cbind() function, which binds columns together to form a matrix.

mat <- cbind(rep(1,3), rep(2,3))
mat

##      [,1] [,2]
## [1,]    1    2
## [2,]    1    2
## [3,]    1    2

Or the rbind() function, which binds rows together to form a matrix.

mat <- rbind(rep(1,3), rep(2,3))
mat

##      [,1] [,2] [,3]
## [1,]    1    1    1
## [2,]    2    2    2

Both cbind and rbind allow for recycled vector elements if one vector is not the same length as the remaining vectors. Recycling is the idea that when vectors are of different length, the vector with shorter length will have repeated values (cycling from first to last) to attain the same length as the longer vector. See the example below.

v1 <- letters[1:10]
v2 <- LETTERS[1:20]
cbind(v1, v2)

##       v1  v2 
##  [1,] "a" "A"
##  [2,] "b" "B"
##  [3,] "c" "C"
##  [4,] "d" "D"
##  [5,] "e" "E"
##  [6,] "f" "F"
##  [7,] "g" "G"
##  [8,] "h" "H"
##  [9,] "i" "I"
## [10,] "j" "J"
## [11,] "a" "K"
## [12,] "b" "L"
## [13,] "c" "M"
## [14,] "d" "N"
## [15,] "e" "O"
## [16,] "f" "P"
## [17,] "g" "Q"
## [18,] "h" "R"
## [19,] "i" "S"
## [20,] "j" "T"

In addition, we can use the head() or tail() functions to show the first few or last few observations of an object in R.

head(a)

## , , 1
## 
##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6
## 
## , , 2
## 
##      [,1] [,2] [,3]
## [1,]    7    1    3
## [2,]    8    2    4

head(letters)

## [1] "a" "b" "c" "d" "e" "f"

tail(iris)

##     Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
## 145          6.7         3.3          5.7         2.5 virginica
## 146          6.7         3.0          5.2         2.3 virginica
## 147          6.3         2.5          5.0         1.9 virginica
## 148          6.5         3.0          5.2         2.0 virginica
## 149          6.2         3.4          5.4         2.3 virginica
## 150          5.9         3.0          5.1         1.8 virginica

We can also perform operations on matrices and arrays. Numeric matrices can be multiplied (or vectors multiplied by matrices) using the %*% operator.

A <- matrix(1:4, ncol = 2)
B <- rbind(rep(1,2), rep(2,2))
A %*% B #result is 2 by 2

##      [,1] [,2]
## [1,]    7    7
## [2,]   10   10

c(10,5) %*% A #result is 1 by 2

##      [,1] [,2]
## [1,]   20   50

Additionally, we may need to transpose arrays and matrices for other calculations using the t() function. Or to invert an invertible matrix using solve().

t(A)

##      [,1] [,2]
## [1,]    1    2
## [2,]    3    4

solve(A)

##      [,1] [,2]
## [1,]   -2  1.5
## [2,]    1 -0.5

A special class of arrays is the table class in R. These contingency tables are useful for counting the frequency of factors or other categorical variables. Cross-classification tables are made when there are two or more factors present.

table(iris$Species)

## 
##     setosa versicolor  virginica 
##         50         50         50

cols1 <- c(rep("red",75), rep("green", 25), rep("blue", 50))
table(cols1)

## cols1
##  blue green   red 
##    50    25    75

table(cols1, iris$Species)

##        
## cols1   setosa versicolor virginica
##   blue       0          0        50
##   green      0         25         0
##   red       50         25         0

An Introduction to R Chapter 8 Probability distributions

R has almost all of the typical probability distributions that you have seen in STAT 100, 200, or 400. Thus we can use R to determine probabilities from these statistical tables. If “prob” represents a particular probability distribution, then we can compute the cumulative distribution function (dprob), the probability density function (pprob), the quantile function (qprob) and a realization or random deviate from the distribution (rprob).

In this course, we’ll mostly focus on generating random deviates from a given probability distribution.

runif(10) #min=0, max=1

##  [1] 0.15217478 0.93981130 0.37485750 0.09582579 0.17796041 0.93321958
##  [7] 0.93092780 0.90333615 0.94461630 0.62488620

rnorm(10) #mean=0, sd=1

##  [1]  0.6930700 -0.7614426 -0.1812283  0.8262827  0.8168122  0.8329949
##  [7]  1.7474677  0.8595739  0.4011501 -0.6481190

rt(10, 2) #df=2

##  [1] -1.39214736  2.16688225  1.56301263 -2.43153267 -1.18864700 -0.45655035
##  [7] -0.51597692 -0.13460450 -0.75399228 -0.05064909

Notice that these are continuous distributions and will almost never produce integers as realizations. If we wanted integers that followed known probability distributions, we could round these real numbers to be integers with the round() function.

round(runif(10),0)

##  [1] 0 0 1 1 0 1 1 0 0 0

round(rnorm(10),0)

##  [1]  0 -2 -1  2  0 -1 -2  0  1  2

round(rt(10, 2),0)

##  [1]  0  0  1  0  2 -1  0  0  1  0

Better yet, discrete distribution random deviates might be the way to go.

rbinom(10, 5, 0.7)

##  [1] 3 1 3 5 3 4 5 3 4 3

rgeom(10, 0.7)

##  [1] 0 0 1 1 0 1 1 0 0 1

rpois(10, 2)

##  [1] 0 2 2 2 5 2 0 2 1 2

Because randomization will occur in future discussions, I want to make you aware of the idea of seeds and sampling.

It’s extremely important that you make sure your version of R is the most up to date from this point on.

Think of seeds as values our machine requires to produce random numbers. Every computer (and R session) has a starting seed number that is almost unique to your machine at any time. But when we might do more advanced computation (e.g., parallel programming or reproducible simulations), it will benefit us to set the seed number to be the same for all computers so that everyone can achieve exactly the same result. We can choose a value for the starting seed of a random number generator or let the starting seed value be chosen randomly (not always advised for repeated experiments).

set.seed(runif(1))
runif(5) #this result will change

## [1] 0.8966972 0.2655087 0.3721239 0.5728534 0.9082078

set.seed(385)
runif(5) #this result won't ever change

## [1] 0.8125824 0.4330019 0.8724149 0.9960230 0.3508466

set.seed(runif(1))
runif(5) #this result will change and will not match any of the results above

## [1] 0.8966972 0.2655087 0.3721239 0.5728534 0.9082078

These seed numbers affect randomization completely and will alter the random deviates produced for any probability distribution. Sometimes, though, we don’t want randomization to be based on a probability distribution. Sometimes, we simply want a random set of values given a set of values. Sampling in this way with or without replacement can also serve as a permutation of known values to change the arrangement. We can take a sample of values in R using the sample() function.

sample(1:10, 10, replace=FALSE)

##  [1]  2  7  3  6 10  8  5  1  4  9

sample(1:10, 10, replace=TRUE)

##  [1] 10  6 10  7  9  5  5  9  9  5

sample(letters)

##  [1] "e" "b" "j" "l" "o" "a" "t" "c" "f" "x" "q" "r" "d" "w" "n" "p" "i" "g" "v"
## [20] "u" "m" "k" "s" "y" "h" "z"

set.seed(385)
vec0 <- runif(10)
sample(vec0)

##  [1] 0.8125824 0.4832165 0.3508466 0.7209431 0.4330019 0.2125395 0.2741956
##  [8] 0.8724149 0.9661465 0.9960230

STAT 385 Statistics Programming Methods - Fall 2020