STAT 448 Spring 2020 - Week 5 Instructor Notes in R

Things Covered in this Week’s Notes

Discriminant Analysis
kNN Classification
Principal Components Analysis (PCA)

Discriminant Analysis

This is an old supervised method for classification that works well when there are multiple classes in the response, continuous predictors, and when the classes are normally distributed. Discriminant analysis is somewhat related to support vector machines. In that sense, it can be considered as an extension of logistic regression because we know there are classes or levels in a response that we want to classify. To do the classifying, discriminant functions are estimated based on posterior probabilities. The discriminant functions create dividers that try very hard to create well-separated classes. These dividers are different depending on the assumptions of the data. The two most common types of discriminant functions are linear and quadratic.

Linear Discriminant Analysis (LDA)
- Assume multivariate normal distribution for each class
- Assume same covariance for each class
- Separating surfaces will be linear(straight lines or planes or hyper-planes)
Quadratic Discriminant Analysis (QDA)
- Assume multivariate normal distribution for each class
- Covariances not assumed to be the same for each class
- Separating surfaces will be quadratic

Some Terminology

Discriminant function: effectively define dividing lines for groups
Prior probabilities: tell relative sizes of known groups in the general population
Posterior probabilities: probabilities that an observation with specific explanatory variable values would be from each of the known groups
Misclassification rate: estimated percentages of observations classified into the wrong group (methods include re-substitution, cross-validation, or using training and test sets)
Training set: the set of observations the discrimination is based on
Testing set: observations with known group not included in training set used to estimate classification of new observations

A strategy for discriminant analysis

Remove any collinearity and extreme observations.
Re-scale or standardize, if necessary.
Check prior probabilities for the classes.
Partition the data into training and testing sets.
Perform appropriate discriminant analysis.
Evaluate the performance of the classifier.

For these examples, we will work with the Chicago Food Inspections Data and subset it for simplicity.

Remove any collinearity and extreme observations.

Here we can import the data, remove duplicate facilities, and create a new variable representing the total number of violations a facility has received.

rm(list=ls())
library(tidyverse)

## -- Attaching packages --------------------------------------------- tidyverse 1.2.1 --

## v ggplot2 3.2.1     v purrr   0.3.2
## v tibble  2.1.3     v dplyr   0.8.3
## v tidyr   1.0.0     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.4.0

## -- Conflicts ------------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

food <- read_csv("https://uofi.box.com/shared/static/5637axblfhajotail80yw7j2s4r27hxd.csv", 
    col_types = cols(Address = col_skip(), 
        `Census Tracts` = col_skip(), City = col_skip(), 
        `Community Areas` = col_skip(), `Historical Wards 2003-2015` = col_skip(), 
        `Inspection Date` = col_date(format = "%m/%d/%Y"), 
        Location = col_skip(), State = col_skip(), 
        Wards = col_skip(), `Zip Codes` = col_skip()))

dim(food)

## [1] 187787     13

colnames(food) <- tolower(colnames(food))

food <- filter(food, `license #` > 0)

food <- arrange(food, desc(`inspection date`)) #sorts data by newest observations first
head(food$`inspection date`);tail(food$`inspection date`)

## [1] "2019-05-31" "2019-05-31" "2019-05-31" "2019-05-31" "2019-05-31"
## [6] "2019-05-31"

## [1] "2010-01-04" "2010-01-04" "2010-01-04" "2010-01-04" "2010-01-04"
## [6] "2010-01-04"

food2 <- distinct(food, `license #`, .keep_all=TRUE) # keeps the distinct (unique) businesses (where only first entries are kept)
dim(food2)

## [1] 36442    13

food2$totalviolations <- str_count(food2$violations, "\\|") +1 # we could create a new variable that counts the number of violations

Next, we simplify the data to 3 possible classes for the result of the food inspection and remove missing values for the total violations predictor.

food3 <- filter(food2, results == "Pass" | results == "Fail" | results == "Pass w/ Conditions")
table(food3$results)

## 
##               Fail               Pass Pass w/ Conditions 
##                688              13328               6868

sum(is.na(food3$totalviolations)==TRUE)

## [1] 6541

sum(is.na(food3$results)==TRUE)

## [1] 0

miss <- which(is.na(food3$totalviolations)==TRUE)

length(miss)

## [1] 6541

food4 <- select(food3[-miss,], results, totalviolations)
sum(is.na(food4))

## [1] 0

set.seed(448)
rs <- sample(nrow(food4),800)
food5 <- food4[rs,]

#rm(food,food2,food3,food4)

Re-scale or standardize, if necessary.

Standardize the predictors.

qqnorm(food5$totalviolations)
qqline(food5$totalviolations)

food6 <- select( mutate( food5,newresults=factor(results),st_violations=scale(food5$totalviolations)),-results,-totalviolations)

Check prior probabilities for the classes.

table(food6$newresults)/nrow(food6) #just the proportions

## 
##               Fail               Pass Pass w/ Conditions 
##             0.0175             0.4725             0.5100

Partition the data into training and testing sets.

# R code (borrowed from Dalpiaz's textbook)

# partitioning the data - 75% training, 25% testing
set.seed(448)
ids<-sample(nrow(food6),floor(0.75*nrow(food6)))
trainingData <- food6[ids,]
testingData <- food6[-ids,]

colnames(food6)

## [1] "newresults"    "st_violations"

#assuming the response is results of the inspection
trainingData_response <- trainingData$newresults
testingData_response <- testingData$newresults

trainingData_predictors <- trainingData[,-1]
testingData_predictors <- testingData[,-1]

Perform appropriate discriminant analysis.

library(MASS)

## 
## Attaching package: 'MASS'

## The following object is masked from 'package:dplyr':
## 
##     select

ldarez <- lda(newresults ~ st_violations, data=trainingData)
qdarez <-qda(newresults ~ st_violations, data=trainingData)

predlda <-predict(ldarez,trainingData)$class
predqda <-predict(qdarez,trainingData)$class

Evaluate the performance of the classifier.

For the training dataset, we have the LDA and QDA performance metrics.

table(predlda,trainingData_response) #LDA confusion matrix

##                     trainingData_response
## predlda              Fail Pass Pass w/ Conditions
##   Fail                  0    0                  0
##   Pass                  7  238                121
##   Pass w/ Conditions    5   39                190

mean(predlda!=trainingData_response) #LDA misclassification rate

## [1] 0.2866667

mean(predlda==trainingData_response) #LDA classification rate

## [1] 0.7133333

table(predqda,trainingData_response) #QDA confusion matrix

##                     trainingData_response
## predqda              Fail Pass Pass w/ Conditions
##   Fail                  1    0                  1
##   Pass                  7  238                121
##   Pass w/ Conditions    4   39                189

mean(predqda!=trainingData_response) #QDA misclassification rate

## [1] 0.2866667

mean(predqda==trainingData_response) #QDA classification rate

## [1] 0.7133333

For the testing dataset, we have these performance metrics.

predzlda <- predict(ldarez,testingData)$class
table(predzlda,testingData_response) #LDA

##                     testingData_response
## predzlda             Fail Pass Pass w/ Conditions
##   Fail                  0    0                  0
##   Pass                  1   83                 44
##   Pass w/ Conditions    1   18                 53

mean(predzlda!=testingData_response) #misclassification rate

## [1] 0.32

mean(predzlda==testingData_response) #classification rate

## [1] 0.68

detach("package:MASS")

kNN Classification

k-nearest neighbors is a non-parametric classification method. Instead of assuming a probability distribution for the response \(Y\), we only need to specify a value for the number of points \(k\), i.e. neighbors, needed to estimate the class that the response belongs to; this decision is based on a majority vote of the neighbors. Distances among the predictors are computed behind the scenes. Thus, predictors must be numeric in this method, which means data may need to be standardized (or re-scaled). The response needs to be a factor with at least 2 levels. kNN can be used for multiple classes of the response, much similar to discriminant analysis.

How to choose \(k\)?

Partition the data into training and testing datasets
Try various values of \(k\) on the training data
The validation error approaches the proportion of response class (or event) of interest as \(k\) increases
Some books suggest \(k\) be an odd number when the response is binary to prevent ties

For the data examples, we will use only the first three parts out of 20 total parts of the 2015 US Natality Data. To do a valid data analysis, we would need to import all 20 parts, combine them into a single data frame, and subset from that entire dataset.

kNN Classification with a Binary response

We begin by importing the US Natality Data.

rm(list = ls())

library(tidyverse)
birth001 <- read_csv("https://uofi.box.com/shared/static/gb07nz7jafspk65wzxq6awlu02kypiq7.csv")

## Parsed with column specification:
## cols(
##   .default = col_double(),
##   MAGE_IMPFLG = col_logical(),
##   MAGE_REPFLG = col_logical(),
##   MAR_P = col_character(),
##   WIC = col_character(),
##   CIG_REC = col_character(),
##   RF_PDIAB = col_character(),
##   RF_GDIAB = col_character(),
##   RF_PHYPE = col_character(),
##   RF_GHYPE = col_character(),
##   RF_EHYPE = col_character(),
##   RF_PPTERM = col_character(),
##   RF_INFTR = col_character(),
##   RF_REDRG = col_character(),
##   RF_ARTEC = col_character(),
##   RF_CESAR = col_character(),
##   IP_GON = col_character(),
##   IP_SYPH = col_character(),
##   IP_CHLAM = col_character(),
##   IP_HEPB = col_character(),
##   IP_HEPC = col_character()
##   # ... with 40 more columns
## )

## See spec(...) for full column specifications.

birth002 <- read_csv("https://uofi.box.com/shared/static/kr2s5wp3jpxdlpcjmq2qs15oo4797u4y.csv")

## Parsed with column specification:
## cols(
##   .default = col_double(),
##   MAGE_IMPFLG = col_logical(),
##   MAGE_REPFLG = col_logical(),
##   MAR_P = col_character(),
##   MAR_IMP = col_logical(),
##   FAGERPT_FLG = col_logical(),
##   WIC = col_character(),
##   CIG_REC = col_character(),
##   RF_PDIAB = col_character(),
##   RF_GDIAB = col_character(),
##   RF_PHYPE = col_character(),
##   RF_GHYPE = col_character(),
##   RF_EHYPE = col_character(),
##   RF_PPTERM = col_character(),
##   RF_INFTR = col_character(),
##   RF_REDRG = col_character(),
##   RF_ARTEC = col_character(),
##   RF_CESAR = col_character(),
##   IP_GON = col_character(),
##   IP_SYPH = col_character(),
##   IP_CHLAM = col_character()
##   # ... with 42 more columns
## )
## See spec(...) for full column specifications.

birth003 <- read_csv("https://uofi.box.com/shared/static/hwsoobqngb8zfyjbwxvmrejjz1lg3yeh.csv")

## Parsed with column specification:
## cols(
##   .default = col_double(),
##   MAGE_IMPFLG = col_logical(),
##   MAGE_REPFLG = col_logical(),
##   MAR_P = col_character(),
##   MAR_IMP = col_logical(),
##   FAGERPT_FLG = col_logical(),
##   WIC = col_character(),
##   CIG_REC = col_character(),
##   RF_PDIAB = col_character(),
##   RF_GDIAB = col_character(),
##   RF_PHYPE = col_character(),
##   RF_GHYPE = col_character(),
##   RF_EHYPE = col_character(),
##   RF_PPTERM = col_character(),
##   RF_INFTR = col_character(),
##   RF_REDRG = col_character(),
##   RF_ARTEC = col_character(),
##   RF_CESAR = col_character(),
##   IP_GON = col_character(),
##   IP_SYPH = col_character(),
##   IP_CHLAM = col_character()
##   # ... with 42 more columns
## )
## See spec(...) for full column specifications.

birth1 <- rbind(birth001, birth002, birth003)
rm(birth001,birth002,birth003)
colnames(birth1) <- tolower(colnames(birth1))
dim(birth1)

## [1] 600000    241

#removing some unknown or NAs based on data key
rs1 <- which(birth1$dbwt==9999); sum(birth1$dbwt==9999)

## [1] 217

rs1 <- c(rs1,which(birth1$apgar5==99)); sum(birth1$apgar5==99)

## [1] 2926

rs1 <- c(rs1,which(birth1$precare==99)); sum(birth1$precare==99)

## [1] 13559

rs1 <- c(rs1,which(birth1$cig_0==99)); sum(birth1$cig_0==99)

## [1] 3617

rs1 <- c(rs1,which(birth1$wtgain==99)); sum(birth1$wtgain==99)

## [1] 22883

rs1 <- c(rs1,which(birth1$dmeth_rec==9)); sum(birth1$dmeth_rec==9)

## [1] 9

rs2 <- unique(rs1)
birth11 <- birth1[-rs2,]

set.seed(448)
rs <- sample(nrow(birth11),667)
birth01 <- birth11[rs,]
rm(birth1,birth11)

#Suppose we know these variables to be collinear: precare, previs, cig_0, cig_1, cig_2, cig_3, pwgt_r, dwgt_r, wtgain, apgar5 , apgar10


head(colnames(birth01))

## [1] "dob_yy"     "dob_mm"     "dob_tt"     "dob_wk"     "bfacil"    
## [6] "f_facility"

Let’s standardize the predictors by centering at mean and scaling by standard deviation.

#Suppose we use the following variables as predictors: apgar5, precare, cig_0, wtgain, mager9, meduc
birth01$cigb4preg <- ifelse(birth01$cig0_r==0,0,1)

birth02 <- mutate(birth01, newdmeth_rec=factor(dmeth_rec)) %>% select(newdmeth_rec, cigb4preg, apgar5, wtgain, dbwt)

colnames(birth02)

## [1] "newdmeth_rec" "cigb4preg"    "apgar5"       "wtgain"      
## [5] "dbwt"

birth2s <- cbind(birth02[,1:2], scale(birth02[,-(1:2)]) )

Partition the data into training and testing sets.

# partitioning the data - 70% training, 30% testing
set.seed(448)
ids<-sample(nrow(birth2s),floor(0.70*nrow(birth2s)))

trainingData <- birth2s[ids,]
testingData <- birth2s[-ids,]

#assuming the response is results of the inspection
trainingData_response <- trainingData$newdmeth_rec
testingData_response <- testingData$newdmeth_rec

trainingData_predictors <- trainingData[,-1]
testingData_predictors <- testingData[,-1]

rm(birth01,birth02,birth2s)

Check the results on the scaled training dataset.

library(class)
knntrainingresults <- knn(train=trainingData_predictors, test=trainingData_predictors, cl=trainingData_response, k=3)
table(trainingData_response,knntrainingresults) #classification table (for the training Data)

##                      knntrainingresults
## trainingData_response   1   2
##                     1 278  35
##                     2  67  86

mean(trainingData_response!=knntrainingresults) #classification error (for the training Data)

## [1] 0.2188841

Check the results on the scaled testing dataset.

knntestingresults <- knn(train=trainingData_predictors, test=testingData_predictors, cl=trainingData_response, k=3)
table(testingData_response,knntestingresults) #classification table (for the testing Data)

##                     knntestingresults
## testingData_response   1   2
##                    1 106  33
##                    2  44  18

mean(testingData_response!=knntestingresults) #classification error (for the testing Data)

## [1] 0.3830846

Instead of arbitrarily picking the \(k\), we can try looping over different values of \(k\) (See Dalpiaz textbook “12.1 Binary Data Example”).

set.seed(448)
ks = 1:(min(length(trainingData_response),100))
errors = rep(x = 0, times = length(ks))

#Same concept but with unscaled predictors
for (i in seq_along(ks)) {
  results = knn(train = trainingData_predictors, 
             test  = testingData_predictors, 
             cl    = trainingData_response, 
             k     = ks[i])
  errors[i] = mean(testingData_response != results)
}
errors #test error rates

##   [1] 0.4527363 0.4079602 0.3930348 0.3830846 0.3781095 0.3781095 0.3681592
##   [8] 0.3781095 0.3681592 0.3482587 0.3482587 0.3233831 0.3432836 0.3432836
##  [15] 0.3233831 0.3283582 0.3134328 0.2985075 0.3184080 0.3233831 0.3333333
##  [22] 0.3283582 0.3333333 0.3333333 0.3333333 0.3283582 0.3383085 0.3283582
##  [29] 0.3233831 0.3283582 0.3233831 0.3084577 0.3134328 0.3084577 0.3134328
##  [36] 0.3084577 0.3134328 0.3084577 0.3084577 0.3034826 0.3084577 0.3134328
##  [43] 0.3084577 0.3084577 0.3134328 0.3134328 0.3134328 0.3134328 0.3134328
##  [50] 0.3134328 0.3134328 0.3134328 0.3134328 0.3134328 0.3134328 0.3134328
##  [57] 0.3134328 0.3134328 0.3134328 0.3134328 0.3084577 0.3084577 0.3084577
##  [64] 0.3084577 0.3084577 0.3084577 0.3084577 0.3084577 0.3084577 0.3084577
##  [71] 0.3084577 0.3084577 0.3084577 0.3084577 0.3084577 0.3084577 0.3084577
##  [78] 0.3084577 0.3084577 0.3084577 0.3084577 0.3084577 0.3084577 0.3084577
##  [85] 0.3084577 0.3084577 0.3084577 0.3084577 0.3084577 0.3084577 0.3084577
##  [92] 0.3084577 0.3084577 0.3084577 0.3084577 0.3084577 0.3084577 0.3084577
##  [99] 0.3084577 0.3084577

mean(testingData_response==1) #proportion of response

## [1] 0.6915423

Principal Components Analysis (PCA)

We may encounter a very large (multivariate) data set.
The data set has variables which are correlated with each other.

For the data examples, we will use only the first part out of 20 total parts of the 2015 US Natality Data. To do a valid data analysis, we would need to import all 20 parts, combine them into a single data frame, and subset from that entire dataset.

rm(list = ls())
library(tidyverse)
birth1 <- read_csv("https://uofi.box.com/shared/static/gb07nz7jafspk65wzxq6awlu02kypiq7.csv")

## Parsed with column specification:
## cols(
##   .default = col_double(),
##   MAGE_IMPFLG = col_logical(),
##   MAGE_REPFLG = col_logical(),
##   MAR_P = col_character(),
##   WIC = col_character(),
##   CIG_REC = col_character(),
##   RF_PDIAB = col_character(),
##   RF_GDIAB = col_character(),
##   RF_PHYPE = col_character(),
##   RF_GHYPE = col_character(),
##   RF_EHYPE = col_character(),
##   RF_PPTERM = col_character(),
##   RF_INFTR = col_character(),
##   RF_REDRG = col_character(),
##   RF_ARTEC = col_character(),
##   RF_CESAR = col_character(),
##   IP_GON = col_character(),
##   IP_SYPH = col_character(),
##   IP_CHLAM = col_character(),
##   IP_HEPB = col_character(),
##   IP_HEPC = col_character()
##   # ... with 40 more columns
## )

## See spec(...) for full column specifications.

colnames(birth1) <- tolower(colnames(birth1))
dim(birth1)

## [1] 200000    241

set.seed(448)
rs <- sample(nrow(birth1),200)
birth01 <- birth1[rs,]
rm(birth1)


cor(select(birth01, precare, previs, cig_0, cig_1, cig_2, cig_3, pwgt_r, dwgt_r, wtgain, apgar5 , apgar10))

##              precare       previs        cig_0       cig_1        cig_2
## precare  1.000000000  0.700075473 -0.032628471 -0.02793144 -0.025838567
## previs   0.700075473  1.000000000 -0.044284753 -0.02726075 -0.027657189
## cig_0   -0.032628471 -0.044284753  1.000000000  0.73276053  0.717474591
## cig_1   -0.027931437 -0.027260751  0.732760531  1.00000000  0.978449455
## cig_2   -0.025838567 -0.027657189  0.717474591  0.97844945  1.000000000
## cig_3   -0.025191035 -0.019186638  0.706515300  0.98512231  0.986580866
## pwgt_r  -0.002593690 -0.014011632 -0.056780416 -0.04363812 -0.041909503
## dwgt_r   0.290107647  0.305614608 -0.067659651 -0.05556439 -0.056386244
## wtgain   0.081863221 -0.003647720 -0.007662838  0.01369831 -0.006127750
## apgar5  -0.089007908 -0.006579596  0.037072805  0.02107618  0.003993934
## apgar10  0.009233443  0.023918142  0.025907652  0.02169362  0.019333771
##                 cig_3      pwgt_r       dwgt_r        wtgain       apgar5
## precare -0.0251910353 -0.00259369  0.290107647  0.0818632206 -0.089007908
## previs  -0.0191866376 -0.01401163  0.305614608 -0.0036477199 -0.006579596
## cig_0    0.7065152996 -0.05678042 -0.067659651 -0.0076628378  0.037072805
## cig_1    0.9851223128 -0.04363812 -0.055564390  0.0136983108  0.021076183
## cig_2    0.9865808659 -0.04190950 -0.056386244 -0.0061277495  0.003993934
## cig_3    1.0000000000 -0.03697427 -0.049156468  0.0008737034  0.011812992
## pwgt_r  -0.0369742722  1.00000000  0.432845057  0.4328778326  0.064175847
## dwgt_r  -0.0491564678  0.43284506  1.000000000  0.3495112494 -0.006115387
## wtgain   0.0008737034  0.43287783  0.349511249  1.0000000000  0.005385297
## apgar5   0.0118129920  0.06417585 -0.006115387  0.0053852968  1.000000000
## apgar10  0.0182618964  0.03991818  0.050581477  0.0506135586  0.714020970
##             apgar10
## precare 0.009233443
## previs  0.023918142
## cig_0   0.025907652
## cig_1   0.021693617
## cig_2   0.019333771
## cig_3   0.018261896
## pwgt_r  0.039918182
## dwgt_r  0.050581477
## wtgain  0.050613559
## apgar5  0.714020970
## apgar10 1.000000000

Objectives are:

Detect interesting features and underlying patterns such as grouping behaviors among variables.
Capture the variation in the data thereby summarizing it so that we don’t have to deal with the entire data set.

How does it work?

For a multivariate data set \(\mathbf{X}\) with \(J\) variables, we have its sample covariance matrix \(\mathbf{V}\) and sample correlation matrix \(\mathbf{R}\).
Typically, we assume \(\mathbf{X}\) has a mean of \(\mathbf{0}\) i.e. it is centered.
The correlation matrix essentially re-scales the variances.
We can decompose \(\mathbf{R}\) as \(\mathbf{R}=\sum_{j=1}^{J} \alpha_j \mathbf{e}_j\mathbf{e}_j^{\top}\).
- \(\alpha_j\) are eigenvalues which measure variance of the principal components
- \(\mathbf{e}_j\) are eigenvectors which measure loadings or directions among the components
- \(\mathbf{e}_j\) must be orthogonal \(\mathbf{e}_j^{\top} \mathbf{e}_k = 0\) and have unit length \(\mathbf{e}_j^{\top} \mathbf{e}_j = 1\) which is the definition of orthonormal.

pcad1 <- prcomp(~ precare+ previs+ cig_0+ cig_1+ cig_2+ cig_3+ pwgt_r+ dwgt_r+ wtgain+ apgar5+ apgar10, data = birth01, scale = TRUE) #correlation-based PCA

\(\mathbf{Z}=\mathbf{X} \mathbf{E}\) are principal component scores (the transformed \(\mathbf{X}\))
The principal components \(z_1,z_2,\ldots,z_J\) have the same total variation as the \(x_1,x_2,\ldots,x_J\)
The variation is ordered from largest to smallest i.e. \(\alpha_1 \geq \alpha_2 \geq\cdots \geq \alpha_J\)

evals<-pcad1$sdev^2 #eigenvalues

The proportion of variation for a particular component \(z_j\) is \(\frac{\alpha_j}{\sum_{j=1}^{J} \alpha_j}\)
We usually select a subset of these components which capture enough of that variation.

How to reduce the dimension? i.e. How to decide on the number of components to keep?

A. Retain the number of components with a particular cumulative proportion of variation - usually a high percentage, such as 85% or higher (depending on the size of the data)

evals/sum(evals)

##  [1] 0.3261188416 0.1865234689 0.1588025848 0.1380154871 0.0562087719
##  [6] 0.0439696622 0.0361113806 0.0285721045 0.0227959292 0.0018854932
## [11] 0.0009962759

cumsum(evals/sum(evals))

##  [1] 0.3261188 0.5126423 0.6714449 0.8094604 0.8656692 0.9096388 0.9457502
##  [8] 0.9743223 0.9971182 0.9990037 1.0000000

B. Keep components with corresponding eigenvalues greater than the average eigenvalue - for a correlation-based PCA, the average eigenvalue equals 1 - for a covariance-based PCA, the average eigenvalue equals \(\frac{1}{J} \sum_{i=1}^{J} \alpha_i\)

evals[evals>mean(evals)]

## [1] 3.587307 2.051758 1.746828 1.518170

C. Draw a scree plot and select the number of eigenvalues above the scree i.e. elbow

# R code
plot(1:length(pcad1$sdev),pcad1$sdev^2/sum(pcad1$sdev^2), type="l", axes=FALSE,xlab="Number of PCs", ylab="Variance", main="Scree Plot")
axis(1, seq(1,length(pcad1$sdev),1))
axis(2, seq(0,round(max(evals/sum(evals)),1),0.1) )

How to interpret the components that you kept?

Usually, the components can be interpreted through the eigenvectors (using a table or a loadings plot) based on a typical behavior.

If all (or almost all) eigenvector values are all in the same direction (sign) and not near 0, then we can interpret that component as an averaging or overall behavior.
If some eigenvector values are in one direction (positive/negative) while others are in a different direction (negative/positive) and all not near 0, then we can interpret that component as a grouping or contrasting behavior. For that group the goal is to describe the attributes for the two groups.

## eigenvectors as a table
evecs <- pcad1$rotation
evecs

##                 PC1        PC2          PC3          PC4          PC5
## precare  0.04017186 0.46984758 -0.263510407 -0.398669452  0.207949105
## previs   0.04008986 0.45805439 -0.229060230 -0.453742274 -0.019152468
## cig_0   -0.43497405 0.02129640  0.001194833 -0.002599076  0.021492996
## cig_1   -0.51792145 0.05240035 -0.016250137  0.013944550 -0.005583816
## cig_2   -0.51622817 0.04857069 -0.025216071  0.011576432 -0.030787894
## cig_3   -0.51557294 0.05485291 -0.022839798  0.012369114 -0.031411504
## pwgt_r   0.04620393 0.36166744  0.182714689  0.490209548 -0.354123153
## dwgt_r   0.06243013 0.52892980  0.013446902  0.191663968 -0.467780264
## wtgain   0.01656664 0.36570621  0.131829864  0.447864329  0.776797707
## apgar5  -0.01922180 0.05859345  0.653113407 -0.266326539 -0.039924275
## apgar10 -0.01872797 0.11453316  0.631798197 -0.293495558  0.068792732
##                  PC6         PC7          PC8         PC9          PC10
## precare -0.181039887 -0.04864149  0.518623953 -0.44928330  0.0028384862
## previs  -0.173813575  0.02435449 -0.551671572  0.44115507 -0.0034799688
## cig_0    0.031826216 -0.89431296  0.005189552  0.08522006 -0.0246496338
## cig_1   -0.004130208  0.21428815 -0.015613172 -0.03705820  0.7283836757
## cig_2   -0.012479726  0.25165532  0.021147950 -0.01315110 -0.6831878890
## cig_3   -0.013108447  0.27988335 -0.008358710 -0.02495598 -0.0262250513
## pwgt_r  -0.671652194 -0.04076774  0.101681786  0.06873689  0.0103023968
## dwgt_r   0.671724288 -0.01484495  0.008262954 -0.09531716  0.0009675305
## wtgain   0.127361857  0.02502762 -0.163180173  0.02627153 -0.0205180240
## apgar5  -0.083841418 -0.04636877 -0.423010981 -0.55534176 -0.0242064991
## apgar10  0.100375515  0.07072396  0.458177198  0.52075945  0.0167924081
##                 PC11
## precare -0.005982211
## previs   0.009262080
## cig_0   -0.034181036
## cig_1    0.387847321
## cig_2    0.445622692
## cig_3   -0.805994684
## pwgt_r   0.004453441
## dwgt_r   0.003607341
## wtgain  -0.004678941
## apgar5   0.002876029
## apgar10 -0.003809849

## eigenvectors as a plot
plot(evecs[,1],evecs[,2], type="n", xlim=c(-1,1), ylim=c(-1,1), xlab=colnames(evecs)[1], ylab=colnames(evecs)[2], main="Loadings Plot" )
text(evecs[,1],evecs[,2], labels = rownames(evecs))

plot(evecs[,1],evecs[,3], type="n", xlim=c(-1,1), ylim=c(-1,1), xlab=colnames(evecs)[1], ylab=colnames(evecs)[3], main="Loadings Plot" )
text(evecs[,1],evecs[,3], labels = rownames(evecs))

plot(evecs[,2],evecs[,3], type="n", xlim=c(-1,1), ylim=c(-1,1), xlab=colnames(evecs)[2], ylab=colnames(evecs)[3], main="Loadings Plot" )
text(evecs[,2],evecs[,3], labels = rownames(evecs))

plot(evecs[,1],evecs[,4], type="n", xlim=c(-1,1), ylim=c(-1,1), xlab=colnames(evecs)[1], ylab=colnames(evecs)[4], main="Loadings Plot" )
text(evecs[,1],evecs[,4], labels = rownames(evecs))

plot(evecs[,2],evecs[,4], type="n", xlim=c(-1,1), ylim=c(-1,1), xlab=colnames(evecs)[2], ylab=colnames(evecs)[4], main="Loadings Plot" )
text(evecs[,2],evecs[,4], labels = rownames(evecs))

plot(evecs[,3],evecs[,4], type="n", xlim=c(-1,1), ylim=c(-1,1), xlab=colnames(evecs)[3], ylab=colnames(evecs)[4], main="Loadings Plot" )
text(evecs[,3],evecs[,4], labels = rownames(evecs))

We can also show the rotated data and look for interesting points in that plot.

newx <- pcad1$x

plot(newx[,1], newx[,2], xlim=c(min(newx[,1]),max(newx[,1])), ylim=c(min(newx[,2]),max(newx[,2])), type="n", xlab=colnames(newx)[1], ylab=colnames(newx)[2], main="Loadings for the Rotated Data Plot" )
text(newx[,1], newx[,2])

plot(newx[,1], newx[,3],xlim=c(min(newx[,1]),max(newx[,1])), ylim=c(min(newx[,3]),max(newx[,3])), type="n", xlab=colnames(newx)[1], ylab=colnames(newx)[3], main="Loadings for the Rotated Data Plot" )
text(newx[,1], newx[,3])

plot(newx[,2], newx[,3], type="n", xlim=c(min(newx[,2]),max(newx[,2])), ylim=c(min(newx[,3]),max(newx[,3])), xlab=colnames(newx)[2], ylab=colnames(newx)[3], main="Loadings for the Rotated Data Plot" )
text(newx[,2], newx[,3])

plot(newx[,1], newx[,4], type="n", xlim=c(min(newx[,1]),max(newx[,1])), ylim=c(min(newx[,4]),max(newx[,4])), xlab=colnames(newx)[1], ylab=colnames(newx)[4], main="Loadings for the Rotated Data Plot" )
text(newx[,1], newx[,4])

plot(newx[,2], newx[,4], type="n", xlim=c(min(newx[,2]),max(newx[,2])), ylim=c(min(newx[,4]),max(newx[,4])), xlab=colnames(newx)[2], ylab=colnames(newx)[4], main="Loadings for the Rotated Data Plot" )
text(newx[,2], newx[,4])

plot(newx[,3], newx[,4], type="n", xlim=c(min(newx[,3]),max(newx[,3])), ylim=c(min(newx[,4]),max(newx[,4])), xlab=colnames(newx)[3], ylab=colnames(newx)[4], main="Loadings for the Rotated Data Plot" )
text(newx[,3], newx[,4])

Downsides to PCA

It is unclear what the principal components represent
By selecting a subset of PCs, we would be ignoring a percentage of the information that was present in the original variables
If there is relatively no correlation in the original dataset, then PCA won’t help draw out any features at all

STAT 448 Spring 2020 - Week 5 Instructor Notes in R

C. Kinson

Things Covered in this Week’s Notes

Discriminant Analysis

Some Terminology

A strategy for discriminant analysis

Remove any collinearity and extreme observations.

Re-scale or standardize, if necessary.

Check prior probabilities for the classes.

Partition the data into training and testing sets.

Perform appropriate discriminant analysis.

Evaluate the performance of the classifier.

kNN Classification

kNN Classification with a Binary response

Principal Components Analysis (PCA)

How does it work?

How to reduce the dimension? i.e. How to decide on the number of components to keep?

How to interpret the components that you kept?

Downsides to PCA