STAT 448 Spring 2020 - Week 10 Instructor Notes in R

Clustering

Idea: observations within the same group are more similar than observations in different clusters
Goal: to identify clusters of observations based on a measure of distance
Typically the data has no true clustering
Typically clustering needs quantitative variables
Final clustering creates well-separated groups

Distances and linkages

Distances between observations are calculated and stored in a distance matrix, usually Euclidean distance is computed
Linkages or distances between clusters determine how observations are placed into clusters
- single linkage: closest points in two different clusters
- complete linkage: distance between furthest apart points in two different clusters
- average linkage: average of all distances between points in one cluster and points in the other cluster

Picking the number of clusters

There are some metrics for selecting a “good” number of clusters such as cubic clustering criterion (CCC), pseudo \(F\), and pseudo \(t^2\).
Higher cubic clustering criterion (ccc) and pseudo \(F\) values indicate better clustering
Lower pseudo \(t^2\) values indicate better clustering
We could also look at the dendrogram and decide on an optimal distance. Cut the tree at that distance and retain the corresponding number of clusters.

A strategy for performing cluster analysis

Remove highly correlated variables and observations that are extreme outliers.
Check whether re-scaling or standardizing the data is necessary.
Perform the clustering and determine the number of clusters to use in the final cluster analysis.
Discuss the resulting clusters and their attributes.

In these examples, we’ll work with the CCSO Bookings Data. To begin, let’s import it, lowercase the column names, remove the last column, and take a random sample of 200 observations. We’re working with the smaller subset for ease and speed of computation and graphics capabilities.

rm(list=ls())
library(tidyverse)

## -- Attaching packages --------------------------------------------- tidyverse 1.2.1 --

## v ggplot2 3.2.1     v purrr   0.3.2
## v tibble  2.1.3     v dplyr   0.8.3
## v tidyr   1.0.0     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.4.0

## -- Conflicts ------------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

ccso <- read_csv("https://uofi.box.com/shared/static/9elozjsg99bgcb7gb546wlfr3r2gc9b7.csv", 
    col_types = cols(`BOOKING DATE` = col_date(format = "%m/%d/%Y"), 
        `BOOKING TIME` = col_time(format = "%H:%M:%S"), 
        `Booking Date/Time` = col_datetime(format = "%m/%d/%y %H:%M:%S"), 
        `RELEASED DATE` = col_date(format = "%m/%d/%Y"), 
        `RELEASED TIME` = col_time(format = "%H:%M:%S"), 
        `Release Date/Time` = col_datetime(format = "%m/%d/%y %H:%M:%S")))

## Warning: Missing column names filled in: 'X35' [35]

## Warning: 315 parsing failures.
##  row      col               expected actual                                                                      file
## 4756 ZIP CODE a double               P61880 'https://uofi.box.com/shared/static/9elozjsg99bgcb7gb546wlfr3r2gc9b7.csv'
## 4757 ZIP CODE a double               P61880 'https://uofi.box.com/shared/static/9elozjsg99bgcb7gb546wlfr3r2gc9b7.csv'
## 6207 ZIP CODE no trailing characters -2857  'https://uofi.box.com/shared/static/9elozjsg99bgcb7gb546wlfr3r2gc9b7.csv'
## 8489 ZIP CODE no trailing characters -4409  'https://uofi.box.com/shared/static/9elozjsg99bgcb7gb546wlfr3r2gc9b7.csv'
## 8490 ZIP CODE no trailing characters -4409  'https://uofi.box.com/shared/static/9elozjsg99bgcb7gb546wlfr3r2gc9b7.csv'
## .... ........ ...................... ...... .........................................................................
## See problems(...) for more details.

colnames(ccso) <- tolower(colnames(ccso))

ccso$daysin <- difftime(ccso$`release date/time`,ccso$`booking date/time`, units="days")
ccso$minsin <- difftime(ccso$`release date/time`,ccso$`booking date/time`, units="mins")
ccso$secsin <- difftime(ccso$`release date/time`,ccso$`booking date/time`, units="secs")
               
sort(table(ccso$`jacket number`),decreasing = TRUE)[1:10] #recycled jacket numbers

## 
##   57267  500001 1017622   44836   33235 1007110 1006848  987949   62750 
##     305     106      84      56      45      44      42      40      38 
##  519849 
##      38

sort(table(ccso$`booking number`),decreasing = TRUE)[1:10] #unique identifier

## 
## 201100007209 201200005490 201400002843 201200004589 201300001304 
##           12           11           11           10           10 
## 201300004226 201400003473 201500005048 201500005830 201600000784 
##           10           10           10           10           10

ccsosplit <- split(ccso,ccso$`booking number`)
bn <- sapply(ccsosplit,nrow)

ccso1 <- select(distinct( arrange(ccso, `booking number`), `booking number`, .keep_all=TRUE), everything())
dim(ccso1)

## [1] 40470    38

ccso2 <- mutate(ccso1, crimecount = bn)

all.equal(as.character(ccso2$`booking number`),names(bn))

## [1] TRUE

set.seed(448)
ids <- sample(nrow(ccso2),200)
ccso3 <- select(mutate(ccso2, mins=as.vector(minsin)), `age at arrest`, `days in jail`, mins, crimecount)
miss <- unique( c(which(is.na(ccso3$mins)),which(is.na(ccso3$`age at arrest`)),which(is.na(ccso3$crimecount)),which(is.na(ccso3$`days in jail`))) )
ccso3 <- ccso3[-miss,]
ccso3 <- ccso3[ids,]

rm(ccso, ccso1)

Remove highly correlated variables and observations that are extreme outliers.

cor(ccso3)

##               age at arrest days in jail       mins crimecount
## age at arrest    1.00000000   0.01408035 0.01402584 0.03167058
## days in jail     0.01408035   1.00000000 0.99997630 0.08779459
## mins             0.01402584   0.99997630 1.00000000 0.08750888
## crimecount       0.03167058   0.08779459 0.08750888 1.00000000

#3 sigma rule: mins
mn<-mean(ccso3$`mins`)
sg<-sd(ccso3$`mins`)
tsr <- which(abs(ccso3$`mins`-mn) > 3*sg)
length(tsr)

## [1] 4

ccso3$`mins`[tsr]

## [1] 257720.1 257732.3 285886.2 728077.9

# box plot rule: mins
q1<-as.vector(quantile(ccso3$`mins`,1/4))
q3<-as.vector(quantile(ccso3$`mins`,3/4))
iqr <- as.vector(q3-q1)
lwr <- which(ccso3$`mins` < q1-1.5*iqr)
upr <- which(ccso3$`mins` > q3+1.5*iqr)
length(lwr);length(upr)

## [1] 0

## [1] 32

ccso3$`mins`[upr]

##  [1]  42662.12 128195.58 257720.12 257732.33  40404.57  41769.05  52533.05
##  [8] 107151.75  41489.32  39637.37  84962.28 143051.15  48530.47  80268.42
## [15]  74398.55 285886.18  41743.30 728077.87 115164.83 106892.95 165324.62
## [22] 172829.67  40327.88  41956.88  52707.22  47322.82  60518.45  38963.38
## [29]  64500.27 153506.67  41474.03  41750.05

# hampel identifier: mins
md<-median(ccso3$`mins`)
sg2<-1.4826*(median(abs(ccso3$`mins`-md)))
hi <- which(abs(ccso3$`mins`-md) > 3*sg2)
length(hi)

## [1] 67

ccso3$`mins`[hi]

##  [1]  17293.933  42662.117 128195.583  11568.200 257720.117 257732.333
##  [7]  40404.567  41769.050  25886.983  52533.050 107151.750  20104.200
## [13]  19425.633  23628.217  20834.850  14391.417  41489.317  39637.367
## [19]  84962.283 143051.150  10655.567  48530.467   5901.883  80268.417
## [25]  29780.950  12505.100  21568.100  11394.183   5612.617  16033.983
## [31]  74398.550 285886.183  12135.200  41743.300 728077.867 115164.833
## [37] 106892.950   6313.083  19413.817  11213.350 165324.617 172829.667
## [43]  22749.633  40327.883   8880.167  41956.883  52707.217  47322.817
## [49]   8089.567  33350.367   6167.283  11873.383  60518.450  13068.667
## [55]  26170.700  29936.783  38963.383  64500.267  20695.250  18149.200
## [61]  21021.600 153506.667  41474.033  41750.050   8582.683  15399.383
## [67]  32650.383

####

#3 sigma rule: `age at arrest`
mn<-mean(ccso3$`age at arrest`)
sg<-sd(ccso3$`age at arrest`)
tsr2 <- which(abs(ccso3$`age at arrest`-mn) > 3*sg)
length(tsr2)

## [1] 4

ccso3$`age at arrest`[tsr2]

## [1] 113 111 112 116

# box plot rule: age at arrest
q1<-as.vector(quantile(ccso3$`age at arrest`,1/4))
q3<-as.vector(quantile(ccso3$`age at arrest`,3/4))
iqr <- as.vector(q3-q1)
lwr2 <- which(ccso3$`age at arrest` < q1-1.5*iqr)
upr2 <- which(ccso3$`age at arrest` > q3+1.5*iqr)
length(lwr2);length(upr2)

## [1] 0

## [1] 4

ccso3$`age at arrest`[upr2]

## [1] 113 111 112 116

# hampel identifier: age at arrest
md<-median(ccso3$`age at arrest`)
sg2<-1.4826*(median(abs(ccso3$`age at arrest`-md)))
hi2 <- which(abs(ccso3$`age at arrest`-md) > 3*sg2)
length(hi2)

## [1] 19

ccso3$`age at arrest`[hi2]

##  [1]  59 113  54  56  63  56  54  54  55  54 111  54  55  55  61  57 112
## [18] 116  54

list(tsr,lwr,upr,hi,tsr2,lwr2,upr2,hi2)

## [[1]]
## [1]  15  19 107 110
## 
## [[2]]
## integer(0)
## 
## [[3]]
##  [1]   7  10  15  19  27  28  32  34  66  67  72  74  79  86 105 107 109
## [18] 110 111 116 130 134 138 145 148 149 166 175 176 183 184 186
## 
## [[4]]
##  [1]   2   7  10  14  15  19  27  28  30  32  34  46  47  50  55  60  66
## [18]  67  72  74  76  79  82  86  89  90  95  99 100 102 105 107 108 109
## [35] 110 111 116 119 121 122 130 134 137 138 141 145 148 149 154 155 156
## [52] 164 166 171 172 173 175 176 177 179 180 183 184 186 187 198 199
## 
## [[5]]
## [1]  20  83 181 189
## 
## [[6]]
## integer(0)
## 
## [[7]]
## [1]  20  83 181 189
## 
## [[8]]
##  [1]   8  20  24  28  32  48  52  54  58  77  83 104 107 108 112 154 181
## [18] 189 192

Reduce(intersect, list(tsr,upr,hi,tsr2,upr2,hi2)) #thanks Xianjun Dong

## integer(0)

Reduce(intersect, list(tsr,upr,hi)) #thanks Xianjun Dong

## [1]  15  19 107 110

Reduce(intersect, list(tsr2,upr2,hi2)) #thanks Xianjun Dong

## [1]  20  83 181 189

ccso3[Reduce(intersect, list(tsr,upr,hi)),]

## # A tibble: 4 x 4
##   `age at arrest` `days in jail`    mins crimecount
##             <dbl>          <dbl>   <dbl>      <int>
## 1              42            178 257720.          1
## 2              53            178 257732.          1
## 3              55            198 285886.          1
## 4              25            505 728078.          3

ccso3[Reduce(intersect, list(tsr2,upr2,hi2)),]

## # A tibble: 4 x 4
##   `age at arrest` `days in jail`  mins crimecount
##             <dbl>          <dbl> <dbl>      <int>
## 1             113              0 225.           1
## 2             111              0 479.           2
## 3             112              0 449.           2
## 4             116              0  12.8          1

Check whether re-scaling or standardizing the data is necessary.

ccso4 <- select(ccso3, -`days in jail`)
ccso4 <- scale(ccso4)
sum(is.na(ccso4)) #and checking for NA or missing values

## [1] 0

#could use na.omit(ccso3) to work with a complete dataset (no missing values)

Perform the clustering and determine the number of clusters to use in the final cluster analysis.

Hierarchical clustering

We begin by creating a distance matrix.

# Code Borrowed from DataCamp's Quick-R
d <- dist(ccso4, method = "euclidean") # distance matrix
length(d)

## [1] 19900

Clustering with single linkage.

fits <- hclust(d, method="single") # method specifies the type of clustering method i.e. linkage

plot(fits) # display the dendogram

We need to decide how many clusters to keep based on the dendrogram.

plot(fits) # display the dendogram
cls <- cutree(fits, k=3) # cut tree into k clusters
rect.hclust(fits, k=3, border="red") # draws dendogram with red borders around the 5 clusters

Clustering with complete linkage.

fitc <- hclust(d, method="complete") # method specifies the type of clustering method i.e. linkage

plot(fitc) # display the dendogram

We need to decide how many clusters to keep based on the dendrogram.

plot(fitc) # display the dendogram
clc <- cutree(fitc, k=3) # cut tree into k=5 clusters
rect.hclust(fitc, k=3, border="red") # draws dendogram with red borders around the 5 clusters

Clustering with average linkage.

fita <- hclust(d, method="average") # method specifies the type of clustering method i.e. linkage

plot(fita) # display the dendogram

We need to decide how many clusters to keep based on the dendrogram.

plot(fita) # display the dendogram
cla <- cutree(fita, k=3) # cut tree into k clusters
rect.hclust(fita, k=3, border="red") # draws dendogram with red borders around the 5 clusters

k-Means Clustering

Some differences between hierarchical clustering and k-means clustering are:
- k-means requires knowing the number of clusters ahead of time - k-means uses centroids of the points not linkages - k-means allows for iterative random cluster assignment until an optimum is achieved, where the optimum is variation within the cluster

To do the k-means clustering, simply use the kmeans function.

clk <- kmeans(ccso4,centers=3)$cluster

Discuss the resulting clusters and their attributes.

After choosing a number of clusters, we can look at their basic descriptive statistics.

table(cls) #hierarchical single linkage clusters

## cls
##   1   2   3 
## 195   4   1

table(clc) #hierarchical complete linkage clusters

## clc
##   1   2   3 
## 185  14   1

table(cla) #hierarchical average linkage clusters

## cla
##   1   2   3 
## 195   4   1

table(clk) #k-means clusters

## clk
##   1   2   3 
##  35  37 128

ccso5 <- mutate(select(ccso3, -`days in jail`),cls,clc,cla,clk)

summarise(group_by(ccso5,cls),clustersize=length(`age at arrest`), avgageatarrest= mean(`age at arrest`), medageatarrest= median(`age at arrest`), avgminsinjail = mean(mins), medminsinjail = median(mins) ) #single linkage cluster avg

## # A tibble: 3 x 6
##     cls clustersize avgageatarrest medageatarrest avgminsinjail
##   <int>       <int>          <dbl>          <dbl>         <dbl>
## 1     1         195           30.9            27         18437.
## 2     2           4          113             112.          291.
## 3     3           1           25              25        728078.
## # ... with 1 more variable: medminsinjail <dbl>

summarise(group_by(ccso5,clc),clustersize=length(`age at arrest`), avgageatarrest= mean(`age at arrest`), medageatarrest= median(`age at arrest`), avgminsinjail = mean(mins), medminsinjail = median(mins) ) #complete linkage cluster avg

## # A tibble: 3 x 6
##     clc clustersize avgageatarrest medageatarrest avgminsinjail
##   <int>       <int>          <dbl>          <dbl>         <dbl>
## 1     1         185           32.0           26          18250.
## 2     2          14           39.6           39.5        15718.
## 3     3           1           25             25         728078.
## # ... with 1 more variable: medminsinjail <dbl>

summarise(group_by(ccso5,cla),clustersize=length(`age at arrest`), avgageatarrest= mean(`age at arrest`), medageatarrest= median(`age at arrest`), avgminsinjail = mean(mins), medminsinjail = median(mins) ) #average linkage cluster avg

## # A tibble: 3 x 6
##     cla clustersize avgageatarrest medageatarrest avgminsinjail
##   <int>       <int>          <dbl>          <dbl>         <dbl>
## 1     1         195           30.9            27         18437.
## 2     2           4          113             112.          291.
## 3     3           1           25              25        728078.
## # ... with 1 more variable: medminsinjail <dbl>

summarise(group_by(ccso5,clk),clustersize=length(`age at arrest`), avgageatarrest= mean(`age at arrest`), medageatarrest= median(`age at arrest`), avgminsinjail = mean(mins), medminsinjail = median(mins) ) #k-means cluster avg

## # A tibble: 3 x 6
##     clk clustersize avgageatarrest medageatarrest avgminsinjail
##   <int>       <int>          <dbl>          <dbl>         <dbl>
## 1     1          35           57.9             54        35734.
## 2     2          37           29.3             26        45365.
## 3     3         128           26.4             24        10900.
## # ... with 1 more variable: medminsinjail <dbl>

We can also visualize them.

ggplot(data=ccso5) +
  geom_point(mapping=aes(x=mins, y=`age at arrest`) ) +
  ggtitle("Raw Data - Before Clustering")

ggplot(data=ccso5) +
  geom_point(mapping=aes(x=mins, y=`age at arrest`, color=factor(cls)) ) +
  ggtitle("Hierarchical Single Linkage Clusters")

ggplot(data=ccso5) +
  geom_point(mapping=aes(x=mins, y=`age at arrest`) ) +
  facet_wrap(cls) +
  ggtitle("Hierarchical Single Linkage Clusters")

ggplot(data=ccso5) +
  geom_point(mapping=aes(x=mins, y=`age at arrest`, color=factor(clc)) ) +
  ggtitle("Hierarchical Complete Linkage Clusters")

ggplot(data=ccso5) +
  geom_point(mapping=aes(x=mins, y=`age at arrest`) ) +
  facet_wrap(clc) +
  ggtitle("Hierarchical Complete Linkage Clusters")

ggplot(data=ccso5) +
  geom_point(mapping=aes(x=mins, y=`age at arrest`, color=factor(cla)) ) +
  ggtitle("Hierarchical Average Linkage Clusters")

ggplot(data=ccso5) +
  geom_point(mapping=aes(x=mins, y=`age at arrest`) ) +
  facet_wrap(cla) +
  ggtitle("Hierarchical Average Linkage Clusters")

ggplot(data=ccso5) +
  geom_point(mapping=aes(x=mins, y=`age at arrest`, color=factor(clk)) ) +
  ggtitle("k-Means Clusters")

ggplot(data=ccso5) +
  geom_point(mapping=aes(x=mins, y=`age at arrest`) ) +
  facet_wrap(clk) +
  ggtitle("k-Means Clusters")

A frequency table is a good structure for accessing which observations were “correctly” or “incorrectly” clustered, if we knew the “correct” clustering.

For data analysis though, it really depends on the questions you want to answer. In other words, we should consider how anything we do is helping the audience understand the data better and what kind of insights can be gleaned from our investigations.

STAT 448 Spring 2020 - Week 10 Instructor Notes in R

C. Kinson

Things Covered in this Week’s Notes

Clustering

Distances and linkages

Picking the number of clusters

A strategy for performing cluster analysis

Remove highly correlated variables and observations that are extreme outliers.

Check whether re-scaling or standardizing the data is necessary.

Perform the clustering and determine the number of clusters to use in the final cluster analysis.

Hierarchical clustering

k-Means Clustering

Discuss the resulting clusters and their attributes.