Clustering
Hierarchical Clustering
k-Means Clustering
Idea: observations within the same group are more similar than observations in different clusters
Goal: to identify clusters of observations based on a measure of distance
Typically the data has no true clustering
Typically clustering needs quantitative variables
Final clustering creates well-separated groups
Distances between observations are calculated and stored in a distance matrix, usually Euclidean distance is computed
Linkages or distances between clusters determine how observations are placed into clusters
single linkage: closest points in two different clusters
complete linkage: distance between furthest apart points in two different clusters
average linkage: average of all distances between points in one cluster and points in the other cluster
There are some metrics for selecting a “good” number of clusters such as cubic clustering criterion (CCC), pseudo \(F\), and pseudo \(t^2\).
Higher cubic clustering criterion (ccc) and pseudo \(F\) values indicate better clustering
Lower pseudo \(t^2\) values indicate better clustering
We could also look at the dendrogram and decide on an optimal distance. Cut the tree at that distance and retain the corresponding number of clusters.
Remove highly correlated variables and observations that are extreme outliers.
Check whether re-scaling or standardizing the data is necessary.
Perform the clustering and determine the number of clusters to use in the final cluster analysis.
Discuss the resulting clusters and their attributes.
In these examples, we’ll work with the CCSO Bookings Data. To begin, let’s import it, lowercase the column names, remove the last column, and take a random sample of 200 observations. We’re working with the smaller subset for ease and speed of computation and graphics capabilities.
rm(list=ls())
library(tidyverse)
## -- Attaching packages --------------------------------------------- tidyverse 1.2.1 --
## v ggplot2 3.2.1 v purrr 0.3.2
## v tibble 2.1.3 v dplyr 0.8.3
## v tidyr 1.0.0 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.4.0
## -- Conflicts ------------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
ccso <- read_csv("https://uofi.box.com/shared/static/9elozjsg99bgcb7gb546wlfr3r2gc9b7.csv",
col_types = cols(`BOOKING DATE` = col_date(format = "%m/%d/%Y"),
`BOOKING TIME` = col_time(format = "%H:%M:%S"),
`Booking Date/Time` = col_datetime(format = "%m/%d/%y %H:%M:%S"),
`RELEASED DATE` = col_date(format = "%m/%d/%Y"),
`RELEASED TIME` = col_time(format = "%H:%M:%S"),
`Release Date/Time` = col_datetime(format = "%m/%d/%y %H:%M:%S")))
## Warning: Missing column names filled in: 'X35' [35]
## Warning: 315 parsing failures.
## row col expected actual file
## 4756 ZIP CODE a double P61880 'https://uofi.box.com/shared/static/9elozjsg99bgcb7gb546wlfr3r2gc9b7.csv'
## 4757 ZIP CODE a double P61880 'https://uofi.box.com/shared/static/9elozjsg99bgcb7gb546wlfr3r2gc9b7.csv'
## 6207 ZIP CODE no trailing characters -2857 'https://uofi.box.com/shared/static/9elozjsg99bgcb7gb546wlfr3r2gc9b7.csv'
## 8489 ZIP CODE no trailing characters -4409 'https://uofi.box.com/shared/static/9elozjsg99bgcb7gb546wlfr3r2gc9b7.csv'
## 8490 ZIP CODE no trailing characters -4409 'https://uofi.box.com/shared/static/9elozjsg99bgcb7gb546wlfr3r2gc9b7.csv'
## .... ........ ...................... ...... .........................................................................
## See problems(...) for more details.
colnames(ccso) <- tolower(colnames(ccso))
ccso$daysin <- difftime(ccso$`release date/time`,ccso$`booking date/time`, units="days")
ccso$minsin <- difftime(ccso$`release date/time`,ccso$`booking date/time`, units="mins")
ccso$secsin <- difftime(ccso$`release date/time`,ccso$`booking date/time`, units="secs")
sort(table(ccso$`jacket number`),decreasing = TRUE)[1:10] #recycled jacket numbers
##
## 57267 500001 1017622 44836 33235 1007110 1006848 987949 62750
## 305 106 84 56 45 44 42 40 38
## 519849
## 38
sort(table(ccso$`booking number`),decreasing = TRUE)[1:10] #unique identifier
##
## 201100007209 201200005490 201400002843 201200004589 201300001304
## 12 11 11 10 10
## 201300004226 201400003473 201500005048 201500005830 201600000784
## 10 10 10 10 10
ccsosplit <- split(ccso,ccso$`booking number`)
bn <- sapply(ccsosplit,nrow)
ccso1 <- select(distinct( arrange(ccso, `booking number`), `booking number`, .keep_all=TRUE), everything())
dim(ccso1)
## [1] 40470 38
ccso2 <- mutate(ccso1, crimecount = bn)
all.equal(as.character(ccso2$`booking number`),names(bn))
## [1] TRUE
set.seed(448)
ids <- sample(nrow(ccso2),200)
ccso3 <- select(mutate(ccso2, mins=as.vector(minsin)), `age at arrest`, `days in jail`, mins, crimecount)
miss <- unique( c(which(is.na(ccso3$mins)),which(is.na(ccso3$`age at arrest`)),which(is.na(ccso3$crimecount)),which(is.na(ccso3$`days in jail`))) )
ccso3 <- ccso3[-miss,]
ccso3 <- ccso3[ids,]
rm(ccso, ccso1)
ccso4 <- select(ccso3, -`days in jail`)
ccso4 <- scale(ccso4)
sum(is.na(ccso4)) #and checking for NA or missing values
## [1] 0
#could use na.omit(ccso3) to work with a complete dataset (no missing values)
We begin by creating a distance matrix.
# Code Borrowed from DataCamp's Quick-R
d <- dist(ccso4, method = "euclidean") # distance matrix
length(d)
## [1] 19900
Clustering with single linkage.
fits <- hclust(d, method="single") # method specifies the type of clustering method i.e. linkage
plot(fits) # display the dendogram
We need to decide how many clusters to keep based on the dendrogram.
plot(fits) # display the dendogram
cls <- cutree(fits, k=3) # cut tree into k clusters
rect.hclust(fits, k=3, border="red") # draws dendogram with red borders around the 5 clusters
Clustering with complete linkage.
fitc <- hclust(d, method="complete") # method specifies the type of clustering method i.e. linkage
plot(fitc) # display the dendogram
We need to decide how many clusters to keep based on the dendrogram.
plot(fitc) # display the dendogram
clc <- cutree(fitc, k=3) # cut tree into k=5 clusters
rect.hclust(fitc, k=3, border="red") # draws dendogram with red borders around the 5 clusters
Clustering with average linkage.
fita <- hclust(d, method="average") # method specifies the type of clustering method i.e. linkage
plot(fita) # display the dendogram
We need to decide how many clusters to keep based on the dendrogram.
plot(fita) # display the dendogram
cla <- cutree(fita, k=3) # cut tree into k clusters
rect.hclust(fita, k=3, border="red") # draws dendogram with red borders around the 5 clusters
Some differences between hierarchical clustering and k-means clustering are:
- k-means requires knowing the number of clusters ahead of time - k-means uses centroids of the points not linkages - k-means allows for iterative random cluster assignment until an optimum is achieved, where the optimum is variation within the cluster
To do the k-means clustering, simply use the kmeans
function.
clk <- kmeans(ccso4,centers=3)$cluster
After choosing a number of clusters, we can look at their basic descriptive statistics.
table(cls) #hierarchical single linkage clusters
## cls
## 1 2 3
## 195 4 1
table(clc) #hierarchical complete linkage clusters
## clc
## 1 2 3
## 185 14 1
table(cla) #hierarchical average linkage clusters
## cla
## 1 2 3
## 195 4 1
table(clk) #k-means clusters
## clk
## 1 2 3
## 35 37 128
ccso5 <- mutate(select(ccso3, -`days in jail`),cls,clc,cla,clk)
summarise(group_by(ccso5,cls),clustersize=length(`age at arrest`), avgageatarrest= mean(`age at arrest`), medageatarrest= median(`age at arrest`), avgminsinjail = mean(mins), medminsinjail = median(mins) ) #single linkage cluster avg
## # A tibble: 3 x 6
## cls clustersize avgageatarrest medageatarrest avgminsinjail
## <int> <int> <dbl> <dbl> <dbl>
## 1 1 195 30.9 27 18437.
## 2 2 4 113 112. 291.
## 3 3 1 25 25 728078.
## # ... with 1 more variable: medminsinjail <dbl>
summarise(group_by(ccso5,clc),clustersize=length(`age at arrest`), avgageatarrest= mean(`age at arrest`), medageatarrest= median(`age at arrest`), avgminsinjail = mean(mins), medminsinjail = median(mins) ) #complete linkage cluster avg
## # A tibble: 3 x 6
## clc clustersize avgageatarrest medageatarrest avgminsinjail
## <int> <int> <dbl> <dbl> <dbl>
## 1 1 185 32.0 26 18250.
## 2 2 14 39.6 39.5 15718.
## 3 3 1 25 25 728078.
## # ... with 1 more variable: medminsinjail <dbl>
summarise(group_by(ccso5,cla),clustersize=length(`age at arrest`), avgageatarrest= mean(`age at arrest`), medageatarrest= median(`age at arrest`), avgminsinjail = mean(mins), medminsinjail = median(mins) ) #average linkage cluster avg
## # A tibble: 3 x 6
## cla clustersize avgageatarrest medageatarrest avgminsinjail
## <int> <int> <dbl> <dbl> <dbl>
## 1 1 195 30.9 27 18437.
## 2 2 4 113 112. 291.
## 3 3 1 25 25 728078.
## # ... with 1 more variable: medminsinjail <dbl>
summarise(group_by(ccso5,clk),clustersize=length(`age at arrest`), avgageatarrest= mean(`age at arrest`), medageatarrest= median(`age at arrest`), avgminsinjail = mean(mins), medminsinjail = median(mins) ) #k-means cluster avg
## # A tibble: 3 x 6
## clk clustersize avgageatarrest medageatarrest avgminsinjail
## <int> <int> <dbl> <dbl> <dbl>
## 1 1 35 57.9 54 35734.
## 2 2 37 29.3 26 45365.
## 3 3 128 26.4 24 10900.
## # ... with 1 more variable: medminsinjail <dbl>
We can also visualize them.
ggplot(data=ccso5) +
geom_point(mapping=aes(x=mins, y=`age at arrest`) ) +
ggtitle("Raw Data - Before Clustering")
ggplot(data=ccso5) +
geom_point(mapping=aes(x=mins, y=`age at arrest`, color=factor(cls)) ) +
ggtitle("Hierarchical Single Linkage Clusters")
ggplot(data=ccso5) +
geom_point(mapping=aes(x=mins, y=`age at arrest`) ) +
facet_wrap(cls) +
ggtitle("Hierarchical Single Linkage Clusters")
ggplot(data=ccso5) +
geom_point(mapping=aes(x=mins, y=`age at arrest`, color=factor(clc)) ) +
ggtitle("Hierarchical Complete Linkage Clusters")
ggplot(data=ccso5) +
geom_point(mapping=aes(x=mins, y=`age at arrest`) ) +
facet_wrap(clc) +
ggtitle("Hierarchical Complete Linkage Clusters")
ggplot(data=ccso5) +
geom_point(mapping=aes(x=mins, y=`age at arrest`, color=factor(cla)) ) +
ggtitle("Hierarchical Average Linkage Clusters")
ggplot(data=ccso5) +
geom_point(mapping=aes(x=mins, y=`age at arrest`) ) +
facet_wrap(cla) +
ggtitle("Hierarchical Average Linkage Clusters")
ggplot(data=ccso5) +
geom_point(mapping=aes(x=mins, y=`age at arrest`, color=factor(clk)) ) +
ggtitle("k-Means Clusters")
ggplot(data=ccso5) +
geom_point(mapping=aes(x=mins, y=`age at arrest`) ) +
facet_wrap(clk) +
ggtitle("k-Means Clusters")
A frequency table is a good structure for accessing which observations were “correctly” or “incorrectly” clustered, if we knew the “correct” clustering.
For data analysis though, it really depends on the questions you want to answer. In other words, we should consider how anything we do is helping the audience understand the data better and what kind of insights can be gleaned from our investigations.