Review

Things to Cover This Week


R for Data Science - Ch. 3 Data Visualization

The tidyverse with its ggplot2 has changed the way R users create visualizations. It is closely tied to tidy data and has a unified coding to create almost any plot you want. It does this through layering various elements of the plot.

An empty ggplot (canvas)

We can create an empty plot that is the coordinate system. Adding layers is how we see the type of plot and the data. We’ll use the sba dataset as an example similarly to the textbook, but a random subset of the data called CFI.

library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.2.1 --
## v ggplot2 3.2.1     v purrr   0.3.2
## v tibble  2.1.3     v dplyr   0.8.3
## v tidyr   0.8.3     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.4.0
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
cfi <- read_csv("https://uofi.box.com/shared/static/5637axblfhajotail80yw7j2s4r27hxd.csv", 
    col_types = cols(`Inspection Date` = col_date(format = "%m/%d/%Y")))
colnames(cfi) <- tolower(colnames(cfi))
CFI <- cfi[,-c(17:22)]
colnames(CFI)
##  [1] "inspection id"   "dba name"        "aka name"       
##  [4] "license #"       "facility type"   "risk"           
##  [7] "address"         "city"            "state"          
## [10] "zip"             "inspection date" "inspection type"
## [13] "results"         "violations"      "latitude"       
## [16] "longitude"
CFI$years <- format(CFI$`inspection date`, "%Y") #adding years variable
CFI$risknew<-ifelse(CFI$risk=="Risk 1 (High)", "High", CFI$risk) ##relabeling the risks
CFI$risknew<-ifelse(CFI$risk=="Risk 2 (Medium)", "Medium", CFI$risknew) ##
CFI$risknew<-ifelse(CFI$risk=="Risk 3 (Low)", "Low", CFI$risknew) ##
CFI2 <- filter(CFI, results=="Fail" | results=="Pass")
CFI2 <- filter(CFI2, risknew == "High" | risknew=="Medium" | risknew=="Low")
CFI3 <- CFI2[sample(1:nrow(CFI2),200),]
ggplot(data = CFI3)

Aesthetic mappings

Next, we want to add a layer that gives us a type of plotting type called a geom function. Every geom function has a mapping argument which requires a visual property or aesthetic aes(). Aesthetics can be shapes shape=, colors color=, and sizes size= of the plotting characters. Usually, We will map an aesthetic to a variable already present in the data. Notice the legend is created automatically.

#basically chicago layout
ggplot(data = CFI2) +
  geom_point(mapping = aes(x = longitude, y = latitude, color = factor(results))) +
  geom_vline(xintercept=-87.6298) + #adding reference lines for Chicago center
  geom_hline(yintercept=41.8781) #adding reference lines for Chicago center
## Warning: Removed 535 rows containing missing values (geom_point).

# smaller random subset
ggplot(data = CFI3) +
  geom_point(mapping = aes(x = longitude, y = latitude, color = factor(results))) 

Other aesthetics include transparency alpha=. What about x and y? Notice the + signs?

We could also set an aesthetic to the geom function outside of the mapping argument, such as

ggplot(data = CFI3) +
  geom_point(mapping = aes(x = longitude, y = latitude), color = "blue")

Facets

Adding facets can be helpful for displaying various categories, levels, or discrete values for the data you’re visualizing. It creates subsets of the data with headers in the plot to remind users of what’s being shown.

With one categorical variable, use facet_wrap

ggplot(data = CFI3) +
  geom_point(mapping = aes(x = longitude, y = latitude), color = "blue") +
  facet_wrap( ~ factor(results))

with two categorical variables use facet_grid

ggplot(data = CFI3) +
  geom_point(mapping = aes(x = longitude, y = latitude), color = "blue") +
  facet_grid( factor(risknew) ~ factor(results))

Geometric objects

There are several kinds of geometric objects we can show.

  • geom_text adding text to plots
ggplot(CFI3, aes(x=longitude, y=latitude, label = `dba name`))+ 
  geom_text(check_overlap = TRUE) #we prevent overlapping labels

  • geom_line line plots
tt <- table(CFI3$years, CFI3$results) #counting the cross-classifications
ttdf <- data.frame(tt)
colnames(ttdf) <- c("Years", "Rez", "Freq")
ttdf
##    Years  Rez Freq
## 1   2010 Fail    6
## 2   2011 Fail    9
## 3   2012 Fail    6
## 4   2013 Fail    3
## 5   2014 Fail    3
## 6   2015 Fail    4
## 7   2016 Fail    6
## 8   2017 Fail    1
## 9   2018 Fail    2
## 10  2019 Fail    2
## 11  2010 Pass   17
## 12  2011 Pass   18
## 13  2012 Pass   21
## 14  2013 Pass   18
## 15  2014 Pass   18
## 16  2015 Pass   18
## 17  2016 Pass   15
## 18  2017 Pass   22
## 19  2018 Pass   10
## 20  2019 Pass    1
ggplot(data = ttdf) + 
  geom_line(mapping = aes(x = Years, y = Freq, group=Rez, color=Rez))

  • geom_bar bar charts

A bar plot with one categorical variable

ggplot(data = CFI3 ) + 
  geom_bar(mapping = aes(x = factor(risknew)) )

A bar plot with two categorical variables and different colors

ggplot(data = CFI3 ) + 
  geom_bar(mapping = aes(x = factor(risknew), fill=factor(results)))

A bar plot with two categorical variables as a facet wrap

ggplot(data = CFI3 ) + 
  geom_bar(mapping = aes(x = factor(risknew))) +
  facet_wrap( ~ factor(results))

  • geom_boxplot box plots

  • geom_density kernel density estimates

  • geom_histogram histograms

  • geom_violin violin plots (similar to bean plots)

Layering multiple geometric objects

We can have multiple geom functions for the same data.

ggplot(data = ttdf) + 
  geom_point(mapping = aes(x = Years, y = Freq, group=Rez, color=Rez)) +
  geom_line(mapping = aes(x = Years, y = Freq, group=Rez, color=Rez))

Or we can have the same geom functions for multiple data.

sum(CFI2$`aka name`=="McDonald", na.rm=TRUE)
## [1] 0
table(CFI3$`aka name`)[table(CFI3$`aka name`)>1]
## 
##                       7-ELEVEN         Chipotle Mexican Grill 
##                              2                              2 
## FAT EDDIE'S EMPLOYEE CAFETERIA                            KFC 
##                              2                              3 
##       LaSalle Language Academy        POTBELLY SANDWICH WORKS 
##                              2                              2 
##                     SAVE-A-LOT                         SUBWAY 
##                              2                              3 
##               WALGREENS #15065 
##                              2
dim(CFI3)
## [1] 200  18
cfi2 <- filter(CFI2, results == "Fail" | results== "Pass")
ggplot(data = cfi2, mapping = aes(x = longitude, y = latitude)) + 
  geom_point(mapping = aes(color = factor(results))) + 
  geom_point(data = filter(cfi2, `aka name` == "McDonalds" | `aka name` == "McDonald's"), color="yellow", size=5)
## Warning: Removed 535 rows containing missing values (geom_point).

Position adjustments

We can change how the geometric objects are oriented in a plot through a position adjustment as a mapping. Different geometric objects benefit from certain position changes.

  • position_fill fills the bar all the way to the top plotting margin and sets a proportion not the raw counts
ggplot(data = CFI3 ) + 
  geom_bar(mapping = aes(x = factor(risknew), fill = factor(results)) , position= "fill")

  • position_dodge shifts bars side-by-side or clustered (grouped)
ggplot(data = CFI3 ) + 
  geom_bar(mapping = aes(x = factor(risknew), fill = factor(results)), position = "dodge" )

  • position_stack changing the stacking of bars
ggplot(data = CFI3 ) + 
  geom_bar(mapping = aes(x = factor(risknew), fill = factor(results)) , position = "stack")

  • position_jitter moves repeated or duplicate points slightly
ggplot(data = CFI3) + 
  geom_point(mapping = aes(x = longitude, y = latitude),  position="jitter")

Others are

  • position_identity places objects in their corresponding location

Adding titles and labels

See https://ggplot2.tidyverse.org/reference/labs.html for more details on labels. See https://ggplot2.tidyverse.org/reference/guide_legend.html for more details on legends.

ggplot(data = CFI3 ) + 
  geom_bar(mapping = aes(x = factor(risknew), fill = factor(results)) , position = "dodge") +
  labs(x= "Risks", title = "Food-Service Businesses in Chicago: a subsample") +
  guides(fill = guide_legend(title = "Full Pass or Fail"))