
Things to Cover This Week

R for Data Science - Ch. 3 Data Visualization

The tidyverse with its ggplot2 has changed the way R users create visualizations. It is closely tied to tidy data and has a unified coding to create almost any plot you want. It does this through layering various elements of the plot.

An empty ggplot (canvas)

We can create an empty plot that is the coordinate system. Adding layers is how we see the type of plot and the data. We’ll use the sba dataset as an example similarly to the textbook, but a random subset of the data called CFI.

cfi <- read_csv("", 
    col_types = cols(`Inspection Date` = col_date(format = "%m/%d/%Y")))
colnames(cfi) <- tolower(colnames(cfi))
CFI <- cfi[,-c(17:22)]
CFI$years <- format(CFI$`inspection date`, "%Y") #adding years variable
CFI$risknew<-ifelse(CFI$risk=="Risk 1 (High)", "High", CFI$risk) ##relabeling the risks
CFI$risknew<-ifelse(CFI$risk=="Risk 2 (Medium)", "Medium", CFI$risknew) ##
CFI$risknew<-ifelse(CFI$risk=="Risk 3 (Low)", "Low", CFI$risknew) ##
CFI2 <- filter(CFI, results=="Fail" | results=="Pass")
CFI2 <- filter(CFI2, risknew == "High" | risknew=="Medium" | risknew=="Low")
CFI3 <- CFI2[sample(1:nrow(CFI2),200),]
ggplot(data = CFI3)

Aesthetic mappings

Next, we want to add a layer that gives us a type of plotting type called a geom function. Every geom function has a mapping argument which requires a visual property or aesthetic aes(). Aesthetics can be shapes shape=, colors color=, and sizes size= of the plotting characters. Usually, We will map an aesthetic to a variable already present in the data. Notice the legend is created automatically.

#basically chicago layout
ggplot(data = CFI2) +
  geom_point(mapping = aes(x = longitude, y = latitude, color = factor(results))) +
  geom_vline(xintercept=-87.6298) + #adding reference lines for Chicago center
  geom_hline(yintercept=41.8781) #adding reference lines for Chicago center
# smaller random subset
ggplot(data = CFI3) +
  geom_point(mapping = aes(x = longitude, y = latitude, color = factor(results))) 

Other aesthetics include transparency alpha=. What about x and y? Notice the + signs?

We could also set an aesthetic to the geom function outside of the mapping argument, such as

ggplot(data = CFI3) +
  geom_point(mapping = aes(x = longitude, y = latitude), color = "blue")


Adding facets can be helpful for displaying various categories, levels, or discrete values for the data you’re visualizing. It creates subsets of the data with headers in the plot to remind users of what’s being shown.

With one categorical variable, use facet_wrap

ggplot(data = CFI3) +
  geom_point(mapping = aes(x = longitude, y = latitude), color = "blue") +
  facet_wrap( ~ factor(results))

with two categorical variables use facet_grid

ggplot(data = CFI3) +
  geom_point(mapping = aes(x = longitude, y = latitude), color = "blue") +
  facet_grid( factor(risknew) ~ factor(results))

Geometric objects

There are several kinds of geometric objects we can show.

  • geom_text adding text to plots
ggplot(CFI3, aes(x=longitude, y=latitude, label = `dba name`))+ 
  geom_text(check_overlap = TRUE) #we prevent overlapping labels

  • geom_line line plots
tt <- table(CFI3$years, CFI3$results) #counting the cross-classifications
ttdf <- data.frame(tt)
colnames(ttdf) <- c("Years", "Rez", "Freq")
ggplot(data = ttdf) + 
  geom_line(mapping = aes(x = Years, y = Freq, group=Rez, color=Rez))

  • geom_bar bar charts

A bar plot with one categorical variable

ggplot(data = CFI3 ) + 
  geom_bar(mapping = aes(x = factor(risknew)) )

A bar plot with two categorical variables and different colors

ggplot(data = CFI3 ) + 
  geom_bar(mapping = aes(x = factor(risknew), fill=factor(results)))

A bar plot with two categorical variables as a facet wrap

ggplot(data = CFI3 ) + 
  geom_bar(mapping = aes(x = factor(risknew))) +
  facet_wrap( ~ factor(results))

  • geom_boxplot box plots

  • geom_density kernel density estimates

  • geom_histogram histograms

  • geom_violin violin plots (similar to bean plots)

Layering multiple geometric objects

We can have multiple geom functions for the same data.

ggplot(data = ttdf) + 
  geom_point(mapping = aes(x = Years, y = Freq, group=Rez, color=Rez)) +
  geom_line(mapping = aes(x = Years, y = Freq, group=Rez, color=Rez))

Or we can have the same geom functions for multiple data.

## [1] 200  18
cfi2 <- filter(CFI2, results == "Fail" | results== "Pass")
ggplot(data = cfi2, mapping = aes(x = longitude, y = latitude)) + 
  geom_point(mapping = aes(color = factor(results))) + 
  geom_point(data = filter(cfi2, `aka name` == "McDonalds" | `aka name` == "McDonald's"), color="yellow", size=5)
Position adjustments

We can change how the geometric objects are oriented in a plot through a position adjustment as a mapping. Different geometric objects benefit from certain position changes.

  • position_fill fills the bar all the way to the top plotting margin and sets a proportion not the raw counts
ggplot(data = CFI3 ) + 
  geom_bar(mapping = aes(x = factor(risknew), fill = factor(results)) , position= "fill")

  • position_dodge shifts bars side-by-side or clustered (grouped)
ggplot(data = CFI3 ) + 
  geom_bar(mapping = aes(x = factor(risknew), fill = factor(results)), position = "dodge" )

  • position_stack changing the stacking of bars
ggplot(data = CFI3 ) + 
  geom_bar(mapping = aes(x = factor(risknew), fill = factor(results)) , position = "stack")

  • position_jitter moves repeated or duplicate points slightly
ggplot(data = CFI3) + 
  geom_point(mapping = aes(x = longitude, y = latitude),  position="jitter")

Others are

  • position_identity places objects in their corresponding location

Adding titles and labels

See for more details on labels. See for more details on legends.

ggplot(data = CFI3 ) + 
  geom_bar(mapping = aes(x = factor(risknew), fill = factor(results)) , position = "dodge") +
  labs(x= "Risks", title = "Food-Service Businesses in Chicago: a subsample") +
  guides(fill = guide_legend(title = "Full Pass or Fail"))