plots in base R
design principles from Knaflic’s Storytelling with Data
R for Data Science Ch. 3 Data visualization
improving the plots with Knaflic’s design principles
The tidyverse with its ggplot2 has changed the way R users create visualizations. It is closely tied to tidy data and has a unified coding to create almost any plot you want. It does this through layering various elements of the plot.
We can create an empty plot that is the coordinate system. Adding layers is how we see the type of plot and the data. We’ll use the sba dataset as an example similarly to the textbook, but a random subset of the data called CFI.
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.2.1 --
## v ggplot2 3.2.1 v purrr 0.3.2
## v tibble 2.1.3 v dplyr 0.8.3
## v tidyr 0.8.3 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.4.0
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
cfi <- read_csv("https://uofi.box.com/shared/static/5637axblfhajotail80yw7j2s4r27hxd.csv",
col_types = cols(`Inspection Date` = col_date(format = "%m/%d/%Y")))
colnames(cfi) <- tolower(colnames(cfi))
CFI <- cfi[,-c(17:22)]
colnames(CFI)
## [1] "inspection id" "dba name" "aka name"
## [4] "license #" "facility type" "risk"
## [7] "address" "city" "state"
## [10] "zip" "inspection date" "inspection type"
## [13] "results" "violations" "latitude"
## [16] "longitude"
CFI$years <- format(CFI$`inspection date`, "%Y") #adding years variable
CFI$risknew<-ifelse(CFI$risk=="Risk 1 (High)", "High", CFI$risk) ##relabeling the risks
CFI$risknew<-ifelse(CFI$risk=="Risk 2 (Medium)", "Medium", CFI$risknew) ##
CFI$risknew<-ifelse(CFI$risk=="Risk 3 (Low)", "Low", CFI$risknew) ##
CFI2 <- filter(CFI, results=="Fail" | results=="Pass")
CFI2 <- filter(CFI2, risknew == "High" | risknew=="Medium" | risknew=="Low")
CFI3 <- CFI2[sample(1:nrow(CFI2),200),]
ggplot(data = CFI3)
Next, we want to add a layer that gives us a type of plotting type called a geom function. Every geom function has a mapping argument which requires a visual property or aesthetic aes()
. Aesthetics can be shapes shape=
, colors color=
, and sizes size=
of the plotting characters. Usually, We will map an aesthetic to a variable already present in the data. Notice the legend is created automatically.
#basically chicago layout
ggplot(data = CFI2) +
geom_point(mapping = aes(x = longitude, y = latitude, color = factor(results))) +
geom_vline(xintercept=-87.6298) + #adding reference lines for Chicago center
geom_hline(yintercept=41.8781) #adding reference lines for Chicago center
## Warning: Removed 535 rows containing missing values (geom_point).
# smaller random subset
ggplot(data = CFI3) +
geom_point(mapping = aes(x = longitude, y = latitude, color = factor(results)))
Other aesthetics include transparency alpha=
. What about x
and y
? Notice the +
signs?
We could also set an aesthetic to the geom function outside of the mapping argument, such as
ggplot(data = CFI3) +
geom_point(mapping = aes(x = longitude, y = latitude), color = "blue")
Adding facets can be helpful for displaying various categories, levels, or discrete values for the data you’re visualizing. It creates subsets of the data with headers in the plot to remind users of what’s being shown.
With one categorical variable, use facet_wrap
ggplot(data = CFI3) +
geom_point(mapping = aes(x = longitude, y = latitude), color = "blue") +
facet_wrap( ~ factor(results))
with two categorical variables use facet_grid
ggplot(data = CFI3) +
geom_point(mapping = aes(x = longitude, y = latitude), color = "blue") +
facet_grid( factor(risknew) ~ factor(results))
There are several kinds of geometric objects we can show.
geom_text
adding text to plotsggplot(CFI3, aes(x=longitude, y=latitude, label = `dba name`))+
geom_text(check_overlap = TRUE) #we prevent overlapping labels
geom_line
line plotstt <- table(CFI3$years, CFI3$results) #counting the cross-classifications
ttdf <- data.frame(tt)
colnames(ttdf) <- c("Years", "Rez", "Freq")
ttdf
## Years Rez Freq
## 1 2010 Fail 6
## 2 2011 Fail 9
## 3 2012 Fail 6
## 4 2013 Fail 3
## 5 2014 Fail 3
## 6 2015 Fail 4
## 7 2016 Fail 6
## 8 2017 Fail 1
## 9 2018 Fail 2
## 10 2019 Fail 2
## 11 2010 Pass 17
## 12 2011 Pass 18
## 13 2012 Pass 21
## 14 2013 Pass 18
## 15 2014 Pass 18
## 16 2015 Pass 18
## 17 2016 Pass 15
## 18 2017 Pass 22
## 19 2018 Pass 10
## 20 2019 Pass 1
ggplot(data = ttdf) +
geom_line(mapping = aes(x = Years, y = Freq, group=Rez, color=Rez))
geom_bar
bar chartsA bar plot with one categorical variable
ggplot(data = CFI3 ) +
geom_bar(mapping = aes(x = factor(risknew)) )
A bar plot with two categorical variables and different colors
ggplot(data = CFI3 ) +
geom_bar(mapping = aes(x = factor(risknew), fill=factor(results)))
A bar plot with two categorical variables as a facet wrap
ggplot(data = CFI3 ) +
geom_bar(mapping = aes(x = factor(risknew))) +
facet_wrap( ~ factor(results))
geom_boxplot
box plots
geom_density
kernel density estimates
geom_histogram
histograms
geom_violin
violin plots (similar to bean plots)
We can have multiple geom functions for the same data.
ggplot(data = ttdf) +
geom_point(mapping = aes(x = Years, y = Freq, group=Rez, color=Rez)) +
geom_line(mapping = aes(x = Years, y = Freq, group=Rez, color=Rez))
Or we can have the same geom functions for multiple data.
sum(CFI2$`aka name`=="McDonald", na.rm=TRUE)
## [1] 0
table(CFI3$`aka name`)[table(CFI3$`aka name`)>1]
##
## 7-ELEVEN Chipotle Mexican Grill
## 2 2
## FAT EDDIE'S EMPLOYEE CAFETERIA KFC
## 2 3
## LaSalle Language Academy POTBELLY SANDWICH WORKS
## 2 2
## SAVE-A-LOT SUBWAY
## 2 3
## WALGREENS #15065
## 2
dim(CFI3)
## [1] 200 18
cfi2 <- filter(CFI2, results == "Fail" | results== "Pass")
ggplot(data = cfi2, mapping = aes(x = longitude, y = latitude)) +
geom_point(mapping = aes(color = factor(results))) +
geom_point(data = filter(cfi2, `aka name` == "McDonalds" | `aka name` == "McDonald's"), color="yellow", size=5)
## Warning: Removed 535 rows containing missing values (geom_point).
We can change how the geometric objects are oriented in a plot through a position adjustment as a mapping. Different geometric objects benefit from certain position changes.
position_fill
fills the bar all the way to the top plotting margin and sets a proportion not the raw countsggplot(data = CFI3 ) +
geom_bar(mapping = aes(x = factor(risknew), fill = factor(results)) , position= "fill")
position_dodge
shifts bars side-by-side or clustered (grouped)ggplot(data = CFI3 ) +
geom_bar(mapping = aes(x = factor(risknew), fill = factor(results)), position = "dodge" )
position_stack
changing the stacking of barsggplot(data = CFI3 ) +
geom_bar(mapping = aes(x = factor(risknew), fill = factor(results)) , position = "stack")
position_jitter
moves repeated or duplicate points slightlyggplot(data = CFI3) +
geom_point(mapping = aes(x = longitude, y = latitude), position="jitter")
Others are
position_identity
places objects in their corresponding locationSee https://ggplot2.tidyverse.org/reference/labs.html for more details on labels. See https://ggplot2.tidyverse.org/reference/guide_legend.html for more details on legends.
ggplot(data = CFI3 ) +
geom_bar(mapping = aes(x = factor(risknew), fill = factor(results)) , position = "dodge") +
labs(x= "Risks", title = "Food-Service Businesses in Chicago: a subsample") +
guides(fill = guide_legend(title = "Full Pass or Fail"))