Review


Data Background


Data Visualizations in Base R

The plot function in R is quite powerful and flexible. We explain by example but first, let’s read in the CCSO Bookings Data.

jail <- read.csv("https://uofi.box.com/shared/static/9elozjsg99bgcb7gb546wlfr3r2gc9b7.csv", stringsAsFactors = FALSE)
apply(jail,2,mode)
##         BOOKING.DATE       BOOKING.NUMBER         BOOKING.TIME 
##          "character"          "character"          "character" 
##        CUSTODY.CLASS    EMPLOYMENT.STATUS INCARCERATION.REASON 
##          "character"          "character"          "character" 
##        JACKET.NUMBER          JACKET.TYPE        PRISONER.TYPE 
##          "character"          "character"          "character" 
##        RELEASED.DATE      RELEASED.REASON        RELEASED.TIME 
##          "character"          "character"          "character" 
##       CHARGE.STATUTE           CRIME.CODE         STATUTE.TYPE 
##          "character"          "character"          "character" 
##                 CITY                 RACE                  SEX 
##          "character"          "character"          "character" 
##                STATE             ZIP.CODE          CITIZENSHIP 
##          "character"          "character"          "character" 
##      MARITIAL.STATUS             MILITARY           OCCUPATION 
##          "character"          "character"          "character" 
##               SCHOOL        ARREST.AGENCY        Age.at.Arrest 
##          "character"          "character"          "character" 
##       Age.at.Release    Booking.Date.Time    Release.Date.Time 
##          "character"          "character"          "character" 
##         Days.in.Jail                Hours              Minutes 
##          "character"          "character"          "character" 
##              Seconds                    X 
##          "character"          "character"
apply(jail,2,class)
##         BOOKING.DATE       BOOKING.NUMBER         BOOKING.TIME 
##          "character"          "character"          "character" 
##        CUSTODY.CLASS    EMPLOYMENT.STATUS INCARCERATION.REASON 
##          "character"          "character"          "character" 
##        JACKET.NUMBER          JACKET.TYPE        PRISONER.TYPE 
##          "character"          "character"          "character" 
##        RELEASED.DATE      RELEASED.REASON        RELEASED.TIME 
##          "character"          "character"          "character" 
##       CHARGE.STATUTE           CRIME.CODE         STATUTE.TYPE 
##          "character"          "character"          "character" 
##                 CITY                 RACE                  SEX 
##          "character"          "character"          "character" 
##                STATE             ZIP.CODE          CITIZENSHIP 
##          "character"          "character"          "character" 
##      MARITIAL.STATUS             MILITARY           OCCUPATION 
##          "character"          "character"          "character" 
##               SCHOOL        ARREST.AGENCY        Age.at.Arrest 
##          "character"          "character"          "character" 
##       Age.at.Release    Booking.Date.Time    Release.Date.Time 
##          "character"          "character"          "character" 
##         Days.in.Jail                Hours              Minutes 
##          "character"          "character"          "character" 
##              Seconds                    X 
##          "character"          "character"
jail$days.in.jail <- as.numeric(jail$Days.in.Jail)
## Warning: NAs introduced by coercion
jail$hours.in.jail <- as.numeric(jail$Hours)
## Warning: NAs introduced by coercion
jail$minutes.in.jail <- as.numeric(jail$Minutes)
## Warning: NAs introduced by coercion
jail$seconds.in.jail <- as.numeric(jail$Seconds)
## Warning: NAs introduced by coercion
colnames(jail)
##  [1] "BOOKING.DATE"         "BOOKING.NUMBER"       "BOOKING.TIME"        
##  [4] "CUSTODY.CLASS"        "EMPLOYMENT.STATUS"    "INCARCERATION.REASON"
##  [7] "JACKET.NUMBER"        "JACKET.TYPE"          "PRISONER.TYPE"       
## [10] "RELEASED.DATE"        "RELEASED.REASON"      "RELEASED.TIME"       
## [13] "CHARGE.STATUTE"       "CRIME.CODE"           "STATUTE.TYPE"        
## [16] "CITY"                 "RACE"                 "SEX"                 
## [19] "STATE"                "ZIP.CODE"             "CITIZENSHIP"         
## [22] "MARITIAL.STATUS"      "MILITARY"             "OCCUPATION"          
## [25] "SCHOOL"               "ARREST.AGENCY"        "Age.at.Arrest"       
## [28] "Age.at.Release"       "Booking.Date.Time"    "Release.Date.Time"   
## [31] "Days.in.Jail"         "Hours"                "Minutes"             
## [34] "Seconds"              "X"                    "days.in.jail"        
## [37] "hours.in.jail"        "minutes.in.jail"      "seconds.in.jail"
jL <- jail[,-c(31:35)]
jail <- jL
dim(jail)
## [1] 67764    34
head(jail, 2)
##   BOOKING.DATE BOOKING.NUMBER BOOKING.TIME              CUSTODY.CLASS
## 1     1/1/2011      2.011e+11      1:05:19 Sentenced IDOC (CCSO ONLY)
## 2     1/1/2011      2.011e+11      1:05:19 Sentenced IDOC (CCSO ONLY)
##      EMPLOYMENT.STATUS     INCARCERATION.REASON JACKET.NUMBER JACKET.TYPE
## 1 Employed - Full Time Arrest - Without Warrant         31830           A
## 2 Employed - Full Time Arrest - Without Warrant         31830           A
##             PRISONER.TYPE RELEASED.DATE
## 1 Misdemeanor Arraignment     2/28/2011
## 2 Misdemeanor Arraignment     2/28/2011
##                                      RELEASED.REASON RELEASED.TIME
## 1 Sentenced (transfer) to State Corrections        Y       1:17:41
## 2 Sentenced (transfer) to State Corrections        Y       1:17:41
##   CHARGE.STATUTE          CRIME.CODE STATUTE.TYPE    CITY  RACE  SEX
## 1     720-5/12-3             BATTERY           NA RANTOUL White Male
## 2    730-5/5-6-4 PROBATION VIOLATION           NA RANTOUL White Male
##      STATE ZIP.CODE CITIZENSHIP MARITIAL.STATUS MILITARY
## 1 ILLINOIS    61866          US          Single     None
## 2 ILLINOIS    61866          US          Single     None
##                                       OCCUPATION
## 1 SERVICE PERSONNEL(HOTEL,RESTAURANT,NIGHT CLUB)
## 2 SERVICE PERSONNEL(HOTEL,RESTAURANT,NIGHT CLUB)
##                       SCHOOL                      ARREST.AGENCY
## 1 Graduated from high school Champaign County Sherriff's Office
## 2 Graduated from high school Champaign County Sherriff's Office
##   Age.at.Arrest Age.at.Release Booking.Date.Time Release.Date.Time
## 1            42             42  1/01/11 01:05:19  2/28/11 01:17:41
## 2            42             42  1/01/11 01:05:19  2/28/11 01:17:41
##   days.in.jail hours.in.jail minutes.in.jail seconds.in.jail
## 1           58             0              12              22
## 2           58             0              12              22
colnames(jail) <- tolower(colnames(jail))

Typically we plot two numeric vectors

x<- jail$age.at.arrest
y<- jail$age.at.release
plot(x,y)

The two numeric vectors could be in a single matrix

mat<- matrix(c(jail$age.at.arrest,jail$age.at.release), ncol=2)
plot(mat)

We could create a time series plot (aka “index plot” aka “series plot”) using a single numeric vector

plot(jail$days.in.jail)

We could take advantage of factors in the data to plot a bar plot or box plots per level of the factor

jr<-factor(jail$race)
levels(jr)
## [1] ""                       "Asian/Pacific Islander"
## [3] "Black"                  "Hispanic"              
## [5] "Native American"        "Unknown"               
## [7] "White"                  "White (Hispanic)"
plot(jr)

plot(jr, jail$days.in.jail)

Or we could use the table function to explicitly count the frequency for the levels or categories to show a bar plot with barplot

sort(table(jail$crime.code), decreasing=TRUE)[1:5]
## 
##               OTHER TRAFFIC OFFENSES SUSPENDED OR REVOKED DRIVERS LICENSE 
##                                 5983                                 5680 
##              OTHER CRIMINAL OFFENSES                       MISC JAIL CODE 
##                                 5158                                 4686 
##                     DOMESTIC BATTERY 
##                                 4421
ccs <- names(sort(table(jail$crime.code), decreasing=TRUE)[1:5])
jcc <- jail[which(jail$crime.code==ccs),]
## Warning in jail$crime.code == ccs: longer object length is not a multiple
## of shorter object length
t1 <- table(jcc$crime.code, jcc$race)

barplot(colSums(t1))

barplot(rowSums(t1))

barplot(t1)

barplot(t1, beside = TRUE)

For data frames, we can leverage that structure to produce multiple plots with one plot function

plot(days.in.jail ~ age.at.arrest + age.at.release, data=jail)

jdf<-data.frame(jail$age.at.arrest, jail$age.at.release, jail$race)
plot(jdf)

One popular multivariate plot is with the pairs function

pairs(jdf)

But if we wanted to see the pairwise relationship between age at arrest and days in jail for each racial category (assuming we only had black and white categories), we could use a coplot

jdff <- data.frame(ageatarrest=jail$age.at.arrest, daysspent=jail$days.in.jail, race=jail$race)
jdff<-jdff[which(jdff$race=="Black" | jdff$race=="White"),]
jdff$race<-factor(jdff$race)
table(jdff$race)
## 
## Black White 
## 36319 25175
colnames(jdff)
## [1] "ageatarrest" "daysspent"   "race"
coplot(daysspent ~ ageatarrest | race, data=jdff)

## 
##  Missing rows: 50854, 52604, 53780, 53781, 56394, 56646, 56647, 56648, 57044, 58304, 58305, 58306, 58307, 58308, 58309, 58310, 58311, 58312, 59835, 59836, 59837, 59838, 59839, 59911, 59912, 59913, 59914, 59915, 60118, 60119, 60120, 60123, 60124, 60125, 60126, 60328, 60343, 60344, 60345, 60912, 60913, 60914, 60915, 60916, 61199, 61200, 61201, 61370, 61371, 61372, 61373, 61374, 61375, 61386, 61387, 61388, 61491, 61492

Adding Plotting Arguments

There are various arguments that can be aded to plots to add visual clarity and to remove things that are distracting. Here’s a short list.

  • type: changes the plotting type
  • xlab: labels the x axis
  • ylab: labels the y axis
  • main: labels the main title of the plot
  • sub: labels the subtitle of the plot
  • pch: changes the plotting character for the point type
  • pwd: changes the plotting character width for the point type
  • lty: changes the plotting character for the line type
  • lwd: changes the plotting character width for the line type
  • xlim: adjusts the lower and upper boundaries of the x axis
  • ylim: adjusts the lower and upper boundaries of the y axis
  • col: changes the colors of the plotting character

Now let’s go back and add basic arguments that may help improve our interpretation of the plots.

plot(mat, xlab="Age at Arrest", ylab="Age at Release", main="A Basic Scatter Plot", pch="+")

plot(jail$days.in.jail, xlab="Index", ylab="Days in Jail", main="A Series Plot", type="l", lty=2, lwd=2)

plot(jr, jail$days.in.jail, main="Distributions of Days Spent in Jail Per Racial Group")

Paired Programming Tasks to Do (base R; not tidyverse)

  1. Import the Chicago Food Inspections Data
  • Show the data dimension
  • Show the first 6 rows (head)
  • Create a visualization that shows a bar plot for each level of Results (variable)
  • Create an object that stores all the businesses that appear more than once in the dataset It’s possible that none are repeat businesses.
  • Among the repeat businesses, show the Results as a bar plot
  • Create a visualization that shows the bar plot of Results for each level Risk
  1. Import the Urbana Rental Inspections Data and show the data dimension
  • Show the data dimension
  • Show the first 6 rows (head)
  • Create a visualization that shows a bar plot for each level of Grade (variable)
  • Create an object that stores all the businesses that appear more than once in the dataset It’s possible that none are repeat businesses.
  • Among the repeat businesses, show the Grade as a bar plot
  • Create a visualization that shows the bar plot of Grade for each level License Status