STAT 385 Statistics Programming Methods

Week 10 Notes

Created by Christopher Kinson

Things Covered in this Week’s Notes

Introduction to the tidyverse
Importing data with the tidyverse
Data wrangling with the tidyverse

Introduction to the tidyverse

The tidyverse is a collection of R packages developed with data science in mind. The developers often collaborate and have support through the RStudio company. The tidyverse itself is a package in R that will call upon each package within the collection including: readr, stringr, dplyr, tidyr, tibble, purr, and forcats. These packages and their functionality work extremely well for tibbles, but they also may be executed on data frames, lists, vectors, and matrices when appropriate (depends on the package and the function).

You’ll want to install the tidyverse package to simplify the usage of any of these at will, e.g., install.packages("tidyverse"). Otherwise, you can install each package on its own, e.g., install.packages("readr"). To use any package alone, simply call it, e.g., library(readr). To learn more about any of these packages and their usage, you should read their documentation at https://www.tidyverse.org/packages/.

##Again as I have said repeatedly, make sure your version of R is up-to-date (newest R version is 4.0.3)!

Importing data in the tidyverse

The bulk of what you will handle in life as statisticians will be external datasets. There are several ways to access and import external data but we’ll focus on two: read_table and read_csv using the tidyverse package. The advantages of readr’s importing (e.g. read_csv) over R’s base importing (e.g. read.csv) are: much faster importing, much larger datasets for importing, and more simplified imported data in the form of tibbles.

using `read_delim()`

This is one function for accessing data in general, especially data with tabs or spaces as delimiters or separators. We usually need to supply the file location and the separator. Let’s read in a local data file. First, download the brain.txt to your local computer.

##run this but change the file location
library(tidyverse)
read_delim("C:/Users/CKinson/Downloads/brain.txt", delim="\t") #works without having to first run library(readr)

readr::read_delim("C:/Users/CKinson/Downloads/brain.txt", delim="\t") #works without having to first run library(tidyverse)

using `read_csv()`

This is one function for accessing data with commas as delimiters or separators. It has the same functionality as read_delim but assumes the default delimiter is comma. Typically, we only need to supply the file location. Let’s read in a local data file. First, download the brain.csv to your local computer.

##run this but change the file location
readr::read_csv("C:/Users/CKinson/Downloads/brain.csv")

We can also use the URLs as the file locations in the tidyverse.

library(tidyverse) #just in case

## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --

## v ggplot2 3.3.2     v purrr   0.3.4
## v tibble  3.0.4     v dplyr   1.0.2
## v tidyr   1.1.2     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.0

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

##run this but change the file location
readr::read_csv("https://uofi.box.com/shared/static/4v7q5rzxsrceeuzioya99zn2z85mcfno.csv")

## 
## -- Column specification --------------------------------------------------------
## cols(
##   species = col_character(),
##   bodykg = col_double(),
##   braing = col_double()
## )

## # A tibble: 62 x 3
##    species                     bodykg braing
##    <chr>                        <dbl>  <dbl>
##  1 African elephant          6654     5712  
##  2 African giant pouched rat    1        6.6
##  3 Arctic Fox                   3.38    44.5
##  4 Arctic ground squirrel       0.92     5.7
##  5 Asian elephant            2547     4603  
##  6 Baboon                      10.6    180. 
##  7 Big brown bat                0.023    0.3
##  8 Brazilian tapir            160      169  
##  9 Cat                          3.3     25.6
## 10 Chimpanzee                  52.2    440  
## # ... with 52 more rows

One final aspect of importing data with readr is the idea of importing the data through RStudio’s “Import Dataset” wizard (menus, not purely from code). Importing in this way allows us to control how each column should be formatted, how specific we need the parameters of importing to be, and the source of the data. In other words, we can use the wizard to do any specific coercion or formatting at the time of importing (handling dates and times, coercing character columns to numeric, etc.)

Here’s a procedure for using RStudio’s “Import Dataset” wizard:

Go to File > Import Dataset > From Text (readr).

Paste the data file location in the File/URL box.

Look at your Data Preview and click on any columns that need re-formatting.
Once satisfied with the Data Preview, copy the code in the Code Editor box and paste it into your code chunk.

Data wrangling with the tidyverse

Data wrangling is the set of procedures and tasks for cleaning and managing data so that the data can be analyzed easily. It is a crucial and time-consuming aspect of data analysis work. In this course, we are going to opt for a mixture of standard R functionality (Week 8) as well as tidyverse functionality (Week 10) to wrangle data in a neat and useful manner. This week, we focus entirely on the tidyverse.

To access a function’s help page for a package that isn’t currently called, use the ? operator before the name of the package then :: then the name of the function. For example,

?readr::read_csv

## starting httpd help server ... done

?dplyr::arrange

We’re going to explore 5 common actions for data wrangling. Let’s show some examples with the SBA Loans Data.

Here’s how we can import an old version of the SBA Loans Data. Using read_csv() will take seconds to import a large dataset.

library(tidyverse)
Sys.time()

## [1] "2020-12-28 15:00:36 CST"

sba <- read_csv("https://uofi.box.com/shared/static/vi37omgitiaa2yyplrom779qvwk1g14x.csv")

## 
## -- Column specification --------------------------------------------------------
## cols(
##   .default = col_character(),
##   LoanNr_ChkDgt = col_double(),
##   Zip = col_double(),
##   NAICS = col_double(),
##   ApprovalFY = col_double(),
##   Term = col_double(),
##   NoEmp = col_double(),
##   NewExist = col_double(),
##   CreateJob = col_double(),
##   RetainedJob = col_double(),
##   FranchiseCode = col_double(),
##   UrbanRural = col_double()
## )
## i Use `spec()` for the full column specifications.

## Warning: 18 parsing failures.
##    row        col               expected actual                                                                      file
## 699733 ApprovalFY no trailing characters  1976A 'https://uofi.box.com/shared/static/vi37omgitiaa2yyplrom779qvwk1g14x.csv'
## 704031 ApprovalFY no trailing characters  1976A 'https://uofi.box.com/shared/static/vi37omgitiaa2yyplrom779qvwk1g14x.csv'
## 705376 ApprovalFY no trailing characters  1976A 'https://uofi.box.com/shared/static/vi37omgitiaa2yyplrom779qvwk1g14x.csv'
## 710382 ApprovalFY no trailing characters  1976A 'https://uofi.box.com/shared/static/vi37omgitiaa2yyplrom779qvwk1g14x.csv'
## 713246 ApprovalFY no trailing characters  1976A 'https://uofi.box.com/shared/static/vi37omgitiaa2yyplrom779qvwk1g14x.csv'
## ...... .......... ...................... ...... .........................................................................
## See problems(...) for more details.

Sys.time()

## [1] "2020-12-28 15:02:03 CST"

Next, we create a subset of the data which is much smaller than the original, print the first few rows of the subset, and remove the original larger data set.

set.seed(448)
SBA <- sba[sample(1:dim(sba)[1],200),] #randomized subset of 200 rows
head(SBA, 10) #first 10 rows

## # A tibble: 10 x 27
##    LoanNr_ChkDgt Name  City  State   Zip Bank  BankState  NAICS ApprovalDate
##            <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>      <dbl> <chr>       
##  1    5420744004 GALL~ MANA~ NJ     8050 BANK~ NC        446191 24-May-02   
##  2    3590135009 ELEC~ AUST~ TX    78737 SUPE~ CA        332322 21-Aug-09   
##  3    8087434004 ALLI~ CAMD~ NJ     8102 PNC ~ DE        722211 10-Nov-04   
##  4    4850623002 LENA~ EDGE~ WI    53534 BADG~ WI        333414 19-May-92   
##  5    2490395004 Trex~ W HO~ CA    90069 BANK~ NC        561990 1-Jun-07    
##  6    8987834001 INNO~ BROO~ CO    80020 KEYB~ OH        423320 15-Jul-05   
##  7    5636874007 MICH~ BERK~ MI    48072 PNC ~ DE        812199 22-Aug-02   
##  8    2641145008 D W ~ TEMPE AZ    85284 NATI~ AZ        541613 16-Aug-07   
##  9    2268455008 Inno~ Stro~ OH    44149 PNC ~ OH        541614 9-Feb-07    
## 10    7385794000 SPYD~ PEOR~ IL    61615 BBCN~ CA        541511 26-Apr-04   
## # ... with 18 more variables: ApprovalFY <dbl>, Term <dbl>, NoEmp <dbl>,
## #   NewExist <dbl>, CreateJob <dbl>, RetainedJob <dbl>, FranchiseCode <dbl>,
## #   UrbanRural <dbl>, RevLineCr <chr>, LowDoc <chr>, ChgOffDate <chr>,
## #   DisbursementDate <chr>, DisbursementGross <chr>, BalanceGross <chr>,
## #   MIS_Status <chr>, ChgOffPrinGr <chr>, GrAppv <chr>, SBA_Appv <chr>

rm(sba) #remove the original large data file

When we check the mode of some of the columns in the SBA data, we notice that several columns represent dollar amounts yet have been automatically imported as character type. We have already learned about coercion and the pitfalls of coercion with data.frames. There is one big issue with attempting to coerce the dollar amounts in these columns: the literal dollar sign in each row preceding the number. Thus we’ll briefly mention character manipulation.

String manipulation (preview of regular expressions)

Strings are characters or sequences of characters. We can work with strings to wrangle them into something we need. The more detailed discussion of this is regular expression and stringr package in the tidyverse. For now, we will focus on the specific SBA data application.

We need to remove the dollar signs and commas in order to properly coerce the columns representing dollar amounts to be numeric. We didn’t have to worry about white space, because read_csv removed it by default. Since we need to remove characters from the string, we can use stringr’s str_remove_all() function in the tidyverse. We’ll use the double escape character \\ before the special characters as our pattern to match as in "\\$|\\,".

test1 <- str_remove_all(SBA$DisbursementGross[1:10], "\\,|\\$") #just trying it out for the first 10 values
test1

##  [1] "25000.00"  "12500.00"  "10000.00"  "250000.00" "12000.00"  "644000.00"
##  [7] "80000.00"  "135000.00" "25699.00"  "15000.00"

as.numeric(test1) #checking coercion

##  [1]  25000  12500  10000 250000  12000 644000  80000 135000  25699  15000

Success!

The following actions are typically found in the `dplyr` package in the tidyverse.

Filtering

We can subset observations based on their values using the filter() function. This type of subsetting allows us to only show observations that satisfy a particular condition. The subset() function is never allowed in this course.

Be careful not to confuse these actions in the tidyverse as permanent assignment operations. These actions show us temporary results that are not stored into any object; unless we assign them using the operator <-.

filter(SBA, State == "CA")

## # A tibble: 29 x 27
##    LoanNr_ChkDgt Name  City  State   Zip Bank  BankState  NAICS ApprovalDate
##            <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>      <dbl> <chr>       
##  1    2490395004 Trex~ W HO~ CA    90069 BANK~ NC        561990 1-Jun-07    
##  2    5243444006 SUNS~ SANT~ CA    95403 UMPQ~ OR        453998 28-Feb-02   
##  3    1088114002 J.G.~ REDO~ CA    90277 OPUS~ CA        541512 2-Apr-97    
##  4    1810195009 Maje~ CHAT~ CA    91311 BANK~ NC        337920 9-May-06    
##  5    4656573007 CLAS~ LOS ~ CA    90039 CATH~ CA        811121 16-Dec-91   
##  6    4912475004 Tali~ SAN ~ CA    94103 JPMO~ IL        327215 13-Oct-11   
##  7    2333974001 CLAR~ CARM~ CA    93923 BANK~ CA             0 18-Sep-98   
##  8    7944054009 JOHN~ SAN ~ CA    95133 EAST~ CA        423920 15-Oct-04   
##  9    1833025007 ALAN~ SAN ~ CA    94117 U.S.~ OH        531210 22-May-06   
## 10    3478643006 COLO~ REDW~ CA    94063 CITI~ CA        323122 12-Jul-89   
## # ... with 19 more rows, and 18 more variables: ApprovalFY <dbl>, Term <dbl>,
## #   NoEmp <dbl>, NewExist <dbl>, CreateJob <dbl>, RetainedJob <dbl>,
## #   FranchiseCode <dbl>, UrbanRural <dbl>, RevLineCr <chr>, LowDoc <chr>,
## #   ChgOffDate <chr>, DisbursementDate <chr>, DisbursementGross <chr>,
## #   BalanceGross <chr>, MIS_Status <chr>, ChgOffPrinGr <chr>, GrAppv <chr>,
## #   SBA_Appv <chr>

filter(SBA, State == "CA", City == "LOS ANGELES")

## # A tibble: 2 x 27
##   LoanNr_ChkDgt Name  City  State   Zip Bank  BankState  NAICS ApprovalDate
##           <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>      <dbl> <chr>       
## 1    4656573007 CLAS~ LOS ~ CA    90039 CATH~ CA        811121 16-Dec-91   
## 2    6513714010 JOHN~ LOS ~ CA    90015 HANM~ CA        424320 22-Jul-03   
## # ... with 18 more variables: ApprovalFY <dbl>, Term <dbl>, NoEmp <dbl>,
## #   NewExist <dbl>, CreateJob <dbl>, RetainedJob <dbl>, FranchiseCode <dbl>,
## #   UrbanRural <dbl>, RevLineCr <chr>, LowDoc <chr>, ChgOffDate <chr>,
## #   DisbursementDate <chr>, DisbursementGross <chr>, BalanceGross <chr>,
## #   MIS_Status <chr>, ChgOffPrinGr <chr>, GrAppv <chr>, SBA_Appv <chr>

filter(SBA, State == "CA" | City == "LOS ANGELES")

## # A tibble: 29 x 27
##    LoanNr_ChkDgt Name  City  State   Zip Bank  BankState  NAICS ApprovalDate
##            <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>      <dbl> <chr>       
##  1    2490395004 Trex~ W HO~ CA    90069 BANK~ NC        561990 1-Jun-07    
##  2    5243444006 SUNS~ SANT~ CA    95403 UMPQ~ OR        453998 28-Feb-02   
##  3    1088114002 J.G.~ REDO~ CA    90277 OPUS~ CA        541512 2-Apr-97    
##  4    1810195009 Maje~ CHAT~ CA    91311 BANK~ NC        337920 9-May-06    
##  5    4656573007 CLAS~ LOS ~ CA    90039 CATH~ CA        811121 16-Dec-91   
##  6    4912475004 Tali~ SAN ~ CA    94103 JPMO~ IL        327215 13-Oct-11   
##  7    2333974001 CLAR~ CARM~ CA    93923 BANK~ CA             0 18-Sep-98   
##  8    7944054009 JOHN~ SAN ~ CA    95133 EAST~ CA        423920 15-Oct-04   
##  9    1833025007 ALAN~ SAN ~ CA    94117 U.S.~ OH        531210 22-May-06   
## 10    3478643006 COLO~ REDW~ CA    94063 CITI~ CA        323122 12-Jul-89   
## # ... with 19 more rows, and 18 more variables: ApprovalFY <dbl>, Term <dbl>,
## #   NoEmp <dbl>, NewExist <dbl>, CreateJob <dbl>, RetainedJob <dbl>,
## #   FranchiseCode <dbl>, UrbanRural <dbl>, RevLineCr <chr>, LowDoc <chr>,
## #   ChgOffDate <chr>, DisbursementDate <chr>, DisbursementGross <chr>,
## #   BalanceGross <chr>, MIS_Status <chr>, ChgOffPrinGr <chr>, GrAppv <chr>,
## #   SBA_Appv <chr>

filter(SBA, State == "CA" | State == "GA")

## # A tibble: 31 x 27
##    LoanNr_ChkDgt Name  City  State   Zip Bank  BankState  NAICS ApprovalDate
##            <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>      <dbl> <chr>       
##  1    2490395004 Trex~ W HO~ CA    90069 BANK~ NC        561990 1-Jun-07    
##  2    5243444006 SUNS~ SANT~ CA    95403 UMPQ~ OR        453998 28-Feb-02   
##  3    1088114002 J.G.~ REDO~ CA    90277 OPUS~ CA        541512 2-Apr-97    
##  4    1810195009 Maje~ CHAT~ CA    91311 BANK~ NC        337920 9-May-06    
##  5    4656573007 CLAS~ LOS ~ CA    90039 CATH~ CA        811121 16-Dec-91   
##  6    4912475004 Tali~ SAN ~ CA    94103 JPMO~ IL        327215 13-Oct-11   
##  7    2333974001 CLAR~ CARM~ CA    93923 BANK~ CA             0 18-Sep-98   
##  8    5628613003 MOOR~ COLQ~ GA    31737 PEOP~ GA        444130 6-Apr-93    
##  9    7944054009 JOHN~ SAN ~ CA    95133 EAST~ CA        423920 15-Oct-04   
## 10    1833025007 ALAN~ SAN ~ CA    94117 U.S.~ OH        531210 22-May-06   
## # ... with 21 more rows, and 18 more variables: ApprovalFY <dbl>, Term <dbl>,
## #   NoEmp <dbl>, NewExist <dbl>, CreateJob <dbl>, RetainedJob <dbl>,
## #   FranchiseCode <dbl>, UrbanRural <dbl>, RevLineCr <chr>, LowDoc <chr>,
## #   ChgOffDate <chr>, DisbursementDate <chr>, DisbursementGross <chr>,
## #   BalanceGross <chr>, MIS_Status <chr>, ChgOffPrinGr <chr>, GrAppv <chr>,
## #   SBA_Appv <chr>

filter(SBA, State == "CA", State == "GA")

## # A tibble: 0 x 27
## # ... with 27 variables: LoanNr_ChkDgt <dbl>, Name <chr>, City <chr>,
## #   State <chr>, Zip <dbl>, Bank <chr>, BankState <chr>, NAICS <dbl>,
## #   ApprovalDate <chr>, ApprovalFY <dbl>, Term <dbl>, NoEmp <dbl>,
## #   NewExist <dbl>, CreateJob <dbl>, RetainedJob <dbl>, FranchiseCode <dbl>,
## #   UrbanRural <dbl>, RevLineCr <chr>, LowDoc <chr>, ChgOffDate <chr>,
## #   DisbursementDate <chr>, DisbursementGross <chr>, BalanceGross <chr>,
## #   MIS_Status <chr>, ChgOffPrinGr <chr>, GrAppv <chr>, SBA_Appv <chr>

Arranging

This action sorts the data by a particular set of columns using the arange() function. We can sort items in ascending (default) or descending order for a particular column. And we can do this for more than one column at once.

arrange(SBA, State, desc(DisbursementGross))

## # A tibble: 200 x 27
##    LoanNr_ChkDgt Name  City  State   Zip Bank  BankState  NAICS ApprovalDate
##            <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>      <dbl> <chr>       
##  1    2764546006 CLEV~ ATTA~ AL    35954 BIRM~ AL        447110 20-Apr-07   
##  2    1632324000 ROGE~ ROGE~ AR    72712 ARVE~ AR             0 21-Oct-97   
##  3    4768083007 SAND~ HARR~ AR    72601 REGI~ AL             0 25-Mar-92   
##  4    5962774007 WARB~ LAKE~ AZ    86403 WELL~ SD        621399 19-Dec-02   
##  5    7071394006 MESA~ PHOE~ AZ    85044 WELL~ SD        811111 29-Dec-03   
##  6    4742415004 Dedi~ TEMPE AZ    85281 JPMO~ IL        711320 22-Jun-11   
##  7    2641145008 D W ~ TEMPE AZ    85284 NATI~ AZ        541613 16-Aug-07   
##  8    3173856008 MOOR~ FULL~ CA    92832 BAY ~ CA        541320 25-Feb-08   
##  9    7526034000 CHEV~ SAN ~ CA    92126 PACI~ CA        524210 16-Jun-04   
## 10    4912475004 Tali~ SAN ~ CA    94103 JPMO~ IL        327215 13-Oct-11   
## # ... with 190 more rows, and 18 more variables: ApprovalFY <dbl>, Term <dbl>,
## #   NoEmp <dbl>, NewExist <dbl>, CreateJob <dbl>, RetainedJob <dbl>,
## #   FranchiseCode <dbl>, UrbanRural <dbl>, RevLineCr <chr>, LowDoc <chr>,
## #   ChgOffDate <chr>, DisbursementDate <chr>, DisbursementGross <chr>,
## #   BalanceGross <chr>, MIS_Status <chr>, ChgOffPrinGr <chr>, GrAppv <chr>,
## #   SBA_Appv <chr>

SBA2 <- arrange(SBA, State, desc(DisbursementGross)) 
head(SBA2)

## # A tibble: 6 x 27
##   LoanNr_ChkDgt Name  City  State   Zip Bank  BankState  NAICS ApprovalDate
##           <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>      <dbl> <chr>       
## 1    2764546006 CLEV~ ATTA~ AL    35954 BIRM~ AL        447110 20-Apr-07   
## 2    1632324000 ROGE~ ROGE~ AR    72712 ARVE~ AR             0 21-Oct-97   
## 3    4768083007 SAND~ HARR~ AR    72601 REGI~ AL             0 25-Mar-92   
## 4    5962774007 WARB~ LAKE~ AZ    86403 WELL~ SD        621399 19-Dec-02   
## 5    7071394006 MESA~ PHOE~ AZ    85044 WELL~ SD        811111 29-Dec-03   
## 6    4742415004 Dedi~ TEMPE AZ    85281 JPMO~ IL        711320 22-Jun-11   
## # ... with 18 more variables: ApprovalFY <dbl>, Term <dbl>, NoEmp <dbl>,
## #   NewExist <dbl>, CreateJob <dbl>, RetainedJob <dbl>, FranchiseCode <dbl>,
## #   UrbanRural <dbl>, RevLineCr <chr>, LowDoc <chr>, ChgOffDate <chr>,
## #   DisbursementDate <chr>, DisbursementGross <chr>, BalanceGross <chr>,
## #   MIS_Status <chr>, ChgOffPrinGr <chr>, GrAppv <chr>, SBA_Appv <chr>

Selecting

This action describes keeping or dropping variables or columns from a dataset using the select() function. We might rename variables or columns via the rename() function. Also, we might change the column order by moving a few columns only and keeping the remaining columns intact (see the everything() function). Second reminder: these are not permanent assignment operations.

select(SBA, State, NoEmp, DisbursementGross) #keeping columns

## # A tibble: 200 x 3
##    State NoEmp DisbursementGross
##    <chr> <dbl> <chr>            
##  1 NJ        1 $25,000.00       
##  2 TX        4 $12,500.00       
##  3 NJ        7 $10,000.00       
##  4 WI       20 $250,000.00      
##  5 CA       11 $12,000.00       
##  6 CO        5 $644,000.00      
##  7 MI        5 $80,000.00       
##  8 AZ       30 $135,000.00      
##  9 OH        0 $25,699.00       
## 10 IL        5 $15,000.00       
## # ... with 190 more rows

select(SBA, -c(State, NoEmp, DisbursementGross)) #dropping columns

## # A tibble: 200 x 24
##    LoanNr_ChkDgt Name  City    Zip Bank  BankState  NAICS ApprovalDate
##            <dbl> <chr> <chr> <dbl> <chr> <chr>      <dbl> <chr>       
##  1    5420744004 GALL~ MANA~  8050 BANK~ NC        446191 24-May-02   
##  2    3590135009 ELEC~ AUST~ 78737 SUPE~ CA        332322 21-Aug-09   
##  3    8087434004 ALLI~ CAMD~  8102 PNC ~ DE        722211 10-Nov-04   
##  4    4850623002 LENA~ EDGE~ 53534 BADG~ WI        333414 19-May-92   
##  5    2490395004 Trex~ W HO~ 90069 BANK~ NC        561990 1-Jun-07    
##  6    8987834001 INNO~ BROO~ 80020 KEYB~ OH        423320 15-Jul-05   
##  7    5636874007 MICH~ BERK~ 48072 PNC ~ DE        812199 22-Aug-02   
##  8    2641145008 D W ~ TEMPE 85284 NATI~ AZ        541613 16-Aug-07   
##  9    2268455008 Inno~ Stro~ 44149 PNC ~ OH        541614 9-Feb-07    
## 10    7385794000 SPYD~ PEOR~ 61615 BBCN~ CA        541511 26-Apr-04   
## # ... with 190 more rows, and 16 more variables: ApprovalFY <dbl>, Term <dbl>,
## #   NewExist <dbl>, CreateJob <dbl>, RetainedJob <dbl>, FranchiseCode <dbl>,
## #   UrbanRural <dbl>, RevLineCr <chr>, LowDoc <chr>, ChgOffDate <chr>,
## #   DisbursementDate <chr>, BalanceGross <chr>, MIS_Status <chr>,
## #   ChgOffPrinGr <chr>, GrAppv <chr>, SBA_Appv <chr>

select( rename(SBA, TotalEmployees = NoEmp), TotalEmployees, State) #renaming columns

## # A tibble: 200 x 2
##    TotalEmployees State
##             <dbl> <chr>
##  1              1 NJ   
##  2              4 TX   
##  3              7 NJ   
##  4             20 WI   
##  5             11 CA   
##  6              5 CO   
##  7              5 MI   
##  8             30 AZ   
##  9              0 OH   
## 10              5 IL   
## # ... with 190 more rows

select(SBA, DisbursementGross, NoEmp, everything()) #moving some columns to the first few columns instead of their current column order

## # A tibble: 200 x 27
##    DisbursementGro~ NoEmp LoanNr_ChkDgt Name  City  State   Zip Bank  BankState
##    <chr>            <dbl>         <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>    
##  1 $25,000.00           1    5420744004 GALL~ MANA~ NJ     8050 BANK~ NC       
##  2 $12,500.00           4    3590135009 ELEC~ AUST~ TX    78737 SUPE~ CA       
##  3 $10,000.00           7    8087434004 ALLI~ CAMD~ NJ     8102 PNC ~ DE       
##  4 $250,000.00         20    4850623002 LENA~ EDGE~ WI    53534 BADG~ WI       
##  5 $12,000.00          11    2490395004 Trex~ W HO~ CA    90069 BANK~ NC       
##  6 $644,000.00          5    8987834001 INNO~ BROO~ CO    80020 KEYB~ OH       
##  7 $80,000.00           5    5636874007 MICH~ BERK~ MI    48072 PNC ~ DE       
##  8 $135,000.00         30    2641145008 D W ~ TEMPE AZ    85284 NATI~ AZ       
##  9 $25,699.00           0    2268455008 Inno~ Stro~ OH    44149 PNC ~ OH       
## 10 $15,000.00           5    7385794000 SPYD~ PEOR~ IL    61615 BBCN~ CA       
## # ... with 190 more rows, and 18 more variables: NAICS <dbl>,
## #   ApprovalDate <chr>, ApprovalFY <dbl>, Term <dbl>, NewExist <dbl>,
## #   CreateJob <dbl>, RetainedJob <dbl>, FranchiseCode <dbl>, UrbanRural <dbl>,
## #   RevLineCr <chr>, LowDoc <chr>, ChgOffDate <chr>, DisbursementDate <chr>,
## #   BalanceGross <chr>, MIS_Status <chr>, ChgOffPrinGr <chr>, GrAppv <chr>,
## #   SBA_Appv <chr>

Mutating

Mutating a data set is the act of creating new variables or overwriting old ones. Mutating is accomplished through the mutate() function.

mutate(SBA, LogDG = log(as.numeric(str_remove_all(DisbursementGross, "\\,|\\$"))), Portion = as.numeric(str_remove_all(SBA_Appv, "\\,|\\$"))/as.numeric(str_remove_all(GrAppv, "\\,|\\$")))

## # A tibble: 200 x 29
##    LoanNr_ChkDgt Name  City  State   Zip Bank  BankState  NAICS ApprovalDate
##            <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>      <dbl> <chr>       
##  1    5420744004 GALL~ MANA~ NJ     8050 BANK~ NC        446191 24-May-02   
##  2    3590135009 ELEC~ AUST~ TX    78737 SUPE~ CA        332322 21-Aug-09   
##  3    8087434004 ALLI~ CAMD~ NJ     8102 PNC ~ DE        722211 10-Nov-04   
##  4    4850623002 LENA~ EDGE~ WI    53534 BADG~ WI        333414 19-May-92   
##  5    2490395004 Trex~ W HO~ CA    90069 BANK~ NC        561990 1-Jun-07    
##  6    8987834001 INNO~ BROO~ CO    80020 KEYB~ OH        423320 15-Jul-05   
##  7    5636874007 MICH~ BERK~ MI    48072 PNC ~ DE        812199 22-Aug-02   
##  8    2641145008 D W ~ TEMPE AZ    85284 NATI~ AZ        541613 16-Aug-07   
##  9    2268455008 Inno~ Stro~ OH    44149 PNC ~ OH        541614 9-Feb-07    
## 10    7385794000 SPYD~ PEOR~ IL    61615 BBCN~ CA        541511 26-Apr-04   
## # ... with 190 more rows, and 20 more variables: ApprovalFY <dbl>, Term <dbl>,
## #   NoEmp <dbl>, NewExist <dbl>, CreateJob <dbl>, RetainedJob <dbl>,
## #   FranchiseCode <dbl>, UrbanRural <dbl>, RevLineCr <chr>, LowDoc <chr>,
## #   ChgOffDate <chr>, DisbursementDate <chr>, DisbursementGross <chr>,
## #   BalanceGross <chr>, MIS_Status <chr>, ChgOffPrinGr <chr>, GrAppv <chr>,
## #   SBA_Appv <chr>, LogDG <dbl>, Portion <dbl>

Functions that are already understood in R as vectorized functions can be used within the mutate() function such as

select( mutate(SBA, Ontology = ifelse(NewExist==2,"new biz", "existing biz")), Ontology, NewExist, NoEmp)

## # A tibble: 200 x 3
##    Ontology     NewExist NoEmp
##    <chr>           <dbl> <dbl>
##  1 existing biz        1     1
##  2 existing biz        1     4
##  3 new biz             2     7
##  4 existing biz        1    20
##  5 existing biz        1    11
##  6 existing biz        1     5
##  7 existing biz        1     5
##  8 existing biz        1    30
##  9 new biz             2     0
## 10 existing biz        1     5
## # ... with 190 more rows

Summarizing

We can create grouped summaries of data by calculating a summary value (frequency, mean, median, etc.) for all members within a group. Ideally, we want the result to be a data frame or other recursive object when possible. With the tidyverse, we need two actions to work in tandem: group_by() and summarise().

The group_by() function allows us to do data processing separately for each group in the data frame. By itself, the function does not affect the data outwardly. The group processing takes place once we combine this function with other dplyr function (verbs), especially summarise(). The group_by() can work with mutate() and arrange() as well when specified.

We can create grouped summaries of data with the summarise function. By itself it collapses a data frame into a vector by aggregating. This implies that we first group the data, then summarise. Also notice the difference between mutate() and summarise() after the group_by().

g <- group_by(SBA, State)

x1<-summarise(g, AvgDG = mean(as.numeric(str_remove_all(DisbursementGross, "\\,|\\$"))))

## `summarise()` ungrouping output (override with `.groups` argument)

x1

## # A tibble: 46 x 2
##    State    AvgDG
##    <chr>    <dbl>
##  1 AL     362000 
##  2 AR      59300 
##  3 AZ    1077472 
##  4 CA     237028.
##  5 CO     262339 
##  6 CT     200000 
##  7 DE      40000 
##  8 FL     150238.
##  9 GA     133500 
## 10 HI      15000 
## # ... with 36 more rows

x2<-mutate(g, AvgDG = mean(as.numeric(str_remove_all(DisbursementGross, "\\,|\\$"))))
x2

## # A tibble: 200 x 28
## # Groups:   State [46]
##    LoanNr_ChkDgt Name  City  State   Zip Bank  BankState  NAICS ApprovalDate
##            <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>      <dbl> <chr>       
##  1    5420744004 GALL~ MANA~ NJ     8050 BANK~ NC        446191 24-May-02   
##  2    3590135009 ELEC~ AUST~ TX    78737 SUPE~ CA        332322 21-Aug-09   
##  3    8087434004 ALLI~ CAMD~ NJ     8102 PNC ~ DE        722211 10-Nov-04   
##  4    4850623002 LENA~ EDGE~ WI    53534 BADG~ WI        333414 19-May-92   
##  5    2490395004 Trex~ W HO~ CA    90069 BANK~ NC        561990 1-Jun-07    
##  6    8987834001 INNO~ BROO~ CO    80020 KEYB~ OH        423320 15-Jul-05   
##  7    5636874007 MICH~ BERK~ MI    48072 PNC ~ DE        812199 22-Aug-02   
##  8    2641145008 D W ~ TEMPE AZ    85284 NATI~ AZ        541613 16-Aug-07   
##  9    2268455008 Inno~ Stro~ OH    44149 PNC ~ OH        541614 9-Feb-07    
## 10    7385794000 SPYD~ PEOR~ IL    61615 BBCN~ CA        541511 26-Apr-04   
## # ... with 190 more rows, and 19 more variables: ApprovalFY <dbl>, Term <dbl>,
## #   NoEmp <dbl>, NewExist <dbl>, CreateJob <dbl>, RetainedJob <dbl>,
## #   FranchiseCode <dbl>, UrbanRural <dbl>, RevLineCr <chr>, LowDoc <chr>,
## #   ChgOffDate <chr>, DisbursementDate <chr>, DisbursementGross <chr>,
## #   BalanceGross <chr>, MIS_Status <chr>, ChgOffPrinGr <chr>, GrAppv <chr>,
## #   SBA_Appv <chr>, AvgDG <dbl>

dim(x1);dim(x2)

## [1] 46  2

## [1] 200  28

Combining

What happens when you need to work with multiple datasets at once? What happens when the one dataset you have is not enough information? Where do you get the additional information? Combining data sets is a very useful data wrangling operation. Grabbing information from another dataset and adding it to your current one potentially increases your information. Combining data could mean different things in various disciplines or the same thing with different terms such as concatenating, merging, binding, appending, or joining.

Concatenating is the act of combining objects or strings together and is typically done with glue() in the tidyverse. In order to use the glue function, we may have to install the glue package separately and call it separately as in glue::glue().

In the tidyverse (dplyr), binding (or appending) is the act of combining two or more objects by stacking one on top of the other such as bind_rows() or stacking one next to the other such as bind_cols().

Merging (or joining) usually implies combining two or more objects with different columns of information into one single object. This merging would require each of the different data objects to have one column in common with a unique identifying information such as an ID variable or geographic location. There are at least 3 situations that can occur when merging objects.

Observations in the two (or more) separate objects could not match each other.

Data 1

ID	Salary
A	$10K
B	$11K
D	$12K

Data 2

ID	Number
C	2175551234
E	2175551235
F	2175551236

Merged Data

ID	Salary	Number
A	$10K
B	$11K
D	$12K
C		2175551234
E		2175551235
F		2175551236

Observations in the two (or more) separate objects could match each other one-to-one.

Data 1

ID	Salary
A	$10K
B	$11K
D	$12K

Data 2

ID	Number
A	2175551214
B	2175551224
D	2175551244

Merged Data

ID	Salary	Number
A	$10K	2175551214
B	$11K	2175551224
D	$12K	2175551244

Observations in the two (or more) separate objects could match each other one-to-many (or many-to-one).

Data 1

ID	Salary
A	$10K
D	$12K

Data 2

ID	Number
A	2175551214
A	2175551204
D	2175551244

Merged Data

ID	Salary	Number
A	$10K	2175551214
A	$10K	2175551204
D	$12K	2175551244

How we merge (or join) the data depends on which of the three situations is intended for the data management. Only keeping the matches (#2 and #3 above) could be accomplished using an inner join (inner_join() in R’s tidyverse-dplyr). Keeping the matches (#2 and #3 above) and non-matches (#1 above) could be accomplished using a full join (full_join() in R’s tidyverse-dplyr). Whenever the common column of the different data objects contain the same information but have different column names, the easiest fix is to rename the column in one of the two objects (rename() in R’s tidyverse-dplyr).

The code example below highlights the different meanings for combining data in R coding.

irisdat1 <- iris[1:100, c(1,4,5)]
irisdat2 <- iris[1:100, c(2,3,5)]
irisdat3 <- iris[101:150, c(2,3,5)]

head(irisdat1)

##   Sepal.Length Petal.Width Species
## 1          5.1         0.2  setosa
## 2          4.9         0.2  setosa
## 3          4.7         0.2  setosa
## 4          4.6         0.2  setosa
## 5          5.0         0.2  setosa
## 6          5.4         0.4  setosa

head(irisdat2)

##   Sepal.Width Petal.Length Species
## 1         3.5          1.4  setosa
## 2         3.0          1.4  setosa
## 3         3.2          1.3  setosa
## 4         3.1          1.5  setosa
## 5         3.6          1.4  setosa
## 6         3.9          1.7  setosa

head(irisdat3)

##     Sepal.Width Petal.Length   Species
## 101         3.3          6.0 virginica
## 102         2.7          5.1 virginica
## 103         3.0          5.9 virginica
## 104         2.9          5.6 virginica
## 105         3.0          5.8 virginica
## 106         3.0          6.6 virginica

glue::glue("The iris we love most is the {iris$Species} species.")[1:10]

## The iris we love most is the setosa species.
## The iris we love most is the setosa species.
## The iris we love most is the setosa species.
## The iris we love most is the setosa species.
## The iris we love most is the setosa species.
## The iris we love most is the setosa species.
## The iris we love most is the setosa species.
## The iris we love most is the setosa species.
## The iris we love most is the setosa species.
## The iris we love most is the setosa species.

head(bind_cols(irisdat1, irisdat2[,-3]))

##   Sepal.Length Petal.Width Species Sepal.Width Petal.Length
## 1          5.1         0.2  setosa         3.5          1.4
## 2          4.9         0.2  setosa         3.0          1.4
## 3          4.7         0.2  setosa         3.2          1.3
## 4          4.6         0.2  setosa         3.1          1.5
## 5          5.0         0.2  setosa         3.6          1.4
## 6          5.4         0.4  setosa         3.9          1.7

head(bind_rows(irisdat2, irisdat3))

##   Sepal.Width Petal.Length Species
## 1         3.5          1.4  setosa
## 2         3.0          1.4  setosa
## 3         3.2          1.3  setosa
## 4         3.1          1.5  setosa
## 5         3.6          1.4  setosa
## 6         3.9          1.7  setosa

head(inner_join(irisdat1, irisdat2, "Species"))

##   Sepal.Length Petal.Width Species Sepal.Width Petal.Length
## 1          5.1         0.2  setosa         3.5          1.4
## 2          5.1         0.2  setosa         3.0          1.4
## 3          5.1         0.2  setosa         3.2          1.3
## 4          5.1         0.2  setosa         3.1          1.5
## 5          5.1         0.2  setosa         3.6          1.4
## 6          5.1         0.2  setosa         3.9          1.7

head(inner_join(mutate(irisdat1, NewSpecies = paste0(Species, 1:100)), mutate(irisdat2, NewSpecies = paste0(Species, 1:100)), "NewSpecies"))

##   Sepal.Length Petal.Width Species.x NewSpecies Sepal.Width Petal.Length
## 1          5.1         0.2    setosa    setosa1         3.5          1.4
## 2          4.9         0.2    setosa    setosa2         3.0          1.4
## 3          4.7         0.2    setosa    setosa3         3.2          1.3
## 4          4.6         0.2    setosa    setosa4         3.1          1.5
## 5          5.0         0.2    setosa    setosa5         3.6          1.4
## 6          5.4         0.4    setosa    setosa6         3.9          1.7
##   Species.y
## 1    setosa
## 2    setosa
## 3    setosa
## 4    setosa
## 5    setosa
## 6    setosa

There are a host of other functions and operations to be discovered in tidyverse’s dplyr. Please check out the RStudio cheatsheets for dplyr https://github.com/rstudio/cheatsheets/raw/master/data-transformation.pdf.

STAT 385 Statistics Programming Methods - Fall 2020

Week 10 Notes

Created by Christopher Kinson

Things Covered in this Week’s Notes

Introduction to the tidyverse

Importing data in the tidyverse

using read_delim()

using read_csv()

Data wrangling with the tidyverse

String manipulation (preview of regular expressions)

The following actions are typically found in the dplyr package in the tidyverse.

Filtering

Arranging

Selecting

Mutating

Summarizing

Combining

using `read_delim()`

using `read_csv()`

The following actions are typically found in the `dplyr` package in the tidyverse.