Introduction to the tidyverse
Importing data with the tidyverse
Data wrangling with the tidyverse
The tidyverse is a collection of R packages developed with data science in mind. The developers often collaborate and have support through the RStudio company. The tidyverse
itself is a package in R that will call upon each package within the collection including: readr
, stringr
, dplyr
, tidyr
, tibble
, purr
, and forcats
. These packages and their functionality work extremely well for tibbles, but they also may be executed on data frames, lists, vectors, and matrices when appropriate (depends on the package and the function).
You’ll want to install the tidyverse
package to simplify the usage of any of these at will, e.g., install.packages("tidyverse")
. Otherwise, you can install each package on its own, e.g., install.packages("readr")
. To use any package alone, simply call it, e.g., library(readr)
. To learn more about any of these packages and their usage, you should read their documentation at https://www.tidyverse.org/packages/.
##Again as I have said repeatedly, make sure your version of R is up-to-date (newest R version is 4.0.3)!
The bulk of what you will handle in life as statisticians will be external datasets. There are several ways to access and import external data but we’ll focus on two: read_table and read_csv using the tidyverse package. The advantages of readr’s importing (e.g. read_csv) over R’s base importing (e.g. read.csv) are: much faster importing, much larger datasets for importing, and more simplified imported data in the form of tibbles.
read_delim()
This is one function for accessing data in general, especially data with tabs or spaces as delimiters or separators. We usually need to supply the file location and the separator. Let’s read in a local data file. First, download the brain.txt to your local computer.
##run this but change the file location
library(tidyverse)
read_delim("C:/Users/CKinson/Downloads/brain.txt", delim="\t") #works without having to first run library(readr)
readr::read_delim("C:/Users/CKinson/Downloads/brain.txt", delim="\t") #works without having to first run library(tidyverse)
read_csv()
This is one function for accessing data with commas as delimiters or separators. It has the same functionality as read_delim
but assumes the default delimiter is comma. Typically, we only need to supply the file location. Let’s read in a local data file. First, download the brain.csv to your local computer.
##run this but change the file location
readr::read_csv("C:/Users/CKinson/Downloads/brain.csv")
We can also use the URLs as the file locations in the tidyverse.
library(tidyverse) #just in case
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.2 v purrr 0.3.4
## v tibble 3.0.4 v dplyr 1.0.2
## v tidyr 1.1.2 v stringr 1.4.0
## v readr 1.4.0 v forcats 0.5.0
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
##run this but change the file location
readr::read_csv("https://uofi.box.com/shared/static/4v7q5rzxsrceeuzioya99zn2z85mcfno.csv")
##
## -- Column specification --------------------------------------------------------
## cols(
## species = col_character(),
## bodykg = col_double(),
## braing = col_double()
## )
## # A tibble: 62 x 3
## species bodykg braing
## <chr> <dbl> <dbl>
## 1 African elephant 6654 5712
## 2 African giant pouched rat 1 6.6
## 3 Arctic Fox 3.38 44.5
## 4 Arctic ground squirrel 0.92 5.7
## 5 Asian elephant 2547 4603
## 6 Baboon 10.6 180.
## 7 Big brown bat 0.023 0.3
## 8 Brazilian tapir 160 169
## 9 Cat 3.3 25.6
## 10 Chimpanzee 52.2 440
## # ... with 52 more rows
One final aspect of importing data with readr
is the idea of importing the data through RStudio’s “Import Dataset” wizard (menus, not purely from code). Importing in this way allows us to control how each column should be formatted, how specific we need the parameters of importing to be, and the source of the data. In other words, we can use the wizard to do any specific coercion or formatting at the time of importing (handling dates and times, coercing character columns to numeric, etc.)
Here’s a procedure for using RStudio’s “Import Dataset” wizard:
Look at your Data Preview and click on any columns that need re-formatting.
Once satisfied with the Data Preview, copy the code in the Code Editor box and paste it into your code chunk.
Data wrangling is the set of procedures and tasks for cleaning and managing data so that the data can be analyzed easily. It is a crucial and time-consuming aspect of data analysis work. In this course, we are going to opt for a mixture of standard R functionality (Week 8) as well as tidyverse functionality (Week 10) to wrangle data in a neat and useful manner. This week, we focus entirely on the tidyverse.
To access a function’s help page for a package that isn’t currently called, use the ?
operator before the name of the package then ::
then the name of the function. For example,
?readr::read_csv
## starting httpd help server ... done
?dplyr::arrange
We’re going to explore 5 common actions for data wrangling. Let’s show some examples with the SBA Loans Data.
Here’s how we can import an old version of the SBA Loans Data. Using read_csv()
will take seconds to import a large dataset.
library(tidyverse)
Sys.time()
## [1] "2020-12-28 15:00:36 CST"
sba <- read_csv("https://uofi.box.com/shared/static/vi37omgitiaa2yyplrom779qvwk1g14x.csv")
##
## -- Column specification --------------------------------------------------------
## cols(
## .default = col_character(),
## LoanNr_ChkDgt = col_double(),
## Zip = col_double(),
## NAICS = col_double(),
## ApprovalFY = col_double(),
## Term = col_double(),
## NoEmp = col_double(),
## NewExist = col_double(),
## CreateJob = col_double(),
## RetainedJob = col_double(),
## FranchiseCode = col_double(),
## UrbanRural = col_double()
## )
## i Use `spec()` for the full column specifications.
## Warning: 18 parsing failures.
## row col expected actual file
## 699733 ApprovalFY no trailing characters 1976A 'https://uofi.box.com/shared/static/vi37omgitiaa2yyplrom779qvwk1g14x.csv'
## 704031 ApprovalFY no trailing characters 1976A 'https://uofi.box.com/shared/static/vi37omgitiaa2yyplrom779qvwk1g14x.csv'
## 705376 ApprovalFY no trailing characters 1976A 'https://uofi.box.com/shared/static/vi37omgitiaa2yyplrom779qvwk1g14x.csv'
## 710382 ApprovalFY no trailing characters 1976A 'https://uofi.box.com/shared/static/vi37omgitiaa2yyplrom779qvwk1g14x.csv'
## 713246 ApprovalFY no trailing characters 1976A 'https://uofi.box.com/shared/static/vi37omgitiaa2yyplrom779qvwk1g14x.csv'
## ...... .......... ...................... ...... .........................................................................
## See problems(...) for more details.
Sys.time()
## [1] "2020-12-28 15:02:03 CST"
Next, we create a subset of the data which is much smaller than the original, print the first few rows of the subset, and remove the original larger data set.
set.seed(448)
SBA <- sba[sample(1:dim(sba)[1],200),] #randomized subset of 200 rows
head(SBA, 10) #first 10 rows
## # A tibble: 10 x 27
## LoanNr_ChkDgt Name City State Zip Bank BankState NAICS ApprovalDate
## <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <dbl> <chr>
## 1 5420744004 GALL~ MANA~ NJ 8050 BANK~ NC 446191 24-May-02
## 2 3590135009 ELEC~ AUST~ TX 78737 SUPE~ CA 332322 21-Aug-09
## 3 8087434004 ALLI~ CAMD~ NJ 8102 PNC ~ DE 722211 10-Nov-04
## 4 4850623002 LENA~ EDGE~ WI 53534 BADG~ WI 333414 19-May-92
## 5 2490395004 Trex~ W HO~ CA 90069 BANK~ NC 561990 1-Jun-07
## 6 8987834001 INNO~ BROO~ CO 80020 KEYB~ OH 423320 15-Jul-05
## 7 5636874007 MICH~ BERK~ MI 48072 PNC ~ DE 812199 22-Aug-02
## 8 2641145008 D W ~ TEMPE AZ 85284 NATI~ AZ 541613 16-Aug-07
## 9 2268455008 Inno~ Stro~ OH 44149 PNC ~ OH 541614 9-Feb-07
## 10 7385794000 SPYD~ PEOR~ IL 61615 BBCN~ CA 541511 26-Apr-04
## # ... with 18 more variables: ApprovalFY <dbl>, Term <dbl>, NoEmp <dbl>,
## # NewExist <dbl>, CreateJob <dbl>, RetainedJob <dbl>, FranchiseCode <dbl>,
## # UrbanRural <dbl>, RevLineCr <chr>, LowDoc <chr>, ChgOffDate <chr>,
## # DisbursementDate <chr>, DisbursementGross <chr>, BalanceGross <chr>,
## # MIS_Status <chr>, ChgOffPrinGr <chr>, GrAppv <chr>, SBA_Appv <chr>
rm(sba) #remove the original large data file
When we check the mode of some of the columns in the SBA data, we notice that several columns represent dollar amounts yet have been automatically imported as character type. We have already learned about coercion and the pitfalls of coercion with data.frames. There is one big issue with attempting to coerce the dollar amounts in these columns: the literal dollar sign in each row preceding the number. Thus we’ll briefly mention character manipulation.
Strings are characters or sequences of characters. We can work with strings to wrangle them into something we need. The more detailed discussion of this is regular expression and stringr
package in the tidyverse. For now, we will focus on the specific SBA data application.
We need to remove the dollar signs and commas in order to properly coerce the columns representing dollar amounts to be numeric. We didn’t have to worry about white space, because read_csv removed it by default. Since we need to remove characters from the string, we can use stringr
’s str_remove_all()
function in the tidyverse. We’ll use the double escape character \\
before the special characters as our pattern to match as in "\\$|\\,"
.
test1 <- str_remove_all(SBA$DisbursementGross[1:10], "\\,|\\$") #just trying it out for the first 10 values
test1
## [1] "25000.00" "12500.00" "10000.00" "250000.00" "12000.00" "644000.00"
## [7] "80000.00" "135000.00" "25699.00" "15000.00"
as.numeric(test1) #checking coercion
## [1] 25000 12500 10000 250000 12000 644000 80000 135000 25699 15000
Success!
dplyr
package in the tidyverse.We can subset observations based on their values using the filter()
function. This type of subsetting allows us to only show observations that satisfy a particular condition. The subset()
function is never allowed in this course.
Be careful not to confuse these actions in the tidyverse as permanent assignment operations. These actions show us temporary results that are not stored into any object; unless we assign them using the operator <-
.
filter(SBA, State == "CA")
## # A tibble: 29 x 27
## LoanNr_ChkDgt Name City State Zip Bank BankState NAICS ApprovalDate
## <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <dbl> <chr>
## 1 2490395004 Trex~ W HO~ CA 90069 BANK~ NC 561990 1-Jun-07
## 2 5243444006 SUNS~ SANT~ CA 95403 UMPQ~ OR 453998 28-Feb-02
## 3 1088114002 J.G.~ REDO~ CA 90277 OPUS~ CA 541512 2-Apr-97
## 4 1810195009 Maje~ CHAT~ CA 91311 BANK~ NC 337920 9-May-06
## 5 4656573007 CLAS~ LOS ~ CA 90039 CATH~ CA 811121 16-Dec-91
## 6 4912475004 Tali~ SAN ~ CA 94103 JPMO~ IL 327215 13-Oct-11
## 7 2333974001 CLAR~ CARM~ CA 93923 BANK~ CA 0 18-Sep-98
## 8 7944054009 JOHN~ SAN ~ CA 95133 EAST~ CA 423920 15-Oct-04
## 9 1833025007 ALAN~ SAN ~ CA 94117 U.S.~ OH 531210 22-May-06
## 10 3478643006 COLO~ REDW~ CA 94063 CITI~ CA 323122 12-Jul-89
## # ... with 19 more rows, and 18 more variables: ApprovalFY <dbl>, Term <dbl>,
## # NoEmp <dbl>, NewExist <dbl>, CreateJob <dbl>, RetainedJob <dbl>,
## # FranchiseCode <dbl>, UrbanRural <dbl>, RevLineCr <chr>, LowDoc <chr>,
## # ChgOffDate <chr>, DisbursementDate <chr>, DisbursementGross <chr>,
## # BalanceGross <chr>, MIS_Status <chr>, ChgOffPrinGr <chr>, GrAppv <chr>,
## # SBA_Appv <chr>
filter(SBA, State == "CA", City == "LOS ANGELES")
## # A tibble: 2 x 27
## LoanNr_ChkDgt Name City State Zip Bank BankState NAICS ApprovalDate
## <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <dbl> <chr>
## 1 4656573007 CLAS~ LOS ~ CA 90039 CATH~ CA 811121 16-Dec-91
## 2 6513714010 JOHN~ LOS ~ CA 90015 HANM~ CA 424320 22-Jul-03
## # ... with 18 more variables: ApprovalFY <dbl>, Term <dbl>, NoEmp <dbl>,
## # NewExist <dbl>, CreateJob <dbl>, RetainedJob <dbl>, FranchiseCode <dbl>,
## # UrbanRural <dbl>, RevLineCr <chr>, LowDoc <chr>, ChgOffDate <chr>,
## # DisbursementDate <chr>, DisbursementGross <chr>, BalanceGross <chr>,
## # MIS_Status <chr>, ChgOffPrinGr <chr>, GrAppv <chr>, SBA_Appv <chr>
filter(SBA, State == "CA" | City == "LOS ANGELES")
## # A tibble: 29 x 27
## LoanNr_ChkDgt Name City State Zip Bank BankState NAICS ApprovalDate
## <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <dbl> <chr>
## 1 2490395004 Trex~ W HO~ CA 90069 BANK~ NC 561990 1-Jun-07
## 2 5243444006 SUNS~ SANT~ CA 95403 UMPQ~ OR 453998 28-Feb-02
## 3 1088114002 J.G.~ REDO~ CA 90277 OPUS~ CA 541512 2-Apr-97
## 4 1810195009 Maje~ CHAT~ CA 91311 BANK~ NC 337920 9-May-06
## 5 4656573007 CLAS~ LOS ~ CA 90039 CATH~ CA 811121 16-Dec-91
## 6 4912475004 Tali~ SAN ~ CA 94103 JPMO~ IL 327215 13-Oct-11
## 7 2333974001 CLAR~ CARM~ CA 93923 BANK~ CA 0 18-Sep-98
## 8 7944054009 JOHN~ SAN ~ CA 95133 EAST~ CA 423920 15-Oct-04
## 9 1833025007 ALAN~ SAN ~ CA 94117 U.S.~ OH 531210 22-May-06
## 10 3478643006 COLO~ REDW~ CA 94063 CITI~ CA 323122 12-Jul-89
## # ... with 19 more rows, and 18 more variables: ApprovalFY <dbl>, Term <dbl>,
## # NoEmp <dbl>, NewExist <dbl>, CreateJob <dbl>, RetainedJob <dbl>,
## # FranchiseCode <dbl>, UrbanRural <dbl>, RevLineCr <chr>, LowDoc <chr>,
## # ChgOffDate <chr>, DisbursementDate <chr>, DisbursementGross <chr>,
## # BalanceGross <chr>, MIS_Status <chr>, ChgOffPrinGr <chr>, GrAppv <chr>,
## # SBA_Appv <chr>
filter(SBA, State == "CA" | State == "GA")
## # A tibble: 31 x 27
## LoanNr_ChkDgt Name City State Zip Bank BankState NAICS ApprovalDate
## <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <dbl> <chr>
## 1 2490395004 Trex~ W HO~ CA 90069 BANK~ NC 561990 1-Jun-07
## 2 5243444006 SUNS~ SANT~ CA 95403 UMPQ~ OR 453998 28-Feb-02
## 3 1088114002 J.G.~ REDO~ CA 90277 OPUS~ CA 541512 2-Apr-97
## 4 1810195009 Maje~ CHAT~ CA 91311 BANK~ NC 337920 9-May-06
## 5 4656573007 CLAS~ LOS ~ CA 90039 CATH~ CA 811121 16-Dec-91
## 6 4912475004 Tali~ SAN ~ CA 94103 JPMO~ IL 327215 13-Oct-11
## 7 2333974001 CLAR~ CARM~ CA 93923 BANK~ CA 0 18-Sep-98
## 8 5628613003 MOOR~ COLQ~ GA 31737 PEOP~ GA 444130 6-Apr-93
## 9 7944054009 JOHN~ SAN ~ CA 95133 EAST~ CA 423920 15-Oct-04
## 10 1833025007 ALAN~ SAN ~ CA 94117 U.S.~ OH 531210 22-May-06
## # ... with 21 more rows, and 18 more variables: ApprovalFY <dbl>, Term <dbl>,
## # NoEmp <dbl>, NewExist <dbl>, CreateJob <dbl>, RetainedJob <dbl>,
## # FranchiseCode <dbl>, UrbanRural <dbl>, RevLineCr <chr>, LowDoc <chr>,
## # ChgOffDate <chr>, DisbursementDate <chr>, DisbursementGross <chr>,
## # BalanceGross <chr>, MIS_Status <chr>, ChgOffPrinGr <chr>, GrAppv <chr>,
## # SBA_Appv <chr>
filter(SBA, State == "CA", State == "GA")
## # A tibble: 0 x 27
## # ... with 27 variables: LoanNr_ChkDgt <dbl>, Name <chr>, City <chr>,
## # State <chr>, Zip <dbl>, Bank <chr>, BankState <chr>, NAICS <dbl>,
## # ApprovalDate <chr>, ApprovalFY <dbl>, Term <dbl>, NoEmp <dbl>,
## # NewExist <dbl>, CreateJob <dbl>, RetainedJob <dbl>, FranchiseCode <dbl>,
## # UrbanRural <dbl>, RevLineCr <chr>, LowDoc <chr>, ChgOffDate <chr>,
## # DisbursementDate <chr>, DisbursementGross <chr>, BalanceGross <chr>,
## # MIS_Status <chr>, ChgOffPrinGr <chr>, GrAppv <chr>, SBA_Appv <chr>
This action sorts the data by a particular set of columns using the arange()
function. We can sort items in ascending (default) or descending order for a particular column. And we can do this for more than one column at once.
arrange(SBA, State, desc(DisbursementGross))
## # A tibble: 200 x 27
## LoanNr_ChkDgt Name City State Zip Bank BankState NAICS ApprovalDate
## <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <dbl> <chr>
## 1 2764546006 CLEV~ ATTA~ AL 35954 BIRM~ AL 447110 20-Apr-07
## 2 1632324000 ROGE~ ROGE~ AR 72712 ARVE~ AR 0 21-Oct-97
## 3 4768083007 SAND~ HARR~ AR 72601 REGI~ AL 0 25-Mar-92
## 4 5962774007 WARB~ LAKE~ AZ 86403 WELL~ SD 621399 19-Dec-02
## 5 7071394006 MESA~ PHOE~ AZ 85044 WELL~ SD 811111 29-Dec-03
## 6 4742415004 Dedi~ TEMPE AZ 85281 JPMO~ IL 711320 22-Jun-11
## 7 2641145008 D W ~ TEMPE AZ 85284 NATI~ AZ 541613 16-Aug-07
## 8 3173856008 MOOR~ FULL~ CA 92832 BAY ~ CA 541320 25-Feb-08
## 9 7526034000 CHEV~ SAN ~ CA 92126 PACI~ CA 524210 16-Jun-04
## 10 4912475004 Tali~ SAN ~ CA 94103 JPMO~ IL 327215 13-Oct-11
## # ... with 190 more rows, and 18 more variables: ApprovalFY <dbl>, Term <dbl>,
## # NoEmp <dbl>, NewExist <dbl>, CreateJob <dbl>, RetainedJob <dbl>,
## # FranchiseCode <dbl>, UrbanRural <dbl>, RevLineCr <chr>, LowDoc <chr>,
## # ChgOffDate <chr>, DisbursementDate <chr>, DisbursementGross <chr>,
## # BalanceGross <chr>, MIS_Status <chr>, ChgOffPrinGr <chr>, GrAppv <chr>,
## # SBA_Appv <chr>
SBA2 <- arrange(SBA, State, desc(DisbursementGross))
head(SBA2)
## # A tibble: 6 x 27
## LoanNr_ChkDgt Name City State Zip Bank BankState NAICS ApprovalDate
## <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <dbl> <chr>
## 1 2764546006 CLEV~ ATTA~ AL 35954 BIRM~ AL 447110 20-Apr-07
## 2 1632324000 ROGE~ ROGE~ AR 72712 ARVE~ AR 0 21-Oct-97
## 3 4768083007 SAND~ HARR~ AR 72601 REGI~ AL 0 25-Mar-92
## 4 5962774007 WARB~ LAKE~ AZ 86403 WELL~ SD 621399 19-Dec-02
## 5 7071394006 MESA~ PHOE~ AZ 85044 WELL~ SD 811111 29-Dec-03
## 6 4742415004 Dedi~ TEMPE AZ 85281 JPMO~ IL 711320 22-Jun-11
## # ... with 18 more variables: ApprovalFY <dbl>, Term <dbl>, NoEmp <dbl>,
## # NewExist <dbl>, CreateJob <dbl>, RetainedJob <dbl>, FranchiseCode <dbl>,
## # UrbanRural <dbl>, RevLineCr <chr>, LowDoc <chr>, ChgOffDate <chr>,
## # DisbursementDate <chr>, DisbursementGross <chr>, BalanceGross <chr>,
## # MIS_Status <chr>, ChgOffPrinGr <chr>, GrAppv <chr>, SBA_Appv <chr>
This action describes keeping or dropping variables or columns from a dataset using the select()
function. We might rename variables or columns via the rename()
function. Also, we might change the column order by moving a few columns only and keeping the remaining columns intact (see the everything()
function). Second reminder: these are not permanent assignment operations.
select(SBA, State, NoEmp, DisbursementGross) #keeping columns
## # A tibble: 200 x 3
## State NoEmp DisbursementGross
## <chr> <dbl> <chr>
## 1 NJ 1 $25,000.00
## 2 TX 4 $12,500.00
## 3 NJ 7 $10,000.00
## 4 WI 20 $250,000.00
## 5 CA 11 $12,000.00
## 6 CO 5 $644,000.00
## 7 MI 5 $80,000.00
## 8 AZ 30 $135,000.00
## 9 OH 0 $25,699.00
## 10 IL 5 $15,000.00
## # ... with 190 more rows
select(SBA, -c(State, NoEmp, DisbursementGross)) #dropping columns
## # A tibble: 200 x 24
## LoanNr_ChkDgt Name City Zip Bank BankState NAICS ApprovalDate
## <dbl> <chr> <chr> <dbl> <chr> <chr> <dbl> <chr>
## 1 5420744004 GALL~ MANA~ 8050 BANK~ NC 446191 24-May-02
## 2 3590135009 ELEC~ AUST~ 78737 SUPE~ CA 332322 21-Aug-09
## 3 8087434004 ALLI~ CAMD~ 8102 PNC ~ DE 722211 10-Nov-04
## 4 4850623002 LENA~ EDGE~ 53534 BADG~ WI 333414 19-May-92
## 5 2490395004 Trex~ W HO~ 90069 BANK~ NC 561990 1-Jun-07
## 6 8987834001 INNO~ BROO~ 80020 KEYB~ OH 423320 15-Jul-05
## 7 5636874007 MICH~ BERK~ 48072 PNC ~ DE 812199 22-Aug-02
## 8 2641145008 D W ~ TEMPE 85284 NATI~ AZ 541613 16-Aug-07
## 9 2268455008 Inno~ Stro~ 44149 PNC ~ OH 541614 9-Feb-07
## 10 7385794000 SPYD~ PEOR~ 61615 BBCN~ CA 541511 26-Apr-04
## # ... with 190 more rows, and 16 more variables: ApprovalFY <dbl>, Term <dbl>,
## # NewExist <dbl>, CreateJob <dbl>, RetainedJob <dbl>, FranchiseCode <dbl>,
## # UrbanRural <dbl>, RevLineCr <chr>, LowDoc <chr>, ChgOffDate <chr>,
## # DisbursementDate <chr>, BalanceGross <chr>, MIS_Status <chr>,
## # ChgOffPrinGr <chr>, GrAppv <chr>, SBA_Appv <chr>
select( rename(SBA, TotalEmployees = NoEmp), TotalEmployees, State) #renaming columns
## # A tibble: 200 x 2
## TotalEmployees State
## <dbl> <chr>
## 1 1 NJ
## 2 4 TX
## 3 7 NJ
## 4 20 WI
## 5 11 CA
## 6 5 CO
## 7 5 MI
## 8 30 AZ
## 9 0 OH
## 10 5 IL
## # ... with 190 more rows
select(SBA, DisbursementGross, NoEmp, everything()) #moving some columns to the first few columns instead of their current column order
## # A tibble: 200 x 27
## DisbursementGro~ NoEmp LoanNr_ChkDgt Name City State Zip Bank BankState
## <chr> <dbl> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 $25,000.00 1 5420744004 GALL~ MANA~ NJ 8050 BANK~ NC
## 2 $12,500.00 4 3590135009 ELEC~ AUST~ TX 78737 SUPE~ CA
## 3 $10,000.00 7 8087434004 ALLI~ CAMD~ NJ 8102 PNC ~ DE
## 4 $250,000.00 20 4850623002 LENA~ EDGE~ WI 53534 BADG~ WI
## 5 $12,000.00 11 2490395004 Trex~ W HO~ CA 90069 BANK~ NC
## 6 $644,000.00 5 8987834001 INNO~ BROO~ CO 80020 KEYB~ OH
## 7 $80,000.00 5 5636874007 MICH~ BERK~ MI 48072 PNC ~ DE
## 8 $135,000.00 30 2641145008 D W ~ TEMPE AZ 85284 NATI~ AZ
## 9 $25,699.00 0 2268455008 Inno~ Stro~ OH 44149 PNC ~ OH
## 10 $15,000.00 5 7385794000 SPYD~ PEOR~ IL 61615 BBCN~ CA
## # ... with 190 more rows, and 18 more variables: NAICS <dbl>,
## # ApprovalDate <chr>, ApprovalFY <dbl>, Term <dbl>, NewExist <dbl>,
## # CreateJob <dbl>, RetainedJob <dbl>, FranchiseCode <dbl>, UrbanRural <dbl>,
## # RevLineCr <chr>, LowDoc <chr>, ChgOffDate <chr>, DisbursementDate <chr>,
## # BalanceGross <chr>, MIS_Status <chr>, ChgOffPrinGr <chr>, GrAppv <chr>,
## # SBA_Appv <chr>
Mutating a data set is the act of creating new variables or overwriting old ones. Mutating is accomplished through the mutate()
function.
mutate(SBA, LogDG = log(as.numeric(str_remove_all(DisbursementGross, "\\,|\\$"))), Portion = as.numeric(str_remove_all(SBA_Appv, "\\,|\\$"))/as.numeric(str_remove_all(GrAppv, "\\,|\\$")))
## # A tibble: 200 x 29
## LoanNr_ChkDgt Name City State Zip Bank BankState NAICS ApprovalDate
## <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <dbl> <chr>
## 1 5420744004 GALL~ MANA~ NJ 8050 BANK~ NC 446191 24-May-02
## 2 3590135009 ELEC~ AUST~ TX 78737 SUPE~ CA 332322 21-Aug-09
## 3 8087434004 ALLI~ CAMD~ NJ 8102 PNC ~ DE 722211 10-Nov-04
## 4 4850623002 LENA~ EDGE~ WI 53534 BADG~ WI 333414 19-May-92
## 5 2490395004 Trex~ W HO~ CA 90069 BANK~ NC 561990 1-Jun-07
## 6 8987834001 INNO~ BROO~ CO 80020 KEYB~ OH 423320 15-Jul-05
## 7 5636874007 MICH~ BERK~ MI 48072 PNC ~ DE 812199 22-Aug-02
## 8 2641145008 D W ~ TEMPE AZ 85284 NATI~ AZ 541613 16-Aug-07
## 9 2268455008 Inno~ Stro~ OH 44149 PNC ~ OH 541614 9-Feb-07
## 10 7385794000 SPYD~ PEOR~ IL 61615 BBCN~ CA 541511 26-Apr-04
## # ... with 190 more rows, and 20 more variables: ApprovalFY <dbl>, Term <dbl>,
## # NoEmp <dbl>, NewExist <dbl>, CreateJob <dbl>, RetainedJob <dbl>,
## # FranchiseCode <dbl>, UrbanRural <dbl>, RevLineCr <chr>, LowDoc <chr>,
## # ChgOffDate <chr>, DisbursementDate <chr>, DisbursementGross <chr>,
## # BalanceGross <chr>, MIS_Status <chr>, ChgOffPrinGr <chr>, GrAppv <chr>,
## # SBA_Appv <chr>, LogDG <dbl>, Portion <dbl>
Functions that are already understood in R as vectorized functions can be used within the mutate()
function such as
select( mutate(SBA, Ontology = ifelse(NewExist==2,"new biz", "existing biz")), Ontology, NewExist, NoEmp)
## # A tibble: 200 x 3
## Ontology NewExist NoEmp
## <chr> <dbl> <dbl>
## 1 existing biz 1 1
## 2 existing biz 1 4
## 3 new biz 2 7
## 4 existing biz 1 20
## 5 existing biz 1 11
## 6 existing biz 1 5
## 7 existing biz 1 5
## 8 existing biz 1 30
## 9 new biz 2 0
## 10 existing biz 1 5
## # ... with 190 more rows
We can create grouped summaries of data by calculating a summary value (frequency, mean, median, etc.) for all members within a group. Ideally, we want the result to be a data frame or other recursive object when possible. With the tidyverse, we need two actions to work in tandem: group_by()
and summarise()
.
The group_by()
function allows us to do data processing separately for each group in the data frame. By itself, the function does not affect the data outwardly. The group processing takes place once we combine this function with other dplyr function (verbs), especially summarise()
. The group_by()
can work with mutate()
and arrange()
as well when specified.
We can create grouped summaries of data with the summarise
function. By itself it collapses a data frame into a vector by aggregating. This implies that we first group the data, then summarise. Also notice the difference between mutate()
and summarise()
after the group_by()
.
g <- group_by(SBA, State)
x1<-summarise(g, AvgDG = mean(as.numeric(str_remove_all(DisbursementGross, "\\,|\\$"))))
## `summarise()` ungrouping output (override with `.groups` argument)
x1
## # A tibble: 46 x 2
## State AvgDG
## <chr> <dbl>
## 1 AL 362000
## 2 AR 59300
## 3 AZ 1077472
## 4 CA 237028.
## 5 CO 262339
## 6 CT 200000
## 7 DE 40000
## 8 FL 150238.
## 9 GA 133500
## 10 HI 15000
## # ... with 36 more rows
x2<-mutate(g, AvgDG = mean(as.numeric(str_remove_all(DisbursementGross, "\\,|\\$"))))
x2
## # A tibble: 200 x 28
## # Groups: State [46]
## LoanNr_ChkDgt Name City State Zip Bank BankState NAICS ApprovalDate
## <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <dbl> <chr>
## 1 5420744004 GALL~ MANA~ NJ 8050 BANK~ NC 446191 24-May-02
## 2 3590135009 ELEC~ AUST~ TX 78737 SUPE~ CA 332322 21-Aug-09
## 3 8087434004 ALLI~ CAMD~ NJ 8102 PNC ~ DE 722211 10-Nov-04
## 4 4850623002 LENA~ EDGE~ WI 53534 BADG~ WI 333414 19-May-92
## 5 2490395004 Trex~ W HO~ CA 90069 BANK~ NC 561990 1-Jun-07
## 6 8987834001 INNO~ BROO~ CO 80020 KEYB~ OH 423320 15-Jul-05
## 7 5636874007 MICH~ BERK~ MI 48072 PNC ~ DE 812199 22-Aug-02
## 8 2641145008 D W ~ TEMPE AZ 85284 NATI~ AZ 541613 16-Aug-07
## 9 2268455008 Inno~ Stro~ OH 44149 PNC ~ OH 541614 9-Feb-07
## 10 7385794000 SPYD~ PEOR~ IL 61615 BBCN~ CA 541511 26-Apr-04
## # ... with 190 more rows, and 19 more variables: ApprovalFY <dbl>, Term <dbl>,
## # NoEmp <dbl>, NewExist <dbl>, CreateJob <dbl>, RetainedJob <dbl>,
## # FranchiseCode <dbl>, UrbanRural <dbl>, RevLineCr <chr>, LowDoc <chr>,
## # ChgOffDate <chr>, DisbursementDate <chr>, DisbursementGross <chr>,
## # BalanceGross <chr>, MIS_Status <chr>, ChgOffPrinGr <chr>, GrAppv <chr>,
## # SBA_Appv <chr>, AvgDG <dbl>
dim(x1);dim(x2)
## [1] 46 2
## [1] 200 28
What happens when you need to work with multiple datasets at once? What happens when the one dataset you have is not enough information? Where do you get the additional information? Combining data sets is a very useful data wrangling operation. Grabbing information from another dataset and adding it to your current one potentially increases your information. Combining data could mean different things in various disciplines or the same thing with different terms such as concatenating, merging, binding, appending, or joining.
Concatenating is the act of combining objects or strings together and is typically done with glue()
in the tidyverse. In order to use the glue function, we may have to install the glue package separately and call it separately as in glue::glue()
.
In the tidyverse (dplyr), binding (or appending) is the act of combining two or more objects by stacking one on top of the other such as bind_rows()
or stacking one next to the other such as bind_cols()
.
Merging (or joining) usually implies combining two or more objects with different columns of information into one single object. This merging would require each of the different data objects to have one column in common with a unique identifying information such as an ID variable or geographic location. There are at least 3 situations that can occur when merging objects.
Data 1
ID | Salary |
---|---|
A | $10K |
B | $11K |
D | $12K |
Data 2
ID | Number |
---|---|
C | 2175551234 |
E | 2175551235 |
F | 2175551236 |
Merged Data
ID | Salary | Number |
---|---|---|
A | $10K | |
B | $11K | |
D | $12K | |
C | 2175551234 | |
E | 2175551235 | |
F | 2175551236 |
Data 1
ID | Salary |
---|---|
A | $10K |
B | $11K |
D | $12K |
Data 2
ID | Number |
---|---|
A | 2175551214 |
B | 2175551224 |
D | 2175551244 |
Merged Data
ID | Salary | Number |
---|---|---|
A | $10K | 2175551214 |
B | $11K | 2175551224 |
D | $12K | 2175551244 |
Data 1
ID | Salary |
---|---|
A | $10K |
D | $12K |
Data 2
ID | Number |
---|---|
A | 2175551214 |
A | 2175551204 |
D | 2175551244 |
Merged Data
ID | Salary | Number |
---|---|---|
A | $10K | 2175551214 |
A | $10K | 2175551204 |
D | $12K | 2175551244 |
How we merge (or join) the data depends on which of the three situations is intended for the data management. Only keeping the matches (#2 and #3 above) could be accomplished using an inner join (inner_join()
in R’s tidyverse-dplyr). Keeping the matches (#2 and #3 above) and non-matches (#1 above) could be accomplished using a full join (full_join()
in R’s tidyverse-dplyr). Whenever the common column of the different data objects contain the same information but have different column names, the easiest fix is to rename the column in one of the two objects (rename()
in R’s tidyverse-dplyr).
The code example below highlights the different meanings for combining data in R coding.
irisdat1 <- iris[1:100, c(1,4,5)]
irisdat2 <- iris[1:100, c(2,3,5)]
irisdat3 <- iris[101:150, c(2,3,5)]
head(irisdat1)
## Sepal.Length Petal.Width Species
## 1 5.1 0.2 setosa
## 2 4.9 0.2 setosa
## 3 4.7 0.2 setosa
## 4 4.6 0.2 setosa
## 5 5.0 0.2 setosa
## 6 5.4 0.4 setosa
head(irisdat2)
## Sepal.Width Petal.Length Species
## 1 3.5 1.4 setosa
## 2 3.0 1.4 setosa
## 3 3.2 1.3 setosa
## 4 3.1 1.5 setosa
## 5 3.6 1.4 setosa
## 6 3.9 1.7 setosa
head(irisdat3)
## Sepal.Width Petal.Length Species
## 101 3.3 6.0 virginica
## 102 2.7 5.1 virginica
## 103 3.0 5.9 virginica
## 104 2.9 5.6 virginica
## 105 3.0 5.8 virginica
## 106 3.0 6.6 virginica
glue::glue("The iris we love most is the {iris$Species} species.")[1:10]
## The iris we love most is the setosa species.
## The iris we love most is the setosa species.
## The iris we love most is the setosa species.
## The iris we love most is the setosa species.
## The iris we love most is the setosa species.
## The iris we love most is the setosa species.
## The iris we love most is the setosa species.
## The iris we love most is the setosa species.
## The iris we love most is the setosa species.
## The iris we love most is the setosa species.
head(bind_cols(irisdat1, irisdat2[,-3]))
## Sepal.Length Petal.Width Species Sepal.Width Petal.Length
## 1 5.1 0.2 setosa 3.5 1.4
## 2 4.9 0.2 setosa 3.0 1.4
## 3 4.7 0.2 setosa 3.2 1.3
## 4 4.6 0.2 setosa 3.1 1.5
## 5 5.0 0.2 setosa 3.6 1.4
## 6 5.4 0.4 setosa 3.9 1.7
head(bind_rows(irisdat2, irisdat3))
## Sepal.Width Petal.Length Species
## 1 3.5 1.4 setosa
## 2 3.0 1.4 setosa
## 3 3.2 1.3 setosa
## 4 3.1 1.5 setosa
## 5 3.6 1.4 setosa
## 6 3.9 1.7 setosa
head(inner_join(irisdat1, irisdat2, "Species"))
## Sepal.Length Petal.Width Species Sepal.Width Petal.Length
## 1 5.1 0.2 setosa 3.5 1.4
## 2 5.1 0.2 setosa 3.0 1.4
## 3 5.1 0.2 setosa 3.2 1.3
## 4 5.1 0.2 setosa 3.1 1.5
## 5 5.1 0.2 setosa 3.6 1.4
## 6 5.1 0.2 setosa 3.9 1.7
head(inner_join(mutate(irisdat1, NewSpecies = paste0(Species, 1:100)), mutate(irisdat2, NewSpecies = paste0(Species, 1:100)), "NewSpecies"))
## Sepal.Length Petal.Width Species.x NewSpecies Sepal.Width Petal.Length
## 1 5.1 0.2 setosa setosa1 3.5 1.4
## 2 4.9 0.2 setosa setosa2 3.0 1.4
## 3 4.7 0.2 setosa setosa3 3.2 1.3
## 4 4.6 0.2 setosa setosa4 3.1 1.5
## 5 5.0 0.2 setosa setosa5 3.6 1.4
## 6 5.4 0.4 setosa setosa6 3.9 1.7
## Species.y
## 1 setosa
## 2 setosa
## 3 setosa
## 4 setosa
## 5 setosa
## 6 setosa
There are a host of other functions and operations to be discovered in tidyverse’s dplyr. Please check out the RStudio cheatsheets for dplyr https://github.com/rstudio/cheatsheets/raw/master/data-transformation.pdf.