Week 4 Notes

Created by Prof. Christopher Kinson


Things Covered in this Week’s Notes


Validating data

Validation essentially means the checking of something for its accuracy. Validating data means checking a dataset for invalid or inaccurate entries in a dataset. This checking is important because as data engineers we do not want the analysts and data scientists to start their analysis with bad data. The first step consists of defining errors. Data errors or data glitches are those data entries that should not be there and may be caused by human error or machine error. Maybe not always, but often these errors are fixable with enough context and background information about the data.

Some common data errors include:

In this section of the notes, we will utilize data validation strategies that could help us identify data errors.

Strategy 1 (Subsetting and Arranging) We can view a print out of a subset based on specific conditions to identify values that either go against or with the conditions. Arranging may help organize the values to quickly assess start or end points. Do review subsetting with characters in Week 3 Notes and pay special attention to how to implement regular expressions in your preferred programming language. There are alternatives to using regular expressions to deal with character strings. Just as there are coding alternatives to achieve most results for this course. Students must do what they are comfortable with so long as it is in alignment with an assignment’s instructions.

#SAS
filename jail url "https://uofi.box.com/shared/static/q9bafe6strodwyxghlvoziatgd91mnhs.txt";
proc import datafile=jail out=jail_data dbms=csv replace;
 getnames=yes;
run;
proc sort data=jail_data;
 by descending age_at_arrest;
run;
data jail_data0;
 set jail_data;
 if days_in_jail>2 and state eq "ILLINOIS";
 keep booking_date booking_number sex race days_in_jail;
run;
proc print data=jail_data (obs=10);
run;

#R
library(tidyverse)
jail_data <- read_csv("https://uofi.box.com/shared/static/q9bafe6strodwyxghlvoziatgd91mnhs.txt")
jail_data %>% arrange(desc(`Age at Arrest`)) %>% filter(`Days in Jail`>2, STATE=="ILLINOIS") %>% select(`BOOKING DATE`, `BOOKING NUMBER`, SEX, RACE, `Days in Jail`)

#Python
TO BE ADDED LATER

Strategy 2 (Counting Frequencies and Duplicates) We can create a frequency table of a select number of variables. Since it’s a frequency table, it has potential to be cross-classified by design. Ideally, the variables for which we want to know the frequency should be grouping or categorical variables. Frequency tables also aid in determining the number of distinct (or unique) values. It is common for a statistical dataset to contain an ID variable - one that uniquely identifies each observation (e.g. UIN, Social Security Number, 10 digit phone number). However, when that is not the case, figuring out which variable to trust as an ID variable can be difficult especially when attempting to discover duplicate observations. An idea may be to combine two or more variables (through concatenation) to create a new ID variable.

#SAS
proc freq data=jail_data nlevels ;
 tables crime_code / out=crimefreq; /*to reduce the amount of output, first export the frequency table then print the first 10 obs*/
run;
proc print data=crimefreq (obs=10);
run;
proc freq data=jail_data nlevels noprint;
 tables sex*race / out=demofreq; /*to reduce the amount of output, first export the frequency table then print the first 10 obs*/
run;
proc print data=demofreq ;
 where race ne "";
run;

#R (in tidyverse)
jail_data %>% group_by(`CRIME CODE`) %>% count(`CRIME CODE`)
jail_data %>% group_by(SEX, RACE) %>% count() %>% pivot_wider(names_from=SEX, values_from=n)

#Python
TO BE ADDED LATER

Strategy 3 (Summary Statistics) We can view a numeric summary of variables based on descriptive statistics such as minimum, maximum, median, mean, and standard deviation. This summary might include information about the number of missing values or NAs, specific quantiles, and extreme observations (similar to a minimum and maximum).

#SAS
proc means data=jail_data min q1 median q3 max mean std nmiss;
 var age_at_arrest;
run;

#R
sumtable_age <- c(summary(jail_data$`Age at Arrest`),sd(jail_data$`Age at Arrest`),sum(is.na(jail_data$`Age at Arrest`)))
names(sumtable_age) <- c(names(summary(jail_data$`Age at Arrest`)), "Std Dev", "# NAs")
sumtable_age

#Python
TO BE ADDED LATER

Strategy 4 (Visualizing Distributions) Although not a focus of our course, producing data visualizations of the distributions of certain variables would allow for a quick inspection of “reasonableness” of the values. For a numeric variable, we can use histograms to show shape and possible skewness and box plots to show symmetry (above and below the median) and potential outliers (1.5*IQR Rule aka “Box Plot Rule”). For a categorical variable, we can use bar plots for a nicer visual than a frequency table.

#SAS
title1 'For IL residents';
 proc sgplot data=jail_data;
 histogram age_at_arrest;
 where state eq "ILLINOIS";
run;
proc sgplot data=jail_data;
 vbox age_at_arrest;
 where state eq "ILLINOIS";
run;
proc sgplot data=jail_data;
 vbar race;
 where state eq "ILLINOIS";
run;
title1;

#R
hist(filter(jail_data,STATE=="ILLINOIS")$`Age at Arrest`, main="", xlab="Age at Arrest for IL Residents")
hist(filter(jail_data,STATE=="ILLINOIS")$`Age at Arrest`, main="", xlab="Age at Arrest for IL Residents")
boxplot(filter(jail_data,STATE=="ILLINOIS")$`Age at Arrest`, main="", xlab="Age at Arrest for IL Residents")
barplot(table(filter(jail_data,STATE=="ILLINOIS")$RACE), main="", xlab="Race for IL Residents")

#Python
TO BE ADDED LATER

Read the following for more information SAS’s PROC FREQ, SAS’s PROC MEANS, SAS’s PROC SGPLOT, SAS’s PROC SORT, plotting in base R, plotting in R’s tidyverse plots, pivots in R’s tidyverse, and readr in R’s tidyverse.


Cleaning data

You need to know context and background information about the data to truly fix data errors. Guessing is not appropriate, especially when money or lives are at stake. Data cleaning is a data-specific task that can be tedious and painful. Do expect to spend a long time (your allotted time multiplied by 2) on data validating and cleaning. Careful and methodical fixing of errors may yield wondrous results for your analytics team (but don’t spend too much time!).

Some common data cleaning approaches include:

Removing duplicate observations Removing duplicates when the observation’s duplication add no new information is helpful in data cleaning. We exercise precaution when removing observations by not removing anything from permanent data files. Use a renamed version of the data in your programming language; always keep the original as the original. Creating a counter variable might be helpful to identify non-unique observations for removal.

#SAS
proc freq data=jail_data nlevels;
 tables booking_number*jacket_number / out=jail_data2;
 ods select nlevels; 
 *where city eq "DANVILLE" ; /*memory issues so let's subset rows*/
run;
proc freq data=jail_data2 nlevels;
 tables booking_number;
 ods select nlevels;
run;
data jail_data3;
 set jail_data;
 newid= compress(catx("",booking_number,jacket_number));
run;
proc freq data=jail_data3 nlevels;
 tables newid;
 ods select nlevels;
run;
proc print data=jail_data3 (obs=10);
 var newid;
run;
/* an alternative way to count unique levels in DATA step
proc sort data=jail_data3;
 by newid; 
run; 
data jail_data3; 
 set jail_data3; 
 count+1; 
 by newid; 
 if first.newid then count=1; 
run;
proc print data=jail_data3 (obs=10);
 var booking_number jacket_number newid crime_code count;
run;
*/

#R
table_bn <- table(jail_data$`BOOKING NUMBER`) 
length(table_bn[table_bn>1])
table_jn <- table(jail_data$`JACKET NUMBER`) 
length(table_jn[table_jn>1])
jail_data2 <- jail_data %>% mutate(newid=new_id) %>% distinct(`BOOKING NUMBER`,`JACKET NUMBER`)
table_bookingnumber <- table(jail_data2$`BOOKING NUMBER`)
length(table_bookingnumber[table_bookingnumber>1])
jail_data3 <- jail_data %>% mutate(newid=new_id) %>% distinct(newid)
table_newid <- table(jail_data3$newid)
length(table_newid[table_newid>1])

#Python
TO BE ADDED LATER

Fixing rounding errors and inconsistent units of measurement Numeric inconsistencies can plague data analysis. Fixing them can be as simple as converting the existing variable to another more accessible unit of measure (e.g. centimeters instead of inches). Re-formatting the variable may solve other numeric errors, especially when the numeric value we want is initially treated as a character string. Rounding errors may be fixed quickly by employing a function in the programming language such as ceiling, flooring, and rounding.

#SAS
data jail_data4;
 set jail_data;
 newhours=minutes/60;
 newhoursrd=round(minutes/60,0.001);
run;
proc print data=jail_data4 (obs=10);
 var minutes newhours newhoursrd;
run;

#R
jail_data %>% mutate(newhours=Minutes/60, newhoursrd = round(newhours,3)) %>% select(RACE, SEX, Minutes, newhours, newhoursrd)

#Python
TO BE ADDED LATER

Removing or replacing missing values A missing value may actually be informative for a data analysis. Thus, be cautious when removing missing values from a dataset. Consider why or how the missing values came to be. If it is missing completely at random, then removal is usually safe.

#SAS
data jail_data5;
 set jail_data;
 if city eq "CHAMPAIGN";
 newid = compress(catx("",booking_number,jacket_number));
 keep age_at_arrest newid;
 if age_at_arrest ne . and newid ne "";
run;
proc sort data=jail_data5;
 by newid;
run;
data jail_data6;
 set jail_data5;
 count+1; 
 by newid; 
 if first.newid then count=1;
 if count eq 1;
 drop count;
run;
proc print data=jail_data6 (obs=10);
run;

#R
jail_data %>% mutate(newid=new_id) %>% filter(CITY=="CHAMPAIGN") %>% select(`Age at Arrest`, newid) %>% drop_na() %>% distinct(newid, .keep_all=TRUE)

#Python
TO BE ADDED LATER

Limiting a distribution to its realistic set of observations This essentially means subsetting with conditions on numeric variables. Review the Week 3 Notes on Subsetting with numbers.

#SAS
data jail_data7;
 set jail_data;
 if city eq "CHAMPAIGN" and 16 < age_at_arrest < 22 ;
 newid = compress(catx("",booking_number,jacket_number));
 if age_at_arrest ne . and newid ne "";
run;
proc sort data=jail_data7;
 by newid;
run;
data jail_data8;
 set jail_data7;
 count+1; 
 by newid; 
 if first.newid then count=1;
 if count eq 1;
 drop count;
run;
proc print data=jail_data8 (obs=10);
run;

#R
jail_data %>% mutate(newid=new_id) %>% filter(CITY=="CHAMPAIGN", `Age at Arrest` < 22, `Age at Arrest` > 16) %>% distinct(newid, .keep_all=TRUE)

#Python
TO BE ADDED LATER

Correcting and subsetting with dates This might also be considered subsetting and applying functions and operators for dates and times. Do review Week 2 Notes on Formatting Data.

#SAS
data jail_data9;
set jail_data;
 newid = compress(catx("",booking_number,jacket_number));
 if city eq "CHAMPAIGN" and booking_date > '31Dec2011'd and released_date < '01Jan2013'd;
run;
proc print data=jail_data9 (obs=6);
run;
proc print data=jail_data9 (firstobs=4700 obs=4705);
run;

#R
j1 <- jail_data %>% mutate(newid=new_id, booking_date=lubridate::mdy(`BOOKING DATE`), released_date=lubridate::mdy(`RELEASED DATE`)) %>% filter(CITY=="CHAMPAIGN", booking_date > "2011-12-31", released_date < "2013-01-01") 
head(j1)
tail(j1)

#Python
TO BE ADDED LATER

Correcting misspelled words, abbreviations, or text cases With unstructured text data, correcting spellings comes up often. When it is important that words should be spelled correctly, a data engineer can spend time addressing it. However, it may be untenable to attempt to correctly spell each word in a text dataset. (Also, the text source’s intent should be considered, such as African American Vernacular English, and other vernaculars that appear in writing). Another aspect of text data cleaning is deciding what to do with the case of the words in the text. Processing text might go smoother if we lower (or upper) the case on all text. For specific variables in a structured dataset, changing the case to a proper case might be beneficial (e.g. people’s name, business name).

#SAS
data jail_data10;
 set jail_data;
 if state eq "ILLINOIS";
 where prxmatch("/^U\w+A/", city); /*must use forward slashes to identify regex pattern*/
run;
data jail11;
 set jail10;
 city = prxchange("s/^U\w+A/URBANA/", -1, city); /*must use s with forward slash then separate the replacement string with forward slash*/
run;
proc freq data=jail11;
 tables city; 
run;
data jail12;
 set jail11;
 city = prxchange("s/\`//", -1, city);
run;
proc freq data=jail12;
 tables city; 
run;
data jail13;
 set jail12;
 city = prxchange("s/\sPH//", -1, city);
run;
proc freq data=jail13;
 tables city; 
run;

#R
colnames(jail_data) <- tolower(colnames(jail_data))
jd1 <- jail_data %>% filter(state=="ILLINOIS")
table(str_extract(jd1$city, "^U\\w+A"))
uids <- which(!is.na(str_extract(jd1$city,"^U\\w+A")))
jd2 <- jd1[uids,] #not a tidyverse way. just plain on ol R
jd1$newcity <- str_replace(jd1$city,"^U\\w+A","URBANA") #to clean use str_replace()
jd1$newcity <- str_replace(jd1$newcity,"^U\\w+A`","URBANA") #to clean use str_replace()
jd1$newcity <- str_replace(jd1$newcity,"^U\\w+A PH","URBANA") #to clean use str_replace()

#Python
TO BE ADDED LATER

Read the following for more information SAS most commonly used functions, SAS regular expressions doc, SAS regular expressions cheat sheet, lubridate in R’s tidyverse, dplyr in R’s tidyverse, and stringr in R’s tidyverse.