STAT 440 Statistical Data Management

Week 7 Notes

Created by Prof. Christopher Kinson

Things Covered in this Week’s Notes

Intro to SQL
Assigning data in SQL
Subsetting data in SQL
Arranging data in SQL
Data validating in SQL
Data cleaning in SQL
Combining data in SQL
Summarizing data in SQL

Intro to SQL

SQL stands for structured query language and represents a sort of grammar of data management and wrangling. Some important grammar or translations are: database (data frame), table (e.g. a subset; a data frame), fields (columns), records (rows), query (any data wrangling task).

The intention behind the language is to have a general method for handling data. SQL can be accessed in almost any platform and programming language including the three in this course (R, SAS, and Python). There are some subtleties in SQL that make it distinct and different from how programming languages access and handle data. Although it is structured and general, SQL has many versions developed by different users and companies for their own specific purposes. In this course, we plan on using the most common SQL statements, clauses, and keywords to avoid tasks that are inoperable on the 3 programming languages. Below is a general form of a typical query.

SELECT object-item <, ...object-item> 
FROM from-list 
<WHERE sql-expression> 
<GROUP BY object-item <, ...object-item >> 
<HAVING sql-expression> 
<ORDER BY order-by-item <DESC> <, ...order-by-item>>

The SELECT statement specifies which columns need to be in the resulting table once the query is complete. New variables may be created in the SELECT statement.

The FROM clause points the dataset (the source of the query). Usually, this dataset exists within the database.

The WHERE clause is a way to select records based on conditions.

The GROUP BY clause groups data for processing. It helps to remove duplicates when there is a unique identifier in the data.

The HAVING clause allows conditions to be placed on the groups for group processing. This clause is relevant when a GROUP BY clause exists.

The ORDER BY clause is the way to arrange the data. The default behavior is sorting in ascending order.

The order of these statements, clauses, and keywords matters and must be strictly followed. Meaning, a WHERE clause cannot appear before a FROM clause. Almost every query requires a SELECT statement.

To begin using SQL in R, we need to install and access the “sqldf” package. Afterwards, it’s a good idea to become familiar with the package (or library) using the help function ?sqldf in R. We see that using SQL with data frames is a nice feature and can be done with the following syntax.

library(sqldf)
iris1 <- iris # using iris data
sqldf("SELECT *
       FROM iris1")

To begin using SQL in SAS, we need to simply use PROC SQL. Afterwards, it’s a good idea to become familiar with the SQL procedure in SAS by reading the documentation here https://documentation.sas.com/?docsetId=sqlproc&docsetVersion=9.4&docsetTarget=n0w2pkrm208upln11i9r4ogwyvow.htm&locale=en. We see that using SQL with a dataset is a nice feature and can be done with the following syntax.

data iris1;
 set sashelp.iris; /* using iris data */
run;
proc sql ;
 SELECT *
 FROM iris1
 ;
quit;

To begin using SQL in Python, we need to import the module “sqlite3”. Afterwards, it’s a good idea to become familiar with the package (or module) by reading some documentation and usage examples here https://datatofish.com/pandas-dataframe-to-sql/. We see that using SQL with Pandas dataframes is a nice feature and can be done with the following syntax borrowed from Datatofish. A few things to keep in mind…

Python does not connect to SQL automatically behind the scense, so we must manually connect them and manually close the connection when we are finished with the database
We use Pandas dataframes to print out a nice SQL table based on a database
The SQL database and the table do not exist beforehand, thus we must manually create them
The database we create will be an empty database, but we will add tables in it that are the Pandas dataframes themselves
The tables we create must have unique names if inside of the same database. We may not overwrite them in the same database.
The output of printing a SQL table using fetchall() does not look pretty, which is why we convert the SQL table to a Pandas dataframe.
After closing the connection, we cannot do any more querying on that database; unless we open the connection and change the table names (or database name).

#just to access the iris data
import pandas as pd
import numpy
import sklearn
from sklearn import datasets
Iris = datasets.load_iris()
iris0 = pd.DataFrame(data= numpy.c_[Iris['data'], Iris['target']],
                     columns= Iris['feature_names'] + ['target'])
iris0 = iris0.rename(columns={'sepal length (cm)': 'sepallength', 'sepal width (cm)': 'sepalwidth', 'petal length (cm)': 'petallength', 'petal width (cm)': 'petalwidth', 'target': 'species'})
print(iris0)


#to do the querying
import sqlite3 

# create a new database (currently empty) and connect with the myTable database 
conn = sqlite3.connect("myiris01.db") 
c = conn.cursor() # cursor object 

#create table name to be placed inside the database
c.execute('CREATE TABLE iris01 (sepallength number, sepalwidth number, petallength number, petalwidth number, species number)')
conn.commit()
df = iris0.copy()
df.to_sql('iris01', conn, if_exists='replace', index = False)

#show the resulting SQl table in a pretty form using Pandas dataframes. Notice the SQL syntax is inside of quotes inside of read_sl_query.
print(pd.read_sql_query("SELECT * \
                        FROM iris01;", conn))

c.close() # closing the cursor which was pointing to connection to the database
conn.close() # closing the connection

Assigning data in SQL

We’ve seen object assignment done with an assignment operator such as x=y in SAS, Python, and R or x<-y in R. In SQL, new fields that are not yet in existence can be created with the SELECT statement using the AS keyword such as SELECT alpha+1 AS beta. Because assignment operators often involve functions and calculations based on existing fields, it’s critical to know and familiarize oneself with SQL’s common operators and functions. Below is a table of common operations and functions.

Operation or Function	SQL Syntax
Addition	+
Subtraction	-
Multiplication	*
Division	/
Exponentiation	POWER()
Modulus	%
Equal to (for comparison)	=
Not Equal to	<>
Greater than	>
Less than	<
Greater than or equal to	>=
Less than or equal to	<=
And	AND
Or	OR
Negation (aka Not)	NOT
Square root	SQRT()
Absolute value	ABS()
Logarithm (natural)	LOG()
Exponential	EXP()
Mean	AVG()

Read the following for more information on SQL’s operators and functions SQL operators and SQL functions.

Subsetting data in SQL

In this section, we discuss typical subsetting with numeric information and not characters or strings. In other programming languages, subsetting with conditions might also be called filtering.

Some common tasks that are included in the process of subsetting are: selecting variables (SELECT, or SELECT * to select all fields), selecting observations through conditions (WHERE for general observations or HAVING for grouped data conditions), assigning new variables (using the AS keyword in the SELECT statement), and removing missing values (using IS NULL in the WHERE statement such as WHERE Name IS NULL). Below is a coding template with common subsetting tasks.

SELECT iid 
FROM dating_data
WHERE met = 1 AND dec = 1
GROUP BY iid

SELECT *, 100.0*candidatevotes/totalvotes AS nc  
FROM elect
GROUP BY state
HAVING candidatevotes = max(candidatevotes)

In general, dates and times may be considered as numeric information. Thus we should recall the formatting dates and times in Week 2. However, in SQL particularly, dates are handled more easily as character strings. This is quite contrary to the way we’ve done date time handling in the norm. All we would need to do is to convert the date and time fields back to be character format using our programming language of choice (not using SQL). Afterwards we can use those strings and their format to subset much like the example below.

SELECT * 
FROM covid
WHERE date2 = '2020-07-23'

Subsetting characters can be hit or miss depending on the programming language’s connection to SQL and the version of SQL used. See the SQL functions SQL functions for usage examples.

Arranging data in SQL

Arranging data is simply ordering and sorting the data; thus we use ORDER BY. The default ordering is ascending. If we want descending order, then we use the DESC keyword after the field name. We can also limit the amount of observations we want to show using the LIMIT n clause where n is a number of records to be shown in the result. This works in both R and Python’s verson of SQL. In SAS, we can use the PROC SQL option outobs=n to limit the number of observations being output. Below is a coding example.

SELECT deaths, cases
FROM covid
ORDER BY by deaths DESC
LIMIT 10

In SAS we can use:
proc sql outobs=10;
 SELECT deaths, cases
 FROM covid
 ORDER BY by deaths DESC
;
quit;

Data validating in SQL

Strategy 1 (Subsetting and Arranging) We can view a print out of a subset based on specific conditions to identify values that either go against or with the conditions. Arranging may help organize the values to quickly assess start or end points.

SELECT deaths, cases
FROM covid
WHERE deaths > 1000 AND date2 = '2020-07-23'
ORDER BY by deaths DESC

Strategy 2 (Counting Frequencies and Duplicates) We can create a frequency table of a select number of variables. Since it’s a frequency table, it has potential to be cross-classified by design. Ideally, the variables for which we want to know the frequency should be grouping or categorical variables. Frequency tables also aid in determining the number of distinct (or unique) values.

SELECT *, count(PlayerID) AS ntimes
FROM nba

#vs
SELECT *, count(PlayerID) AS ntimes
FROM nba
GROUP BY PlayerID

Strategy 3 (Summary Statistics) We can view a numeric summary of variables based on descriptive statistics such as minimum, maximum, median, mean, and standard deviation. This summary might include information about the number of missing values or NAs, specific quantiles, and extreme observations (similar to a minimum and maximum).

SELECT sum(G) AS sumg, AVG(G) AS meang, MIN(g) as ming
FROM nba
GROUP BY PlayerID

Data cleaning in SQL

Removing duplicate observations Removing duplicates when the observation’s duplication add no new information is helpful in data cleaning.

SELECT PlayerID
FROM nba
GROUP BY PlayerID

#vs
SELECT DISTINCT PlayerID
FROM nba

Fixing rounding errors and inconsistent units of measurement Numeric inconsistencies can plague data analysis. Fixing them can be as simple as converting the existing variable to another more accessible unit of measure (e.g. centimeters instead of inches).

SELECT *, 100.0*candidatevotes/totalvotes AS nc, candidatevotes*1.0  AS cv
FROM elect
GROUP BY state
HAVING candidatevotes = max(candidatevotes)

Removing or replacing missing values A missing value may actually be informative for a data analysis. Thus, be cautious when removing missing values from a dataset. Consider why or how the missing values came to be. If it is missing completely at random, then removal is usually safe.

SELECT *
FROM nba
WHERE Player IS NOT NULL

Limiting a distribution to its realistic set of observations This essentially means subsetting with conditions on numeric variables. Review the Week 3 Notes on Subsetting with numbers.

SELECT *
FROM nba
WHERE AST > 100

Correcting and subsetting with dates This might also be considered subsetting and applying functions and operators for dates and times. Do review Week 2 Notes on Formatting Data.

covid <- read_csv("https://uofi.box.com/shared/static/s17ox0f1k7htzjpd2mudiagp7vt3uueb.csv")
covid$date2 <- as.character(covid$date)
sqldf("SELECT * 
      FROM covid
      WHERE date2 = '2020-07-23' ")

Combining data in SQL

In SQL, concatenating is the act of combining objects or strings together and is typically done with CONCAT().

In SQL, binding (or appending) is the act of combining two or more objects by stacking one on top of the other using UNION or stacking one next to the other with SQL joins such as FULL JOIN or INNER JOIN.

There are at least 3 situations that can occur when merging objects.

Observations in the two (or more) separate objects could not match each other.

Data 1

ID	Salary
A	$10K
B	$11K
D	$12K

Data 2

ID	Number
C	2175551234
E	2175551235
F	2175551236

Merged Data

ID	Salary	Number
A	$10K
B	$11K
D	$12K
C		2175551234
E		2175551235
F		2175551236

Observations in the two (or more) separate objects could match each other one-to-one.

Data 1

ID	Salary
A	$10K
B	$11K
D	$12K

Data 2

ID	Number
A	2175551214
B	2175551224
D	2175551244

Merged Data

ID	Salary	Number
A	$10K	2175551214
B	$11K	2175551224
D	$12K	2175551244

Observations in the two (or more) separate objects could match each other one-to-many (or many-to-one).

Data 1

ID	Salary
A	$10K
D	$12K

Data 2

ID	Number
A	2175551214
A	2175551204
D	2175551244

Merged Data

ID	Salary	Number
A	$10K	2175551214
A	$10K	2175551204
D	$12K	2175551244

How we merge (or join) the data depends on which of the three situations is intended for the data management. Only keeping the matches (#2 and #3 above) could be accomplished using an inner join (INNER JOIN) in SQL. Keeping the matches (#2 and #3 above) and non-matches (#1 above) could be accomplished using a full join (FULL JOIN) in SQL. ONe differentiator in SQL is that the column names do not need to be the same in order to join two or more datasets.

The code example below highlights joining data in SQL. Inner joins can be done without the explicit usage of INNER JOIN.

#no explicit INNER JOIN
coet <- sqldf("SELECT covid2.state, party, totalvotes, deaths, cases
      FROM covid2, elect2
      WHERE covid2.state = elect2.state
      ")

#vs
#explicit INNER JOIN and using AS keyword for aliasing data names
coett <- sqldf("SELECT cc.state, party, totalvotes, deaths, cases
      FROM covid2 as cc INNER JOIN elect2 as ee ON cc.state=ee.state
      ")

Summarizing data

Summarizing data in SQL can be accomplished with certain summary functions or mathematical operations on grouped data. See https://www.w3schools.com/sql/sql_ref_sqlserver.asp for lists of some SQL functions. See https://www.w3schools.com/sql/sql_operators.asp for lists of some SQL operations.

Below is an SQL example (technically in R due to using assignment operators <-) of summarizing data when we want to calculate the average length and width of certain iris plants. In particular we want to find the average petal width and sepal width of iris plants who have petal lengths greater than or equal to 0.5 inches in increments of 0.5 inches. To do this, first we need to convert the lengths and widths to inches, then filter to only keep petal lengths greater than or equal to 0.5 inches, then make the new groups based on the increments, then summarize by averaging the petal widths and sepal widths for the new groups. The code below is written as strictly SQL. You can try it in your programming language to see the results. After running this SQL syntax below (and adjusting for how your programming language uses assignment operators), the results are the same as the results from the Week 6 Notes.

x1  sqldf("SELECT *, `Sepal.Length` AS sl, `Sepal.Width` AS sw, `Petal.Length` AS pl, `Petal.Width` AS pw 
      FROM iris1
      ")
x2 <- sqldf("SELECT sl, sw, pl, pw, sl/2.54 AS slin, sw/2.54 AS swin, pl/2.54 AS plin, pw/2.54 AS pwin
            FROM x1
            WHERE plin >= 0.5")
x21 <- sqldf("SELECT *, 1 AS newgroup
             FROM x2
             WHERE plin < 1.0")
x22 <- sqldf("SELECT *, 2 AS newgroup
             FROM x2
             WHERE plin BETWEEN 1.0 AND 1.4999")
x23 <- sqldf("SELECT *, 3 AS newgroup
             FROM x2
             WHERE plin BETWEEN 1.5 AND 1.9999")
x24 <- sqldf("SELECT *, 4 AS newgroup
             FROM x2
             WHERE plin BETWEEN 2.0 AND 2.4999")
x25 <- sqldf("SELECT *, 5 AS newgroup
             FROM x2
             WHERE plin >= 2.5")
#now append them (using UNION)
x3 <- sqldf("SELECT *
            FROM x21
            UNION
            SELECT *
            FROM x22
            UNION
            SELECT *
            FROM x23
            UNION
            SELECT *
            FROM x24
            UNION
            SELECT *
            FROM x25  ")
sqldf("SELECT newgroup, AVG(pwin) AS avgpwin, AVG(swin) AS avgswin
      FROM x3
      GROUP BY newgroup")

STAT 440 Statistical Data Management - Summer 2020

Week 7 Notes

Created by Prof. Christopher Kinson

Things Covered in this Week’s Notes

Intro to SQL

Assigning data in SQL

Subsetting data in SQL

Arranging data in SQL

Data validating in SQL

Data cleaning in SQL

Combining data in SQL

Summarizing data