Intro to SQL
Assigning data in SQL
Subsetting data in SQL
Arranging data in SQL
Data validating in SQL
Data cleaning in SQL
Combining data in SQL
Summarizing data in SQL
SQL stands for structured query language and represents a sort of grammar of data management and wrangling. Some important grammar or translations are: database (data frame), table (e.g. a subset; a data frame), fields (columns), records (rows), query (any data wrangling task).
The intention behind the language is to have a general method for handling data. SQL can be accessed in almost any platform and programming language including the three in this course (R, SAS, and Python). There are some subtleties in SQL that make it distinct and different from how programming languages access and handle data. Although it is structured and general, SQL has many versions developed by different users and companies for their own specific purposes. In this course, we plan on using the most common SQL statements, clauses, and keywords to avoid tasks that are inoperable on the 3 programming languages. Below is a general form of a typical query.
SELECT object-item <, ...object-item>
FROM from-list
<WHERE sql-expression>
<GROUP BY object-item <, ...object-item >>
<HAVING sql-expression>
<ORDER BY order-by-item <DESC> <, ...order-by-item>>
The SELECT
statement specifies which columns need to be in the resulting table once the query is complete. New variables may be created in the SELECT
statement.
The FROM
clause points the dataset (the source of the query). Usually, this dataset exists within the database.
The WHERE
clause is a way to select records based on conditions.
The GROUP BY
clause groups data for processing. It helps to remove duplicates when there is a unique identifier in the data.
The HAVING
clause allows conditions to be placed on the groups for group processing. This clause is relevant when a GROUP BY
clause exists.
The ORDER BY
clause is the way to arrange the data. The default behavior is sorting in ascending order.
The order of these statements, clauses, and keywords matters and must be strictly followed. Meaning, a WHERE
clause cannot appear before a FROM
clause. Almost every query requires a SELECT
statement.
To begin using SQL in R, we need to install and access the “sqldf” package. Afterwards, it’s a good idea to become familiar with the package (or library) using the help function ?sqldf
in R. We see that using SQL with data frames is a nice feature and can be done with the following syntax.
library(sqldf)
iris1 <- iris # using iris data
sqldf("SELECT *
FROM iris1")
To begin using SQL in SAS, we need to simply use PROC SQL
. Afterwards, it’s a good idea to become familiar with the SQL procedure in SAS by reading the documentation here https://documentation.sas.com/?docsetId=sqlproc&docsetVersion=9.4&docsetTarget=n0w2pkrm208upln11i9r4ogwyvow.htm&locale=en. We see that using SQL with a dataset is a nice feature and can be done with the following syntax.
data iris1;
set sashelp.iris; /* using iris data */
run;
proc sql ;
SELECT *
FROM iris1
;
quit;
To begin using SQL in Python, we need to import the module “sqlite3”. Afterwards, it’s a good idea to become familiar with the package (or module) by reading some documentation and usage examples here https://datatofish.com/pandas-dataframe-to-sql/. We see that using SQL with Pandas dataframes is a nice feature and can be done with the following syntax borrowed from Datatofish. A few things to keep in mind…
Python does not connect to SQL automatically behind the scense, so we must manually connect them and manually close the connection when we are finished with the database
We use Pandas dataframes to print out a nice SQL table based on a database
The SQL database and the table do not exist beforehand, thus we must manually create them
The database we create will be an empty database, but we will add tables in it that are the Pandas dataframes themselves
The tables we create must have unique names if inside of the same database. We may not overwrite them in the same database.
The output of printing a SQL table using fetchall()
does not look pretty, which is why we convert the SQL table to a Pandas dataframe.
After closing the connection, we cannot do any more querying on that database; unless we open the connection and change the table names (or database name).
#just to access the iris data
import pandas as pd
import numpy
import sklearn
from sklearn import datasets
Iris = datasets.load_iris()
iris0 = pd.DataFrame(data= numpy.c_[Iris['data'], Iris['target']],
columns= Iris['feature_names'] + ['target'])
iris0 = iris0.rename(columns={'sepal length (cm)': 'sepallength', 'sepal width (cm)': 'sepalwidth', 'petal length (cm)': 'petallength', 'petal width (cm)': 'petalwidth', 'target': 'species'})
print(iris0)
#to do the querying
import sqlite3
# create a new database (currently empty) and connect with the myTable database
conn = sqlite3.connect("myiris01.db")
c = conn.cursor() # cursor object
#create table name to be placed inside the database
c.execute('CREATE TABLE iris01 (sepallength number, sepalwidth number, petallength number, petalwidth number, species number)')
conn.commit()
df = iris0.copy()
df.to_sql('iris01', conn, if_exists='replace', index = False)
#show the resulting SQl table in a pretty form using Pandas dataframes. Notice the SQL syntax is inside of quotes inside of read_sl_query.
print(pd.read_sql_query("SELECT * \
FROM iris01;", conn))
c.close() # closing the cursor which was pointing to connection to the database
conn.close() # closing the connection
We’ve seen object assignment done with an assignment operator such as x=y
in SAS, Python, and R or x<-y
in R. In SQL, new fields that are not yet in existence can be created with the SELECT
statement using the AS
keyword such as SELECT alpha+1 AS beta
. Because assignment operators often involve functions and calculations based on existing fields, it’s critical to know and familiarize oneself with SQL’s common operators and functions. Below is a table of common operations and functions.
Operation or Function | SQL Syntax |
---|---|
Addition | + |
Subtraction | - |
Multiplication | * |
Division | / |
Exponentiation | POWER() |
Modulus | % |
Equal to (for comparison) | = |
Not Equal to | <> |
Greater than | > |
Less than | < |
Greater than or equal to | >= |
Less than or equal to | <= |
And | AND |
Or | OR |
Negation (aka Not) | NOT |
Square root | SQRT() |
Absolute value | ABS() |
Logarithm (natural) | LOG() |
Exponential | EXP() |
Mean | AVG() |
Read the following for more information on SQL’s operators and functions SQL operators and SQL functions.
In this section, we discuss typical subsetting with numeric information and not characters or strings. In other programming languages, subsetting with conditions might also be called filtering.
Some common tasks that are included in the process of subsetting are: selecting variables (SELECT
, or SELECT *
to select all fields), selecting observations through conditions (WHERE
for general observations or HAVING
for grouped data conditions), assigning new variables (using the AS
keyword in the SELECT
statement), and removing missing values (using IS NULL
in the WHERE
statement such as WHERE Name IS NULL
). Below is a coding template with common subsetting tasks.
SELECT iid
FROM dating_data
WHERE met = 1 AND dec = 1
GROUP BY iid
SELECT *, 100.0*candidatevotes/totalvotes AS nc
FROM elect
GROUP BY state
HAVING candidatevotes = max(candidatevotes)
In general, dates and times may be considered as numeric information. Thus we should recall the formatting dates and times in Week 2. However, in SQL particularly, dates are handled more easily as character strings. This is quite contrary to the way we’ve done date time handling in the norm. All we would need to do is to convert the date and time fields back to be character format using our programming language of choice (not using SQL). Afterwards we can use those strings and their format to subset much like the example below.
SELECT *
FROM covid
WHERE date2 = '2020-07-23'
Subsetting characters can be hit or miss depending on the programming language’s connection to SQL and the version of SQL used. See the SQL functions SQL functions for usage examples.
Arranging data is simply ordering and sorting the data; thus we use ORDER BY
. The default ordering is ascending. If we want descending order, then we use the DESC
keyword after the field name. We can also limit the amount of observations we want to show using the LIMIT n
clause where n is a number of records to be shown in the result. This works in both R and Python’s verson of SQL. In SAS, we can use the PROC SQL option outobs=n to limit the number of observations being output. Below is a coding example.
SELECT deaths, cases
FROM covid
ORDER BY by deaths DESC
LIMIT 10
In SAS we can use:
proc sql outobs=10;
SELECT deaths, cases
FROM covid
ORDER BY by deaths DESC
;
quit;
Strategy 1 (Subsetting and Arranging) We can view a print out of a subset based on specific conditions to identify values that either go against or with the conditions. Arranging may help organize the values to quickly assess start or end points.
SELECT deaths, cases
FROM covid
WHERE deaths > 1000 AND date2 = '2020-07-23'
ORDER BY by deaths DESC
Strategy 2 (Counting Frequencies and Duplicates) We can create a frequency table of a select number of variables. Since it’s a frequency table, it has potential to be cross-classified by design. Ideally, the variables for which we want to know the frequency should be grouping or categorical variables. Frequency tables also aid in determining the number of distinct (or unique) values.
SELECT *, count(PlayerID) AS ntimes
FROM nba
#vs
SELECT *, count(PlayerID) AS ntimes
FROM nba
GROUP BY PlayerID
Strategy 3 (Summary Statistics) We can view a numeric summary of variables based on descriptive statistics such as minimum, maximum, median, mean, and standard deviation. This summary might include information about the number of missing values or NAs, specific quantiles, and extreme observations (similar to a minimum and maximum).
SELECT sum(G) AS sumg, AVG(G) AS meang, MIN(g) as ming
FROM nba
GROUP BY PlayerID
Removing duplicate observations Removing duplicates when the observation’s duplication add no new information is helpful in data cleaning.
SELECT PlayerID
FROM nba
GROUP BY PlayerID
#vs
SELECT DISTINCT PlayerID
FROM nba
Fixing rounding errors and inconsistent units of measurement Numeric inconsistencies can plague data analysis. Fixing them can be as simple as converting the existing variable to another more accessible unit of measure (e.g. centimeters instead of inches).
SELECT *, 100.0*candidatevotes/totalvotes AS nc, candidatevotes*1.0 AS cv
FROM elect
GROUP BY state
HAVING candidatevotes = max(candidatevotes)
Removing or replacing missing values A missing value may actually be informative for a data analysis. Thus, be cautious when removing missing values from a dataset. Consider why or how the missing values came to be. If it is missing completely at random, then removal is usually safe.
SELECT *
FROM nba
WHERE Player IS NOT NULL
Limiting a distribution to its realistic set of observations This essentially means subsetting with conditions on numeric variables. Review the Week 3 Notes on Subsetting with numbers.
SELECT *
FROM nba
WHERE AST > 100
Correcting and subsetting with dates This might also be considered subsetting and applying functions and operators for dates and times. Do review Week 2 Notes on Formatting Data.
covid <- read_csv("https://uofi.box.com/shared/static/s17ox0f1k7htzjpd2mudiagp7vt3uueb.csv")
covid$date2 <- as.character(covid$date)
sqldf("SELECT *
FROM covid
WHERE date2 = '2020-07-23' ")
In SQL, concatenating is the act of combining objects or strings together and is typically done with CONCAT()
.
In SQL, binding (or appending) is the act of combining two or more objects by stacking one on top of the other using UNION
or stacking one next to the other with SQL joins such as FULL JOIN
or INNER JOIN
.
There are at least 3 situations that can occur when merging objects.
Data 1
ID | Salary |
---|---|
A | $10K |
B | $11K |
D | $12K |
Data 2
ID | Number |
---|---|
C | 2175551234 |
E | 2175551235 |
F | 2175551236 |
Merged Data
ID | Salary | Number |
---|---|---|
A | $10K | |
B | $11K | |
D | $12K | |
C | 2175551234 | |
E | 2175551235 | |
F | 2175551236 |
Data 1
ID | Salary |
---|---|
A | $10K |
B | $11K |
D | $12K |
Data 2
ID | Number |
---|---|
A | 2175551214 |
B | 2175551224 |
D | 2175551244 |
Merged Data
ID | Salary | Number |
---|---|---|
A | $10K | 2175551214 |
B | $11K | 2175551224 |
D | $12K | 2175551244 |
Data 1
ID | Salary |
---|---|
A | $10K |
D | $12K |
Data 2
ID | Number |
---|---|
A | 2175551214 |
A | 2175551204 |
D | 2175551244 |
Merged Data
ID | Salary | Number |
---|---|---|
A | $10K | 2175551214 |
A | $10K | 2175551204 |
D | $12K | 2175551244 |
How we merge (or join) the data depends on which of the three situations is intended for the data management. Only keeping the matches (#2 and #3 above) could be accomplished using an inner join (INNER JOIN
) in SQL. Keeping the matches (#2 and #3 above) and non-matches (#1 above) could be accomplished using a full join (FULL JOIN
) in SQL. ONe differentiator in SQL is that the column names do not need to be the same in order to join two or more datasets.
The code example below highlights joining data in SQL. Inner joins can be done without the explicit usage of INNER JOIN
.
#no explicit INNER JOIN
coet <- sqldf("SELECT covid2.state, party, totalvotes, deaths, cases
FROM covid2, elect2
WHERE covid2.state = elect2.state
")
#vs
#explicit INNER JOIN and using AS keyword for aliasing data names
coett <- sqldf("SELECT cc.state, party, totalvotes, deaths, cases
FROM covid2 as cc INNER JOIN elect2 as ee ON cc.state=ee.state
")
Summarizing data in SQL can be accomplished with certain summary functions or mathematical operations on grouped data. See https://www.w3schools.com/sql/sql_ref_sqlserver.asp for lists of some SQL functions. See https://www.w3schools.com/sql/sql_operators.asp for lists of some SQL operations.
Below is an SQL example (technically in R due to using assignment operators <-
) of summarizing data when we want to calculate the average length and width of certain iris plants. In particular we want to find the average petal width and sepal width of iris plants who have petal lengths greater than or equal to 0.5 inches in increments of 0.5 inches. To do this, first we need to convert the lengths and widths to inches, then filter to only keep petal lengths greater than or equal to 0.5 inches, then make the new groups based on the increments, then summarize by averaging the petal widths and sepal widths for the new groups. The code below is written as strictly SQL. You can try it in your programming language to see the results. After running this SQL syntax below (and adjusting for how your programming language uses assignment operators), the results are the same as the results from the Week 6 Notes.
x1 sqldf("SELECT *, `Sepal.Length` AS sl, `Sepal.Width` AS sw, `Petal.Length` AS pl, `Petal.Width` AS pw
FROM iris1
")
x2 <- sqldf("SELECT sl, sw, pl, pw, sl/2.54 AS slin, sw/2.54 AS swin, pl/2.54 AS plin, pw/2.54 AS pwin
FROM x1
WHERE plin >= 0.5")
x21 <- sqldf("SELECT *, 1 AS newgroup
FROM x2
WHERE plin < 1.0")
x22 <- sqldf("SELECT *, 2 AS newgroup
FROM x2
WHERE plin BETWEEN 1.0 AND 1.4999")
x23 <- sqldf("SELECT *, 3 AS newgroup
FROM x2
WHERE plin BETWEEN 1.5 AND 1.9999")
x24 <- sqldf("SELECT *, 4 AS newgroup
FROM x2
WHERE plin BETWEEN 2.0 AND 2.4999")
x25 <- sqldf("SELECT *, 5 AS newgroup
FROM x2
WHERE plin >= 2.5")
#now append them (using UNION)
x3 <- sqldf("SELECT *
FROM x21
UNION
SELECT *
FROM x22
UNION
SELECT *
FROM x23
UNION
SELECT *
FROM x24
UNION
SELECT *
FROM x25 ")
sqldf("SELECT newgroup, AVG(pwin) AS avgpwin, AVG(swin) AS avgswin
FROM x3
GROUP BY newgroup")