Week 3 Notes

Created by Prof. Christopher Kinson


Things Covered in this Week’s Notes


Assigning objects

Technically, we have already assigned objects when we named a dataset for the previous homework assignments. Here, assigning objects is the act of creating an object, data structure, or variable in your programming language. The objects that we create should be done so with a purpose such as using a named dataset for data analysis or a newly created variable based on existing variables. Assigning an object is usually done with an assignment operator and this object should be named. Often, we may utilize mathematical, statistical, and logical (conditional) operators when creating a new object. Using these different operators may result in specific data structures or data types that we must also consider.

When naming an object, it’s best to use a naming convention that is clear, simple, and short. Different programming languages have different rules about the character length of a named object and restrictions about which characters or numbers may be included in the name. Be aware of the programming language’s rules (or prescriptive suggestions): Python, R, and SAS.

Object assignment is done with an operator itself such as x=y in SAS, Python, and R or x<-y in R. Thus, it’s critical to know and familiarize oneself with a programming language’s common operators and functions. Below is a table of common operations and functions.

Operation or Function SAS Syntax R Syntax Python Syntax
Addition + + +
Subtraction - - -
Multiplication * * *
Division / / /
Exponentiation ** ^ **
Modulus mod() %% %
Equal to (for comparison) eq == ==
Not Equal to ne != !=
Greater than gt > >
Less than lt < <
Greater than or equal to ge >= >=
Less than or equal to le <= <=
And and & and
Or or | or
Negation (aka Not) not ! not
Square root sqrt() sqrt() sqrt() in math library
Absolute value abs() abs() abs()
Logarithm (natural) log() log() log() in math library
Exponential exp() exp() exp() in math library
Mean mean() mean() mean() in statistics library
Standard deviation (sample) std() sd() stdev() in statistics library

Read the following for more information SAS operators, SAS functions, Python operators, Python math library, Python statistics library, and R.


Subsetting with numbers

For the sake of this course, subsetting a dataset means we use conditions and other directives to create a dataset that is usually no larger than the original one. In this section, we discuss typical subsetting with numeric information and not characters or strings. Dates and times may be considered as numeric information. Thus we should recall the formatting dates and times in Week 2. In other programming languages, subsetting with conditions might also be called filtering.

Below is a snapshot of the US State-level COVID-19 Historical Data from the New York Times (6284 observations and 5 columns). Access the data here https://uofi.box.com/shared/static/tpjb8jh73b4aa76u17e0or0r9b7yozb8.csv. Original source: https://github.com/nytimes/covid-19-data.

Some common tasks that are included in the process of subsetting are: selecting variables, selecting observations through conditions, assigning new variables (with or without operators and functions), and removing missing values. Below is a coding template with common subsetting tasks.

#SAS
data iris2;
 set sashelp.iris;
 sepalarea = sepallength*sepalwidth*constant('pi'); /*assigning variables*/
 if petalwidth le 5; /*selecting observations through conditions*/
 where sepalwidth is not missing; /*removing missing values*/
 keep sepallength sepalwidth sepalarea;  /*selecting variables*/
run;
proc contents data=iris2;
run;
proc print data=iris2 (obs=10);
run;

#R
library(tidyverse)
iris2 <- iris %>% mutate(Sepal.Area=Sepal.Length*Sepal.Width*pi) %>% filter(Petal.Width <= 0.5) %>% drop_na(Sepal.Width) %>% select(Sepal.Length, Sepal.Width, Sepal.Area)
str(iris2)
iris2[1:10,]

#Python
import pandas
import numpy
import sklearn
from sklearn import datasets
iris = datasets.load_iris()
iris2 = pandas.DataFrame(iris.data, columns = iris.feature_names)
iris2.columns = iris2.columns.str.replace(' ', '')
iris2.columns = iris2.columns.str.replace('\\(cm\\)', '')
import math
iris2['sepalarea'] = iris2['sepallength']*iris2['sepalwidth']*math.pi
iris3 = iris2[iris2.petalwidth < 0.5]
iris4 = iris3.iloc[:,[0,1,4]]
iris5 = iris4[pandas.notnull(iris4)]
iris5.info()
iris5.head(10)

Try It Out!

After reading the notes, students should be able to attempt these problems.

  • Import the US State-level COVID-19 Historical Data from the New York Times in your preferred programming language. Print the first 10 observations of the data and the descriptor portion of the whole data.

  • Consider what type of questions and analyses a data analyst may have for the data.

  • Based on your questions, create more than one subset that are sensible for the questions/analyses.


Subsetting with characters

In this section, we discuss typical subsetting with characters and strings. Character strings can be quite difficult to wrangle. This difficulty may be due to character encoding and differences in how computers interpret strings. One alleviation of this difficulty was conceptualized in the 1950s by Stephen Kleene called regular expressions (or regex for short). A regular expression is a standardized pattern for finding strings and characters. Regex exists separate from programming languages (much like SQL) and is incorporated via library or module into the programming language.

Regex can be used to pick out certain characters in a character vector, and this is helpful for subsetting. Regex can be used in R within the “tidyverse” package. Regex can be used in Python within the “re” library. In SAS, regex can be called with the set of RX functions (general regex) or PRX (Perl-specific regular expressions). Read more here R, Python, and SAS. Below is a table containing common regex syntax for finding characters and strings. The table’s example is the following sentence:

"Friends of the Geese are hosting a memorial service Saturday for the 175 geese killed this week by the Urbana Park District in its 'charity harvest.'"
Regex Syntax String Found
\w “FriendsoftheGeesearehostingamemorialserviceSaturdayforthe175geesekilledthisweekbytheUrbanaParkDistrictinitscharityharvest”
\d “175”
[\.\’] “‘.’”
[^\.] “Friends of the Geese are hosting a memorial service Saturday for the 175 geese killed this week by the Urbana Park District in its ‘charity harvest’”
[A-M] “FriendsGeeseDistrict”
s+ “FriendsGeesehostingservicegeesethisDistrictitsharvest”
s+|S+ “FriendsGeesehostingserviceSaturdaygeesethisDistrictitsharvest”
e{2} “Geesegeeseweek”

See https://www.rexegg.com/ (especially their Quick-Start https://www.rexegg.com/regex-quickstart.html) for more information on regular expressions in general.

Try It Out!

After reading the notes, students should be able to attempt these problems.

  • Using the sentence about the geese above, do the following in a programming language:

    • Find the number of times the letter “s” appears

    • Count the number of times a word begins with the letter “t”

    • Find the number of times any digit between 1-9 appears

    • Count the number of words that contain the letter “g”

    • Which words have consecutive characters appearing twice?

    • Show only the words excluding punctuation such that each word appears in its own vector.


Arranging data

Arranging data is simply ordering and sorting the data. Arranging may happen via selecting certain columns or variables to appear first, second, so on. It could also include sorting the columns in the dataset hierarchically (sort var1 then var2, etc). Because programming languages are different, we should expect that sorting hierarchies may not work the same way. For example, sorting in SAS such that var1 is sorted first then var2 may be different in R or Python. The difference might be reversed such that one should sort var2 then var1 in order to get the same result.

Below are coding templates with common arranging tasks.

#SAS
proc sort data=iris2;
 by descending sepallength sepalarea;
run;
proc print data=iris2 (obs=10);
run;

#R
arrange(iris2, desc(Sepal.Length), Sepal.Area)[1:10,]

#Python
iris5.sort_values(by=['sepallength', 'sepalarea'], ascending=[False,True]).head(10)

Try It Out!

After reading the notes, students should be able to attempt these problems.

Subset1

Subset2

For each image, do the following:

  1. Figure out how the data is arranged and subsetted based on the image (no programming required).

  2. Use a programming language to get the same arrangement of the subset.


Highlighting loops and conditional execution

Conditional execution typically involves a statement with conditions such that an expression or expressions may be executed (or run). A very common statement across most programming languages is the if statement. The if statement requires a TRUE logical condition in order for the expressions to run. Multiple expressions or grouped expressions can be used. Multiple conditions can be combined with logical operators & (“and”) and | (“or”). The else statement is used when the user intends for an alternative set of expressions to be executed when the logical condition is FALSE. Below is a coding template for if and else statements.

#SAS
data test;
x = 0;
i = 1;
if i le 1 then 
  x = 20;
  else 
    x = 0;
run;
proc print data = test;
run;

#R
i = 1
if (i <= 1) {
  x = 20
  } else {
    x = 0
  }
x

#Python
i = 1
if (i <= 1) {
  x = 20
  } else {
    x = 0
  }
x

It is quite common to find conditional execution combined with loops. Loops are coding structures that perform repetitive actions on expressions. Usually, the repetitive actions happen in the body of the loop. Loops elevate simplistic coding when actions are done repeatedly over several lines. They can also be used to show how values change over iterations. Loops can also be used inside of other loops, called nested loops, for more complicated algorithmic programming. There are 2 main type of loops: index-controlled loops and condition-controlled loops.

With index-controlled loops (e.g. for or do), actions are repeated for a predetermined number of times using an index or counter.

The index is determined by the expression and changes each time the expressions execute, until the index reaches the maximum value of the expression. To stop the loop, we must smartly utilize the dimension of the expression. Below is a coding template for an index-controlled loop.

#SAS
data test;
x = 0;
i = 1;
do i = 1 to 4;
x = 5*i;
end;
run;
proc print data = test;
run;

#R
for (i in 1:4) {
 x = 5*i
}
x

#Python
for i in range(5):
    x=5*i

print(x)

With condition-controlled loops (e.g. while, do while, do until, repeat), actions are repeated once a condition is satisfied. These conditions may be checked at the top or at bottom of the loop.

Below is a diagram for condition-controlled loops when the condition is checked at the top.

For conditions that need to be checked at the top, the logical condition should be TRUE at the very beginning before the grouped expressions can be repeated. Once the condition is FALSE, the loop stops. Below is a coding template for a condition-controlled loop where the condition is checked at the top.

#SAS
data test;
x = 0;
i = 1;
do while (i < 5);
x = 5*i;
i+1;
end;
run;
proc print data = test;
run;

#R
i = 1
while (i < 5) {
 x = 5*i
 i = i + 1
}
x

#Python
i = 1
while i < 5:
  x = 5*i
  i = i + 1

print(x)

Below is a diagram for condition-controlled loops when the condition is checked at the bottom.

For conditions that need to be checked at the bottom, the expressions execute until the condition is TRUE. When the condition is TRUE, the break or end statement provides the end and exit of the loop. Without the break, the loop never stops. Below is a coding template for a condition-controlled loop where the condition is checked at the bottom.

#SAS
data test;
x = 0;
i = 1;
do until (i >= 5);
x = 5*i;
i+1;
end;
run;
proc print data=test;
run;

#R
i = 1
repeat{
 x = 5*i
 i = i + 1
 if (i >= 5) break
}
x

#Python #no pre-existing structure but we can improvise
i = 1
while True:
  x = 5*i
  i = i + 1

  if (i == 5):
    break

print(x)

Try It Out!

After reading the notes, students should be able to attempt these problems.

  • For both problems below, use loop(s) or conditional execution(s) or both.

    1. In the US State-level COVID-19 Historical Data from the New York Times, find the 10 states with with the highest case and death rates, where the rate denominator is each 7 day period. Be sure to arrange the data in descending order of case rate then descending order of death rate.

    2. In the OKCupid Profiles Data, find the 10 most verbose self-centered profiles based on the how long the user’s statement for each of the essay questions (essay0-essay9) and how many times they refer to themselves using the pronouns “I” and “me”. Be sure to arrange the data in descending order of verbose self-centeredness.