Week 5 Notes

Created by Christopher Kinson


Things Covered in this Week’s Notes


ICYMI: The Git Golden Rule is to always pull your repo before you push.

ICYMI: An Introduction to R by Venables, Smith and the R Core Team is one of the reference textbooks in the syllabus.


An Introduction to R Chapter 6 Lists and data frames

Lists

Lists (recursive structures) are one of the most flexible data objects to work. Lists are made up of components, and the components may be of any mode. We use the function list().

?list
## starting httpd help server ... done
y <- list(y1=letters, y2=LETTERS, y3=iris[1:26,])
y
## $y1
##  [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s"
## [20] "t" "u" "v" "w" "x" "y" "z"
## 
## $y2
##  [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" "S"
## [20] "T" "U" "V" "W" "X" "Y" "Z"
## 
## $y3
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1           5.1         3.5          1.4         0.2  setosa
## 2           4.9         3.0          1.4         0.2  setosa
## 3           4.7         3.2          1.3         0.2  setosa
## 4           4.6         3.1          1.5         0.2  setosa
## 5           5.0         3.6          1.4         0.2  setosa
## 6           5.4         3.9          1.7         0.4  setosa
## 7           4.6         3.4          1.4         0.3  setosa
## 8           5.0         3.4          1.5         0.2  setosa
## 9           4.4         2.9          1.4         0.2  setosa
## 10          4.9         3.1          1.5         0.1  setosa
## 11          5.4         3.7          1.5         0.2  setosa
## 12          4.8         3.4          1.6         0.2  setosa
## 13          4.8         3.0          1.4         0.1  setosa
## 14          4.3         3.0          1.1         0.1  setosa
## 15          5.8         4.0          1.2         0.2  setosa
## 16          5.7         4.4          1.5         0.4  setosa
## 17          5.4         3.9          1.3         0.4  setosa
## 18          5.1         3.5          1.4         0.3  setosa
## 19          5.7         3.8          1.7         0.3  setosa
## 20          5.1         3.8          1.5         0.3  setosa
## 21          5.4         3.4          1.7         0.2  setosa
## 22          5.1         3.7          1.5         0.4  setosa
## 23          4.6         3.6          1.0         0.2  setosa
## 24          5.1         3.3          1.7         0.5  setosa
## 25          4.8         3.4          1.9         0.2  setosa
## 26          5.0         3.0          1.6         0.2  setosa

Components in a list are always numbered and can be called or referenced with their numbers as locations much like arrays and matrices through indexing. Components do not have to be named, but could be named, which would allow for more flexible way to call or reference the components in the list.

y[[1]]
##  [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s"
## [20] "t" "u" "v" "w" "x" "y" "z"
y$y1
##  [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s"
## [20] "t" "u" "v" "w" "x" "y" "z"
y[[2]][1]
## [1] "A"
y[[3]]
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1           5.1         3.5          1.4         0.2  setosa
## 2           4.9         3.0          1.4         0.2  setosa
## 3           4.7         3.2          1.3         0.2  setosa
## 4           4.6         3.1          1.5         0.2  setosa
## 5           5.0         3.6          1.4         0.2  setosa
## 6           5.4         3.9          1.7         0.4  setosa
## 7           4.6         3.4          1.4         0.3  setosa
## 8           5.0         3.4          1.5         0.2  setosa
## 9           4.4         2.9          1.4         0.2  setosa
## 10          4.9         3.1          1.5         0.1  setosa
## 11          5.4         3.7          1.5         0.2  setosa
## 12          4.8         3.4          1.6         0.2  setosa
## 13          4.8         3.0          1.4         0.1  setosa
## 14          4.3         3.0          1.1         0.1  setosa
## 15          5.8         4.0          1.2         0.2  setosa
## 16          5.7         4.4          1.5         0.4  setosa
## 17          5.4         3.9          1.3         0.4  setosa
## 18          5.1         3.5          1.4         0.3  setosa
## 19          5.7         3.8          1.7         0.3  setosa
## 20          5.1         3.8          1.5         0.3  setosa
## 21          5.4         3.4          1.7         0.2  setosa
## 22          5.1         3.7          1.5         0.4  setosa
## 23          4.6         3.6          1.0         0.2  setosa
## 24          5.1         3.3          1.7         0.5  setosa
## 25          4.8         3.4          1.9         0.2  setosa
## 26          5.0         3.0          1.6         0.2  setosa
y[[3]][,1]
##  [1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9 5.4 4.8 4.8 4.3 5.8 5.7 5.4 5.1 5.7
## [20] 5.1 5.4 5.1 4.6 5.1 4.8 5.0
y[[3]]$Species
##  [1] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
## [11] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
## [21] setosa setosa setosa setosa setosa setosa
## Levels: setosa versicolor virginica

Data Frames

A data frame (a rectangular data object with rows and columns) is a list of class “data.frame”. To create a data frame, use data.frame(). If an object is not a data frame, then it can be coerced into a data frame using as.data.frame().

data.frame(y1=letters, y2=LETTERS)
##    y1 y2
## 1   a  A
## 2   b  B
## 3   c  C
## 4   d  D
## 5   e  E
## 6   f  F
## 7   g  G
## 8   h  H
## 9   i  I
## 10  j  J
## 11  k  K
## 12  l  L
## 13  m  M
## 14  n  N
## 15  o  O
## 16  p  P
## 17  q  Q
## 18  r  R
## 19  s  S
## 20  t  T
## 21  u  U
## 22  v  V
## 23  w  W
## 24  x  X
## 25  y  Y
## 26  z  Z
as.data.frame(cbind(letters,order(letters)))
##    letters V2
## 1        a  1
## 2        b  2
## 3        c  3
## 4        d  4
## 5        e  5
## 6        f  6
## 7        g  7
## 8        h  8
## 9        i  9
## 10       j 10
## 11       k 11
## 12       l 12
## 13       m 13
## 14       n 14
## 15       o 15
## 16       p 16
## 17       q 17
## 18       r 18
## 19       s 19
## 20       t 20
## 21       u 21
## 22       v 22
## 23       w 23
## 24       x 24
## 25       y 25
## 26       z 26
data.frame(newcol=2, `newer col`=letters)
##    newcol newer.col
## 1       2         a
## 2       2         b
## 3       2         c
## 4       2         d
## 5       2         e
## 6       2         f
## 7       2         g
## 8       2         h
## 9       2         i
## 10      2         j
## 11      2         k
## 12      2         l
## 13      2         m
## 14      2         n
## 15      2         o
## 16      2         p
## 17      2         q
## 18      2         r
## 19      2         s
## 20      2         t
## 21      2         u
## 22      2         v
## 23      2         w
## 24      2         x
## 25      2         y
## 26      2         z

Data frames in R have some quirky default behaviors including:

  • character vectors are coerced to be factors

  • variable names are automatically adjusted and sometimes in uncontrolled ways

  • rows may be named with a variable or other object

  • all columns must have the same length; no recycling

Tibbles (Bonus content)

Tibbles (a tidyverse data frame) were created to eliminate the data.frame quirks mentioned above to smooth out data processing.

library(tidyverse)
## -- Attaching packages -------------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.2     v purrr   0.3.4
## v tibble  3.0.3     v dplyr   1.0.2
## v tidyr   1.1.2     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.5.0
## -- Conflicts ----------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
tibble(y1=letters, y2=LETTERS)
## # A tibble: 26 x 2
##    y1    y2   
##    <chr> <chr>
##  1 a     A    
##  2 b     B    
##  3 c     C    
##  4 d     D    
##  5 e     E    
##  6 f     F    
##  7 g     G    
##  8 h     H    
##  9 i     I    
## 10 j     J    
## # ... with 16 more rows
as_tibble(iris)
## # A tibble: 150 x 5
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##           <dbl>       <dbl>        <dbl>       <dbl> <fct>  
##  1          5.1         3.5          1.4         0.2 setosa 
##  2          4.9         3            1.4         0.2 setosa 
##  3          4.7         3.2          1.3         0.2 setosa 
##  4          4.6         3.1          1.5         0.2 setosa 
##  5          5           3.6          1.4         0.2 setosa 
##  6          5.4         3.9          1.7         0.4 setosa 
##  7          4.6         3.4          1.4         0.3 setosa 
##  8          5           3.4          1.5         0.2 setosa 
##  9          4.4         2.9          1.4         0.2 setosa 
## 10          4.9         3.1          1.5         0.1 setosa 
## # ... with 140 more rows
tibble(newcol=2, `newer col`=letters)
## # A tibble: 26 x 2
##    newcol `newer col`
##     <dbl> <chr>      
##  1      2 a          
##  2      2 b          
##  3      2 c          
##  4      2 d          
##  5      2 e          
##  6      2 f          
##  7      2 g          
##  8      2 h          
##  9      2 i          
## 10      2 j          
## # ... with 16 more rows

One quirky default behavior of tibbles occurs when printing them. By default, we can only see a print out of 10 rows and only the columns that fit on one screen. Some values in the columns are abbreviated (written in short form) while remaining columns are written in text form without values in the printout. But we can alter the default behavior.

iris2 <- tibble(iris,spec2=paste0(iris$Species,seq(0,1,length.out = 50)))
iris2
## # A tibble: 150 x 6
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species spec2              
##           <dbl>       <dbl>        <dbl>       <dbl> <fct>   <chr>              
##  1          5.1         3.5          1.4         0.2 setosa  setosa0            
##  2          4.9         3            1.4         0.2 setosa  setosa0.0204081632~
##  3          4.7         3.2          1.3         0.2 setosa  setosa0.0408163265~
##  4          4.6         3.1          1.5         0.2 setosa  setosa0.0612244897~
##  5          5           3.6          1.4         0.2 setosa  setosa0.0816326530~
##  6          5.4         3.9          1.7         0.4 setosa  setosa0.1020408163~
##  7          4.6         3.4          1.4         0.3 setosa  setosa0.1224489795~
##  8          5           3.4          1.5         0.2 setosa  setosa0.1428571428~
##  9          4.4         2.9          1.4         0.2 setosa  setosa0.1632653061~
## 10          4.9         3.1          1.5         0.1 setosa  setosa0.1836734693~
## # ... with 140 more rows
options(tibble.print_max = 50, tibble.print_min = 10)
as_tibble(iris2)
## # A tibble: 150 x 6
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species spec2              
##           <dbl>       <dbl>        <dbl>       <dbl> <fct>   <chr>              
##  1          5.1         3.5          1.4         0.2 setosa  setosa0            
##  2          4.9         3            1.4         0.2 setosa  setosa0.0204081632~
##  3          4.7         3.2          1.3         0.2 setosa  setosa0.0408163265~
##  4          4.6         3.1          1.5         0.2 setosa  setosa0.0612244897~
##  5          5           3.6          1.4         0.2 setosa  setosa0.0816326530~
##  6          5.4         3.9          1.7         0.4 setosa  setosa0.1020408163~
##  7          4.6         3.4          1.4         0.3 setosa  setosa0.1224489795~
##  8          5           3.4          1.5         0.2 setosa  setosa0.1428571428~
##  9          4.4         2.9          1.4         0.2 setosa  setosa0.1632653061~
## 10          4.9         3.1          1.5         0.1 setosa  setosa0.1836734693~
## # ... with 140 more rows
options(tibble.width = Inf)
as_tibble(iris2)
## # A tibble: 150 x 6
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##           <dbl>       <dbl>        <dbl>       <dbl> <fct>  
##  1          5.1         3.5          1.4         0.2 setosa 
##  2          4.9         3            1.4         0.2 setosa 
##  3          4.7         3.2          1.3         0.2 setosa 
##  4          4.6         3.1          1.5         0.2 setosa 
##  5          5           3.6          1.4         0.2 setosa 
##  6          5.4         3.9          1.7         0.4 setosa 
##  7          4.6         3.4          1.4         0.3 setosa 
##  8          5           3.4          1.5         0.2 setosa 
##  9          4.4         2.9          1.4         0.2 setosa 
## 10          4.9         3.1          1.5         0.1 setosa 
##    spec2                   
##    <chr>                   
##  1 setosa0                 
##  2 setosa0.0204081632653061
##  3 setosa0.0408163265306122
##  4 setosa0.0612244897959184
##  5 setosa0.0816326530612245
##  6 setosa0.102040816326531 
##  7 setosa0.122448979591837 
##  8 setosa0.142857142857143 
##  9 setosa0.163265306122449 
## 10 setosa0.183673469387755 
## # ... with 140 more rows

Reminder: A data frame is a list. A tibble is a data frame. A tibble is a list. A list is not a data frame. A list is not a tibble.


Missing values

Missing values are often represented as blanks, “.”, or " " in data frames. They are slightly different from null values and unknown values. Null values (NULL in R) are undefined values often used in R coding to create empty objects. Unknown values are usually noted or marked as “unknown” in a dataset. Older datasets might mark a value as “9999” or “99999” to represent unknown values. A missing value could be unknown or NULL or an actual value that just never made it into the data frame. But unknown values are not necessarily missing when they are represented as “unknown” or “9999” within a data frame. There are some things we can glean from missing values.

Missing not at random is a category of missing values in which the reason the values are missing is related to the values that are missing. This category occurs in the real world and does not allow for data analysis to be done without bias.

Missing at random is a category of missing values in which there is a relationship between the missing values and the observed values. The missingness for this category is not truly random. This category occurs in the real world and does not allow for data analysis to be done without bias.

Missing completely at random is a category of missing values in which there is no relationship between missing values and observed values (data that we can see). This category rarely occurs in the real world, but does allow for data analysis to be done without bias.