An Introduction to R Chapter 6 Lists and data frames
Missing values
Lists (recursive structures) are one of the most flexible data objects to work. Lists are made up of components, and the components may be of any mode. We use the function list()
.
?list
## starting httpd help server ... done
y <- list(y1=letters, y2=LETTERS, y3=iris[1:26,])
y
## $y1
## [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s"
## [20] "t" "u" "v" "w" "x" "y" "z"
##
## $y2
## [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" "S"
## [20] "T" "U" "V" "W" "X" "Y" "Z"
##
## $y3
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
## 7 4.6 3.4 1.4 0.3 setosa
## 8 5.0 3.4 1.5 0.2 setosa
## 9 4.4 2.9 1.4 0.2 setosa
## 10 4.9 3.1 1.5 0.1 setosa
## 11 5.4 3.7 1.5 0.2 setosa
## 12 4.8 3.4 1.6 0.2 setosa
## 13 4.8 3.0 1.4 0.1 setosa
## 14 4.3 3.0 1.1 0.1 setosa
## 15 5.8 4.0 1.2 0.2 setosa
## 16 5.7 4.4 1.5 0.4 setosa
## 17 5.4 3.9 1.3 0.4 setosa
## 18 5.1 3.5 1.4 0.3 setosa
## 19 5.7 3.8 1.7 0.3 setosa
## 20 5.1 3.8 1.5 0.3 setosa
## 21 5.4 3.4 1.7 0.2 setosa
## 22 5.1 3.7 1.5 0.4 setosa
## 23 4.6 3.6 1.0 0.2 setosa
## 24 5.1 3.3 1.7 0.5 setosa
## 25 4.8 3.4 1.9 0.2 setosa
## 26 5.0 3.0 1.6 0.2 setosa
Components in a list are always numbered and can be called or referenced with their numbers as locations much like arrays and matrices through indexing. Components do not have to be named, but could be named, which would allow for more flexible way to call or reference the components in the list.
y[[1]]
## [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s"
## [20] "t" "u" "v" "w" "x" "y" "z"
y$y1
## [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s"
## [20] "t" "u" "v" "w" "x" "y" "z"
y[[2]][1]
## [1] "A"
y[[3]]
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
## 7 4.6 3.4 1.4 0.3 setosa
## 8 5.0 3.4 1.5 0.2 setosa
## 9 4.4 2.9 1.4 0.2 setosa
## 10 4.9 3.1 1.5 0.1 setosa
## 11 5.4 3.7 1.5 0.2 setosa
## 12 4.8 3.4 1.6 0.2 setosa
## 13 4.8 3.0 1.4 0.1 setosa
## 14 4.3 3.0 1.1 0.1 setosa
## 15 5.8 4.0 1.2 0.2 setosa
## 16 5.7 4.4 1.5 0.4 setosa
## 17 5.4 3.9 1.3 0.4 setosa
## 18 5.1 3.5 1.4 0.3 setosa
## 19 5.7 3.8 1.7 0.3 setosa
## 20 5.1 3.8 1.5 0.3 setosa
## 21 5.4 3.4 1.7 0.2 setosa
## 22 5.1 3.7 1.5 0.4 setosa
## 23 4.6 3.6 1.0 0.2 setosa
## 24 5.1 3.3 1.7 0.5 setosa
## 25 4.8 3.4 1.9 0.2 setosa
## 26 5.0 3.0 1.6 0.2 setosa
y[[3]][,1]
## [1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9 5.4 4.8 4.8 4.3 5.8 5.7 5.4 5.1 5.7
## [20] 5.1 5.4 5.1 4.6 5.1 4.8 5.0
y[[3]]$Species
## [1] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
## [11] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
## [21] setosa setosa setosa setosa setosa setosa
## Levels: setosa versicolor virginica
A data frame (a rectangular data object with rows and columns) is a list of class “data.frame”. To create a data frame, use data.frame()
. If an object is not a data frame, then it can be coerced into a data frame using as.data.frame()
.
data.frame(y1=letters, y2=LETTERS)
## y1 y2
## 1 a A
## 2 b B
## 3 c C
## 4 d D
## 5 e E
## 6 f F
## 7 g G
## 8 h H
## 9 i I
## 10 j J
## 11 k K
## 12 l L
## 13 m M
## 14 n N
## 15 o O
## 16 p P
## 17 q Q
## 18 r R
## 19 s S
## 20 t T
## 21 u U
## 22 v V
## 23 w W
## 24 x X
## 25 y Y
## 26 z Z
as.data.frame(cbind(letters,order(letters)))
## letters V2
## 1 a 1
## 2 b 2
## 3 c 3
## 4 d 4
## 5 e 5
## 6 f 6
## 7 g 7
## 8 h 8
## 9 i 9
## 10 j 10
## 11 k 11
## 12 l 12
## 13 m 13
## 14 n 14
## 15 o 15
## 16 p 16
## 17 q 17
## 18 r 18
## 19 s 19
## 20 t 20
## 21 u 21
## 22 v 22
## 23 w 23
## 24 x 24
## 25 y 25
## 26 z 26
data.frame(newcol=2, `newer col`=letters)
## newcol newer.col
## 1 2 a
## 2 2 b
## 3 2 c
## 4 2 d
## 5 2 e
## 6 2 f
## 7 2 g
## 8 2 h
## 9 2 i
## 10 2 j
## 11 2 k
## 12 2 l
## 13 2 m
## 14 2 n
## 15 2 o
## 16 2 p
## 17 2 q
## 18 2 r
## 19 2 s
## 20 2 t
## 21 2 u
## 22 2 v
## 23 2 w
## 24 2 x
## 25 2 y
## 26 2 z
Data frames in R have some quirky default behaviors including:
character vectors are coerced to be factors
variable names are automatically adjusted and sometimes in uncontrolled ways
rows may be named with a variable or other object
all columns must have the same length; no recycling
Tibbles (a tidyverse data frame) were created to eliminate the data.frame quirks mentioned above to smooth out data processing.
library(tidyverse)
## -- Attaching packages -------------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.2 v purrr 0.3.4
## v tibble 3.0.3 v dplyr 1.0.2
## v tidyr 1.1.2 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.5.0
## -- Conflicts ----------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
tibble(y1=letters, y2=LETTERS)
## # A tibble: 26 x 2
## y1 y2
## <chr> <chr>
## 1 a A
## 2 b B
## 3 c C
## 4 d D
## 5 e E
## 6 f F
## 7 g G
## 8 h H
## 9 i I
## 10 j J
## # ... with 16 more rows
as_tibble(iris)
## # A tibble: 150 x 5
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## <dbl> <dbl> <dbl> <dbl> <fct>
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
## 7 4.6 3.4 1.4 0.3 setosa
## 8 5 3.4 1.5 0.2 setosa
## 9 4.4 2.9 1.4 0.2 setosa
## 10 4.9 3.1 1.5 0.1 setosa
## # ... with 140 more rows
tibble(newcol=2, `newer col`=letters)
## # A tibble: 26 x 2
## newcol `newer col`
## <dbl> <chr>
## 1 2 a
## 2 2 b
## 3 2 c
## 4 2 d
## 5 2 e
## 6 2 f
## 7 2 g
## 8 2 h
## 9 2 i
## 10 2 j
## # ... with 16 more rows
One quirky default behavior of tibbles occurs when printing them. By default, we can only see a print out of 10 rows and only the columns that fit on one screen. Some values in the columns are abbreviated (written in short form) while remaining columns are written in text form without values in the printout. But we can alter the default behavior.
iris2 <- tibble(iris,spec2=paste0(iris$Species,seq(0,1,length.out = 50)))
iris2
## # A tibble: 150 x 6
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species spec2
## <dbl> <dbl> <dbl> <dbl> <fct> <chr>
## 1 5.1 3.5 1.4 0.2 setosa setosa0
## 2 4.9 3 1.4 0.2 setosa setosa0.0204081632~
## 3 4.7 3.2 1.3 0.2 setosa setosa0.0408163265~
## 4 4.6 3.1 1.5 0.2 setosa setosa0.0612244897~
## 5 5 3.6 1.4 0.2 setosa setosa0.0816326530~
## 6 5.4 3.9 1.7 0.4 setosa setosa0.1020408163~
## 7 4.6 3.4 1.4 0.3 setosa setosa0.1224489795~
## 8 5 3.4 1.5 0.2 setosa setosa0.1428571428~
## 9 4.4 2.9 1.4 0.2 setosa setosa0.1632653061~
## 10 4.9 3.1 1.5 0.1 setosa setosa0.1836734693~
## # ... with 140 more rows
options(tibble.print_max = 50, tibble.print_min = 10)
as_tibble(iris2)
## # A tibble: 150 x 6
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species spec2
## <dbl> <dbl> <dbl> <dbl> <fct> <chr>
## 1 5.1 3.5 1.4 0.2 setosa setosa0
## 2 4.9 3 1.4 0.2 setosa setosa0.0204081632~
## 3 4.7 3.2 1.3 0.2 setosa setosa0.0408163265~
## 4 4.6 3.1 1.5 0.2 setosa setosa0.0612244897~
## 5 5 3.6 1.4 0.2 setosa setosa0.0816326530~
## 6 5.4 3.9 1.7 0.4 setosa setosa0.1020408163~
## 7 4.6 3.4 1.4 0.3 setosa setosa0.1224489795~
## 8 5 3.4 1.5 0.2 setosa setosa0.1428571428~
## 9 4.4 2.9 1.4 0.2 setosa setosa0.1632653061~
## 10 4.9 3.1 1.5 0.1 setosa setosa0.1836734693~
## # ... with 140 more rows
options(tibble.width = Inf)
as_tibble(iris2)
## # A tibble: 150 x 6
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## <dbl> <dbl> <dbl> <dbl> <fct>
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
## 7 4.6 3.4 1.4 0.3 setosa
## 8 5 3.4 1.5 0.2 setosa
## 9 4.4 2.9 1.4 0.2 setosa
## 10 4.9 3.1 1.5 0.1 setosa
## spec2
## <chr>
## 1 setosa0
## 2 setosa0.0204081632653061
## 3 setosa0.0408163265306122
## 4 setosa0.0612244897959184
## 5 setosa0.0816326530612245
## 6 setosa0.102040816326531
## 7 setosa0.122448979591837
## 8 setosa0.142857142857143
## 9 setosa0.163265306122449
## 10 setosa0.183673469387755
## # ... with 140 more rows
Reminder: A data frame is a list. A tibble is a data frame. A tibble is a list. A list is not a data frame. A list is not a tibble.
Missing values are often represented as blanks, “.”, or " " in data frames. They are slightly different from null values and unknown values. Null values (NULL
in R) are undefined values often used in R coding to create empty objects. Unknown values are usually noted or marked as “unknown” in a dataset. Older datasets might mark a value as “9999” or “99999” to represent unknown values. A missing value could be unknown or NULL or an actual value that just never made it into the data frame. But unknown values are not necessarily missing when they are represented as “unknown” or “9999” within a data frame. There are some things we can glean from missing values.
Missing not at random is a category of missing values in which the reason the values are missing is related to the values that are missing. This category occurs in the real world and does not allow for data analysis to be done without bias.
Missing at random is a category of missing values in which there is a relationship between the missing values and the observed values. The missingness for this category is not truly random. This category occurs in the real world and does not allow for data analysis to be done without bias.
Missing completely at random is a category of missing values in which there is no relationship between missing values and observed values (data that we can see). This category rarely occurs in the real world, but does allow for data analysis to be done without bias.