Things Covered in this Week’s Notes


Text Analysis

Text analysis, aka text mining, is extracting important features (data mining) from documents (text) - without reading the document oneself - in order to make decisions.

Useful Definitions

  • character: a symbol representing a letter or number

  • word: a term; a series of characters with meaning, often fundamental to a language

  • n-grams: group of words

  • string: a sequence of characters, words, or other data

  • document: a written, printed, or electronic item containing text (characters, words, etc.)

  • corpus: a collection of documents

  • term document matrix (TDM): a data structure with individual terms as rows and individual documents as columns

  • bag of words: text mining method that treats words as the feature to be extracted

  • stopword: very common words in the English language that aren’t informative for analysis

A Procedure for Text Analysis

The following notes are sourced from T. Kwartler’s Text Mining in Practice with R book, but with tidyverse functions instead of base R’s regular expression functions following the stringr package https://stringr.tidyverse.org/articles/from-base.html and Amelia McMamara and Nicholas J. Horton’s Wrangling Categorical Data in R https://www.tandfonline.com/doi/full/10.1080/00031305.2017.1356375.

  1. Define the problem and specific goals.
  2. Identify the text or documents needed.
  3. Organize (structure) the documents & the text within the documents.
  4. Extract features and analyze the structured data.
  5. Make a recommendation or decision based on the analysis.

1. Define the problem and specific goals. Think about the first two steps of the PPDAC-Investigative Cycle.

Suppose these questions were of analytical interest:

  • Is OkCupid full of creeps?

  • How many creeps were active on OKCupid during this space and time?

  • What makes someone creepy or a creep?

2. Identify the text or documents needed

Think about the D of PPDAC-Investigative Cycle.

Let’s import the OKCupid data.

library(tidyverse)
## -- Attaching packages --------------------------------------------- tidyverse 1.2.1 --
## v ggplot2 3.2.1     v purrr   0.3.2
## v tibble  2.1.3     v dplyr   0.8.3
## v tidyr   1.0.0     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.4.0
## -- Conflicts ------------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
okc <- read_csv("https://uofi.box.com/shared/static/oy32nc373w4jqz3kummksnw6wvhfrl7a.csv", 
    col_types = cols(last_online = col_datetime(format = "%Y-%m-%d-%H-%M"))) #takes about 20 seconds on laptop
head(okc)
## # A tibble: 6 x 31
##     age body_type diet  drinks drugs education essay0 essay1 essay2 essay3
##   <dbl> <chr>     <chr> <chr>  <chr> <chr>     <chr>  <chr>  <chr>  <chr> 
## 1    22 a little~ stri~ socia~ never working ~ "abou~ "curr~ "maki~ "the ~
## 2    35 average   most~ often  some~ working ~ "i am~ dedic~ "bein~ <NA>  
## 3    38 thin      anyt~ socia~ <NA>  graduate~ "i'm ~ "i ma~ "impr~ "my l~
## 4    23 thin      vege~ socia~ <NA>  working ~ i wor~ readi~ "play~ socia~
## 5    29 athletic  <NA>  socia~ never graduate~ "hey ~ work ~ "crea~ i smi~
## 6    29 average   most~ socia~ <NA>  graduate~ "i'm ~ "buil~ "imag~ "i ha~
## # ... with 21 more variables: essay4 <chr>, essay5 <chr>, essay6 <chr>,
## #   essay7 <chr>, essay8 <chr>, essay9 <chr>, ethnicity <chr>,
## #   height <dbl>, income <dbl>, job <chr>, last_online <dttm>,
## #   location <chr>, offspring <chr>, orientation <chr>, pets <chr>,
## #   religion <chr>, sex <chr>, sign <chr>, smokes <chr>, speaks <chr>,
## #   status <chr>
colnames(okc) <- tolower(colnames(okc))

Suppose to address creepiness, we focus on essay8 “The most private thing I am willing to admit” and essay9 “You should message me if…”. But, we may need to look at other essay questions.

set.seed(448)
si <- sample(1:nrow(okc),20) #random sample of 20 rows
okc$essay8[si]
##  [1] "i'm still not really sure what a hipster is :o/"                                                                                                                                                                                                                                                                                                                                                                                                                                                
##  [2] "my tumblr url:<br />\nasinfulthinker.tumblr.com<br />\n<br />\nhave fun with that."                                                                                                                                                                                                                                                                                                                                                                                                             
##  [3] "oh, no. you are not going to get me to answer this one."                                                                                                                                                                                                                                                                                                                                                                                                                                        
##  [4] NA                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               
##  [5] "lando calrissian is my spirit animal."                                                                                                                                                                                                                                                                                                                                                                                                                                                          
##  [6] "hmmm... i'm not private about much..."                                                                                                                                                                                                                                                                                                                                                                                                                                                          
##  [7] "i'm not entirely myself. ask me about it."                                                                                                                                                                                                                                                                                                                                                                                                                                                      
##  [8] "on a dating website or in person?"                                                                                                                                                                                                                                                                                                                                                                                                                                                              
##  [9] "i think terry gross' snort when she laughs is kinda hot!"                                                                                                                                                                                                                                                                                                                                                                                                                                       
## [10] "is only said in bed"                                                                                                                                                                                                                                                                                                                                                                                                                                                                            
## [11] "i need to be working at this very moment. eish, the web.<br />\n<br />\nand not sure if this is private, but most people think it's odd or\nalmost impossible (in this country at least): i have never owned a\ncar. i lived in la without a car, in fact. it's not that fun to be\nthe only girl on the bus heading back from hollywood at 2am\nsaturday night, but i understand that homeless people need a place\nto stay warm too and i don't mind. i do have a driver's licence,\nhowever."
## [12] "i will admit to people i trust"                                                                                                                                                                                                                                                                                                                                                                                                                                                                 
## [13] "i secretly want to be a back-up singer....like back in the day for\nray charles or aretha but like...now."                                                                                                                                                                                                                                                                                                                                                                                      
## [14] "i have a fear of blood and i can't stand scary movies"                                                                                                                                                                                                                                                                                                                                                                                                                                          
## [15] "i love to sit on the couch, curl up in a blanket and watch\nmarathons of law and order svu, or vh1 reality t.v. ..."                                                                                                                                                                                                                                                                                                                                                                            
## [16] "unsure :/"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      
## [17] NA                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               
## [18] NA                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               
## [19] NA                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               
## [20] "i got hammered drunk and saw ke$ha perform when she came to sf last\nyear. we r who we r"
okc$essay9[si]
##  [1] "you're bored, just feel like it, or you want to get to know me."                                                                                                                                                                                                                                                                                                                                                                                                                                   
##  [2] "you've read this far and still seem interested in getting to know\nme.<br />\nsmiles guaranteed.<br />\noh and don't be shy, i just like people 0=)<br />\n<br />\np.s. if you can guess which one of my top three favorite super\nheroes, i will personally send you a batch of homemade cookies."                                                                                                                                                                                                
##  [3] NA                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  
##  [4] "if you're exciting and fun to be with"                                                                                                                                                                                                                                                                                                                                                                                                                                                             
##  [5] "you are a woody allen muse, dancer, harpist, jazz vocalist, fashion\ndesigner, gallery curator, you can help me with my french...<br />\n<br />\nor less specifically, you're a classy lady looking for someone to\nbe classy with."                                                                                                                                                                                                                                                               
##  [6] "you really enjoy food... you are looking for stimulation... are\nattracted to energetic women... have an artistic side... are\nconfident but not cocky... are inspired by something... can't sit\nstill... you're willing to try anything at least once... are easy\nin conversation... you don't take yourself too seriously.. you'd\nrather do than watch... you read or listen to npr but hopefully\nboth... you like to dance... you give great hugs... kids and dogs\nlove you..."            
##  [7] "you want to<br />\nyou like to talk about books/movies/television<br />\nyou feel like going somewhere in the city on our bikes!<br />\nyou don't take yourself too seriously.<br />\nyou're into beards<br />\n<br />\nyou should definitely message me if you want at the least, a new\nfriend."                                                                                                                                                                                                 
##  [8] "you have:<br />\na healthy sense of humor<br />\na good balance of confidence and humility<br />\nno allergies to dogs<br />\nan understanding that the left lane is for passing"                                                                                                                                                                                                                                                                                                                  
##  [9] "i'm looking to find a good, kind-hearted woman who's easy on the\neyes. message me if you like: art, history, art openings, politics,\nnpr, tennis, farmer's markets, cooking great meals at home,\nshooting pool, alameda antique fair, exploring sf, exploring nyc,\njogging, hiking in marin, yosemite, getting the hell out of\ndodge.<br />\n<br />\nalso, message me if you would dance the nasty boogie woogie with\nme:<br />\n<br />\nhttp://www.youtube.com/watch?v=syfcidcf8gk"         
## [10] NA                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  
## [11] "... you are funny.<br />\nand if you also have a zest for life and find that the world is too\ninteresting, and there are too many things to see and learn, to\nstay home and watch lots of television. i don't understand people\nwho \"get bored\" :)"                                                                                                                                                                                                                                           
## [12] "you are intelligent and want to have a fun, comfortable and\ninteresting \"date\", and learn that the reality of who i am in real\ntime is much more than the profile."                                                                                                                                                                                                                                                                                                                            
## [13] "you love to dance, hike, travel, camp, and cook...(not necessarily\nin that order)<br />\nyou are taller than me<br />\nyou don't hate los angeles<br />\nyou are silly<br />\nyou are really silly.<br />\nyou also know how to be serious.<br />\nyou like wine and cheese.<br />\nyou are creative in some way.<br />\nyou are spiritual in some way.<br />\nyou like affection.<br />\nyou like to give as much as you like to take.<br />\nif you are looking for something more than casual."
## [14] "you are fun-loving, enjoy random deep conversations, and love great\nfood."                                                                                                                                                                                                                                                                                                                                                                                                                        
## [15] "you are awesome!!!"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                
## [16] "you like redheads, music and love."                                                                                                                                                                                                                                                                                                                                                                                                                                                                
## [17] "you are a real girl who wants to be treated well by a real\nman.<br />\n<br />\nyou are not a hoodrat, golddigger, or shady chick"                                                                                                                                                                                                                                                                                                                                                                 
## [18] "1) you have a sense of curiosity<br />\n2) you treat people who you owe nothing to well"                                                                                                                                                                                                                                                                                                                                                                                                           
## [19] NA                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  
## [20] "you've got a kick ass laugh and you like showing it off"

There are some basic statistics we can compute based on character count. We can also count up the number of times a character or string appears in a vector.

nc8 <- str_length(okc$essay8[si]) #counts number of characters
mean(nc8, na.rm = TRUE)
## [1] 79.5
median(nc8, na.rm = TRUE)
## [1] 50
okc$essay8[si][2]
## [1] "my tumblr url:<br />\nasinfulthinker.tumblr.com<br />\n<br />\nhave fun with that."
str_count(okc$essay8[si][2], "i") #counts number of times a string appears in a vector
## [1] 3
str_detect(okc$essay8[si],"like|love") #returns logicals
##  [1] FALSE FALSE FALSE    NA FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [12] FALSE  TRUE FALSE  TRUE FALSE    NA    NA    NA FALSE
str_which(okc$essay8[si],"like|love") #returns index
## [1] 13 15
str_subset(okc$essay8[si],"like|love") #returns the vector values themselves
## [1] "i secretly want to be a back-up singer....like back in the day for\nray charles or aretha but like...now."          
## [2] "i love to sit on the couch, curl up in a blanket and watch\nmarathons of law and order svu, or vh1 reality t.v. ..."

As well as replace certain strings with a particular string and split up strings.

str_replace_all(okc$essay8[si][2], '[\n]', " ")
## [1] "my tumblr url:<br /> asinfulthinker.tumblr.com<br /> <br /> have fun with that."
str_split(okc$essay8[si][2],"[\n]") #creates a list splitting the string for every string written as "\n"
## [[1]]
## [1] "my tumblr url:<br />"            "asinfulthinker.tumblr.com<br />"
## [3] "<br />"                          "have fun with that."

3. Organize (structure) the documents & the text within the documents.

The text is collected and placed into a corpus. Afterwards, we can clean it.

library(tm)
## Loading required package: NLP
## 
## Attaching package: 'NLP'
## The following object is masked from 'package:ggplot2':
## 
##     annotate
e8 <- data.frame(doc_id=si,text=okc$essay8[si],stringsAsFactors = FALSE)
corpus <- VCorpus(DataframeSource(e8))

Processing the data is often the first step in text mining projects. We usually want to do the following data preparations (in this order):

  • Set all words to lowercase
  • Remove stopwords and extremely rare words
  • Remove punctuation and other symbols
  • Remove unnecessary whitespace
  • Remove numbers (unless they have importance in the Problem i.e. Step 1)
  • Correct spelling errors
#R code from Kwartler's book

# Return NA instead of tolower error
tryTolower <- function(x){
# return NA when there is an error
y = NA
# tryCatch error
try_error = tryCatch(tolower(x), error = function(e) e)
# if not an error
if (!inherits(try_error, 'error'))
y = tolower(x)
return(y)
}

#custom.stopwords <- c(stopwords("english"), additional useless words)

clean.corpus<-function(corpus){
corpus <- tm_map(corpus, content_transformer(tryTolower))
#corpus <- tm_map(corpus, removeWords, custom.stopwords)
corpus <- tm_map(corpus, removeWords, stopwords('english'))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removeNumbers)
return(corpus)
}

newcorpus <- clean.corpus(corpus)

For mispelling, you can do several things. You can fix the spelling in a different software such as MS Word or Excel. Fix the spelling manually for a small number of items. Or we can use R with the qdap library (assuming you have Java on your computer) to handle multiple cases of mispelled words.

library(qdap)
which_misspelled(object)
fix.text <- function(myStr) {
 check <- check_spelling(myStr)
 splitted <- strsplit(myStr, split=' ')
 for (i in 1:length(check$row)) {
 splitted[[check$row[i]]][as.numeric(check$word.
no[i])] = check$suggestion[i]
 }
 df <- unlist(lapply(splitted, function(x) paste(x,
collapse = ' ')))
 return(df)
}

4. Extract features and analyze the structured data.

Visualizations

Now that our corpus is cleaned, we view it and place it into a TDM to extract term frequencies.

as.list(newcorpus)
## $`27563`
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 28
## 
## $`42609`
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 47
## 
## $`59167`
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 23
## 
## $`32014`
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: NA
## 
## $`47585`
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 30
## 
## $`42271`
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 17
## 
## $`14469`
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 14
## 
## $`51796`
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 22
## 
## $`47343`
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 41
## 
## $`35425`
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 9
## 
## $`46773`
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 269
## 
## $`54734`
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 24
## 
## $`1820`
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 66
## 
## $`13655`
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 30
## 
## $`31981`
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 73
## 
## $`36735`
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 7
## 
## $`12002`
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: NA
## 
## $`12363`
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: NA
## 
## $`33125`
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: NA
## 
## $`46197`
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 58
tdm<-TermDocumentMatrix(newcorpus, control=list(weighting=weightTf))
tdm.essay8 <- as.matrix(tdm)
rownames(tdm.essay8)
##   [1] "admit"                     "almost"                   
##   [3] "animal"                    "answer"                   
##   [5] "aretha"                    "asinfulthinkertumblrcombr"
##   [7] "ask"                       "back"                     
##   [9] "bed"                       "blanket"                  
##  [11] "blood"                     "bus"                      
##  [13] "calrissian"                "came"                     
##  [15] "car"                       "charles"                  
##  [17] "couch"                     "country"                  
##  [19] "curl"                      "dating"                   
##  [21] "day"                       "drivers"                  
##  [23] "drunk"                     "eish"                     
##  [25] "entirely"                  "fact"                     
##  [27] "fear"                      "fun"                      
##  [29] "get"                       "girl"                     
##  [31] "going"                     "got"                      
##  [33] "gross"                     "hammered"                 
##  [35] "heading"                   "hipster"                  
##  [37] "hmmm"                      "hollywood"                
##  [39] "homeless"                  "hot"                      
##  [41] "however"                   "impossible"               
##  [43] "keha"                      "kinda"                    
##  [45] "lando"                     "last"                     
##  [47] "laughs"                    "law"                      
##  [49] "least"                     "licence"                  
##  [51] "likenow"                   "lived"                    
##  [53] "love"                      "marathons"                
##  [55] "mind"                      "moment"                   
##  [57] "movies"                    "much"                     
##  [59] "need"                      "never"                    
##  [61] "night"                     "odd"                      
##  [63] "one"                       "order"                    
##  [65] "owned"                     "people"                   
##  [67] "perform"                   "person"                   
##  [69] "place"                     "private"                  
##  [71] "ray"                       "reality"                  
##  [73] "really"                    "said"                     
##  [75] "saturday"                  "saw"                      
##  [77] "scary"                     "secretly"                 
##  [79] "singerlike"                "sit"                      
##  [81] "snort"                     "spirit"                   
##  [83] "stand"                     "stay"                     
##  [85] "still"                     "sure"                     
##  [87] "svu"                       "terry"                    
##  [89] "think"                     "trust"                    
##  [91] "tumblr"                    "understand"               
##  [93] "unsure"                    "urlbr"                    
##  [95] "want"                      "warm"                     
##  [97] "watch"                     "webbr"                    
##  [99] "website"                   "will"                     
## [101] "without"                   "working"                  
## [103] "year"
sfq <- data.frame(words=names(sort(rowSums(tdm.essay8),decreasing = TRUE)), freqs=sort(rowSums(tdm.essay8),decreasing = TRUE), row.names = NULL)
sfq
##                         words freqs
## 1                        back     3
## 2                      people     3
## 3                         car     2
## 4                         fun     2
## 5                        need     2
## 6                     private     2
## 7                        sure     2
## 8                       think     2
## 9                       admit     1
## 10                     almost     1
## 11                     animal     1
## 12                     answer     1
## 13                     aretha     1
## 14  asinfulthinkertumblrcombr     1
## 15                        ask     1
## 16                        bed     1
## 17                    blanket     1
## 18                      blood     1
## 19                        bus     1
## 20                 calrissian     1
## 21                       came     1
## 22                    charles     1
## 23                      couch     1
## 24                    country     1
## 25                       curl     1
## 26                     dating     1
## 27                        day     1
## 28                    drivers     1
## 29                      drunk     1
## 30                       eish     1
## 31                   entirely     1
## 32                       fact     1
## 33                       fear     1
## 34                        get     1
## 35                       girl     1
## 36                      going     1
## 37                        got     1
## 38                      gross     1
## 39                   hammered     1
## 40                    heading     1
## 41                    hipster     1
## 42                       hmmm     1
## 43                  hollywood     1
## 44                   homeless     1
## 45                        hot     1
## 46                    however     1
## 47                 impossible     1
## 48                       keha     1
## 49                      kinda     1
## 50                      lando     1
## 51                       last     1
## 52                     laughs     1
## 53                        law     1
## 54                      least     1
## 55                    licence     1
## 56                    likenow     1
## 57                      lived     1
## 58                       love     1
## 59                  marathons     1
## 60                       mind     1
## 61                     moment     1
## 62                     movies     1
## 63                       much     1
## 64                      never     1
## 65                      night     1
## 66                        odd     1
## 67                        one     1
## 68                      order     1
## 69                      owned     1
## 70                    perform     1
## 71                     person     1
## 72                      place     1
## 73                        ray     1
## 74                    reality     1
## 75                     really     1
## 76                       said     1
## 77                   saturday     1
## 78                        saw     1
## 79                      scary     1
## 80                   secretly     1
## 81                 singerlike     1
## 82                        sit     1
## 83                      snort     1
## 84                     spirit     1
## 85                      stand     1
## 86                       stay     1
## 87                      still     1
## 88                        svu     1
## 89                      terry     1
## 90                      trust     1
## 91                     tumblr     1
## 92                 understand     1
## 93                     unsure     1
## 94                      urlbr     1
## 95                       want     1
## 96                       warm     1
## 97                      watch     1
## 98                      webbr     1
## 99                    website     1
## 100                      will     1
## 101                   without     1
## 102                   working     1
## 103                      year     1

Here’s a bar plot with term frequencies sorted.

ggplot(sfq[1:20,], mapping = aes(x = reorder(words, freqs), y = freqs)) +
  geom_bar(stat= "identity", fill=rgb(0/255,191/255,196/255)) +
  coord_flip() +
  scale_colour_hue() +
  labs(x= "Words", title = "20 Most Frequenct Words (Essay8 Subset)") +
  theme(panel.background = element_blank(), axis.ticks.x = element_blank(),axis.ticks.y = element_blank())

  #guides(fill = guide_legend(title = ""))
#  geom_text(aes(label= freqs), colour= "white",hjust= 1.25, size= 5.0)

We can make a word cloud, which also depicts term frequencies.

library(wordcloud)
## Loading required package: RColorBrewer
wordcloud(sfq$words,sfq$freqs, min.freq = 1, max.words = 30, colors=blues9) #c(rgb(0/255,191/255,196/255),rgb(248/255,118/255,109/255)))

We can find associations \([0,1]\) for word pairings with one word of interest.

library(tm)
associations<-findAssocs(tdm, 'private', 0.11)
associations<-as.data.frame(associations) 
associations$terms<-row.names(associations)
associations$terms<-factor(associations$terms, levels=associations$terms)

ggplot(associations, aes(y=terms)) +
  geom_point(aes(x=private), data=associations, size=1) +
  theme(text=element_text(size=10), axis.title.y=element_blank(), panel.background = element_blank())

Clustering

We can view the clustering of words by first removing terms (rows) that have lots of zeroes (sparsity).

library(dendextend)
## 
## ---------------------
## Welcome to dendextend version 1.12.0
## Type citation('dendextend') for how to cite the package.
## 
## Type browseVignettes(package = 'dendextend') for the package vignette.
## The github page is: https://github.com/talgalili/dendextend/
## 
## Suggestions and bug-reports can be submitted at: https://github.com/talgalili/dendextend/issues
## Or contact: <tal.galili@gmail.com>
## 
##  To suppress this message use:  suppressPackageStartupMessages(library(dendextend))
## ---------------------
## 
## Attaching package: 'dendextend'
## The following object is masked from 'package:stats':
## 
##     cutree
library(circlize)
## ========================================
## circlize version 0.4.8
## CRAN page: https://cran.r-project.org/package=circlize
## Github page: https://github.com/jokergoo/circlize
## Documentation: http://jokergoo.github.io/circlize_book/book/
## 
## If you use it in published research, please cite:
## Gu, Z. circlize implements and enhances circular visualization 
##   in R. Bioinformatics 2014.
## ========================================
tdm2 <- removeSparseTerms(tdm, sparse=0.95) #the lower the sparse parameter, the fewer words are selected
#dim(tdm2)
hc <- hclust(dist(tdm2, method="euclidean"), method="complete")
#plot(hc)
hcd <- as.dendrogram(hc)
#clusMember <- cutree(hc,4) #choosing 4 clusters
hcd<-color_labels(hcd,4, col = c( rgb(0/255,191/255,196/255),rgb(19/255,41/255,75/255), rgb(255/255,74/255,39/255), rgb(248/255,118/255,109/255)))
hcd<-color_branches(hcd,4, col = c( rgb(0/255,191/255,196/255),rgb(19/255,41/255,75/255), rgb(255/255,74/255,39/255), rgb(248/255,118/255,109/255)))

plot(hcd, main = "Essay 8 Subset", type = "triangle",yaxt='n')

circlize_dendrogram(hcd, labels_track_height = 0.5,
dend_track_height = 0.4)

Sentiment Analysis

For sentiment analysis (extracting emotional intent from text), we run into several difficulties due to lexicons (dictionary accepted meanings), cultural meanings of words, and social-temporal changes.

The textbook uses the qdap package for polarity scoring, but this required Java. If you don’t have Java like me, it’s not going to work. Lots of computers these days don’t need Java and it takes up a decent chunk of space (hard drive memory).

Text Classification

We’ve seen classifiers before, such as logistic regression and k Nearest Neighbors. In those methods we had a non-numeric response variable that had categories. That type of response followed a binomial (for 2 categories) or multinomial distribution (for more than 2 categories). That response is unchanged in the text analysis paradigm; the subjects are documents. The only difference is that each independent variable (predictor) is now a unique term. We create a model matrix from the document term matrix (DTM). All data preparation steps that we did before including paritioning the data into training and testing sets and text cleaning are required for document classification.

Suppose we want to predict whether the users are single or not based on the way that they respond to Essay 9: “You should message me if…”.

library(tidyverse)
okc <- read_csv("https://uofi.box.com/shared/static/oy32nc373w4jqz3kummksnw6wvhfrl7a.csv", 
    col_types = cols(last_online = col_datetime(format = "%Y-%m-%d-%H-%M"))) #takes about 20 seconds on laptop
head(okc)
## # A tibble: 6 x 31
##     age body_type diet  drinks drugs education essay0 essay1 essay2 essay3
##   <dbl> <chr>     <chr> <chr>  <chr> <chr>     <chr>  <chr>  <chr>  <chr> 
## 1    22 a little~ stri~ socia~ never working ~ "abou~ "curr~ "maki~ "the ~
## 2    35 average   most~ often  some~ working ~ "i am~ dedic~ "bein~ <NA>  
## 3    38 thin      anyt~ socia~ <NA>  graduate~ "i'm ~ "i ma~ "impr~ "my l~
## 4    23 thin      vege~ socia~ <NA>  working ~ i wor~ readi~ "play~ socia~
## 5    29 athletic  <NA>  socia~ never graduate~ "hey ~ work ~ "crea~ i smi~
## 6    29 average   most~ socia~ <NA>  graduate~ "i'm ~ "buil~ "imag~ "i ha~
## # ... with 21 more variables: essay4 <chr>, essay5 <chr>, essay6 <chr>,
## #   essay7 <chr>, essay8 <chr>, essay9 <chr>, ethnicity <chr>,
## #   height <dbl>, income <dbl>, job <chr>, last_online <dttm>,
## #   location <chr>, offspring <chr>, orientation <chr>, pets <chr>,
## #   religion <chr>, sex <chr>, sign <chr>, smokes <chr>, speaks <chr>,
## #   status <chr>
colnames(okc) <- tolower(colnames(okc))

length(okc$status)
## [1] 59946
okc$status2 <- recode(okc$status, available="single", married="not single", `seeing someone`="not single")
okc2 <- filter(okc, status2=="single" | status2=="not single")
table(okc2$status2) #now a binary response
## 
## not single     single 
##       2374      57562
rm(okc)

Split the data into training and testing sets.

#a preliminary text cleaning
e9 <- str_replace_all(okc2$essay9, "[\n]", " ")
e9 <- str_replace_all(e9, "<br|<em>|</strong>|<strong>|</em>|<p|</p>", " ")
okc2$e9 <- e9

set.seed(448)
id0 <- sample(nrow(okc2),500) #random sample size of 200 for simplicity and quick results
okc500 <- okc2[id0,]

tots <- nrow(okc500)
set.seed(448)
ids <- sample(1:tots,round(tots*0.7))
length(ids)
## [1] 350
train.dat <- okc500[ids,c("status2","e9")]
test.dat<- okc500[-ids,c("status2","e9")]

table(train.dat$status2)
## 
## not single     single 
##         13        337
table(test.dat$status2)
## 
## not single     single 
##          8        142
#rm(okc2)

Clean the text data.

library(tm)

cleanthedata<-function(x){
x<-tolower(x)
x<-removeWords(x,stopwords('en'))
x<-removePunctuation(x)
x<-stripWhitespace(x)
return(x)
}

train.dat.x <- cleanthedata(train.dat$e9)

Now put it into a model matrix form as a DTM.

match.matrix <- function(text.col, original.matrix=NULL, weighting=weightTf){
  control <- list(weighting=weighting)
  training.col <- sapply(as.vector(text.col,mode='character'), function(x){iconv(x, to='UTF8', sub='byte')})
  corpus <- VCorpus(VectorSource(training.col))
  matrix <- DocumentTermMatrix(corpus, control=control)
  if (!is.null(original.matrix)) {
    terms <- colnames(original.matrix[, which(!colnames(original.matrix) %in% colnames(matrix))])
    weight <- 0
    if (attr(original.matrix,'weighting')[2] =='tfidf')
      weight <- 0.000000001
    amat <- matrix(weight,nrow=nrow(matrix), ncol=length(terms))
    colnames(amat) <- terms
    rownames(amat) <- rownames(matrix)
    fixed <- as.DocumentTermMatrix( cbind(matrix[, which(colnames(matrix) %in% colnames(original.matrix))], amat), weighting=weighting)
    matrix <- fixed
  }
  matrix <- matrix[, sort(colnames(matrix))]
  gc()
  return(matrix)
}

We probably don’t want to do modeling with a large number of predictors, so let’s coerce the 0 entries to be “.” for an easier-to-handle matrix Matrix function.

library(glmnet)
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
## Loaded glmnet 3.0-1
train.dtm <- match.matrix(train.dat.x, weighting=tm::weightTfIdf)
## Warning in weighting(x): empty document(s): 10 11 13 24 56 62 70 72 76 77
## 81 84 90 92 93 96 104 105 116 125 127 129 130 137 140 141 145 146 151 159
## 164 170 171 172 182 183 199 208 210 211 216 221 223 225 226 228 230 232 234
## 247 252 254 257 261 262 263 264 270 271 273 274 277 283 285 288 295 299 301
## 302 305 308 322 335 343 346
train.matrix <- Matrix(as.matrix(train.dtm), sparse=TRUE)
rm(train.dat.x)

Now, we can fit a lasso regression model for binary response that incorporates 10-fold cross-validation to select tuning parameters for lowest misclassification rate. For a binary response that is coerced to factor, the last level in alphabetical order is the target class (the level which is being predicted).

cv <- cv.glmnet(train.matrix, y=as.factor(train.dat$status2), alpha=1, family='binomial', nfolds=10, intercept=FALSE, type.measure ='class')
#table(train.dat$status2)
plot(cv)

Model performance on the training sets.

#training
preds<-predict(cv,train.matrix,type="class", s=cv$lambda.1se)
cm <- table(preds, train.dat$status2) # confusion matrix
mean(preds!=train.dat$status2) # misclassification rate
## [1] 0.2885714
mean(preds==train.dat$status2) #classification rate
## [1] 0.7114286
sum(train.dat$status2=="single")/length(train.dat$status2) #response proportion
## [1] 0.9628571
rm(train.matrix,preds,cm)

Model performance on the testing sets.

test.matrix<-Matrix(as.matrix(match.matrix(cleanthedata(test.dat$e9), weighting=tm::weightTfIdf, original.matrix=train.dtm)), sparse=TRUE)
## Warning in weighting(x): empty document(s): 6 11 16 18 19 33 34 41 44 45 47
## 51 52 54 67 68 70 78 85 86 87 89 90 96 99 103 111 113 114 115 132 135 140
## 147 149
## Warning in weighting(x): empty document(s): 6 11 16 18 19 33 34 41 44 45 47
## 51 52 54 67 68 70 78 85 86 87 89 90 96 99 103 111 113 114 115 132 134 135
## 140 147 149
## Warning in weighting(x): unreferenced term(s): 11cd 12a 2010 4055 40s
## 420 500lbs 50s abilities absurdity accept accompanies acquainted acronyms
## across actively actual add addition admit adorable advanced advertising
## advice affectionate affinity afterwards aging agreement alabamian album
## alcohol alfred alone alternative although angeles anime anoush ans answer
## answers anthing antimatch antonio anybody anywho apartment apologies
## appearence apply appreciates approach architecture arent arrondissements
## artisan artist artistic artists ask asking aspire associated astronomy
## astute atleast attach attracted attraction audiance available avoid awarded
## away awkward back bad baked balk ballpark banks banter bar bars based
## bdsm beach beam beard becomes bed bedroom beers behind benefit berkeley
## bets beware beyond biking bingo bitching black blog boardgames boarding
## bombs bond bong bonuses books bookstore boot booty boring bother bought
## boulder boundaries bout bowl braces brainstorm brave bribing bridge
## bridges brilliant bros brunch brushed bulk burger burning business butch
## cake calling calls cam came camp campfolk candy cape capoeira car cared
## caring carlos carry cart cascading cat catch catches cats cellphone centric
## challenge change characterize chase chat chatting cheap checked cheers
## cheese chess children chose chuckle cigarettes claim class classical cleans
## clearly clever cliche climbing clingy close closer clown clubbar clubbing
## cmon coat cocreation coddled coffeedrink coherently coladas collaborate
## colors comes comic committed communicative compassionate compel complain
## completely compliment concert concerts conscious conscioussuper consider
## consideration considering content contents control conversationalist
## converse convo cooking corners correct correctness countries covers
## cracks create creating creativity crew criterion critical crossed cthulhu
## cuba cuddly curiosity current customer cutie cuz dabbling damn daniels
## dark darkness darn daters days deal dealbreaker deathspeed debt decent
## dedicated demon depressing depth descent describe desired destination
## determined develop deviants dialogue die died dinner direction disagree
## disappointed discover discovered discuss disdain disgust distance
## distant dnd dodgers doesnt dolores dont doors dork dorky dotty douchebag
## drama drawn dream dreams drinking drinks drinkscoffee driving drugs
## dude duke dumb dunes duper dynamic dyslexic early easily easy easygoing
## eating ecstatic educated effort eggs elbow ella elsewhere elusive email
## emails embrace emotional emotionally encouraged end endings endless
## energized energy engage english enjoyed enjoys enthusiastic entire
## entrepreneurial epicurean episode equal err esp etc european evenkeeled
## event events ever everythings evolve example excels excess exchanged
## excitement expansive expecting experienced exploration exploring express
## extremely extroverted eyes fabulous facial failures fairly fake fall
## familiar fangs fantastic farcompelled farmers fashioned faster father
## favor favors feelings fell femininity ferry festivals fey fickle figure
## figured finally financially finding fingers flattery flipflops flirting
## flirts focus folk following follows foodie foods foot football forest
## form fortune forum found fourhand freaky freespirited friendly fucking
## fulfilling fulfillment fullest funloving future gal gave gee geeks
## geeky generally ghia giant girls given giving glances glasses goals
## goes golddigger goofball goofy google gorgeous gorgonzola gosling gotta
## gotten grab gracious grade grades grammar graveyard ground groups growth
## gt70 guess guinea guitar guylol gym habits haha haiku haikus hair haired
## halfbrain handsize handsome handstands handy hanging hangout happened
## happy hard hardworking harry hate havoc head headed headgames heal health
## healthcare healthy hearing heart heated heavy heck hell helping helps
## hide hikerun hilarity hills hipster hissing hit hmm hole homemade homework
## homophones honesty honor honorable hoodrat hopefully hopeless hoping
## horror hotcutesexy however hrefhttpenwikipediaorgwikiprincesselointaine
## hrefhttpwordsmithorgwordsprincesselointainehtml
## hrefhttpwwwyoutubecomwatchvqudqbdxtskyamplistpl062714bd0745e016ampfeatureplppplayall
## hrefhttpxkcdcom1027 hrefinterestsfreaksfreaks hrefinterestsgeeksgeeks
## hrefinterestsprojectproject httpwwwyoutubecomwatchvao68m8ppqky
## httpwwwyoutubecomwatchvnlmtoeiquo humble hurt hwp icing ideas
## ignored ilink imaginative importantly impressed includes
## individuals info infoenthusiasm inner inside inspire institutions
## integrity intellect intelligence intense intention intentions
## interests22i27mlonesomesincemonkeydied27s0adrivinghalfcrazyrightfootfollowsleftaround0athinkdarnthingslazy22
## interestsartistsartists interestsartsyartsy interestschampagnechampagne
## interestscreativecreative interestsdragdrag interestseducatededucated
## interestsgenderfuckgenderfuckers interestskinksterskinksters
## interestsmindfulnessmindfulness interestsmusiciansmusicians interface
## internet interwebs intimate intimidated intresting introverted involvement
## jack jest jesus job jobims joining joint joking journey joy judgement
## judges jump justice justplain karman kayak keen keha kick kilter kinda
## kindhearted kings kink kinky kisses knolls knowing knowledge knowledgeable
## knows knw korea laid lake lands lately later laughs laws laying lazy
## learned learning left leftist legal legged legs length letting levels
## lick lie lied lifes lifestyle liked liking list listening locals lointaine
## lointaines lolinternet lonesome longing longterm loose los lost lots
## low lower machines macho mainly maintain mammals managing manga manners
## marathons margaritas marital markets marrow matchcom matter mature meaning
## meaningful measure meat medicine mediocre mellow members memorable men
## mention merritt merry mesh messages messaging metal meter metropolis middle
## mile miles mindset mine minus miserably misery mite modern moment money
## monkey monogamy month morning motel mother motivated mouse mouth movie
## movies mundane museums mustache mutual name naturally nature necessarily
## necessity needy negotiable neither nerd nerdywittysarcastic neuron
## neverending nevermind night nonmonogamous normal norms note noted nothing
## noticed oakland ocean odds offended offense often oil okc okcupid omnivore
## ongoing opening opposite order organized oriented original outcome outdoors
## owe owner owning paddle painting pants paper papers paragraph parameters
## parenthetical parker parts party passport past path patience paying pbr
## peaks peeked peoples per percentage percentages perfectly performance
## perhaps period permission personality personally photos physical physics
## piano pick pickup picky pieces pig pina piqued pizza plain plan plane
## planning plans plate playful plays pleasant pleasepretty point pointlessly
## polite political politically politics poly popular porsche possibilites
## possibilities post pot potter practice predetermine presenting presumptive
## pretend price pride primarily princess priorities prioritiesone privacy
## professional profiles progressive project projectoutingactivityadventure
## projects proper provide prufrock psychologically punch punctuation punk
## puppies puppy purse quality queens queer question questions quick quickly
## quiet quite rad raise ramifications random rate raw readily realize realty
## reason reasonably reasons recently recipes recommendations recommended
## refridgerator relate relationships relaxed relaxing remark renovating
## renting reparte reply require reserved resonance responding responsible
## rest reward rich ridecreative ridiculous rings ritual rivers road roadtrip
## robbing rocking role romantically romps room rooms ropesbondage route rule
## run running rush ryan sabotaging safe salsa sarah sarcastic sassy says
## scabs scare scene science scifi scorpios search second section secure seem
## seems selfsummary selfsupporting sensitive sensuality sentence sentences
## series serve settle shady shark sharks shaved shift shoot shop shops shore
## short showsevents side sight sightreading sign silence simply sincere
## sincerity sing sisters sit sites situations sketch ski skiing skull sky
## skyscraper slopes slow smack small smell snatching snob snow social society
## soirees someones sometime sometimes somewhat soon soonev
dim(test.matrix)
## [1]  150 1705
#testing
testpreds<-predict(cv,test.matrix,type='class', s=cv$lambda.min)
cmt <- table(testpreds, test.dat$status2) # confusion matrix
mean(testpreds!=test.dat$status2) # misclassification rate
## [1] 0.32
mean(testpreds==test.dat$status2) # classification rate
## [1] 0.68
sum(test.dat$status2=="single")/length(test.dat$status2) #response proportion
## [1] 0.9466667
rm(testpreds,cmt)

Please check https://www.marcoaltini.com/blog/dealing-with-imbalanced-data-undersampling-oversampling-and-proper-cross-validation for ideas on what to do about predicting for categories that appear much less frequently than others (imbalanced data).

5. Make a recommendation or decision based on the analysis.

This part depends on your Problem and Plan. Answering questions you asked helps the audience so much. Making recommendations is an extension of answering the problems you addressed. It means that you make a suggestion to the audience that they do something with your analysis.