STAT 448 Spring 2020 - Week 4 Instructor Notes in R

Things Covered in this Week’s Notes

Generalized Linear Models (GLM)
GLM: Linear Regression
GLM: Logistic Regression
GLM: Count Regression

Generalized Linear Models (GLM)

Broad class of models including linear regression, logistic regression, and other discrete-response models
GLM is the acronym for generalized linear models
Made up of 3 components
- random component
  - specifies the response variable, $Y$, and its probability distribution
- systematic component
  - specifies the predictors $x_i$ in $\alpha+\beta_1 x_1+\cdots+\beta_p x_p$
- link function
  - specifies the function $g(\cdot)$ which details how the mean, $\mathrm{E}(Y)$ of the response variable, is related to the predictors as in $g() = +_1 x_1++_p x_p $
Parameter estimates come from maximum likelihood

Here are some examples of response variables, their distributions, and link functions.

Response Type	Distribution	Link Function	$g(\mu)$
continuous	normal	identity	$g(\mu)=\mu$
binary or events/trials	binomial	logit	$g(\mu)=\mathrm{log} \big( \mu/(1-\mu) \big)$
count	poisson	log	$g(\mu)=\mathrm{log} \big(\mu \big)$
count with some extreme values	gamma	reciprocal	$g(\mu)=1/\mu$

We’ll work with the SBA Loans Data.

library(tidyverse)

## -- Attaching packages --------------------------------------------- tidyverse 1.2.1 --

## v ggplot2 3.2.1     v purrr   0.3.2
## v tibble  2.1.3     v dplyr   0.8.3
## v tidyr   1.0.0     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.4.0

## -- Conflicts ------------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

SBA <- read_csv("https://uofi.box.com/shared/static/vi37omgitiaa2yyplrom779qvwk1g14x.csv", 
                                             col_types = cols(ApprovalDate = col_date(format = "%d-%b-%y"), 
                                                              BalanceGross = col_number(), ChgOffDate = col_date(format = "%d-%b-%y"), 
                                                              ChgOffPrinGr = col_number(), DisbursementDate = col_date(format = "%d-%b-%y"), 
                                                              DisbursementGross = col_number(), 
                                                              ApprovalFY = col_integer(),
                                                              GrAppv = col_number(), SBA_Appv = col_number()))

## Warning: 18 parsing failures.
##    row        col               expected actual                                                                      file
## 699733 ApprovalFY no trailing characters      A 'https://uofi.box.com/shared/static/vi37omgitiaa2yyplrom779qvwk1g14x.csv'
## 704031 ApprovalFY no trailing characters      A 'https://uofi.box.com/shared/static/vi37omgitiaa2yyplrom779qvwk1g14x.csv'
## 705376 ApprovalFY no trailing characters      A 'https://uofi.box.com/shared/static/vi37omgitiaa2yyplrom779qvwk1g14x.csv'
## 710382 ApprovalFY no trailing characters      A 'https://uofi.box.com/shared/static/vi37omgitiaa2yyplrom779qvwk1g14x.csv'
## 713246 ApprovalFY no trailing characters      A 'https://uofi.box.com/shared/static/vi37omgitiaa2yyplrom779qvwk1g14x.csv'
## ...... .......... ...................... ...... .........................................................................
## See problems(...) for more details.

Next, we make simple cleaning steps, create a subset of the data which is much smaller than the original, and remove the original large data set.

colnames(SBA) <- tolower(colnames(SBA))

SBA[,28] <- log(SBA$disbursementgross)
colnames(SBA)[28]<-"logdisbursementgross"
set.seed(448)
sba <- SBA[sample(1:dim(SBA)[1],200),]

rm(SBA) #remove the original large data file

GLM: Linear Regression

Attempts to answer questions such as
- How can we predict…?
- Which kinds of variables make sense to include for linear regression as predictors?
- Which kinds of variables make sense to include for linear regression as responses?
response $y$ is continuous
residuals $(y-\hat{y})$ ~ $N(\mu,\sigma^2)$
predictor(s) are $x_j$ and may be continuous or binary
We are modeling $\mathrm{E}(Y)=\beta_0 + \beta_1 x_{1}+ \ldots + \beta_p x_{p}$
We are predicting $\hat{\mu} = \hat\beta_0 + \hat\beta_1 x_{1}+ \ldots + \hat\beta_p x_{p}$

A Linear Regression Modeling Strategy

Detect any multicollinearity and remove those predictors from the model before fitting.

s1 <- select(sba, term, noemp, createjob, retainedjob, sba_appv, grappv, disbursementgross, logdisbursementgross)
dim(s1)

## [1] 200   8

cor(s1)

##                            term        noemp createjob  retainedjob
## term                 1.00000000  0.141462735 0.2765796  0.075721031
## noemp                0.14146274  1.000000000 0.3816089 -0.003400163
## createjob            0.27657957  0.381608898 1.0000000  0.360524016
## retainedjob          0.07572103 -0.003400163 0.3605240  1.000000000
## sba_appv             0.45963460  0.265908673 0.3620452  0.125686786
## grappv               0.42965160  0.207493987 0.3221656  0.136934332
## disbursementgross    0.41119624  0.205734901 0.3117202  0.149545039
## logdisbursementgross 0.55776902  0.166107375 0.2542944  0.107471651
##                       sba_appv    grappv disbursementgross
## term                 0.4596346 0.4296516         0.4111962
## noemp                0.2659087 0.2074940         0.2057349
## createjob            0.3620452 0.3221656         0.3117202
## retainedjob          0.1256868 0.1369343         0.1495450
## sba_appv             1.0000000 0.9895666         0.9788813
## grappv               0.9895666 1.0000000         0.9916146
## disbursementgross    0.9788813 0.9916146         1.0000000
## logdisbursementgross 0.7116531 0.7081799         0.7199597
##                      logdisbursementgross
## term                            0.5577690
## noemp                           0.1661074
## createjob                       0.2542944
## retainedjob                     0.1074717
## sba_appv                        0.7116531
## grappv                          0.7081799
## disbursementgross               0.7199597
## logdisbursementgross            1.0000000

pairs(s1)

pairs(select(s1, sba_appv, grappv))

car::vif(lm(logdisbursementgross ~ term + noemp + createjob + retainedjob + sba_appv + grappv, data=s1))

##        term       noemp   createjob retainedjob    sba_appv      grappv 
##    1.338369    1.383718    1.567531    1.209614   66.835055   62.498746

#library(car)
car::vif(lm(logdisbursementgross ~ term + noemp + createjob + retainedjob + grappv, data=s1))

##        term       noemp   createjob retainedjob      grappv 
##    1.261148    1.215218    1.498213    1.184595    1.311217

Fit the model.

#traditional linear regression
lm(logdisbursementgross ~ term + noemp + createjob + retainedjob + grappv, data=s1)

## 
## Call:
## lm(formula = logdisbursementgross ~ term + noemp + createjob + 
##     retainedjob + grappv, data = s1)
## 
## Coefficients:
## (Intercept)         term        noemp    createjob  retainedjob  
##   1.041e+01    5.100e-03    1.883e-04   -6.307e-03    3.096e-03  
##      grappv  
##   2.492e-06

#glm: linear regression
lin <- glm(logdisbursementgross ~ term + noemp + createjob + retainedjob + grappv, data=s1, family=gaussian)

Check the model diagnostics and model fit:
- Root mean square error
- $R^2$
- adjusted $R^2$

summary(lin)

## 
## Call:
## glm(formula = logdisbursementgross ~ term + noemp + createjob + 
##     retainedjob + grappv, family = gaussian, data = s1)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.4285  -0.3700   0.1042   0.5729   1.8396  
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.041e+01  1.063e-01  97.903  < 2e-16 ***
## term         5.100e-03  8.458e-04   6.029 8.16e-09 ***
## noemp        1.883e-04  7.455e-04   0.253    0.801    
## createjob   -6.307e-03  1.228e-02  -0.514    0.608    
## retainedjob  3.096e-03  1.035e-02   0.299    0.765    
## grappv       2.492e-06  2.296e-07  10.854  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.6939444)
## 
##     Null deviance: 321.25  on 199  degrees of freedom
## Residual deviance: 134.63  on 194  degrees of freedom
## AIC: 502.41
## 
## Number of Fisher Scoring iterations: 2

library(rsq,modelr)
modelr::rmse(lin, data=s1) #RMSE

## [1] 0.8204426

rsq::rsq(lin,adj=FALSE,data=s1) #R-squared

## [1] 0.5809368

rsq::rsq(lin,adj=TRUE,data=s1) #adjusted R-squared

## [1] 0.5701362

Check for influential observations and remove any influential points and refit the model.

names(influence(lin))

## [1] "hat"          "coefficients" "sigma"        "dev.res"     
## [5] "pear.res"

fivenum(abs(influence(lin)$dev.res))

##       123        46       199        21       159 
## 0.0027204 0.2149730 0.4916502 0.8273653 3.4284854

fivenum(abs(influence(lin)$pear.res))

##       123        46       199        21       159 
## 0.0027204 0.2149730 0.4916502 0.8273653 3.4284854

names(summary(lin))

##  [1] "call"           "terms"          "family"         "deviance"      
##  [5] "aic"            "contrasts"      "df.residual"    "null.deviance" 
##  [9] "df.null"        "iter"           "deviance.resid" "coefficients"  
## [13] "aliased"        "dispersion"     "df"             "cov.unscaled"  
## [17] "cov.scaled"

fivenum(abs(summary(lin)$deviance.resid))

##       123        46       199        21       159 
## 0.0027204 0.2149730 0.4916502 0.8273653 3.4284854

plot(1:length(summary(lin)$deviance.resid),summary(lin)$deviance.resid)
abline(h=-2, lty=2)
abline(h=0)
abline(h=2, lty=2)

ri <- which(abs(summary(lin)$deviance.resid)>3)

plot(logdisbursementgross ~ grappv, data=s1)
points(logdisbursementgross[ri] ~ grappv[ri], data=s1, col=c("blue","orange"), pch=16)

s1wori1 <- s1[-ri[1],]
s1wori2 <- s1[-ri[2],]
linwori1 <- glm(logdisbursementgross ~ term + noemp + createjob + retainedjob + grappv, data=s1wori1, family=gaussian)
linwori2 <- glm(logdisbursementgross ~ term + noemp + createjob + retainedjob + grappv, data=s1wori2, family=gaussian)
rsq::rsq(lin, adj=TRUE, data=s1) #adjusted R-squared

## [1] 0.5701362

rsq::rsq(linwori1, adj=TRUE, data=s1wori1) #adjusted R-squared

## [1] 0.6539762

rsq::rsq(linwori2, adj=TRUE, data=s1wori2) #adjusted R-squared

## [1] 0.5870418

Check model fit and diagnostics again.
If not satisfied with the model, you can attempt model selection.
- When doing model selection, keep in mind that we must consider lots of models.
- Human decisions may be based on information criteria (AIC, BIC, Mallow’s cp) or personal choice.
- Automatic selection (forward, backward, stepwise) decision may be based on variable contribution to the F statistic or its analogue.
- No single approach will help you make the decision among all of these models.
- Use more than one approach and understand why certain predictors are going into your models at all.
- Modeling goals are: improved prediction and clear model interpretation

library(MASS)

## 
## Attaching package: 'MASS'

## The following object is masked from 'package:dplyr':
## 
##     select

MASS::stepAIC(lin, direction = "both", trace = FALSE) #stepwise selection

## 
## Call:  glm(formula = logdisbursementgross ~ term + grappv, family = gaussian, 
##     data = s1)
## 
## Coefficients:
## (Intercept)         term       grappv  
##   1.042e+01    5.038e-03    2.479e-06  
## 
## Degrees of Freedom: 199 Total (i.e. Null);  197 Residual
## Null Deviance:       321.3 
## Residual Deviance: 134.8     AIC: 496.7

MASS::stepAIC(lin, direction = "forward", trace = FALSE) #forward selection

## 
## Call:  glm(formula = logdisbursementgross ~ term + noemp + createjob + 
##     retainedjob + grappv, family = gaussian, data = s1)
## 
## Coefficients:
## (Intercept)         term        noemp    createjob  retainedjob  
##   1.041e+01    5.100e-03    1.883e-04   -6.307e-03    3.096e-03  
##      grappv  
##   2.492e-06  
## 
## Degrees of Freedom: 199 Total (i.e. Null);  194 Residual
## Null Deviance:       321.3 
## Residual Deviance: 134.6     AIC: 502.4

MASS::stepAIC(lin, direction = "backward", trace = FALSE) #backward selection

## 
## Call:  glm(formula = logdisbursementgross ~ term + grappv, family = gaussian, 
##     data = s1)
## 
## Coefficients:
## (Intercept)         term       grappv  
##   1.042e+01    5.038e-03    2.479e-06  
## 
## Degrees of Freedom: 199 Total (i.e. Null);  197 Residual
## Null Deviance:       321.3 
## Residual Deviance: 134.8     AIC: 496.7

library(leaps)
comps <- leaps::regsubsets(logdisbursementgross ~ term + noemp + createjob + retainedjob + grappv, data = s1, nbest=4)
plot(comps,scale="adjr2")

plot(comps,scale="bic")

plot(comps,scale="Cp")

#car::subsets(comps, statistic="rss") #legend may not be visible

If the final model has been selected, interpret the predicted parameter estimates $\hat\beta$ and prediction performance.

lin$coefficients

##   (Intercept)          term         noemp     createjob   retainedjob 
##  1.040590e+01  5.099629e-03  1.882764e-04 -6.307249e-03  3.096399e-03 
##        grappv 
##  2.491616e-06

modelr::rmse(lin, data=s1) #RMSE

## [1] 0.8204426

rsq::rsq(lin,adj=FALSE,data=s1) #R-squared

## [1] 0.5809368

rsq::rsq(lin,adj=TRUE,data=s1) #adjusted R-squared

## [1] 0.5701362

GLM: Logistic Regression

Attempts to answer questions such as
- How can we predict…?
- Which kinds of variables make sense to include for logistic regression as predictors?
- Which kinds of variables make sense to include for logistic regression as responses?
response is $y = 0 \textrm{ or } 1$
$y$ ~ $Ber(\pi)$
predictor(s) are $x_j$ and may be categorical or continuous
We are modeling $\mathrm{logit}(\pi)=\mathrm{log}(\frac{\pi}{1-\pi})=\beta_0 + \beta_1 x_{1}+ \ldots + \beta_p x_{p}$
We are predicting $\hat{\pi} = \frac{1}{1+e^{-(\hat\beta_0 + \hat\beta_1 x_{1}+ \ldots + \hat\beta_p x_{p})}}$

A Logistic Regression Modeling Strategy

Detect any multicollinearity and remove those predictors from the model before fitting.
Fit the model.

s2 <- dplyr::select(mutate(sba, defaultbin = ifelse(mis_status=="CHGOFF", 1, 0), ruralbin = ifelse(urbanrural==2, 1, 0), newbizbin = ifelse(newexist==2, 1, 0), lowdocbin = ifelse(lowdoc=="Y", 1, 0)), defaultbin, term, grappv, lowdocbin, newbizbin, ruralbin)
s2 <- s2[-which(is.na(s2)),]
dim(s2)

## [1] 199   6

logi <- glm(factor(defaultbin) ~ term+ grappv+ factor(lowdocbin)+ factor(newbizbin)+ factor(ruralbin), family=binomial, data=s2)

Check the model diagnostics and model fit:
- Area under the curve (AUC)
- generalized $R^2$
- max-rescaled $R^2$
- a classification table

summary(logi)

## 
## Call:
## glm(formula = factor(defaultbin) ~ term + grappv + factor(lowdocbin) + 
##     factor(newbizbin) + factor(ruralbin), family = binomial, 
##     data = s2)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.37514  -0.68471  -0.31033  -0.05937   2.82170  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)   
## (Intercept)         3.938e-01  4.789e-01   0.822  0.41088   
## term               -1.726e-02  6.473e-03  -2.667  0.00765 **
## grappv             -5.495e-06  2.702e-06  -2.033  0.04200 * 
## factor(lowdocbin)1 -2.234e+00  1.059e+00  -2.110  0.03487 * 
## factor(newbizbin)1  4.319e-01  4.622e-01   0.935  0.34999   
## factor(ruralbin)1  -3.935e-01  8.167e-01  -0.482  0.62990   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 178.79  on 198  degrees of freedom
## Residual deviance: 139.84  on 193  degrees of freedom
## AIC: 151.84
## 
## Number of Fisher Scoring iterations: 7

library(pROC)

## Type 'citation("pROC")' for a citation.

## 
## Attaching package: 'pROC'

## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

length(s2$defaultbin)

## [1] 199

length(logi$fitted.values)

## [1] 199

pROC::auc(roc(s2$defaultbin, ifelse(predict(logi, type = "response") > 0.5, 1, 0))) #AUC

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

## Area under the curve: 0.5486

rsq::rsq(logi, adj=FALSE, data=s2) #R-squared

## [1] 0.1970723

rsq::rsq(logi, adj=TRUE, data=s2) #adjusted R-squared

## [1] 0.1762711

table(predicted = ifelse(predict(logi, type = "response") > 0.5, 1, 0), actual = s2$defaultbin) #classification table (confusion matrix)

##          actual
## predicted   0   1
##         0 162  29
##         1   4   4

mean(s2$defaultbin != ifelse(predict(logi, type = "response") > 0.5, 1, 0)) #misclassification rate

## [1] 0.1658291

sum(s2$defaultbin)/length(s2$defaultbin) #prior response rate

## [1] 0.1658291

Check for influential observations and remove any influential points and refit the model.

names(influence(logi))

## [1] "hat"          "coefficients" "sigma"        "dev.res"     
## [5] "pear.res"

fivenum(abs(influence(logi)$dev.res))

##          150          165           62           66          154 
## 0.0002454432 0.2079731586 0.5683981718 0.8877455845 2.8217011447

fivenum(abs(influence(logi)$pear.res))

##          150          165           62           66          154 
## 0.0001735545 0.1478580768 0.4187092099 0.6949778818 7.2505521891

names(summary(logi))

##  [1] "call"           "terms"          "family"         "deviance"      
##  [5] "aic"            "contrasts"      "df.residual"    "null.deviance" 
##  [9] "df.null"        "iter"           "deviance.resid" "coefficients"  
## [13] "aliased"        "dispersion"     "df"             "cov.unscaled"  
## [17] "cov.scaled"

fivenum(abs(summary(logi)$deviance.resid))

##          150          165           62           66          154 
## 0.0002454432 0.2079731586 0.5683981718 0.8877455845 2.8217011447

plot(1:length(summary(logi)$deviance.resid),summary(logi)$deviance.resid)
abline(h=-2, lty=2)
abline(h=0)
abline(h=2, lty=2)

Check model fit and diagnostics again.
If not satisfied with the model, you can attempt model selection.
- When doing model selection, keep in mind that we must consider lots of models.
- Human decisions may be based on information criteria or personal choice.
- Automatic selection decision may be based on variable contribution to the F statistic or its analogue.
- No single approach will help you make the decision among all of these models.
- Use more than one approach and understand why certain predictors are going into your models at all.
- Modeling goals are: improved prediction and clear model interpretation

#library(MASS)
MASS::stepAIC(logi, direction = "both", trace = FALSE) #stepwise selection

## 
## Call:  glm(formula = factor(defaultbin) ~ term + grappv + factor(lowdocbin), 
##     family = binomial, data = s2)
## 
## Coefficients:
##        (Intercept)                term              grappv  
##          4.874e-01          -1.720e-02          -5.630e-06  
## factor(lowdocbin)1  
##         -2.129e+00  
## 
## Degrees of Freedom: 198 Total (i.e. Null);  195 Residual
## Null Deviance:       178.8 
## Residual Deviance: 141   AIC: 149

MASS::stepAIC(logi, direction = "forward", trace = FALSE) #forward selection

## 
## Call:  glm(formula = factor(defaultbin) ~ term + grappv + factor(lowdocbin) + 
##     factor(newbizbin) + factor(ruralbin), family = binomial, 
##     data = s2)
## 
## Coefficients:
##        (Intercept)                term              grappv  
##          3.938e-01          -1.726e-02          -5.495e-06  
## factor(lowdocbin)1  factor(newbizbin)1   factor(ruralbin)1  
##         -2.234e+00           4.319e-01          -3.935e-01  
## 
## Degrees of Freedom: 198 Total (i.e. Null);  193 Residual
## Null Deviance:       178.8 
## Residual Deviance: 139.8     AIC: 151.8

MASS::stepAIC(logi, direction = "backward", trace = FALSE) #backward selection

## 
## Call:  glm(formula = factor(defaultbin) ~ term + grappv + factor(lowdocbin), 
##     family = binomial, data = s2)
## 
## Coefficients:
##        (Intercept)                term              grappv  
##          4.874e-01          -1.720e-02          -5.630e-06  
## factor(lowdocbin)1  
##         -2.129e+00  
## 
## Degrees of Freedom: 198 Total (i.e. Null);  195 Residual
## Null Deviance:       178.8 
## Residual Deviance: 141   AIC: 149

If the final model has been selected, interpret the predicted parameter estimates $\hat\beta$ and prediction performance.

logi$coefficients #MLEs

##        (Intercept)               term             grappv 
##       3.937918e-01      -1.726428e-02      -5.495462e-06 
## factor(lowdocbin)1 factor(newbizbin)1  factor(ruralbin)1 
##      -2.233606e+00       4.319310e-01      -3.935468e-01

exp(coef(logi)) #exp(MLEs)

##        (Intercept)               term             grappv 
##          1.4825918          0.9828839          0.9999945 
## factor(lowdocbin)1 factor(newbizbin)1  factor(ruralbin)1 
##          0.1071414          1.5402289          0.6746597

pROC::auc(roc(s2$defaultbin, ifelse(predict(logi, type = "response") > 0.5, 1, 0))) #AUC

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

## Area under the curve: 0.5486

rsq::rsq(logi, adj=FALSE, data=s2) #R-squared

## [1] 0.1970723

rsq::rsq(logi, adj=TRUE, data=s2) #adjusted R-squared

## [1] 0.1762711

mean(s2$defaultbin != ifelse(predict(logi, type = "response") > 0.5, 1, 0)) #misclassification rate

## [1] 0.1658291

Interpreting the predicted estimates $\hat\beta$.

When interpreting the logistic regression model results, it’s easy to focus on the odds ratio results.

Odds ratio are bounded from $[0,\infty)$
An odds ratio of 1 indicates independence which means the effect this predictor has on the response is practically useless.
If 1 is within the odds ratio confidence interval, then the effect is not practically nor statistically useful.
Odds ratio $< 1$ indicate that the predictor tends to decrease the odds of the response level of interest
Odds ratio $> 1$ indicate that the predictor tends to increase the odds of the response level of interest

Now suppose the following is true: - binary response $y$ with event=1 (success) or event=0 (failure) - numeric predictor $W$
- categorical predictor $X$ with reference coding (Yes or No categories)

Then the interpretation might be " Holding all other predictors constant,
- a 1-unit increase in $W$ predicts an increase in the odds of event=1 by exp($\hat\beta_W$) times
- the estimated odds of event=1 for the ‘Yes’ category is exp($\hat\beta_X$) times the estimated odds for the ‘No’ category. "

Warning! When complete separation or quasi-separation occurs…

This indicates that the model can predict the response perfectly. Implying

the unique MLEs do not exist
extremely large parameter estimates

In R, you may see the following warning in red text “fitted probabilities numerically 0 or 1 occurred.”

Potential causes are:

multicollinearity
sparsity: not all of the response levels are observed for each of the predictors (or their categories) included in the model
sample size is small
response level of interest is rare

Potential workarounds:

modeling the exact logistic regression (computationally intensive!)
http://support.sas.com/kb/22/599.html for more general guidance
http://documentation.sas.com/?docsetId=statug&docsetVersion=14.3&docsetTarget=statug_logistic_examples14.htm&locale=en for better understanding

GLM: Count Regression

To use the gamma model, we should observe the variance will be close to the mean squared

To use the Poisson model, we should observe the variance will be close to the mean

A Count Regression Modeling Strategy

Detect any multicollinearity and remove those predictors from the model before fitting.
Fit the model.

is.integer(sba$disbursementgross)

## [1] FALSE

s3 <- mutate(sba, defaultbin = ifelse(mis_status=="CHGOFF", 1, 0), ruralbin = ifelse(urbanrural==2, 1, 0), newbizbin = ifelse(newexist==2, 1, 0), lowdocbin = ifelse(lowdoc=="Y", 1, 0), dg = as.integer(round(sba$disbursementgross)))
s3 <- s3[-which(is.na(s3)),]
colnames(s3)

##  [1] "loannr_chkdgt"        "name"                 "city"                
##  [4] "state"                "zip"                  "bank"                
##  [7] "bankstate"            "naics"                "approvaldate"        
## [10] "approvalfy"           "term"                 "noemp"               
## [13] "newexist"             "createjob"            "retainedjob"         
## [16] "franchisecode"        "urbanrural"           "revlinecr"           
## [19] "lowdoc"               "chgoffdate"           "disbursementdate"    
## [22] "disbursementgross"    "balancegross"         "mis_status"          
## [25] "chgoffpringr"         "grappv"               "sba_appv"            
## [28] "logdisbursementgross" "defaultbin"           "ruralbin"            
## [31] "newbizbin"            "lowdocbin"            "dg"

is.integer(s3$dg)

## [1] TRUE

pois <- glm( dg ~ term+ noemp+ createjob+ retainedjob+ grappv+ factor(lowdocbin)+ factor(newbizbin)+ factor(ruralbin), family=poisson, data=s3)

Check the model diagnostics and model fit:
- Root mean square error
- $R^2$
- adjusted $R^2$

summary(pois)

## 
## Call:
## glm(formula = dg ~ term + noemp + createjob + retainedjob + grappv + 
##     factor(lowdocbin) + factor(newbizbin) + factor(ruralbin), 
##     family = poisson, data = s3)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -498.77  -211.76   -62.50    92.05   963.18  
## 
## Coefficients:
##                      Estimate Std. Error  z value Pr(>|z|)    
## (Intercept)         1.118e+01  4.186e-04 26713.45   <2e-16 ***
## term                4.504e-03  1.839e-06  2449.52   <2e-16 ***
## noemp               2.049e-04  9.880e-07   207.39   <2e-16 ***
## createjob           3.678e-03  1.847e-05   199.06   <2e-16 ***
## retainedjob        -4.907e-03  2.789e-05  -175.94   <2e-16 ***
## grappv              1.245e-06  2.077e-10  5993.91   <2e-16 ***
## factor(lowdocbin)1 -6.485e-01  8.178e-04  -793.00   <2e-16 ***
## factor(newbizbin)1  2.632e-02  4.487e-04    58.66   <2e-16 ***
## factor(ruralbin)1  -1.947e-02  6.281e-04   -30.99   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 50004643  on 199  degrees of freedom
## Residual deviance: 12344211  on 191  degrees of freedom
## AIC: 12346883
## 
## Number of Fisher Scoring iterations: 5

modelr::rmse(pois, data=s3) #RMSE

## [1] 347160.1

rsq::rsq(pois,adj=FALSE,data=s3) #R-squared

## [1] 0.806498

rsq::rsq(pois,adj=TRUE,data=s3) #adjusted R-squared

## [1] 0.7983932

Check for influential observations and remove any influential points and refit the model.

summary(influence(pois)$dev.res)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -498.77 -211.76  -62.50  -37.36   92.05  963.18

Check model fit and diagnostics again.
If not satisfied with the model, you can attempt model selection.
- When doing model selection, keep in mind that we must consider lots of models.
- Human decisions may be based on information criteria or personal choice.
- Automatic selection decision may be based on variable contribution to the F statistic or its analogue.
- No single approach will help you make the decision among all of these models.
- Use more than one approach and understand why certain predictors are going into your models at all.
- Modeling goals are: improved prediction and clear model interpretation

MASS::stepAIC(pois, direction = "both", trace = FALSE) #stepwise selection

## 
## Call:  glm(formula = dg ~ term + noemp + createjob + retainedjob + grappv + 
##     factor(lowdocbin) + factor(newbizbin) + factor(ruralbin), 
##     family = poisson, data = s3)
## 
## Coefficients:
##        (Intercept)                term               noemp  
##          1.118e+01           4.504e-03           2.049e-04  
##          createjob         retainedjob              grappv  
##          3.678e-03          -4.907e-03           1.245e-06  
## factor(lowdocbin)1  factor(newbizbin)1   factor(ruralbin)1  
##         -6.485e-01           2.632e-02          -1.947e-02  
## 
## Degrees of Freedom: 199 Total (i.e. Null);  191 Residual
## Null Deviance:       5e+07 
## Residual Deviance: 12340000  AIC: 12350000

MASS::stepAIC(pois, direction = "forward", trace = FALSE) #forward selection

## 
## Call:  glm(formula = dg ~ term + noemp + createjob + retainedjob + grappv + 
##     factor(lowdocbin) + factor(newbizbin) + factor(ruralbin), 
##     family = poisson, data = s3)
## 
## Coefficients:
##        (Intercept)                term               noemp  
##          1.118e+01           4.504e-03           2.049e-04  
##          createjob         retainedjob              grappv  
##          3.678e-03          -4.907e-03           1.245e-06  
## factor(lowdocbin)1  factor(newbizbin)1   factor(ruralbin)1  
##         -6.485e-01           2.632e-02          -1.947e-02  
## 
## Degrees of Freedom: 199 Total (i.e. Null);  191 Residual
## Null Deviance:       5e+07 
## Residual Deviance: 12340000  AIC: 12350000

MASS::stepAIC(pois, direction = "backward", trace = FALSE) #backward selection

## 
## Call:  glm(formula = dg ~ term + noemp + createjob + retainedjob + grappv + 
##     factor(lowdocbin) + factor(newbizbin) + factor(ruralbin), 
##     family = poisson, data = s3)
## 
## Coefficients:
##        (Intercept)                term               noemp  
##          1.118e+01           4.504e-03           2.049e-04  
##          createjob         retainedjob              grappv  
##          3.678e-03          -4.907e-03           1.245e-06  
## factor(lowdocbin)1  factor(newbizbin)1   factor(ruralbin)1  
##         -6.485e-01           2.632e-02          -1.947e-02  
## 
## Degrees of Freedom: 199 Total (i.e. Null);  191 Residual
## Null Deviance:       5e+07 
## Residual Deviance: 12340000  AIC: 12350000

If the final model has been selected, interpret the predicted parameter estimates $\hat\beta$ and prediction performance.

Now suppose the following is true: - count response $y$ - numeric predictor $W$
- categorical predictor $X$ with reference coding (Yes or No categories)

Then the interpretation might be " Holding all other predictors constant,
- a 1-unit increase in $W$ predicts an increase in the log count of the response by $\hat\beta_W$
- the predicted log count of the response for the ‘Yes’ category is $\hat\beta_X$ higher (or lower if $\hat\beta_X$ is negative) than the predicted log count for the ‘No’ category. "

pois$coefficients #MLEs

##        (Intercept)               term              noemp 
##       1.118193e+01       4.503829e-03       2.048979e-04 
##          createjob        retainedjob             grappv 
##       3.677625e-03      -4.907495e-03       1.245080e-06 
## factor(lowdocbin)1 factor(newbizbin)1  factor(ruralbin)1 
##      -6.485168e-01       2.632085e-02      -1.946734e-02

exp(coef(pois)) #exp(MLEs)

##        (Intercept)               term              noemp 
##       7.182097e+04       1.004514e+00       1.000205e+00 
##          createjob        retainedjob             grappv 
##       1.003684e+00       9.951045e-01       1.000001e+00 
## factor(lowdocbin)1 factor(newbizbin)1  factor(ruralbin)1 
##       5.228206e-01       1.026670e+00       9.807209e-01

rsq::rsq(pois, adj=FALSE, data=s3) #R-squared

## [1] 0.806498

rsq::rsq(pois, adj=TRUE, data=s3) #adjusted R-squared

## [1] 0.7983932

Overdispersion

Overdispersion is a phenomenon that sometimes occurs in data that are modeled with Poisson distributions that adversely affects model parameter estimates. It means that there is a much larger variation in the data than expected. Dispersion $\phi$, which represents the residual deviance divided by degrees of freedom, is expected to be equal to 1 for any given Poisson model.

For the Poisson case, if the dispersion estimate after fitting is far from 1, then we have a dispersion issue in our model that needs addressing.
- data is overdispersed if the dispersion estimate $> 1$
- data is underdispersed if the dispersion estimate $< 1$

We can model for overdispersion by adjusting the scale parameter of the model with a quasi-Poisson model.

summary(pois) #model without adjusting for dispersion

## 
## Call:
## glm(formula = dg ~ term + noemp + createjob + retainedjob + grappv + 
##     factor(lowdocbin) + factor(newbizbin) + factor(ruralbin), 
##     family = poisson, data = s3)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -498.77  -211.76   -62.50    92.05   963.18  
## 
## Coefficients:
##                      Estimate Std. Error  z value Pr(>|z|)    
## (Intercept)         1.118e+01  4.186e-04 26713.45   <2e-16 ***
## term                4.504e-03  1.839e-06  2449.52   <2e-16 ***
## noemp               2.049e-04  9.880e-07   207.39   <2e-16 ***
## createjob           3.678e-03  1.847e-05   199.06   <2e-16 ***
## retainedjob        -4.907e-03  2.789e-05  -175.94   <2e-16 ***
## grappv              1.245e-06  2.077e-10  5993.91   <2e-16 ***
## factor(lowdocbin)1 -6.485e-01  8.178e-04  -793.00   <2e-16 ***
## factor(newbizbin)1  2.632e-02  4.487e-04    58.66   <2e-16 ***
## factor(ruralbin)1  -1.947e-02  6.281e-04   -30.99   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 50004643  on 199  degrees of freedom
## Residual deviance: 12344211  on 191  degrees of freedom
## AIC: 12346883
## 
## Number of Fisher Scoring iterations: 5

summary(pois)$deviance/summary(pois)$df.residual #raw dispersion estimate

## [1] 64629.38

pois2 <- glm( dg ~ term+ noemp+ createjob+ retainedjob+ grappv+ factor(lowdocbin)+ factor(newbizbin)+ factor(ruralbin), family=quasipoisson, data=s3)
 #model with adjusting for dispersion
summary(pois2)

## 
## Call:
## glm(formula = dg ~ term + noemp + createjob + retainedjob + grappv + 
##     factor(lowdocbin) + factor(newbizbin) + factor(ruralbin), 
##     family = quasipoisson, data = s3)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -498.77  -211.76   -62.50    92.05   963.18  
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         1.118e+01  1.110e-01 100.712  < 2e-16 ***
## term                4.504e-03  4.877e-04   9.235  < 2e-16 ***
## noemp               2.049e-04  2.621e-04   0.782  0.43525    
## createjob           3.678e-03  4.900e-03   0.750  0.45389    
## retainedjob        -4.907e-03  7.399e-03  -0.663  0.50794    
## grappv              1.245e-06  5.510e-08  22.598  < 2e-16 ***
## factor(lowdocbin)1 -6.485e-01  2.169e-01  -2.990  0.00316 ** 
## factor(newbizbin)1  2.632e-02  1.190e-01   0.221  0.82520    
## factor(ruralbin)1  -1.947e-02  1.666e-01  -0.117  0.90711    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for quasipoisson family taken to be 70355.79)
## 
##     Null deviance: 50004643  on 199  degrees of freedom
## Residual deviance: 12344211  on 191  degrees of freedom
## AIC: NA
## 
## Number of Fisher Scoring iterations: 5

#What changed between the results of the two models - pois vs pois2?

An alternative to using the scale-adjusted quasi-Poisson model is to use a different model altogether that is still relevant for count data such as the negative binomial, which is related to the gamma distribution. See https://cran.r-project.org/web/packages/pscl/vignettes/countreg.pdf for more details.