Things Covered in this Week’s Notes

Generalized Linear Models (GLM)

Here are some examples of response variables, their distributions, and link functions.

Response Type Distribution Link Function \(g(\mu)\)
continuous normal identity \(g(\mu)=\mu\)
binary or events/trials binomial logit \(g(\mu)=\mathrm{log} \big( \mu/(1-\mu) \big)\)
count poisson log \(g(\mu)=\mathrm{log} \big(\mu \big)\)
count with some extreme values gamma reciprocal \(g(\mu)=1/\mu\)

We’ll work with the SBA Loans Data.

library(tidyverse)
## -- Attaching packages --------------------------------------------- tidyverse 1.2.1 --
## v ggplot2 3.2.1     v purrr   0.3.2
## v tibble  2.1.3     v dplyr   0.8.3
## v tidyr   1.0.0     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.4.0
## -- Conflicts ------------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
SBA <- read_csv("https://uofi.box.com/shared/static/vi37omgitiaa2yyplrom779qvwk1g14x.csv", 
                                             col_types = cols(ApprovalDate = col_date(format = "%d-%b-%y"), 
                                                              BalanceGross = col_number(), ChgOffDate = col_date(format = "%d-%b-%y"), 
                                                              ChgOffPrinGr = col_number(), DisbursementDate = col_date(format = "%d-%b-%y"), 
                                                              DisbursementGross = col_number(), 
                                                              ApprovalFY = col_integer(),
                                                              GrAppv = col_number(), SBA_Appv = col_number()))
## Warning: 18 parsing failures.
##    row        col               expected actual                                                                      file
## 699733 ApprovalFY no trailing characters      A 'https://uofi.box.com/shared/static/vi37omgitiaa2yyplrom779qvwk1g14x.csv'
## 704031 ApprovalFY no trailing characters      A 'https://uofi.box.com/shared/static/vi37omgitiaa2yyplrom779qvwk1g14x.csv'
## 705376 ApprovalFY no trailing characters      A 'https://uofi.box.com/shared/static/vi37omgitiaa2yyplrom779qvwk1g14x.csv'
## 710382 ApprovalFY no trailing characters      A 'https://uofi.box.com/shared/static/vi37omgitiaa2yyplrom779qvwk1g14x.csv'
## 713246 ApprovalFY no trailing characters      A 'https://uofi.box.com/shared/static/vi37omgitiaa2yyplrom779qvwk1g14x.csv'
## ...... .......... ...................... ...... .........................................................................
## See problems(...) for more details.

Next, we make simple cleaning steps, create a subset of the data which is much smaller than the original, and remove the original large data set.

colnames(SBA) <- tolower(colnames(SBA))

SBA[,28] <- log(SBA$disbursementgross)
colnames(SBA)[28]<-"logdisbursementgross"
set.seed(448)
sba <- SBA[sample(1:dim(SBA)[1],200),]

rm(SBA) #remove the original large data file

GLM: Linear Regression

A Linear Regression Modeling Strategy

  • Detect any multicollinearity and remove those predictors from the model before fitting.
s1 <- select(sba, term, noemp, createjob, retainedjob, sba_appv, grappv, disbursementgross, logdisbursementgross)
dim(s1)
## [1] 200   8
cor(s1)
##                            term        noemp createjob  retainedjob
## term                 1.00000000  0.141462735 0.2765796  0.075721031
## noemp                0.14146274  1.000000000 0.3816089 -0.003400163
## createjob            0.27657957  0.381608898 1.0000000  0.360524016
## retainedjob          0.07572103 -0.003400163 0.3605240  1.000000000
## sba_appv             0.45963460  0.265908673 0.3620452  0.125686786
## grappv               0.42965160  0.207493987 0.3221656  0.136934332
## disbursementgross    0.41119624  0.205734901 0.3117202  0.149545039
## logdisbursementgross 0.55776902  0.166107375 0.2542944  0.107471651
##                       sba_appv    grappv disbursementgross
## term                 0.4596346 0.4296516         0.4111962
## noemp                0.2659087 0.2074940         0.2057349
## createjob            0.3620452 0.3221656         0.3117202
## retainedjob          0.1256868 0.1369343         0.1495450
## sba_appv             1.0000000 0.9895666         0.9788813
## grappv               0.9895666 1.0000000         0.9916146
## disbursementgross    0.9788813 0.9916146         1.0000000
## logdisbursementgross 0.7116531 0.7081799         0.7199597
##                      logdisbursementgross
## term                            0.5577690
## noemp                           0.1661074
## createjob                       0.2542944
## retainedjob                     0.1074717
## sba_appv                        0.7116531
## grappv                          0.7081799
## disbursementgross               0.7199597
## logdisbursementgross            1.0000000
pairs(s1)

pairs(select(s1, sba_appv, grappv))

car::vif(lm(logdisbursementgross ~ term + noemp + createjob + retainedjob + sba_appv + grappv, data=s1))
##        term       noemp   createjob retainedjob    sba_appv      grappv 
##    1.338369    1.383718    1.567531    1.209614   66.835055   62.498746
#library(car)
car::vif(lm(logdisbursementgross ~ term + noemp + createjob + retainedjob + grappv, data=s1))
##        term       noemp   createjob retainedjob      grappv 
##    1.261148    1.215218    1.498213    1.184595    1.311217
  • Fit the model.
#traditional linear regression
lm(logdisbursementgross ~ term + noemp + createjob + retainedjob + grappv, data=s1)
## 
## Call:
## lm(formula = logdisbursementgross ~ term + noemp + createjob + 
##     retainedjob + grappv, data = s1)
## 
## Coefficients:
## (Intercept)         term        noemp    createjob  retainedjob  
##   1.041e+01    5.100e-03    1.883e-04   -6.307e-03    3.096e-03  
##      grappv  
##   2.492e-06
#glm: linear regression
lin <- glm(logdisbursementgross ~ term + noemp + createjob + retainedjob + grappv, data=s1, family=gaussian)
  • Check the model diagnostics and model fit:
    • Root mean square error
    • \(R^2\)
    • adjusted \(R^2\)
summary(lin)
## 
## Call:
## glm(formula = logdisbursementgross ~ term + noemp + createjob + 
##     retainedjob + grappv, family = gaussian, data = s1)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.4285  -0.3700   0.1042   0.5729   1.8396  
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.041e+01  1.063e-01  97.903  < 2e-16 ***
## term         5.100e-03  8.458e-04   6.029 8.16e-09 ***
## noemp        1.883e-04  7.455e-04   0.253    0.801    
## createjob   -6.307e-03  1.228e-02  -0.514    0.608    
## retainedjob  3.096e-03  1.035e-02   0.299    0.765    
## grappv       2.492e-06  2.296e-07  10.854  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.6939444)
## 
##     Null deviance: 321.25  on 199  degrees of freedom
## Residual deviance: 134.63  on 194  degrees of freedom
## AIC: 502.41
## 
## Number of Fisher Scoring iterations: 2
library(rsq,modelr)
modelr::rmse(lin, data=s1) #RMSE
## [1] 0.8204426
rsq::rsq(lin,adj=FALSE,data=s1) #R-squared
## [1] 0.5809368
rsq::rsq(lin,adj=TRUE,data=s1) #adjusted R-squared
## [1] 0.5701362
  • Check for influential observations and remove any influential points and refit the model.
names(influence(lin))
## [1] "hat"          "coefficients" "sigma"        "dev.res"     
## [5] "pear.res"
fivenum(abs(influence(lin)$dev.res))
##       123        46       199        21       159 
## 0.0027204 0.2149730 0.4916502 0.8273653 3.4284854
fivenum(abs(influence(lin)$pear.res))
##       123        46       199        21       159 
## 0.0027204 0.2149730 0.4916502 0.8273653 3.4284854
names(summary(lin))
##  [1] "call"           "terms"          "family"         "deviance"      
##  [5] "aic"            "contrasts"      "df.residual"    "null.deviance" 
##  [9] "df.null"        "iter"           "deviance.resid" "coefficients"  
## [13] "aliased"        "dispersion"     "df"             "cov.unscaled"  
## [17] "cov.scaled"
fivenum(abs(summary(lin)$deviance.resid))
##       123        46       199        21       159 
## 0.0027204 0.2149730 0.4916502 0.8273653 3.4284854
plot(1:length(summary(lin)$deviance.resid),summary(lin)$deviance.resid)
abline(h=-2, lty=2)
abline(h=0)
abline(h=2, lty=2)

ri <- which(abs(summary(lin)$deviance.resid)>3)

plot(logdisbursementgross ~ grappv, data=s1)
points(logdisbursementgross[ri] ~ grappv[ri], data=s1, col=c("blue","orange"), pch=16)

s1wori1 <- s1[-ri[1],]
s1wori2 <- s1[-ri[2],]
linwori1 <- glm(logdisbursementgross ~ term + noemp + createjob + retainedjob + grappv, data=s1wori1, family=gaussian)
linwori2 <- glm(logdisbursementgross ~ term + noemp + createjob + retainedjob + grappv, data=s1wori2, family=gaussian)
rsq::rsq(lin, adj=TRUE, data=s1) #adjusted R-squared
## [1] 0.5701362
rsq::rsq(linwori1, adj=TRUE, data=s1wori1) #adjusted R-squared
## [1] 0.6539762
rsq::rsq(linwori2, adj=TRUE, data=s1wori2) #adjusted R-squared
## [1] 0.5870418
  • Check model fit and diagnostics again.

  • If not satisfied with the model, you can attempt model selection.

    • When doing model selection, keep in mind that we must consider lots of models.
    • Human decisions may be based on information criteria (AIC, BIC, Mallow’s cp) or personal choice.
    • Automatic selection (forward, backward, stepwise) decision may be based on variable contribution to the F statistic or its analogue.
    • No single approach will help you make the decision among all of these models.
    • Use more than one approach and understand why certain predictors are going into your models at all.
    • Modeling goals are: improved prediction and clear model interpretation
library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
MASS::stepAIC(lin, direction = "both", trace = FALSE) #stepwise selection
## 
## Call:  glm(formula = logdisbursementgross ~ term + grappv, family = gaussian, 
##     data = s1)
## 
## Coefficients:
## (Intercept)         term       grappv  
##   1.042e+01    5.038e-03    2.479e-06  
## 
## Degrees of Freedom: 199 Total (i.e. Null);  197 Residual
## Null Deviance:       321.3 
## Residual Deviance: 134.8     AIC: 496.7
MASS::stepAIC(lin, direction = "forward", trace = FALSE) #forward selection
## 
## Call:  glm(formula = logdisbursementgross ~ term + noemp + createjob + 
##     retainedjob + grappv, family = gaussian, data = s1)
## 
## Coefficients:
## (Intercept)         term        noemp    createjob  retainedjob  
##   1.041e+01    5.100e-03    1.883e-04   -6.307e-03    3.096e-03  
##      grappv  
##   2.492e-06  
## 
## Degrees of Freedom: 199 Total (i.e. Null);  194 Residual
## Null Deviance:       321.3 
## Residual Deviance: 134.6     AIC: 502.4
MASS::stepAIC(lin, direction = "backward", trace = FALSE) #backward selection
## 
## Call:  glm(formula = logdisbursementgross ~ term + grappv, family = gaussian, 
##     data = s1)
## 
## Coefficients:
## (Intercept)         term       grappv  
##   1.042e+01    5.038e-03    2.479e-06  
## 
## Degrees of Freedom: 199 Total (i.e. Null);  197 Residual
## Null Deviance:       321.3 
## Residual Deviance: 134.8     AIC: 496.7
library(leaps)
comps <- leaps::regsubsets(logdisbursementgross ~ term + noemp + createjob + retainedjob + grappv, data = s1, nbest=4)
plot(comps,scale="adjr2")

plot(comps,scale="bic")

plot(comps,scale="Cp")

#car::subsets(comps, statistic="rss") #legend may not be visible
  • If the final model has been selected, interpret the predicted parameter estimates \(\hat\beta\) and prediction performance.
lin$coefficients
##   (Intercept)          term         noemp     createjob   retainedjob 
##  1.040590e+01  5.099629e-03  1.882764e-04 -6.307249e-03  3.096399e-03 
##        grappv 
##  2.491616e-06
modelr::rmse(lin, data=s1) #RMSE
## [1] 0.8204426
rsq::rsq(lin,adj=FALSE,data=s1) #R-squared
## [1] 0.5809368
rsq::rsq(lin,adj=TRUE,data=s1) #adjusted R-squared
## [1] 0.5701362

GLM: Logistic Regression

A Logistic Regression Modeling Strategy

  • Detect any multicollinearity and remove those predictors from the model before fitting.

  • Fit the model.

s2 <- dplyr::select(mutate(sba, defaultbin = ifelse(mis_status=="CHGOFF", 1, 0), ruralbin = ifelse(urbanrural==2, 1, 0), newbizbin = ifelse(newexist==2, 1, 0), lowdocbin = ifelse(lowdoc=="Y", 1, 0)), defaultbin, term, grappv, lowdocbin, newbizbin, ruralbin)
s2 <- s2[-which(is.na(s2)),]
dim(s2)
## [1] 199   6
logi <- glm(factor(defaultbin) ~ term+ grappv+ factor(lowdocbin)+ factor(newbizbin)+ factor(ruralbin), family=binomial, data=s2)
  • Check the model diagnostics and model fit:
    • Area under the curve (AUC)
    • generalized \(R^2\)
    • max-rescaled \(R^2\)
    • a classification table
summary(logi)
## 
## Call:
## glm(formula = factor(defaultbin) ~ term + grappv + factor(lowdocbin) + 
##     factor(newbizbin) + factor(ruralbin), family = binomial, 
##     data = s2)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.37514  -0.68471  -0.31033  -0.05937   2.82170  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)   
## (Intercept)         3.938e-01  4.789e-01   0.822  0.41088   
## term               -1.726e-02  6.473e-03  -2.667  0.00765 **
## grappv             -5.495e-06  2.702e-06  -2.033  0.04200 * 
## factor(lowdocbin)1 -2.234e+00  1.059e+00  -2.110  0.03487 * 
## factor(newbizbin)1  4.319e-01  4.622e-01   0.935  0.34999   
## factor(ruralbin)1  -3.935e-01  8.167e-01  -0.482  0.62990   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 178.79  on 198  degrees of freedom
## Residual deviance: 139.84  on 193  degrees of freedom
## AIC: 151.84
## 
## Number of Fisher Scoring iterations: 7
library(pROC)
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
length(s2$defaultbin)
## [1] 199
length(logi$fitted.values)
## [1] 199
pROC::auc(roc(s2$defaultbin, ifelse(predict(logi, type = "response") > 0.5, 1, 0))) #AUC
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Area under the curve: 0.5486
rsq::rsq(logi, adj=FALSE, data=s2) #R-squared
## [1] 0.1970723
rsq::rsq(logi, adj=TRUE, data=s2) #adjusted R-squared
## [1] 0.1762711
table(predicted = ifelse(predict(logi, type = "response") > 0.5, 1, 0), actual = s2$defaultbin) #classification table (confusion matrix)
##          actual
## predicted   0   1
##         0 162  29
##         1   4   4
mean(s2$defaultbin != ifelse(predict(logi, type = "response") > 0.5, 1, 0)) #misclassification rate
## [1] 0.1658291
sum(s2$defaultbin)/length(s2$defaultbin) #prior response rate
## [1] 0.1658291
  • Check for influential observations and remove any influential points and refit the model.
names(influence(logi))
## [1] "hat"          "coefficients" "sigma"        "dev.res"     
## [5] "pear.res"
fivenum(abs(influence(logi)$dev.res))
##          150          165           62           66          154 
## 0.0002454432 0.2079731586 0.5683981718 0.8877455845 2.8217011447
fivenum(abs(influence(logi)$pear.res))
##          150          165           62           66          154 
## 0.0001735545 0.1478580768 0.4187092099 0.6949778818 7.2505521891
names(summary(logi))
##  [1] "call"           "terms"          "family"         "deviance"      
##  [5] "aic"            "contrasts"      "df.residual"    "null.deviance" 
##  [9] "df.null"        "iter"           "deviance.resid" "coefficients"  
## [13] "aliased"        "dispersion"     "df"             "cov.unscaled"  
## [17] "cov.scaled"
fivenum(abs(summary(logi)$deviance.resid))
##          150          165           62           66          154 
## 0.0002454432 0.2079731586 0.5683981718 0.8877455845 2.8217011447
plot(1:length(summary(logi)$deviance.resid),summary(logi)$deviance.resid)
abline(h=-2, lty=2)
abline(h=0)
abline(h=2, lty=2)

  • Check model fit and diagnostics again.

  • If not satisfied with the model, you can attempt model selection.

    • When doing model selection, keep in mind that we must consider lots of models.
    • Human decisions may be based on information criteria or personal choice.
    • Automatic selection decision may be based on variable contribution to the F statistic or its analogue.
    • No single approach will help you make the decision among all of these models.
    • Use more than one approach and understand why certain predictors are going into your models at all.
    • Modeling goals are: improved prediction and clear model interpretation
#library(MASS)
MASS::stepAIC(logi, direction = "both", trace = FALSE) #stepwise selection
## 
## Call:  glm(formula = factor(defaultbin) ~ term + grappv + factor(lowdocbin), 
##     family = binomial, data = s2)
## 
## Coefficients:
##        (Intercept)                term              grappv  
##          4.874e-01          -1.720e-02          -5.630e-06  
## factor(lowdocbin)1  
##         -2.129e+00  
## 
## Degrees of Freedom: 198 Total (i.e. Null);  195 Residual
## Null Deviance:       178.8 
## Residual Deviance: 141   AIC: 149
MASS::stepAIC(logi, direction = "forward", trace = FALSE) #forward selection
## 
## Call:  glm(formula = factor(defaultbin) ~ term + grappv + factor(lowdocbin) + 
##     factor(newbizbin) + factor(ruralbin), family = binomial, 
##     data = s2)
## 
## Coefficients:
##        (Intercept)                term              grappv  
##          3.938e-01          -1.726e-02          -5.495e-06  
## factor(lowdocbin)1  factor(newbizbin)1   factor(ruralbin)1  
##         -2.234e+00           4.319e-01          -3.935e-01  
## 
## Degrees of Freedom: 198 Total (i.e. Null);  193 Residual
## Null Deviance:       178.8 
## Residual Deviance: 139.8     AIC: 151.8
MASS::stepAIC(logi, direction = "backward", trace = FALSE) #backward selection
## 
## Call:  glm(formula = factor(defaultbin) ~ term + grappv + factor(lowdocbin), 
##     family = binomial, data = s2)
## 
## Coefficients:
##        (Intercept)                term              grappv  
##          4.874e-01          -1.720e-02          -5.630e-06  
## factor(lowdocbin)1  
##         -2.129e+00  
## 
## Degrees of Freedom: 198 Total (i.e. Null);  195 Residual
## Null Deviance:       178.8 
## Residual Deviance: 141   AIC: 149
  • If the final model has been selected, interpret the predicted parameter estimates \(\hat\beta\) and prediction performance.
logi$coefficients #MLEs
##        (Intercept)               term             grappv 
##       3.937918e-01      -1.726428e-02      -5.495462e-06 
## factor(lowdocbin)1 factor(newbizbin)1  factor(ruralbin)1 
##      -2.233606e+00       4.319310e-01      -3.935468e-01
exp(coef(logi)) #exp(MLEs)
##        (Intercept)               term             grappv 
##          1.4825918          0.9828839          0.9999945 
## factor(lowdocbin)1 factor(newbizbin)1  factor(ruralbin)1 
##          0.1071414          1.5402289          0.6746597
pROC::auc(roc(s2$defaultbin, ifelse(predict(logi, type = "response") > 0.5, 1, 0))) #AUC
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Area under the curve: 0.5486
rsq::rsq(logi, adj=FALSE, data=s2) #R-squared
## [1] 0.1970723
rsq::rsq(logi, adj=TRUE, data=s2) #adjusted R-squared
## [1] 0.1762711
mean(s2$defaultbin != ifelse(predict(logi, type = "response") > 0.5, 1, 0)) #misclassification rate
## [1] 0.1658291

Interpreting the predicted estimates \(\hat\beta\).

When interpreting the logistic regression model results, it’s easy to focus on the odds ratio results.

  • Odds ratio are bounded from \([0,\infty)\)

  • An odds ratio of 1 indicates independence which means the effect this predictor has on the response is practically useless.

  • If 1 is within the odds ratio confidence interval, then the effect is not practically nor statistically useful.

  • Odds ratio \(< 1\) indicate that the predictor tends to decrease the odds of the response level of interest

  • Odds ratio \(> 1\) indicate that the predictor tends to increase the odds of the response level of interest

Now suppose the following is true: - binary response \(y\) with event=1 (success) or event=0 (failure) - numeric predictor \(W\)
- categorical predictor \(X\) with reference coding (Yes or No categories)

Then the interpretation might be " Holding all other predictors constant,
- a 1-unit increase in \(W\) predicts an increase in the odds of event=1 by exp(\(\hat\beta_W\)) times
- the estimated odds of event=1 for the ‘Yes’ category is exp(\(\hat\beta_X\)) times the estimated odds for the ‘No’ category. "

Warning! When complete separation or quasi-separation occurs…

This indicates that the model can predict the response perfectly. Implying

  • the unique MLEs do not exist

  • extremely large parameter estimates

In R, you may see the following warning in red text “fitted probabilities numerically 0 or 1 occurred.”

Potential causes are:

  • multicollinearity

  • sparsity: not all of the response levels are observed for each of the predictors (or their categories) included in the model

  • sample size is small

  • response level of interest is rare

Potential workarounds:


GLM: Count Regression

To use the gamma model, we should observe the variance will be close to the mean squared

To use the Poisson model, we should observe the variance will be close to the mean

A Count Regression Modeling Strategy

  • Detect any multicollinearity and remove those predictors from the model before fitting.

  • Fit the model.

is.integer(sba$disbursementgross)
## [1] FALSE
s3 <- mutate(sba, defaultbin = ifelse(mis_status=="CHGOFF", 1, 0), ruralbin = ifelse(urbanrural==2, 1, 0), newbizbin = ifelse(newexist==2, 1, 0), lowdocbin = ifelse(lowdoc=="Y", 1, 0), dg = as.integer(round(sba$disbursementgross)))
s3 <- s3[-which(is.na(s3)),]
colnames(s3)
##  [1] "loannr_chkdgt"        "name"                 "city"                
##  [4] "state"                "zip"                  "bank"                
##  [7] "bankstate"            "naics"                "approvaldate"        
## [10] "approvalfy"           "term"                 "noemp"               
## [13] "newexist"             "createjob"            "retainedjob"         
## [16] "franchisecode"        "urbanrural"           "revlinecr"           
## [19] "lowdoc"               "chgoffdate"           "disbursementdate"    
## [22] "disbursementgross"    "balancegross"         "mis_status"          
## [25] "chgoffpringr"         "grappv"               "sba_appv"            
## [28] "logdisbursementgross" "defaultbin"           "ruralbin"            
## [31] "newbizbin"            "lowdocbin"            "dg"
is.integer(s3$dg)
## [1] TRUE
pois <- glm( dg ~ term+ noemp+ createjob+ retainedjob+ grappv+ factor(lowdocbin)+ factor(newbizbin)+ factor(ruralbin), family=poisson, data=s3)
  • Check the model diagnostics and model fit:
    • Root mean square error
    • \(R^2\)
    • adjusted \(R^2\)
summary(pois)
## 
## Call:
## glm(formula = dg ~ term + noemp + createjob + retainedjob + grappv + 
##     factor(lowdocbin) + factor(newbizbin) + factor(ruralbin), 
##     family = poisson, data = s3)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -498.77  -211.76   -62.50    92.05   963.18  
## 
## Coefficients:
##                      Estimate Std. Error  z value Pr(>|z|)    
## (Intercept)         1.118e+01  4.186e-04 26713.45   <2e-16 ***
## term                4.504e-03  1.839e-06  2449.52   <2e-16 ***
## noemp               2.049e-04  9.880e-07   207.39   <2e-16 ***
## createjob           3.678e-03  1.847e-05   199.06   <2e-16 ***
## retainedjob        -4.907e-03  2.789e-05  -175.94   <2e-16 ***
## grappv              1.245e-06  2.077e-10  5993.91   <2e-16 ***
## factor(lowdocbin)1 -6.485e-01  8.178e-04  -793.00   <2e-16 ***
## factor(newbizbin)1  2.632e-02  4.487e-04    58.66   <2e-16 ***
## factor(ruralbin)1  -1.947e-02  6.281e-04   -30.99   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 50004643  on 199  degrees of freedom
## Residual deviance: 12344211  on 191  degrees of freedom
## AIC: 12346883
## 
## Number of Fisher Scoring iterations: 5
modelr::rmse(pois, data=s3) #RMSE
## [1] 347160.1
rsq::rsq(pois,adj=FALSE,data=s3) #R-squared
## [1] 0.806498
rsq::rsq(pois,adj=TRUE,data=s3) #adjusted R-squared
## [1] 0.7983932
  • Check for influential observations and remove any influential points and refit the model.
summary(influence(pois)$dev.res)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -498.77 -211.76  -62.50  -37.36   92.05  963.18
  • Check model fit and diagnostics again.

  • If not satisfied with the model, you can attempt model selection.

    • When doing model selection, keep in mind that we must consider lots of models.
    • Human decisions may be based on information criteria or personal choice.
    • Automatic selection decision may be based on variable contribution to the F statistic or its analogue.
    • No single approach will help you make the decision among all of these models.
    • Use more than one approach and understand why certain predictors are going into your models at all.
    • Modeling goals are: improved prediction and clear model interpretation
MASS::stepAIC(pois, direction = "both", trace = FALSE) #stepwise selection
## 
## Call:  glm(formula = dg ~ term + noemp + createjob + retainedjob + grappv + 
##     factor(lowdocbin) + factor(newbizbin) + factor(ruralbin), 
##     family = poisson, data = s3)
## 
## Coefficients:
##        (Intercept)                term               noemp  
##          1.118e+01           4.504e-03           2.049e-04  
##          createjob         retainedjob              grappv  
##          3.678e-03          -4.907e-03           1.245e-06  
## factor(lowdocbin)1  factor(newbizbin)1   factor(ruralbin)1  
##         -6.485e-01           2.632e-02          -1.947e-02  
## 
## Degrees of Freedom: 199 Total (i.e. Null);  191 Residual
## Null Deviance:       5e+07 
## Residual Deviance: 12340000  AIC: 12350000
MASS::stepAIC(pois, direction = "forward", trace = FALSE) #forward selection
## 
## Call:  glm(formula = dg ~ term + noemp + createjob + retainedjob + grappv + 
##     factor(lowdocbin) + factor(newbizbin) + factor(ruralbin), 
##     family = poisson, data = s3)
## 
## Coefficients:
##        (Intercept)                term               noemp  
##          1.118e+01           4.504e-03           2.049e-04  
##          createjob         retainedjob              grappv  
##          3.678e-03          -4.907e-03           1.245e-06  
## factor(lowdocbin)1  factor(newbizbin)1   factor(ruralbin)1  
##         -6.485e-01           2.632e-02          -1.947e-02  
## 
## Degrees of Freedom: 199 Total (i.e. Null);  191 Residual
## Null Deviance:       5e+07 
## Residual Deviance: 12340000  AIC: 12350000
MASS::stepAIC(pois, direction = "backward", trace = FALSE) #backward selection
## 
## Call:  glm(formula = dg ~ term + noemp + createjob + retainedjob + grappv + 
##     factor(lowdocbin) + factor(newbizbin) + factor(ruralbin), 
##     family = poisson, data = s3)
## 
## Coefficients:
##        (Intercept)                term               noemp  
##          1.118e+01           4.504e-03           2.049e-04  
##          createjob         retainedjob              grappv  
##          3.678e-03          -4.907e-03           1.245e-06  
## factor(lowdocbin)1  factor(newbizbin)1   factor(ruralbin)1  
##         -6.485e-01           2.632e-02          -1.947e-02  
## 
## Degrees of Freedom: 199 Total (i.e. Null);  191 Residual
## Null Deviance:       5e+07 
## Residual Deviance: 12340000  AIC: 12350000
  • If the final model has been selected, interpret the predicted parameter estimates \(\hat\beta\) and prediction performance.

Now suppose the following is true: - count response \(y\) - numeric predictor \(W\)
- categorical predictor \(X\) with reference coding (Yes or No categories)

Then the interpretation might be " Holding all other predictors constant,
- a 1-unit increase in \(W\) predicts an increase in the log count of the response by \(\hat\beta_W\)
- the predicted log count of the response for the ‘Yes’ category is \(\hat\beta_X\) higher (or lower if \(\hat\beta_X\) is negative) than the predicted log count for the ‘No’ category. "

pois$coefficients #MLEs
##        (Intercept)               term              noemp 
##       1.118193e+01       4.503829e-03       2.048979e-04 
##          createjob        retainedjob             grappv 
##       3.677625e-03      -4.907495e-03       1.245080e-06 
## factor(lowdocbin)1 factor(newbizbin)1  factor(ruralbin)1 
##      -6.485168e-01       2.632085e-02      -1.946734e-02
exp(coef(pois)) #exp(MLEs)
##        (Intercept)               term              noemp 
##       7.182097e+04       1.004514e+00       1.000205e+00 
##          createjob        retainedjob             grappv 
##       1.003684e+00       9.951045e-01       1.000001e+00 
## factor(lowdocbin)1 factor(newbizbin)1  factor(ruralbin)1 
##       5.228206e-01       1.026670e+00       9.807209e-01
rsq::rsq(pois, adj=FALSE, data=s3) #R-squared
## [1] 0.806498
rsq::rsq(pois, adj=TRUE, data=s3) #adjusted R-squared
## [1] 0.7983932

Overdispersion

Overdispersion is a phenomenon that sometimes occurs in data that are modeled with Poisson distributions that adversely affects model parameter estimates. It means that there is a much larger variation in the data than expected. Dispersion \(\phi\), which represents the residual deviance divided by degrees of freedom, is expected to be equal to 1 for any given Poisson model.

For the Poisson case, if the dispersion estimate after fitting is far from 1, then we have a dispersion issue in our model that needs addressing.
- data is overdispersed if the dispersion estimate \(> 1\)
- data is underdispersed if the dispersion estimate \(< 1\)

We can model for overdispersion by adjusting the scale parameter of the model with a quasi-Poisson model.

summary(pois) #model without adjusting for dispersion
## 
## Call:
## glm(formula = dg ~ term + noemp + createjob + retainedjob + grappv + 
##     factor(lowdocbin) + factor(newbizbin) + factor(ruralbin), 
##     family = poisson, data = s3)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -498.77  -211.76   -62.50    92.05   963.18  
## 
## Coefficients:
##                      Estimate Std. Error  z value Pr(>|z|)    
## (Intercept)         1.118e+01  4.186e-04 26713.45   <2e-16 ***
## term                4.504e-03  1.839e-06  2449.52   <2e-16 ***
## noemp               2.049e-04  9.880e-07   207.39   <2e-16 ***
## createjob           3.678e-03  1.847e-05   199.06   <2e-16 ***
## retainedjob        -4.907e-03  2.789e-05  -175.94   <2e-16 ***
## grappv              1.245e-06  2.077e-10  5993.91   <2e-16 ***
## factor(lowdocbin)1 -6.485e-01  8.178e-04  -793.00   <2e-16 ***
## factor(newbizbin)1  2.632e-02  4.487e-04    58.66   <2e-16 ***
## factor(ruralbin)1  -1.947e-02  6.281e-04   -30.99   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 50004643  on 199  degrees of freedom
## Residual deviance: 12344211  on 191  degrees of freedom
## AIC: 12346883
## 
## Number of Fisher Scoring iterations: 5
summary(pois)$deviance/summary(pois)$df.residual #raw dispersion estimate
## [1] 64629.38
pois2 <- glm( dg ~ term+ noemp+ createjob+ retainedjob+ grappv+ factor(lowdocbin)+ factor(newbizbin)+ factor(ruralbin), family=quasipoisson, data=s3)
 #model with adjusting for dispersion
summary(pois2)
## 
## Call:
## glm(formula = dg ~ term + noemp + createjob + retainedjob + grappv + 
##     factor(lowdocbin) + factor(newbizbin) + factor(ruralbin), 
##     family = quasipoisson, data = s3)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -498.77  -211.76   -62.50    92.05   963.18  
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         1.118e+01  1.110e-01 100.712  < 2e-16 ***
## term                4.504e-03  4.877e-04   9.235  < 2e-16 ***
## noemp               2.049e-04  2.621e-04   0.782  0.43525    
## createjob           3.678e-03  4.900e-03   0.750  0.45389    
## retainedjob        -4.907e-03  7.399e-03  -0.663  0.50794    
## grappv              1.245e-06  5.510e-08  22.598  < 2e-16 ***
## factor(lowdocbin)1 -6.485e-01  2.169e-01  -2.990  0.00316 ** 
## factor(newbizbin)1  2.632e-02  1.190e-01   0.221  0.82520    
## factor(ruralbin)1  -1.947e-02  1.666e-01  -0.117  0.90711    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for quasipoisson family taken to be 70355.79)
## 
##     Null deviance: 50004643  on 199  degrees of freedom
## Residual deviance: 12344211  on 191  degrees of freedom
## AIC: NA
## 
## Number of Fisher Scoring iterations: 5
#What changed between the results of the two models - pois vs pois2?

An alternative to using the scale-adjusted quasi-Poisson model is to use a different model altogether that is still relevant for count data such as the negative binomial, which is related to the gamma distribution. See https://cran.r-project.org/web/packages/pscl/vignettes/countreg.pdf for more details.