Generalized Linear Models (GLM)
GLM: Linear Regression
GLM: Logistic Regression
GLM: Count Regression
Broad class of models including linear regression, logistic regression, and other discrete-response models
GLM is the acronym for generalized linear models
Made up of 3 components
Parameter estimates come from maximum likelihood
Here are some examples of response variables, their distributions, and link functions.
Response Type | Distribution | Link Function | \(g(\mu)\) |
---|---|---|---|
continuous | normal | identity | \(g(\mu)=\mu\) |
binary or events/trials | binomial | logit | \(g(\mu)=\mathrm{log} \big( \mu/(1-\mu) \big)\) |
count | poisson | log | \(g(\mu)=\mathrm{log} \big(\mu \big)\) |
count with some extreme values | gamma | reciprocal | \(g(\mu)=1/\mu\) |
We’ll work with the SBA Loans Data.
library(tidyverse)
## -- Attaching packages --------------------------------------------- tidyverse 1.2.1 --
## v ggplot2 3.2.1 v purrr 0.3.2
## v tibble 2.1.3 v dplyr 0.8.3
## v tidyr 1.0.0 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.4.0
## -- Conflicts ------------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
SBA <- read_csv("https://uofi.box.com/shared/static/vi37omgitiaa2yyplrom779qvwk1g14x.csv",
col_types = cols(ApprovalDate = col_date(format = "%d-%b-%y"),
BalanceGross = col_number(), ChgOffDate = col_date(format = "%d-%b-%y"),
ChgOffPrinGr = col_number(), DisbursementDate = col_date(format = "%d-%b-%y"),
DisbursementGross = col_number(),
ApprovalFY = col_integer(),
GrAppv = col_number(), SBA_Appv = col_number()))
## Warning: 18 parsing failures.
## row col expected actual file
## 699733 ApprovalFY no trailing characters A 'https://uofi.box.com/shared/static/vi37omgitiaa2yyplrom779qvwk1g14x.csv'
## 704031 ApprovalFY no trailing characters A 'https://uofi.box.com/shared/static/vi37omgitiaa2yyplrom779qvwk1g14x.csv'
## 705376 ApprovalFY no trailing characters A 'https://uofi.box.com/shared/static/vi37omgitiaa2yyplrom779qvwk1g14x.csv'
## 710382 ApprovalFY no trailing characters A 'https://uofi.box.com/shared/static/vi37omgitiaa2yyplrom779qvwk1g14x.csv'
## 713246 ApprovalFY no trailing characters A 'https://uofi.box.com/shared/static/vi37omgitiaa2yyplrom779qvwk1g14x.csv'
## ...... .......... ...................... ...... .........................................................................
## See problems(...) for more details.
Next, we make simple cleaning steps, create a subset of the data which is much smaller than the original, and remove the original large data set.
colnames(SBA) <- tolower(colnames(SBA))
SBA[,28] <- log(SBA$disbursementgross)
colnames(SBA)[28]<-"logdisbursementgross"
set.seed(448)
sba <- SBA[sample(1:dim(SBA)[1],200),]
rm(SBA) #remove the original large data file
Attempts to answer questions such as
response \(y\) is continuous
residuals \((y-\hat{y})\) ~ \(N(\mu,\sigma^2)\)
predictor(s) are \(x_j\) and may be continuous or binary
We are modeling \(\mathrm{E}(Y)=\beta_0 + \beta_1 x_{1}+ \ldots + \beta_p x_{p}\)
We are predicting \(\hat{\mu} = \hat\beta_0 + \hat\beta_1 x_{1}+ \ldots + \hat\beta_p x_{p}\)
s1 <- select(sba, term, noemp, createjob, retainedjob, sba_appv, grappv, disbursementgross, logdisbursementgross)
dim(s1)
## [1] 200 8
cor(s1)
## term noemp createjob retainedjob
## term 1.00000000 0.141462735 0.2765796 0.075721031
## noemp 0.14146274 1.000000000 0.3816089 -0.003400163
## createjob 0.27657957 0.381608898 1.0000000 0.360524016
## retainedjob 0.07572103 -0.003400163 0.3605240 1.000000000
## sba_appv 0.45963460 0.265908673 0.3620452 0.125686786
## grappv 0.42965160 0.207493987 0.3221656 0.136934332
## disbursementgross 0.41119624 0.205734901 0.3117202 0.149545039
## logdisbursementgross 0.55776902 0.166107375 0.2542944 0.107471651
## sba_appv grappv disbursementgross
## term 0.4596346 0.4296516 0.4111962
## noemp 0.2659087 0.2074940 0.2057349
## createjob 0.3620452 0.3221656 0.3117202
## retainedjob 0.1256868 0.1369343 0.1495450
## sba_appv 1.0000000 0.9895666 0.9788813
## grappv 0.9895666 1.0000000 0.9916146
## disbursementgross 0.9788813 0.9916146 1.0000000
## logdisbursementgross 0.7116531 0.7081799 0.7199597
## logdisbursementgross
## term 0.5577690
## noemp 0.1661074
## createjob 0.2542944
## retainedjob 0.1074717
## sba_appv 0.7116531
## grappv 0.7081799
## disbursementgross 0.7199597
## logdisbursementgross 1.0000000
pairs(s1)
pairs(select(s1, sba_appv, grappv))
car::vif(lm(logdisbursementgross ~ term + noemp + createjob + retainedjob + sba_appv + grappv, data=s1))
## term noemp createjob retainedjob sba_appv grappv
## 1.338369 1.383718 1.567531 1.209614 66.835055 62.498746
#library(car)
car::vif(lm(logdisbursementgross ~ term + noemp + createjob + retainedjob + grappv, data=s1))
## term noemp createjob retainedjob grappv
## 1.261148 1.215218 1.498213 1.184595 1.311217
#traditional linear regression
lm(logdisbursementgross ~ term + noemp + createjob + retainedjob + grappv, data=s1)
##
## Call:
## lm(formula = logdisbursementgross ~ term + noemp + createjob +
## retainedjob + grappv, data = s1)
##
## Coefficients:
## (Intercept) term noemp createjob retainedjob
## 1.041e+01 5.100e-03 1.883e-04 -6.307e-03 3.096e-03
## grappv
## 2.492e-06
#glm: linear regression
lin <- glm(logdisbursementgross ~ term + noemp + createjob + retainedjob + grappv, data=s1, family=gaussian)
summary(lin)
##
## Call:
## glm(formula = logdisbursementgross ~ term + noemp + createjob +
## retainedjob + grappv, family = gaussian, data = s1)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.4285 -0.3700 0.1042 0.5729 1.8396
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.041e+01 1.063e-01 97.903 < 2e-16 ***
## term 5.100e-03 8.458e-04 6.029 8.16e-09 ***
## noemp 1.883e-04 7.455e-04 0.253 0.801
## createjob -6.307e-03 1.228e-02 -0.514 0.608
## retainedjob 3.096e-03 1.035e-02 0.299 0.765
## grappv 2.492e-06 2.296e-07 10.854 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.6939444)
##
## Null deviance: 321.25 on 199 degrees of freedom
## Residual deviance: 134.63 on 194 degrees of freedom
## AIC: 502.41
##
## Number of Fisher Scoring iterations: 2
library(rsq,modelr)
modelr::rmse(lin, data=s1) #RMSE
## [1] 0.8204426
rsq::rsq(lin,adj=FALSE,data=s1) #R-squared
## [1] 0.5809368
rsq::rsq(lin,adj=TRUE,data=s1) #adjusted R-squared
## [1] 0.5701362
names(influence(lin))
## [1] "hat" "coefficients" "sigma" "dev.res"
## [5] "pear.res"
fivenum(abs(influence(lin)$dev.res))
## 123 46 199 21 159
## 0.0027204 0.2149730 0.4916502 0.8273653 3.4284854
fivenum(abs(influence(lin)$pear.res))
## 123 46 199 21 159
## 0.0027204 0.2149730 0.4916502 0.8273653 3.4284854
names(summary(lin))
## [1] "call" "terms" "family" "deviance"
## [5] "aic" "contrasts" "df.residual" "null.deviance"
## [9] "df.null" "iter" "deviance.resid" "coefficients"
## [13] "aliased" "dispersion" "df" "cov.unscaled"
## [17] "cov.scaled"
fivenum(abs(summary(lin)$deviance.resid))
## 123 46 199 21 159
## 0.0027204 0.2149730 0.4916502 0.8273653 3.4284854
plot(1:length(summary(lin)$deviance.resid),summary(lin)$deviance.resid)
abline(h=-2, lty=2)
abline(h=0)
abline(h=2, lty=2)
ri <- which(abs(summary(lin)$deviance.resid)>3)
plot(logdisbursementgross ~ grappv, data=s1)
points(logdisbursementgross[ri] ~ grappv[ri], data=s1, col=c("blue","orange"), pch=16)
s1wori1 <- s1[-ri[1],]
s1wori2 <- s1[-ri[2],]
linwori1 <- glm(logdisbursementgross ~ term + noemp + createjob + retainedjob + grappv, data=s1wori1, family=gaussian)
linwori2 <- glm(logdisbursementgross ~ term + noemp + createjob + retainedjob + grappv, data=s1wori2, family=gaussian)
rsq::rsq(lin, adj=TRUE, data=s1) #adjusted R-squared
## [1] 0.5701362
rsq::rsq(linwori1, adj=TRUE, data=s1wori1) #adjusted R-squared
## [1] 0.6539762
rsq::rsq(linwori2, adj=TRUE, data=s1wori2) #adjusted R-squared
## [1] 0.5870418
Check model fit and diagnostics again.
If not satisfied with the model, you can attempt model selection.
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
MASS::stepAIC(lin, direction = "both", trace = FALSE) #stepwise selection
##
## Call: glm(formula = logdisbursementgross ~ term + grappv, family = gaussian,
## data = s1)
##
## Coefficients:
## (Intercept) term grappv
## 1.042e+01 5.038e-03 2.479e-06
##
## Degrees of Freedom: 199 Total (i.e. Null); 197 Residual
## Null Deviance: 321.3
## Residual Deviance: 134.8 AIC: 496.7
MASS::stepAIC(lin, direction = "forward", trace = FALSE) #forward selection
##
## Call: glm(formula = logdisbursementgross ~ term + noemp + createjob +
## retainedjob + grappv, family = gaussian, data = s1)
##
## Coefficients:
## (Intercept) term noemp createjob retainedjob
## 1.041e+01 5.100e-03 1.883e-04 -6.307e-03 3.096e-03
## grappv
## 2.492e-06
##
## Degrees of Freedom: 199 Total (i.e. Null); 194 Residual
## Null Deviance: 321.3
## Residual Deviance: 134.6 AIC: 502.4
MASS::stepAIC(lin, direction = "backward", trace = FALSE) #backward selection
##
## Call: glm(formula = logdisbursementgross ~ term + grappv, family = gaussian,
## data = s1)
##
## Coefficients:
## (Intercept) term grappv
## 1.042e+01 5.038e-03 2.479e-06
##
## Degrees of Freedom: 199 Total (i.e. Null); 197 Residual
## Null Deviance: 321.3
## Residual Deviance: 134.8 AIC: 496.7
library(leaps)
comps <- leaps::regsubsets(logdisbursementgross ~ term + noemp + createjob + retainedjob + grappv, data = s1, nbest=4)
plot(comps,scale="adjr2")
plot(comps,scale="bic")
plot(comps,scale="Cp")
#car::subsets(comps, statistic="rss") #legend may not be visible
lin$coefficients
## (Intercept) term noemp createjob retainedjob
## 1.040590e+01 5.099629e-03 1.882764e-04 -6.307249e-03 3.096399e-03
## grappv
## 2.491616e-06
modelr::rmse(lin, data=s1) #RMSE
## [1] 0.8204426
rsq::rsq(lin,adj=FALSE,data=s1) #R-squared
## [1] 0.5809368
rsq::rsq(lin,adj=TRUE,data=s1) #adjusted R-squared
## [1] 0.5701362
Attempts to answer questions such as
response is \(y = 0 \textrm{ or } 1\)
\(y\) ~ \(Ber(\pi)\)
predictor(s) are \(x_j\) and may be categorical or continuous
We are modeling \(\mathrm{logit}(\pi)=\mathrm{log}(\frac{\pi}{1-\pi})=\beta_0 + \beta_1 x_{1}+ \ldots + \beta_p x_{p}\)
We are predicting \(\hat{\pi} = \frac{1}{1+e^{-(\hat\beta_0 + \hat\beta_1 x_{1}+ \ldots + \hat\beta_p x_{p})}}\)
Detect any multicollinearity and remove those predictors from the model before fitting.
Fit the model.
s2 <- dplyr::select(mutate(sba, defaultbin = ifelse(mis_status=="CHGOFF", 1, 0), ruralbin = ifelse(urbanrural==2, 1, 0), newbizbin = ifelse(newexist==2, 1, 0), lowdocbin = ifelse(lowdoc=="Y", 1, 0)), defaultbin, term, grappv, lowdocbin, newbizbin, ruralbin)
s2 <- s2[-which(is.na(s2)),]
dim(s2)
## [1] 199 6
logi <- glm(factor(defaultbin) ~ term+ grappv+ factor(lowdocbin)+ factor(newbizbin)+ factor(ruralbin), family=binomial, data=s2)
summary(logi)
##
## Call:
## glm(formula = factor(defaultbin) ~ term + grappv + factor(lowdocbin) +
## factor(newbizbin) + factor(ruralbin), family = binomial,
## data = s2)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.37514 -0.68471 -0.31033 -0.05937 2.82170
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.938e-01 4.789e-01 0.822 0.41088
## term -1.726e-02 6.473e-03 -2.667 0.00765 **
## grappv -5.495e-06 2.702e-06 -2.033 0.04200 *
## factor(lowdocbin)1 -2.234e+00 1.059e+00 -2.110 0.03487 *
## factor(newbizbin)1 4.319e-01 4.622e-01 0.935 0.34999
## factor(ruralbin)1 -3.935e-01 8.167e-01 -0.482 0.62990
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 178.79 on 198 degrees of freedom
## Residual deviance: 139.84 on 193 degrees of freedom
## AIC: 151.84
##
## Number of Fisher Scoring iterations: 7
library(pROC)
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
length(s2$defaultbin)
## [1] 199
length(logi$fitted.values)
## [1] 199
pROC::auc(roc(s2$defaultbin, ifelse(predict(logi, type = "response") > 0.5, 1, 0))) #AUC
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Area under the curve: 0.5486
rsq::rsq(logi, adj=FALSE, data=s2) #R-squared
## [1] 0.1970723
rsq::rsq(logi, adj=TRUE, data=s2) #adjusted R-squared
## [1] 0.1762711
table(predicted = ifelse(predict(logi, type = "response") > 0.5, 1, 0), actual = s2$defaultbin) #classification table (confusion matrix)
## actual
## predicted 0 1
## 0 162 29
## 1 4 4
mean(s2$defaultbin != ifelse(predict(logi, type = "response") > 0.5, 1, 0)) #misclassification rate
## [1] 0.1658291
sum(s2$defaultbin)/length(s2$defaultbin) #prior response rate
## [1] 0.1658291
names(influence(logi))
## [1] "hat" "coefficients" "sigma" "dev.res"
## [5] "pear.res"
fivenum(abs(influence(logi)$dev.res))
## 150 165 62 66 154
## 0.0002454432 0.2079731586 0.5683981718 0.8877455845 2.8217011447
fivenum(abs(influence(logi)$pear.res))
## 150 165 62 66 154
## 0.0001735545 0.1478580768 0.4187092099 0.6949778818 7.2505521891
names(summary(logi))
## [1] "call" "terms" "family" "deviance"
## [5] "aic" "contrasts" "df.residual" "null.deviance"
## [9] "df.null" "iter" "deviance.resid" "coefficients"
## [13] "aliased" "dispersion" "df" "cov.unscaled"
## [17] "cov.scaled"
fivenum(abs(summary(logi)$deviance.resid))
## 150 165 62 66 154
## 0.0002454432 0.2079731586 0.5683981718 0.8877455845 2.8217011447
plot(1:length(summary(logi)$deviance.resid),summary(logi)$deviance.resid)
abline(h=-2, lty=2)
abline(h=0)
abline(h=2, lty=2)
Check model fit and diagnostics again.
If not satisfied with the model, you can attempt model selection.
#library(MASS)
MASS::stepAIC(logi, direction = "both", trace = FALSE) #stepwise selection
##
## Call: glm(formula = factor(defaultbin) ~ term + grappv + factor(lowdocbin),
## family = binomial, data = s2)
##
## Coefficients:
## (Intercept) term grappv
## 4.874e-01 -1.720e-02 -5.630e-06
## factor(lowdocbin)1
## -2.129e+00
##
## Degrees of Freedom: 198 Total (i.e. Null); 195 Residual
## Null Deviance: 178.8
## Residual Deviance: 141 AIC: 149
MASS::stepAIC(logi, direction = "forward", trace = FALSE) #forward selection
##
## Call: glm(formula = factor(defaultbin) ~ term + grappv + factor(lowdocbin) +
## factor(newbizbin) + factor(ruralbin), family = binomial,
## data = s2)
##
## Coefficients:
## (Intercept) term grappv
## 3.938e-01 -1.726e-02 -5.495e-06
## factor(lowdocbin)1 factor(newbizbin)1 factor(ruralbin)1
## -2.234e+00 4.319e-01 -3.935e-01
##
## Degrees of Freedom: 198 Total (i.e. Null); 193 Residual
## Null Deviance: 178.8
## Residual Deviance: 139.8 AIC: 151.8
MASS::stepAIC(logi, direction = "backward", trace = FALSE) #backward selection
##
## Call: glm(formula = factor(defaultbin) ~ term + grappv + factor(lowdocbin),
## family = binomial, data = s2)
##
## Coefficients:
## (Intercept) term grappv
## 4.874e-01 -1.720e-02 -5.630e-06
## factor(lowdocbin)1
## -2.129e+00
##
## Degrees of Freedom: 198 Total (i.e. Null); 195 Residual
## Null Deviance: 178.8
## Residual Deviance: 141 AIC: 149
logi$coefficients #MLEs
## (Intercept) term grappv
## 3.937918e-01 -1.726428e-02 -5.495462e-06
## factor(lowdocbin)1 factor(newbizbin)1 factor(ruralbin)1
## -2.233606e+00 4.319310e-01 -3.935468e-01
exp(coef(logi)) #exp(MLEs)
## (Intercept) term grappv
## 1.4825918 0.9828839 0.9999945
## factor(lowdocbin)1 factor(newbizbin)1 factor(ruralbin)1
## 0.1071414 1.5402289 0.6746597
pROC::auc(roc(s2$defaultbin, ifelse(predict(logi, type = "response") > 0.5, 1, 0))) #AUC
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Area under the curve: 0.5486
rsq::rsq(logi, adj=FALSE, data=s2) #R-squared
## [1] 0.1970723
rsq::rsq(logi, adj=TRUE, data=s2) #adjusted R-squared
## [1] 0.1762711
mean(s2$defaultbin != ifelse(predict(logi, type = "response") > 0.5, 1, 0)) #misclassification rate
## [1] 0.1658291
When interpreting the logistic regression model results, it’s easy to focus on the odds ratio results.
Odds ratio are bounded from \([0,\infty)\)
An odds ratio of 1 indicates independence which means the effect this predictor has on the response is practically useless.
If 1 is within the odds ratio confidence interval, then the effect is not practically nor statistically useful.
Odds ratio \(< 1\) indicate that the predictor tends to decrease the odds of the response level of interest
Odds ratio \(> 1\) indicate that the predictor tends to increase the odds of the response level of interest
Now suppose the following is true: - binary response \(y\) with event=1 (success) or event=0 (failure) - numeric predictor \(W\)
- categorical predictor \(X\) with reference coding (Yes or No categories)
Then the interpretation might be " Holding all other predictors constant,
- a 1-unit increase in \(W\) predicts an increase in the odds of event=1 by exp(\(\hat\beta_W\)) times
- the estimated odds of event=1 for the ‘Yes’ category is exp(\(\hat\beta_X\)) times the estimated odds for the ‘No’ category. "
This indicates that the model can predict the response perfectly. Implying
the unique MLEs do not exist
extremely large parameter estimates
In R, you may see the following warning in red text “fitted probabilities numerically 0 or 1 occurred.”
Potential causes are:
multicollinearity
sparsity: not all of the response levels are observed for each of the predictors (or their categories) included in the model
sample size is small
response level of interest is rare
Potential workarounds:
modeling the exact logistic regression (computationally intensive!)
http://support.sas.com/kb/22/599.html for more general guidance
http://documentation.sas.com/?docsetId=statug&docsetVersion=14.3&docsetTarget=statug_logistic_examples14.htm&locale=en for better understanding
To use the gamma model, we should observe the variance will be close to the mean squared
To use the Poisson model, we should observe the variance will be close to the mean
Detect any multicollinearity and remove those predictors from the model before fitting.
Fit the model.
is.integer(sba$disbursementgross)
## [1] FALSE
s3 <- mutate(sba, defaultbin = ifelse(mis_status=="CHGOFF", 1, 0), ruralbin = ifelse(urbanrural==2, 1, 0), newbizbin = ifelse(newexist==2, 1, 0), lowdocbin = ifelse(lowdoc=="Y", 1, 0), dg = as.integer(round(sba$disbursementgross)))
s3 <- s3[-which(is.na(s3)),]
colnames(s3)
## [1] "loannr_chkdgt" "name" "city"
## [4] "state" "zip" "bank"
## [7] "bankstate" "naics" "approvaldate"
## [10] "approvalfy" "term" "noemp"
## [13] "newexist" "createjob" "retainedjob"
## [16] "franchisecode" "urbanrural" "revlinecr"
## [19] "lowdoc" "chgoffdate" "disbursementdate"
## [22] "disbursementgross" "balancegross" "mis_status"
## [25] "chgoffpringr" "grappv" "sba_appv"
## [28] "logdisbursementgross" "defaultbin" "ruralbin"
## [31] "newbizbin" "lowdocbin" "dg"
is.integer(s3$dg)
## [1] TRUE
pois <- glm( dg ~ term+ noemp+ createjob+ retainedjob+ grappv+ factor(lowdocbin)+ factor(newbizbin)+ factor(ruralbin), family=poisson, data=s3)
summary(pois)
##
## Call:
## glm(formula = dg ~ term + noemp + createjob + retainedjob + grappv +
## factor(lowdocbin) + factor(newbizbin) + factor(ruralbin),
## family = poisson, data = s3)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -498.77 -211.76 -62.50 92.05 963.18
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.118e+01 4.186e-04 26713.45 <2e-16 ***
## term 4.504e-03 1.839e-06 2449.52 <2e-16 ***
## noemp 2.049e-04 9.880e-07 207.39 <2e-16 ***
## createjob 3.678e-03 1.847e-05 199.06 <2e-16 ***
## retainedjob -4.907e-03 2.789e-05 -175.94 <2e-16 ***
## grappv 1.245e-06 2.077e-10 5993.91 <2e-16 ***
## factor(lowdocbin)1 -6.485e-01 8.178e-04 -793.00 <2e-16 ***
## factor(newbizbin)1 2.632e-02 4.487e-04 58.66 <2e-16 ***
## factor(ruralbin)1 -1.947e-02 6.281e-04 -30.99 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 50004643 on 199 degrees of freedom
## Residual deviance: 12344211 on 191 degrees of freedom
## AIC: 12346883
##
## Number of Fisher Scoring iterations: 5
modelr::rmse(pois, data=s3) #RMSE
## [1] 347160.1
rsq::rsq(pois,adj=FALSE,data=s3) #R-squared
## [1] 0.806498
rsq::rsq(pois,adj=TRUE,data=s3) #adjusted R-squared
## [1] 0.7983932
summary(influence(pois)$dev.res)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -498.77 -211.76 -62.50 -37.36 92.05 963.18
Check model fit and diagnostics again.
If not satisfied with the model, you can attempt model selection.
MASS::stepAIC(pois, direction = "both", trace = FALSE) #stepwise selection
##
## Call: glm(formula = dg ~ term + noemp + createjob + retainedjob + grappv +
## factor(lowdocbin) + factor(newbizbin) + factor(ruralbin),
## family = poisson, data = s3)
##
## Coefficients:
## (Intercept) term noemp
## 1.118e+01 4.504e-03 2.049e-04
## createjob retainedjob grappv
## 3.678e-03 -4.907e-03 1.245e-06
## factor(lowdocbin)1 factor(newbizbin)1 factor(ruralbin)1
## -6.485e-01 2.632e-02 -1.947e-02
##
## Degrees of Freedom: 199 Total (i.e. Null); 191 Residual
## Null Deviance: 5e+07
## Residual Deviance: 12340000 AIC: 12350000
MASS::stepAIC(pois, direction = "forward", trace = FALSE) #forward selection
##
## Call: glm(formula = dg ~ term + noemp + createjob + retainedjob + grappv +
## factor(lowdocbin) + factor(newbizbin) + factor(ruralbin),
## family = poisson, data = s3)
##
## Coefficients:
## (Intercept) term noemp
## 1.118e+01 4.504e-03 2.049e-04
## createjob retainedjob grappv
## 3.678e-03 -4.907e-03 1.245e-06
## factor(lowdocbin)1 factor(newbizbin)1 factor(ruralbin)1
## -6.485e-01 2.632e-02 -1.947e-02
##
## Degrees of Freedom: 199 Total (i.e. Null); 191 Residual
## Null Deviance: 5e+07
## Residual Deviance: 12340000 AIC: 12350000
MASS::stepAIC(pois, direction = "backward", trace = FALSE) #backward selection
##
## Call: glm(formula = dg ~ term + noemp + createjob + retainedjob + grappv +
## factor(lowdocbin) + factor(newbizbin) + factor(ruralbin),
## family = poisson, data = s3)
##
## Coefficients:
## (Intercept) term noemp
## 1.118e+01 4.504e-03 2.049e-04
## createjob retainedjob grappv
## 3.678e-03 -4.907e-03 1.245e-06
## factor(lowdocbin)1 factor(newbizbin)1 factor(ruralbin)1
## -6.485e-01 2.632e-02 -1.947e-02
##
## Degrees of Freedom: 199 Total (i.e. Null); 191 Residual
## Null Deviance: 5e+07
## Residual Deviance: 12340000 AIC: 12350000
Now suppose the following is true: - count response \(y\) - numeric predictor \(W\)
- categorical predictor \(X\) with reference coding (Yes or No categories)
Then the interpretation might be " Holding all other predictors constant,
- a 1-unit increase in \(W\) predicts an increase in the log count of the response by \(\hat\beta_W\)
- the predicted log count of the response for the ‘Yes’ category is \(\hat\beta_X\) higher (or lower if \(\hat\beta_X\) is negative) than the predicted log count for the ‘No’ category. "
pois$coefficients #MLEs
## (Intercept) term noemp
## 1.118193e+01 4.503829e-03 2.048979e-04
## createjob retainedjob grappv
## 3.677625e-03 -4.907495e-03 1.245080e-06
## factor(lowdocbin)1 factor(newbizbin)1 factor(ruralbin)1
## -6.485168e-01 2.632085e-02 -1.946734e-02
exp(coef(pois)) #exp(MLEs)
## (Intercept) term noemp
## 7.182097e+04 1.004514e+00 1.000205e+00
## createjob retainedjob grappv
## 1.003684e+00 9.951045e-01 1.000001e+00
## factor(lowdocbin)1 factor(newbizbin)1 factor(ruralbin)1
## 5.228206e-01 1.026670e+00 9.807209e-01
rsq::rsq(pois, adj=FALSE, data=s3) #R-squared
## [1] 0.806498
rsq::rsq(pois, adj=TRUE, data=s3) #adjusted R-squared
## [1] 0.7983932
Overdispersion is a phenomenon that sometimes occurs in data that are modeled with Poisson distributions that adversely affects model parameter estimates. It means that there is a much larger variation in the data than expected. Dispersion \(\phi\), which represents the residual deviance divided by degrees of freedom, is expected to be equal to 1 for any given Poisson model.
For the Poisson case, if the dispersion estimate after fitting is far from 1, then we have a dispersion issue in our model that needs addressing.
- data is overdispersed if the dispersion estimate \(> 1\)
- data is underdispersed if the dispersion estimate \(< 1\)
We can model for overdispersion by adjusting the scale parameter of the model with a quasi-Poisson model.
summary(pois) #model without adjusting for dispersion
##
## Call:
## glm(formula = dg ~ term + noemp + createjob + retainedjob + grappv +
## factor(lowdocbin) + factor(newbizbin) + factor(ruralbin),
## family = poisson, data = s3)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -498.77 -211.76 -62.50 92.05 963.18
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.118e+01 4.186e-04 26713.45 <2e-16 ***
## term 4.504e-03 1.839e-06 2449.52 <2e-16 ***
## noemp 2.049e-04 9.880e-07 207.39 <2e-16 ***
## createjob 3.678e-03 1.847e-05 199.06 <2e-16 ***
## retainedjob -4.907e-03 2.789e-05 -175.94 <2e-16 ***
## grappv 1.245e-06 2.077e-10 5993.91 <2e-16 ***
## factor(lowdocbin)1 -6.485e-01 8.178e-04 -793.00 <2e-16 ***
## factor(newbizbin)1 2.632e-02 4.487e-04 58.66 <2e-16 ***
## factor(ruralbin)1 -1.947e-02 6.281e-04 -30.99 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 50004643 on 199 degrees of freedom
## Residual deviance: 12344211 on 191 degrees of freedom
## AIC: 12346883
##
## Number of Fisher Scoring iterations: 5
summary(pois)$deviance/summary(pois)$df.residual #raw dispersion estimate
## [1] 64629.38
pois2 <- glm( dg ~ term+ noemp+ createjob+ retainedjob+ grappv+ factor(lowdocbin)+ factor(newbizbin)+ factor(ruralbin), family=quasipoisson, data=s3)
#model with adjusting for dispersion
summary(pois2)
##
## Call:
## glm(formula = dg ~ term + noemp + createjob + retainedjob + grappv +
## factor(lowdocbin) + factor(newbizbin) + factor(ruralbin),
## family = quasipoisson, data = s3)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -498.77 -211.76 -62.50 92.05 963.18
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.118e+01 1.110e-01 100.712 < 2e-16 ***
## term 4.504e-03 4.877e-04 9.235 < 2e-16 ***
## noemp 2.049e-04 2.621e-04 0.782 0.43525
## createjob 3.678e-03 4.900e-03 0.750 0.45389
## retainedjob -4.907e-03 7.399e-03 -0.663 0.50794
## grappv 1.245e-06 5.510e-08 22.598 < 2e-16 ***
## factor(lowdocbin)1 -6.485e-01 2.169e-01 -2.990 0.00316 **
## factor(newbizbin)1 2.632e-02 1.190e-01 0.221 0.82520
## factor(ruralbin)1 -1.947e-02 1.666e-01 -0.117 0.90711
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for quasipoisson family taken to be 70355.79)
##
## Null deviance: 50004643 on 199 degrees of freedom
## Residual deviance: 12344211 on 191 degrees of freedom
## AIC: NA
##
## Number of Fisher Scoring iterations: 5
#What changed between the results of the two models - pois vs pois2?
An alternative to using the scale-adjusted quasi-Poisson model is to use a different model altogether that is still relevant for count data such as the negative binomial, which is related to the gamma distribution. See https://cran.r-project.org/web/packages/pscl/vignettes/countreg.pdf for more details.