Why Linear Regression is not Ideal

Count data measures the frequency of an event. Because of this, count data is always going to be discrete as well as non-negative. Linear regression however often assumes the data is continuous and can include negative values.

Linear regression assumes that the relationship between the independent and dependent variables is linear and that the dependent variable is normally distributed with constant variance (homoscedasticity). These assumptions often do not hold true for count data, leading to several issues:

  1. Non-negative Predictions: Linear regression can produce negative predictions, which are not meaningful for count data.

  2. Heteroscedasticity: Count data often exhibits heteroscedasticity, where the variance increases with the mean. Recall we need homoscedasticity (equal variance) to be met for linear regression to be valid.

  3. Non-normality: Linear regression assumes that the residuals (differences between observed and predicted values) are normally distributed, not often seen in count data.

Hopefully, by now, you’re thinking “Well Gianni, you have made it clear linear regression is a terrible idea for my count data, tell me more about this Poisson Regression”

Introducing: Poisson Regression

Poisson regression is designed explicitly for count data and assumes that the counts follow a Poisson distribution. The Poisson distribution models the probability of a given number of events happening in a fixed interval of time or space, given a known average rate of occurrence.

In Poisson regression, we have the following assumptions:

  1. Each observation is independent of the others (i.e. the number of ice cream cones sold today will not influence the number sold tomorrow).
  2. The mean and the variance of your model are identical. Formally, \(E(Y_i) = Var(Y_i)\). I will introduce this concept more in depth as well as what to do when this is violated in Part 2.

Formal Math Notation

If you are familiar with the formula for linear regression model, you will find the Poisson is quite familiar:

\(log(\lambda_i) = \beta_0 + \beta x_{i1} + \beta_2 x_{i2} + \ldots + \beta_p x_{ip}\) where \(\lambda_i\) is the expected count for the \(i\)-th observation and \(\beta_0, \beta_1, \ldots, \beta_p\) are our regression coefficients.

Implementation in Python

Let’s show a simple example of how to apply Poisson regression in Python. We will use a simple data set, modeling the number of awards given to high school students. Our two predictors are the score on their math final and a categorical variable of three levels for which program they are in: Vocational, General, and Academic. Here is a preview of our data set, which we will call awards:

id num_awards prog math
45 0 Vocational 41
108 0 General 41
15 0 Vocational 44
67 0 Vocational 42
153 0 Vocational 40

We can visualize our counts using a histogram to get a better idea of the distribution of our counts: Counts Separated by Program Type

Our response variable does not have any hint of normality. Seeing there is a right skew, Poisson regression seems to be the most fitting choice.

We can use the following Python code:

import statsmodels.formula.api as smf

# Poisson regression model
poisson_model = smf.poisson('num_awards ~ C(prog) + math', data=awards).fit()

We now have the following model:

Poisson Regression Results
Dep. Variable: num_awards No. Observations: 200
Model: Poisson Df Residuals: 196
Method: MLE Df Model: 3
Date: Thu, 16 May 2024 Pseudo R-squ.: 0.2118
Time: 10:38:47 Log-Likelihood: -182.75
converged: True LL-Null: -231.86
Covariance Type: nonrobust LLR p-value: 3.747e-21
coef std err z P>|z| [0.025 0.975]
Intercept -5.2471 0.658 -7.969 0.000 -6.538 -3.957
C(prog)[T.Academic] 1.0839 0.358 3.025 0.002 0.382 1.786
C(prog)[T.Vocational] 0.3698 0.441 0.838 0.402 -0.495 1.234
math 0.0702 0.011 6.619 0.000 0.049 0.091

With the statsmodels.api summary, you can examine the coefficients, model statistics, significance tests, etc., similar to what you would with any other type of regression. There is one thing to keep in mind when interpreting coefficients here. Recall that we are building a model for the log of the expected count. So for example: for every additional point a student receives on the math final, the difference in the logs of expected counts is expected to change by 0.0702.

References:

Updated: