Why Linear Regression is not Ideal
Count data measures the frequency of an event. Because of this, count data is always going to be discrete as well as non-negative. Linear regression however often assumes the data is continuous and can include negative values.
Linear regression assumes that the relationship between the independent and dependent variables is linear and that the dependent variable is normally distributed with constant variance (homoscedasticity). These assumptions often do not hold true for count data, leading to several issues:
-
Non-negative Predictions: Linear regression can produce negative predictions, which are not meaningful for count data.
-
Heteroscedasticity: Count data often exhibits heteroscedasticity, where the variance increases with the mean. Recall we need homoscedasticity (equal variance) to be met for linear regression to be valid.
-
Non-normality: Linear regression assumes that the residuals (differences between observed and predicted values) are normally distributed, not often seen in count data.
Hopefully, by now, you’re thinking “Well Gianni, you have made it clear linear regression is a terrible idea for my count data, tell me more about this Poisson Regression”
Introducing: Poisson Regression
Poisson regression is designed explicitly for count data and assumes that the counts follow a Poisson distribution. The Poisson distribution models the probability of a given number of events happening in a fixed interval of time or space, given a known average rate of occurrence.
In Poisson regression, we have the following assumptions:
- Each observation is independent of the others (i.e. the number of ice cream cones sold today will not influence the number sold tomorrow).
- The mean and the variance of your model are identical. Formally, \(E(Y_i) = Var(Y_i)\). I will introduce this concept more in depth as well as what to do when this is violated in Part 2.
Formal Math Notation
If you are familiar with the formula for linear regression model, you will find the Poisson is quite familiar:
\(log(\lambda_i) = \beta_0 + \beta x_{i1} + \beta_2 x_{i2} + \ldots + \beta_p x_{ip}\) where \(\lambda_i\) is the expected count for the \(i\)-th observation and \(\beta_0, \beta_1, \ldots, \beta_p\) are our regression coefficients.
Implementation in Python
Let’s show a simple example of how to apply Poisson regression in Python. We will use a simple data set, modeling the number of awards given to high school students. Our two predictors are the score on their math final and a categorical variable of three levels for which program they are in: Vocational, General, and Academic. Here is a preview of our data set, which we will call awards
:
id | num_awards | prog | math |
---|---|---|---|
45 | 0 | Vocational | 41 |
108 | 0 | General | 41 |
15 | 0 | Vocational | 44 |
67 | 0 | Vocational | 42 |
153 | 0 | Vocational | 40 |
We can visualize our counts using a histogram to get a better idea of the distribution of our counts:
Our response variable does not have any hint of normality. Seeing there is a right skew, Poisson regression seems to be the most fitting choice.
We can use the following Python code:
import statsmodels.formula.api as smf
# Poisson regression model
poisson_model = smf.poisson('num_awards ~ C(prog) + math', data=awards).fit()
We now have the following model:
Dep. Variable: | num_awards | No. Observations: | 200 |
---|---|---|---|
Model: | Poisson | Df Residuals: | 196 |
Method: | MLE | Df Model: | 3 |
Date: | Thu, 16 May 2024 | Pseudo R-squ.: | 0.2118 |
Time: | 10:38:47 | Log-Likelihood: | -182.75 |
converged: | True | LL-Null: | -231.86 |
Covariance Type: | nonrobust | LLR p-value: | 3.747e-21 |
coef | std err | z | P>|z| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | -5.2471 | 0.658 | -7.969 | 0.000 | -6.538 | -3.957 |
C(prog)[T.Academic] | 1.0839 | 0.358 | 3.025 | 0.002 | 0.382 | 1.786 |
C(prog)[T.Vocational] | 0.3698 | 0.441 | 0.838 | 0.402 | -0.495 | 1.234 |
math | 0.0702 | 0.011 | 6.619 | 0.000 | 0.049 | 0.091 |
With the statsmodels.api summary, you can examine the coefficients, model statistics, significance tests, etc., similar to what you would with any other type of regression. There is one thing to keep in mind when interpreting coefficients here. Recall that we are building a model for the log of the expected count. So for example: for every additional point a student receives on the math final, the difference in the logs of expected counts is expected to change by 0.0702.
References: