Counting on the Right Regression (Pt. 1): Poisson vs. Linear for Count Data

Why Linear Regression is not Ideal

Count data measures the frequency of an event. Because of this, count data is always going to be discrete as well as non-negative. Linear regression however often assumes the data is continuous and can include negative values.

Linear regression assumes that the relationship between the independent and dependent variables is linear and that the dependent variable is normally distributed with constant variance (homoscedasticity). These assumptions often do not hold true for count data, leading to several issues:

Non-negative Predictions: Linear regression can produce negative predictions, which are not meaningful for count data.
Heteroscedasticity: Count data often exhibits heteroscedasticity, where the variance increases with the mean. Recall we need homoscedasticity (equal variance) to be met for linear regression to be valid.
Non-normality: Linear regression assumes that the residuals (differences between observed and predicted values) are normally distributed, not often seen in count data.

Hopefully, by now, you’re thinking “Well Gianni, you have made it clear linear regression is a terrible idea for my count data, tell me more about this Poisson Regression”

Introducing: Poisson Regression

Poisson regression is designed explicitly for count data and assumes that the counts follow a Poisson distribution. The Poisson distribution models the probability of a given number of events happening in a fixed interval of time or space, given a known average rate of occurrence.

In Poisson regression, we have the following assumptions:

Each observation is independent of the others (i.e. the number of ice cream cones sold today will not influence the number sold tomorrow).
The mean and the variance of your model are identical. Formally, \(E(Y_i) = Var(Y_i)\). I will introduce this concept more in depth as well as what to do when this is violated in Part 2.

Formal Math Notation

If you are familiar with the formula for linear regression model, you will find the Poisson is quite familiar:

\(log(\lambda_i) = \beta_0 + \beta x_{i1} + \beta_2 x_{i2} + \ldots + \beta_p x_{ip}\) where \(\lambda_i\) is the expected count for the \(i\)-th observation and \(\beta_0, \beta_1, \ldots, \beta_p\) are our regression coefficients.

Implementation in Python

Let’s show a simple example of how to apply Poisson regression in Python. We will use a simple data set, modeling the number of awards given to high school students. Our two predictors are the score on their math final and a categorical variable of three levels for which program they are in: Vocational, General, and Academic. Here is a preview of our data set, which we will call awards:

id	prog	math
45	Vocational	41
108	General	41
15	Vocational	44
67	Vocational	42
153	Vocational	40

We can visualize our counts using a histogram to get a better idea of the distribution of our counts: Counts Separated by Program Type

Our response variable does not have any hint of normality. Seeing there is a right skew, Poisson regression seems to be the most fitting choice.

We can use the following Python code:

import statsmodels.formula.api as smf

# Poisson regression model
poisson_model = smf.poisson('num_awards ~ C(prog) + math', data=awards).fit()

We now have the following model:

Poisson Regression Results
Dep. Variable:	num_awards	No. Observations:	200
Model:	Poisson	Df Residuals:	196
Method:	MLE	Df Model:	3
Date:	Thu, 16 May 2024	Pseudo R-squ.:	0.2118
Time:	10:38:47	Log-Likelihood:	-182.75
converged:	True	LL-Null:	-231.86
Covariance Type:	nonrobust	LLR p-value:	3.747e-21

	coef	std err	z	P>\|z\|	[0.025	0.975]
Intercept	-5.2471	0.658	-7.969	0.000	-6.538	-3.957
C(prog)[T.Academic]	1.0839	0.358	3.025	0.002	0.382	1.786
C(prog)[T.Vocational]	0.3698	0.441	0.838	0.402	-0.495	1.234
math	0.0702	0.011	6.619	0.000	0.049	0.091

With the statsmodels.api summary, you can examine the coefficients, model statistics, significance tests, etc., similar to what you would with any other type of regression. There is one thing to keep in mind when interpreting coefficients here. Recall that we are building a model for the log of the expected count. So for example: for every additional point a student receives on the math final, the difference in the logs of expected counts is expected to change by 0.0702.

References:

Share on

Twitter Facebook LinkedIn

Counting on the Right Regression (Pt. 1): Poisson vs. Linear for Count Data

Gianni Spiga

Why Linear Regression is not Ideal

Introducing: Poisson Regression

Formal Math Notation

Implementation in Python

Share on

You may also enjoy

Wells Fargo Announces the Autograph Journey Card - Is it the Card for You?

Machine Learning Culinary School: Chopping Up Your Data 🍴

library(strugglR) #1: Retaining Variable Labels in Data Subsets

Welcome to the Website!