Pseudo R-square : Understanding Goodness-of-Fit in Logistic Regression

$\textbf{1. Import Libraries}$

import numpy as np
import pandas as pd
import statsmodels.api as sm

np.random.seed(42)

$\textbf{2. Simulate Data}$

n_samples = 100
X1 = np.random.rand(n_samples)
X2 = np.random.rand(n_samples)
X3 = np.random.rand(n_samples)
X4 = np.random.rand(n_samples)

noise = np.random.normal(0, 0.1, n_samples)

beta_0 = -1  # Intercept coefficient
beta_1 = 2.5
beta_2 = -1.5
beta_3 = 1.2
beta_4 = -0.8

linear_combination = (beta_0 + beta_1 * X1 + beta_2 * X2 + beta_3 * X3 + beta_4 * X4 + noise)

# logistic transformation
probabilities = 1 / (1 + np.exp(-linear_combination))

# Generate binary response y
y = np.random.binomial(1, probabilities)

# Create a DataFrame
data = pd.DataFrame({
    'X1': X1,
    'X2': X2,
    'X3': X3,
    'X4': X4,
    'y': y
})

data

	X1	X2	X3	X4	y
0	0.374540	0.031429	0.642032	0.051682	1
1	0.950714	0.636410	0.084140	0.531355	0
2	0.731994	0.314356	0.161629	0.540635	1
3	0.598658	0.508571	0.898554	0.637430	0
4	0.156019	0.907566	0.606429	0.726091	0
...	...	...	...	...	...
95	0.493796	0.349210	0.522243	0.930757	0
96	0.522733	0.725956	0.769994	0.858413	0
97	0.427541	0.897110	0.215821	0.428994	0
98	0.025419	0.887086	0.622890	0.750871	0
99	0.107891	0.779876	0.085347	0.754543	0

100 rows × 5 columns

$\textbf{3. Logistic Modelling}$

X = sm.add_constant(data[['X1', 'X2', 'X3', 'X4']])
y = data['y']

logit_model = sm.Logit(y, X)
model = logit_model.fit()

print(model.summary())

Optimization terminated successfully.
         Current function value: 0.500991
         Iterations 6
                           Logit Regression Results                           
==============================================================================
Dep. Variable:                      y   No. Observations:                  100
Model:                          Logit   Df Residuals:                       95
Method:                           MLE   Df Model:                            4
Date:                Tue, 10 Dec 2024   Pseudo R-squ.:                  0.2696
Time:                        20:50:40   Log-Likelihood:                -50.099
converged:                       True   LL-Null:                       -68.593
Covariance Type:            nonrobust   LLR p-value:                 1.812e-07
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const         -0.3730      1.001     -0.372      0.710      -2.336       1.590
X1             3.4532      0.908      3.804      0.000       1.674       5.232
X2            -3.2572      0.934     -3.486      0.000      -5.089      -1.426
X3             1.0942      0.901      1.214      0.225      -0.672       2.860
X4            -1.1455      0.864     -1.325      0.185      -2.840       0.549
==============================================================================

Here,

 Pseudo R-squ. : 0.2696

$\text{But, Why does logistic regression model report Pseudo - $R^2$ and not the R-squ. in its summary ?}$

Let's first understand, what is $R^2$ $?$

In Linear Regression, $R^2$ measures the proportion of variance in the dependent variable $(y)$ that is explained by the independent variables $(X)$ in a regression model, which indicate goodness of fit. It is defined as:

$R^2 = \frac{\text{SSR}}{\text{SST}}$

$\quad \text{where:}$

$\text{SSR} = \sum_{i=1}^n \left( \hat{y}_i - \bar{y} \right)^2$
$\text{SST} = \sum_{i=1}^n \left( y_i - \bar{y} \right)^2$

Now

In linear regression, the dependent variable is continuous, which makes it possible to calculate measures like Sum of Squares Regression (SSR) and Total Sum of Squares (SST). These form the basis of $R^2$ , a measure of how well the model explains the data.
$R^2$ assumes a linear relationship and works by dividing the total variation in the dependent variable into explained and unexplained parts. This only works when the outcomes are continuous.
In logistic regression, the dependent variable is binary (e.g., $0$ or $1$ ), so traditional variance-based calculations like $SSR$ and $SST$ don’t apply. Binary outcomes don’t have variance in the same way continuous data does.
Logistic regression uses likelihood-based methods (like $log$ - $likelihood$ or $deviance$ ) to measure how well the model fits the data. These methods replace $R^2$ for evaluating the model’s performance.

Remark

You might think that the variance $p(1-p)$ from the Bernoulli distribution, which logistic regression uses to model the probability of success $p$ , could be split into "explained" and "unexplained" variance, like in linear regression.

However, this variance, $p(1-p)$ , only reflects the uncertainty in the predicted probabilities. It doesn’t represent the variance of the actual target variable, which is binary ( $0$ or $1$ ). Because of this, we can’t apply the traditional idea of explained and unexplained variance in logistic regression.

$\text{Reference: Agresti, A. (2002). Categorical Data Analysis. Wiley.}$

$\textbf{The Role of Pseudo-$R^2$}$

In logistic regression, traditional $R^2$ doesn't work because it's designed for continuous outcomes, not binary ones. To fill this gap, we use Pseudo- $R^2$ measures to assess how well the model fits the data by comparing it to a baseline (null model).

Here are some common types of Pseudo- $R^2$ :

McFadden's $R^2$ – This is reported in statsmodels library.
Cox and Snell's $R^2$
Nagelkerke's $R^2$

While Pseudo- $R^2$ do not directly measure explained variance, they serve as relative indicators of how well the model fits the data compared to a $null$ $model$ . Pseudo- $R^2$ provides a way to interpret model effectiveness in a conceptually similar, but differently calculated, manner to traditional $R^2$ .

$\text{What is null model ?}$

A null model is the simplest logistic regression model that includes no predictors, only an intercept. It predicts the same probability for all observations, equal to the proportion of $1$ 's in the dataset.

For example, If 30% of customers in a dataset made a purchase, the null model predicts: $p = P(purchase) = 0.3$ for every customer in the dataset.

1. McFadden's $R^2$

McFadden's $R^2$ evaluates the goodness-of-fit for logistic regression, analogous to $R^2$ in linear regression, but made for categorical response models in classification tasks.

It provides insight into how much the model improves upon the baseline scenario where no predictors are included, effectively answering the question: “Is the model doing significantly better than random guessing?”

Formula:

$R^2 = 1 - \frac{\text{log-likelihood of the fitted model}}{\text{log-likelihood of the null model}}$

$R^2_{\text{McFadden}} = 1 - \frac{\ln(L_{\text{model}})}{\ln(L_{\text{null}})}$

$L_{\text{model}}$ : The likelihood of the fitted model.
$L_{\text{null}}$ : The likelihood of the null model (a model with only an intercept).

Interpretation:

Range: From $0$ to $1$ , where higher values indicating better model fit.
Values between 0.2–0.4 are generally considered a good fit.

Advantages: It benchmarks the improvement of the fitted model over the null model.

Limitations: It yields inherently smaller values than traditional $R^2$ , so a low value does not always imply a poor model.

print(model.summary()) # we have this model from above

                           Logit Regression Results                           
==============================================================================
Dep. Variable:                      y   No. Observations:                  100
Model:                          Logit   Df Residuals:                       95
Method:                           MLE   Df Model:                            4
Date:                Tue, 10 Dec 2024   Pseudo R-squ.:                  0.2696
Time:                        20:50:40   Log-Likelihood:                -50.099
converged:                       True   LL-Null:                       -68.593
Covariance Type:            nonrobust   LLR p-value:                 1.812e-07
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const         -0.3730      1.001     -0.372      0.710      -2.336       1.590
X1             3.4532      0.908      3.804      0.000       1.674       5.232
X2            -3.2572      0.934     -3.486      0.000      -5.089      -1.426
X3             1.0942      0.901      1.214      0.225      -0.672       2.860
X4            -1.1455      0.864     -1.325      0.185      -2.840       0.549
==============================================================================

mcfadden_r2 = 1 - (model.llf / model.llnull)   # llf: log-likelihood of full model
                                               # llnull: log-likelihood of null model

print(f"McFadden's pseudo R-squared: {mcfadden_r2:.4f}")

McFadden's pseudo R-squared: 0.2696

2. Adjusted McFadden's $R^2$

Similar to McFadden’s $R^2$ , but penalizes for the number of predictors to account for model complexity.

Formula:

$R^2_{\text{Adjusted McFadden}} = 1 - \frac{\ln(L_{\text{model}}) - k}{\ln(L_{\text{null}})}$

Where $k$ is the number of predictors.

k = model.df_model  # Number of predictors
adjusted_mcfadden_r2 = 1 - ((model.llf - k) / model.llnull)

print(f"Adjusted McFadden's pseudo R-squared: {adjusted_mcfadden_r2:.4f}")

Adjusted McFadden's pseudo R-squared: 0.2113

3. Cox and Snell's $R^2$

Formula:

$R^2 = 1 - \left( \frac{\text{likelihood of the fitted model}}{\text{likelihood of the null model}} \right)^{\frac{2}{N}}$

$= 1 - \left(\frac{L_{\text{null}}}{L_{\text{model}}}\right)^{\frac{2}{n}}$

$L_{\text{model}}$ : The likelihood of the fitted model.
$L_{\text{null}}$ : The likelihood of the null model (a model with only an intercept).
$n$ : The total number of observations.

Interpretation:

Range: From $0$ to a theoretical maximum less than $1$ , where higher values indicate better model fit.
The maximum value depends on the sample size and the model’s characteristics.

Advantages: It provides an interpretable measure of fit, consistent with the $R^2$ concept from linear regression.

Limitations: Its range is from $0$ and maximum value is less than $1$ , which can make comparisons between models or datasets less straightforward. Its maximum depends on the dataset and model characteristics.

L_null = np.exp(model.llnull)
L_model = np.exp(model.llf)

cox_snell_r2 = 1 - np.power(L_null / L_model, 2 / model.nobs)

print(f"Cox and Snell's pseudo R-squared: {cox_snell_r2:.4f}")

Cox and Snell's pseudo R-squared: 0.3092

4. Nagelkerke's $R^2$

It adjusts Cox and Snell's $R^2$ to ensure its maximum value is $1$ , standardizing it for better comparability.

Formula:

$R^2 = \frac{\text{Cox-Snell R}^2}{1 - \text{(likelihood of the null model)}^{\frac{2}{N}}}$

$= \frac{R^2_{\text{Cox-Snell}}}{1 - \left(L_{\text{null}}\right)^{\frac{2}{n}}}$

$R^2_{\text{Cox-Snell}}$ : The Cox and Snell pseudo $R^2$ .
$L_{\text{null}}$ : The likelihood of the null model (a model with only an intercept).
$n$ : The total number of observations.

Interpretation:

Ranges from $0$ to $1$ , with higher values indicating better model fit.

Advantages:

Adjusts Cox and Snell's $R^2$ so it can go up to 1, making it easier to interpret.
it Helps compare how well different models or datasets fit.
Feels more similar to the $R^2$ used in linear regression, which makes it more familiar and intuitive.

Disadvantages:

It doesn’t show how much variation in the data is explained, as it’s still an approximation.
Depends on likelihood values, so it can be affected by the assumptions of the model.

L_null = np.exp(model.llnull)
L_model = np.exp(model.llf)

cox_snell_r2 = 1 - np.power(L_null / L_model, 2 / model.nobs)  #cox-snell

nagelkerke_r2 = cox_snell_r2 / (1 - np.power(L_null, 2 / model.nobs))

print(f"Nagelkerke's pseudo R-squared: {nagelkerke_r2:.4f}")

Nagelkerke's pseudo R-squared: 0.4142

Pseudo R-square : Understanding Goodness-of-Fit in Logistic Regression

1. Import Libraries\textbf{1. Import Libraries}1. Import Libraries

2. Simulate Data\textbf{2. Simulate Data}2. Simulate Data

3. Logistic Modelling\textbf{3. Logistic Modelling}3. Logistic Modelling

The Role of Pseudo-R2\textbf{The Role of Pseudo-$R^2$}The Role of Pseudo-R2

1. McFadden's R2R^2R2

2. Adjusted McFadden's R2R^2R2

3. Cox and Snell's R2R^2R2

4. Nagelkerke's R2R^2R2

Made with REPL Notes Build your own website in minutes with Jupyter notebooks.

$\textbf{1. Import Libraries}$

$\textbf{2. Simulate Data}$

$\textbf{3. Logistic Modelling}$

$\textbf{The Role of Pseudo-$R^2$}$

1. McFadden's $R^2$

2. Adjusted McFadden's $R^2$

3. Cox and Snell's $R^2$

4. Nagelkerke's $R^2$