Pseudo R-square : Understanding Goodness-of-Fit in Logistic Regression

kameshcodes

1. Import Libraries\textbf{1. Import Libraries}


import numpy as np
import pandas as pd
import statsmodels.api as sm

np.random.seed(42)

2. Simulate Data\textbf{2. Simulate Data}


n_samples = 100
X1 = np.random.rand(n_samples)
X2 = np.random.rand(n_samples)
X3 = np.random.rand(n_samples)
X4 = np.random.rand(n_samples)

noise = np.random.normal(0, 0.1, n_samples)

beta_0 = -1  # Intercept coefficient
beta_1 = 2.5
beta_2 = -1.5
beta_3 = 1.2
beta_4 = -0.8

linear_combination = (beta_0 + beta_1 * X1 + beta_2 * X2 + beta_3 * X3 + beta_4 * X4 + noise)
# logistic transformation
probabilities = 1 / (1 + np.exp(-linear_combination))

# Generate binary response y
y = np.random.binomial(1, probabilities)

# Create a DataFrame
data = pd.DataFrame({
    'X1': X1,
    'X2': X2,
    'X3': X3,
    'X4': X4,
    'y': y
})

data
X1 X2 X3 X4 y
0 0.374540 0.031429 0.642032 0.051682 1
1 0.950714 0.636410 0.084140 0.531355 0
2 0.731994 0.314356 0.161629 0.540635 1
3 0.598658 0.508571 0.898554 0.637430 0
4 0.156019 0.907566 0.606429 0.726091 0
... ... ... ... ... ...
95 0.493796 0.349210 0.522243 0.930757 0
96 0.522733 0.725956 0.769994 0.858413 0
97 0.427541 0.897110 0.215821 0.428994 0
98 0.025419 0.887086 0.622890 0.750871 0
99 0.107891 0.779876 0.085347 0.754543 0

100 rows × 5 columns

3. Logistic Modelling\textbf{3. Logistic Modelling}


X = sm.add_constant(data[['X1', 'X2', 'X3', 'X4']])
y = data['y']

logit_model = sm.Logit(y, X)
model = logit_model.fit()

print(model.summary())
Optimization terminated successfully.
         Current function value: 0.500991
         Iterations 6
                           Logit Regression Results                           
==============================================================================
Dep. Variable:                      y   No. Observations:                  100
Model:                          Logit   Df Residuals:                       95
Method:                           MLE   Df Model:                            4
Date:                Tue, 10 Dec 2024   Pseudo R-squ.:                  0.2696
Time:                        20:50:40   Log-Likelihood:                -50.099
converged:                       True   LL-Null:                       -68.593
Covariance Type:            nonrobust   LLR p-value:                 1.812e-07
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const         -0.3730      1.001     -0.372      0.710      -2.336       1.590
X1             3.4532      0.908      3.804      0.000       1.674       5.232
X2            -3.2572      0.934     -3.486      0.000      -5.089      -1.426
X3             1.0942      0.901      1.214      0.225      -0.672       2.860
X4            -1.1455      0.864     -1.325      0.185      -2.840       0.549
==============================================================================

Here,

 Pseudo R-squ. : 0.2696

But, Why does logistic regression model report Pseudo - R2 and not the R-squ. in its summary ?\text{But, Why does logistic regression model report Pseudo - $R^2$ and not the R-squ. in its summary ?}



Let's first understand, what is R2R^2 ??

In Linear Regression, R2R^2 measures the proportion of variance in the dependent variable (y)(y) that is explained by the independent variables (X)(X) in a regression model, which indicate goodness of fit. It is defined as:


R2=SSRSSTR^2 = \frac{\text{SSR}}{\text{SST}}

where:\quad \text{where:}

  • SSR=i=1n(y^iyˉ)2\text{SSR} = \sum_{i=1}^n \left( \hat{y}_i - \bar{y} \right)^2

  • SST=i=1n(yiyˉ)2\text{SST} = \sum_{i=1}^n \left( y_i - \bar{y} \right)^2


Now

  • In linear regression, the dependent variable is continuous, which makes it possible to calculate measures like Sum of Squares Regression (SSR) and Total Sum of Squares (SST). These form the basis of R2R^2, a measure of how well the model explains the data.

  • R2R^2 assumes a linear relationship and works by dividing the total variation in the dependent variable into explained and unexplained parts. This only works when the outcomes are continuous.

  • In logistic regression, the dependent variable is binary (e.g., 00 or 11), so traditional variance-based calculations like SSRSSR and SSTSST don’t apply. Binary outcomes don’t have variance in the same way continuous data does.

  • Logistic regression uses likelihood-based methods (like loglog-likelihoodlikelihood or deviancedeviance) to measure how well the model fits the data. These methods replace R2R^2 for evaluating the model’s performance.


Remark


You might think that the variance p(1p)p(1-p) from the Bernoulli distribution, which logistic regression uses to model the probability of success pp, could be split into "explained" and "unexplained" variance, like in linear regression.

However, this variance, p(1p)p(1-p), only reflects the uncertainty in the predicted probabilities. It doesn’t represent the variance of the actual target variable, which is binary (00 or 11). Because of this, we can’t apply the traditional idea of explained and unexplained variance in logistic regression.


Reference: Agresti, A. (2002). Categorical Data Analysis. Wiley.\text{Reference: Agresti, A. (2002). Categorical Data Analysis. Wiley.}


The Role of Pseudo-R2\textbf{The Role of Pseudo-$R^2$}


In logistic regression, traditional R2R^2 doesn't work because it's designed for continuous outcomes, not binary ones. To fill this gap, we use Pseudo-R2R^2 measures to assess how well the model fits the data by comparing it to a baseline (null model).

Here are some common types of Pseudo-R2R^2:

  • McFadden's R2R^2 – This is reported in statsmodels library.
  • Cox and Snell's R2R^2
  • Nagelkerke's R2R^2

While Pseudo-R2R^2 do not directly measure explained variance, they serve as relative indicators of how well the model fits the data compared to a nullnull modelmodel. Pseudo-R2R^2 provides a way to interpret model effectiveness in a conceptually similar, but differently calculated, manner to traditional R2R^2.


What is null model ?\text{What is null model ?}


A null model is the simplest logistic regression model that includes no predictors, only an intercept. It predicts the same probability for all observations, equal to the proportion of 11's in the dataset.

  • For example, If 30% of customers in a dataset made a purchase, the null model predicts: p=P(purchase)=0.3p = P(purchase) = 0.3 for every customer in the dataset.


1. McFadden's R2R^2


McFadden's R2R^2 evaluates the goodness-of-fit for logistic regression, analogous to R2R^2 in linear regression, but made for categorical response models in classification tasks.

It provides insight into how much the model improves upon the baseline scenario where no predictors are included, effectively answering the question: “Is the model doing significantly better than random guessing?”


Formula:

R2=1log-likelihood of the fitted modellog-likelihood of the null modelR^2 = 1 - \frac{\text{log-likelihood of the fitted model}}{\text{log-likelihood of the null model}}


RMcFadden2=1ln(Lmodel)ln(Lnull)R^2_{\text{McFadden}} = 1 - \frac{\ln(L_{\text{model}})}{\ln(L_{\text{null}})}

  • LmodelL_{\text{model}}: The likelihood of the fitted model.
  • LnullL_{\text{null}}: The likelihood of the null model (a model with only an intercept).

Interpretation:

  • Range: From 00 to 11, where higher values indicating better model fit.
  • Values between 0.2–0.4 are generally considered a good fit.

Advantages: It benchmarks the improvement of the fitted model over the null model.

Limitations: It yields inherently smaller values than traditional R2R^2, so a low value does not always imply a poor model.

print(model.summary()) # we have this model from above
                           Logit Regression Results                           
==============================================================================
Dep. Variable:                      y   No. Observations:                  100
Model:                          Logit   Df Residuals:                       95
Method:                           MLE   Df Model:                            4
Date:                Tue, 10 Dec 2024   Pseudo R-squ.:                  0.2696
Time:                        20:50:40   Log-Likelihood:                -50.099
converged:                       True   LL-Null:                       -68.593
Covariance Type:            nonrobust   LLR p-value:                 1.812e-07
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const         -0.3730      1.001     -0.372      0.710      -2.336       1.590
X1             3.4532      0.908      3.804      0.000       1.674       5.232
X2            -3.2572      0.934     -3.486      0.000      -5.089      -1.426
X3             1.0942      0.901      1.214      0.225      -0.672       2.860
X4            -1.1455      0.864     -1.325      0.185      -2.840       0.549
==============================================================================
mcfadden_r2 = 1 - (model.llf / model.llnull)   # llf: log-likelihood of full model
                                               # llnull: log-likelihood of null model

print(f"McFadden's pseudo R-squared: {mcfadden_r2:.4f}")
McFadden's pseudo R-squared: 0.2696

2. Adjusted McFadden's R2R^2

Similar to McFadden’s R2R^2, but penalizes for the number of predictors to account for model complexity.

Formula:

RAdjusted McFadden2=1ln(Lmodel)kln(Lnull)R^2_{\text{Adjusted McFadden}} = 1 - \frac{\ln(L_{\text{model}}) - k}{\ln(L_{\text{null}})}

Where kk is the number of predictors.

k = model.df_model  # Number of predictors
adjusted_mcfadden_r2 = 1 - ((model.llf - k) / model.llnull)

print(f"Adjusted McFadden's pseudo R-squared: {adjusted_mcfadden_r2:.4f}")
Adjusted McFadden's pseudo R-squared: 0.2113

3. Cox and Snell's R2R^2



Formula:

R2=1(likelihood of the fitted modellikelihood of the null model)2NR^2 = 1 - \left( \frac{\text{likelihood of the fitted model}}{\text{likelihood of the null model}} \right)^{\frac{2}{N}}


=1(LnullLmodel)2n= 1 - \left(\frac{L_{\text{null}}}{L_{\text{model}}}\right)^{\frac{2}{n}}

  • LmodelL_{\text{model}}: The likelihood of the fitted model.
  • LnullL_{\text{null}}: The likelihood of the null model (a model with only an intercept).
  • nn: The total number of observations.

Interpretation:

  • Range: From 00 to a theoretical maximum less than 11, where higher values indicate better model fit.
  • The maximum value depends on the sample size and the model’s characteristics.

Advantages: It provides an interpretable measure of fit, consistent with the R2R^2 concept from linear regression.

Limitations: Its range is from 00 and maximum value is less than 11, which can make comparisons between models or datasets less straightforward. Its maximum depends on the dataset and model characteristics.

L_null = np.exp(model.llnull)
L_model = np.exp(model.llf)

cox_snell_r2 = 1 - np.power(L_null / L_model, 2 / model.nobs)

print(f"Cox and Snell's pseudo R-squared: {cox_snell_r2:.4f}")
Cox and Snell's pseudo R-squared: 0.3092

4. Nagelkerke's R2R^2


It adjusts Cox and Snell's R2R^2 to ensure its maximum value is 11, standardizing it for better comparability.

Formula:

R2=Cox-Snell R21(likelihood of the null model)2NR^2 = \frac{\text{Cox-Snell R}^2}{1 - \text{(likelihood of the null model)}^{\frac{2}{N}}}


=RCox-Snell21(Lnull)2n= \frac{R^2_{\text{Cox-Snell}}}{1 - \left(L_{\text{null}}\right)^{\frac{2}{n}}}

  • RCox-Snell2R^2_{\text{Cox-Snell}}: The Cox and Snell pseudo R2R^2.
  • LnullL_{\text{null}}: The likelihood of the null model (a model with only an intercept).
  • nn: The total number of observations.

Interpretation:

  • Ranges from 00 to 11, with higher values indicating better model fit.

Advantages:

  • Adjusts Cox and Snell's R2R^2 so it can go up to 1, making it easier to interpret.
  • it Helps compare how well different models or datasets fit.
  • Feels more similar to the R2R^2 used in linear regression, which makes it more familiar and intuitive.

Disadvantages:

  • It doesn’t show how much variation in the data is explained, as it’s still an approximation.
  • Depends on likelihood values, so it can be affected by the assumptions of the model.
L_null = np.exp(model.llnull)
L_model = np.exp(model.llf)

cox_snell_r2 = 1 - np.power(L_null / L_model, 2 / model.nobs)  #cox-snell

nagelkerke_r2 = cox_snell_r2 / (1 - np.power(L_null, 2 / model.nobs))

print(f"Nagelkerke's pseudo R-squared: {nagelkerke_r2:.4f}")
Nagelkerke's pseudo R-squared: 0.4142

Made with REPL Notes Build your own website in minutes with Jupyter notebooks.