import numpy as np
import pandas as pd
import statsmodels.api as sm
np.random.seed(42)n_samples = 100
X1 = np.random.rand(n_samples)
X2 = np.random.rand(n_samples)
X3 = np.random.rand(n_samples)
X4 = np.random.rand(n_samples)
noise = np.random.normal(0, 0.1, n_samples)
beta_0 = -1 # Intercept coefficient
beta_1 = 2.5
beta_2 = -1.5
beta_3 = 1.2
beta_4 = -0.8
linear_combination = (beta_0 + beta_1 * X1 + beta_2 * X2 + beta_3 * X3 + beta_4 * X4 + noise)# logistic transformation
probabilities = 1 / (1 + np.exp(-linear_combination))
# Generate binary response y
y = np.random.binomial(1, probabilities)
# Create a DataFrame
data = pd.DataFrame({
'X1': X1,
'X2': X2,
'X3': X3,
'X4': X4,
'y': y
})
data| X1 | X2 | X3 | X4 | y | |
|---|---|---|---|---|---|
| 0 | 0.374540 | 0.031429 | 0.642032 | 0.051682 | 1 |
| 1 | 0.950714 | 0.636410 | 0.084140 | 0.531355 | 0 |
| 2 | 0.731994 | 0.314356 | 0.161629 | 0.540635 | 1 |
| 3 | 0.598658 | 0.508571 | 0.898554 | 0.637430 | 0 |
| 4 | 0.156019 | 0.907566 | 0.606429 | 0.726091 | 0 |
| ... | ... | ... | ... | ... | ... |
| 95 | 0.493796 | 0.349210 | 0.522243 | 0.930757 | 0 |
| 96 | 0.522733 | 0.725956 | 0.769994 | 0.858413 | 0 |
| 97 | 0.427541 | 0.897110 | 0.215821 | 0.428994 | 0 |
| 98 | 0.025419 | 0.887086 | 0.622890 | 0.750871 | 0 |
| 99 | 0.107891 | 0.779876 | 0.085347 | 0.754543 | 0 |
100 rows × 5 columns
X = sm.add_constant(data[['X1', 'X2', 'X3', 'X4']])
y = data['y']
logit_model = sm.Logit(y, X)
model = logit_model.fit()
print(model.summary())Optimization terminated successfully.
Current function value: 0.500991
Iterations 6
Logit Regression Results
==============================================================================
Dep. Variable: y No. Observations: 100
Model: Logit Df Residuals: 95
Method: MLE Df Model: 4
Date: Tue, 10 Dec 2024 Pseudo R-squ.: 0.2696
Time: 20:50:40 Log-Likelihood: -50.099
converged: True LL-Null: -68.593
Covariance Type: nonrobust LLR p-value: 1.812e-07
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
const -0.3730 1.001 -0.372 0.710 -2.336 1.590
X1 3.4532 0.908 3.804 0.000 1.674 5.232
X2 -3.2572 0.934 -3.486 0.000 -5.089 -1.426
X3 1.0942 0.901 1.214 0.225 -0.672 2.860
X4 -1.1455 0.864 -1.325 0.185 -2.840 0.549
==============================================================================
Here,
Pseudo R-squ. : 0.2696
Let's first understand, what is
In Linear Regression, measures the proportion of variance in the dependent variable that is explained by the independent variables in a regression model, which indicate goodness of fit. It is defined as:
Now
In linear regression, the dependent variable is continuous, which makes it possible to calculate measures like Sum of Squares Regression (SSR) and Total Sum of Squares (SST). These form the basis of , a measure of how well the model explains the data.
assumes a linear relationship and works by dividing the total variation in the dependent variable into explained and unexplained parts. This only works when the outcomes are continuous.
In logistic regression, the dependent variable is binary (e.g., or ), so traditional variance-based calculations like and don’t apply. Binary outcomes don’t have variance in the same way continuous data does.
Logistic regression uses likelihood-based methods (like - or ) to measure how well the model fits the data. These methods replace for evaluating the model’s performance.
Remark
You might think that the variance from the Bernoulli distribution, which logistic regression uses to model the probability of success , could be split into "explained" and "unexplained" variance, like in linear regression.
However, this variance, , only reflects the uncertainty in the predicted probabilities. It doesn’t represent the variance of the actual target variable, which is binary ( or ). Because of this, we can’t apply the traditional idea of explained and unexplained variance in logistic regression.
In logistic regression, traditional doesn't work because it's designed for continuous outcomes, not binary ones. To fill this gap, we use Pseudo- measures to assess how well the model fits the data by comparing it to a baseline (null model).
Here are some common types of Pseudo-:
statsmodels library.While Pseudo- do not directly measure explained variance, they serve as relative indicators of how well the model fits the data compared to a . Pseudo- provides a way to interpret model effectiveness in a conceptually similar, but differently calculated, manner to traditional .
A null model is the simplest logistic regression model that includes no predictors, only an intercept. It predicts the same probability for all observations, equal to the proportion of 's in the dataset.
McFadden's evaluates the goodness-of-fit for logistic regression, analogous to in linear regression, but made for categorical response models in classification tasks.
It provides insight into how much the model improves upon the baseline scenario where no predictors are included, effectively answering the question: “Is the model doing significantly better than random guessing?”
Formula:
Interpretation:
Advantages: It benchmarks the improvement of the fitted model over the null model.
Limitations: It yields inherently smaller values than traditional , so a low value does not always imply a poor model.
print(model.summary()) # we have this model from above Logit Regression Results
==============================================================================
Dep. Variable: y No. Observations: 100
Model: Logit Df Residuals: 95
Method: MLE Df Model: 4
Date: Tue, 10 Dec 2024 Pseudo R-squ.: 0.2696
Time: 20:50:40 Log-Likelihood: -50.099
converged: True LL-Null: -68.593
Covariance Type: nonrobust LLR p-value: 1.812e-07
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
const -0.3730 1.001 -0.372 0.710 -2.336 1.590
X1 3.4532 0.908 3.804 0.000 1.674 5.232
X2 -3.2572 0.934 -3.486 0.000 -5.089 -1.426
X3 1.0942 0.901 1.214 0.225 -0.672 2.860
X4 -1.1455 0.864 -1.325 0.185 -2.840 0.549
==============================================================================
mcfadden_r2 = 1 - (model.llf / model.llnull) # llf: log-likelihood of full model
# llnull: log-likelihood of null model
print(f"McFadden's pseudo R-squared: {mcfadden_r2:.4f}")McFadden's pseudo R-squared: 0.2696
Similar to McFadden’s , but penalizes for the number of predictors to account for model complexity.
Formula:
Where is the number of predictors.
k = model.df_model # Number of predictors
adjusted_mcfadden_r2 = 1 - ((model.llf - k) / model.llnull)
print(f"Adjusted McFadden's pseudo R-squared: {adjusted_mcfadden_r2:.4f}")Adjusted McFadden's pseudo R-squared: 0.2113
Formula:
Interpretation:
Advantages: It provides an interpretable measure of fit, consistent with the concept from linear regression.
Limitations: Its range is from and maximum value is less than , which can make comparisons between models or datasets less straightforward. Its maximum depends on the dataset and model characteristics.
L_null = np.exp(model.llnull)
L_model = np.exp(model.llf)
cox_snell_r2 = 1 - np.power(L_null / L_model, 2 / model.nobs)
print(f"Cox and Snell's pseudo R-squared: {cox_snell_r2:.4f}")Cox and Snell's pseudo R-squared: 0.3092
It adjusts Cox and Snell's to ensure its maximum value is , standardizing it for better comparability.
Formula:
Interpretation:
Advantages:
Disadvantages:
L_null = np.exp(model.llnull)
L_model = np.exp(model.llf)
cox_snell_r2 = 1 - np.power(L_null / L_model, 2 / model.nobs) #cox-snell
nagelkerke_r2 = cox_snell_r2 / (1 - np.power(L_null, 2 / model.nobs))
print(f"Nagelkerke's pseudo R-squared: {nagelkerke_r2:.4f}")Nagelkerke's pseudo R-squared: 0.4142