3
$\begingroup$

I've only just started learning logistic regression so bear with me. As I understand it, the logistic regression model is as follows:

$$\log\left(\frac{p(X)}{1-p(X)}\right) = \beta_0 + \beta_1X $$

where p(X) = the proportion of successes.

Please tell if I'm wrong but isn't the logistic regression model effectively a linear regression model with a logit link transformation of the binomial response?

I ran into a problem when I was playing around with a data frame. I'm trying to fit a logistic regression model to a data set with two explanatory variables, water (binary) & concentration (continuous) and a binomial response variable (no. of successes from no. of trials).

 > trout_new
 Concentration Water Total Successes prop logit
47 2880 0 143 105 0.73426573 1.0163742
5 180 1 68 4 0.05882353 -2.7725887
7 180 1 109 11 0.10091743 -2.1870722
41 1440 0 100 14 0.14000000 -1.8152900
35 360 0 61 1 0.01639344 -4.0943446
34 360 0 145 21 0.14482759 -1.7757591
44 1440 0 113 3 0.02654867 -3.6018681
39 720 0 99 40 0.40404040 -0.3886580
4 90 1 122 9 0.07377049 -2.5301632
14 720 1 87 3 0.03448276 -3.3322045
11 360 1 129 9 0.06976744 -2.5902672
17 1440 1 140 60 0.42857143 -0.2876821

Column prop is the proportion of successes to Total and column logit is the log odds of success which was calculated as log(prop/(1-prop)).

I fit a logistic regression model as such:

> trout_model <- glm(cbind(Successes,Total-Successes) ~ Water + 
 Concentration, family = binomial, 
 data = trout_new)
> summary(trout_model)
Call:
glm(formula = cbind(Successes, Total - Successes) ~ Water + 
 Concentration, 
 family = binomial, data = trout_new)
Deviance Residuals: 
 Min 1Q Median 3Q Max 
-6.9030 -2.4959 -0.5221 1.5977 6.8773 
Coefficients:
 Estimate Std. Error z value Pr(>|z|) 
(Intercept) -2.820e+00 1.856e-01 -15.193 <2e-16 ***
Water 2.404e-01 1.728e-01 1.391 0.164 
Concentration 1.245e-03 9.557e-05 13.029 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
 Null deviance: 375.75 on 11 degrees of freedom
Residual deviance: 141.66 on 9 degrees of freedom
AIC: 195.51
Number of Fisher Scoring iterations: 5

Then I fit a linear model to log odds of successes, I tried glm(..., family = gaussian) and lm().

> trout_model1 <- glm(logit~Water + Concentration, 
 family = gaussian, data = trout_new)
> summary(trout_model1)
Call:
glm(formula = logit ~ Water + Concentration, family = gaussian, 
 data = trout_new)
Deviance Residuals: 
 Min 1Q Median 3Q Max 
-2.12774 -0.57066 0.08564 0.70799 1.99285 
Coefficients:
 Estimate Std. Error t value Pr(>|t|) 
(Intercept) -3.2888858 0.8145197 -4.038 0.00294 **
Water 0.3817345 0.8213712 0.465 0.65315 
Concentration 0.0012602 0.0005232 2.409 0.03932 * 
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for gaussian family taken to be 1.615836)
 Null deviance: 24.689 on 11 degrees of freedom
Residual deviance: 14.543 on 9 degrees of freedom
AIC: 44.361
Number of Fisher Scoring iterations: 2
> trout_model2<- lm(logit~Water + Concentration, data = trout_new)
> summary(trout_model2)
Call:
lm(formula = logit ~ Water + Concentration, data = trout_new)
Residuals:
 Min 1Q Median 3Q Max 
-2.12774 -0.57066 0.08564 0.70799 1.99285 
Coefficients:
 Estimate Std. Error t value Pr(>|t|) 
(Intercept) -3.2888858 0.8145197 -4.038 0.00294 **
Water 0.3817345 0.8213712 0.465 0.65315 
Concentration 0.0012602 0.0005232 2.409 0.03932 * 
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.271 on 9 degrees of freedom
Multiple R-squared: 0.411, Adjusted R-squared: 0.2801 
F-statistic: 3.14 on 2 and 9 DF, p-value: 0.09239

The coefficients from the linear regression of the log odds (logit transformation) is different to that of logistic regression. Can someone explain why they are different? Is there more to logistic regression that I'm overlooking? Does the difference lie in the Maximum Likelihood vs Least Squares functions?

If the log odds of a response equate to a linear combination of parameters as suggested by textbooks. How can I model this in R using the lm() function or is that not possible?

kjetil b halvorsen
85.6k32 gold badges216 silver badges694 bronze badges
asked Dec 30, 2022 at 14:30
$\endgroup$

2 Answers 2

6
$\begingroup$

Three reasons come to mind.

  1. Taking the expected value of a transformation need not equal that same transformation of an expected value. Jensen’s inequality is one result related to this.
  2. You can do this when you have multiple events correspond to the same features, as you appear to have. For each value of concentration and the "water" indicator variable, you have multiple observations, some of which are successes and some of which are failures. Logistic regression need not operate on such data. It is totally possible, perhaps even more common, to have every success and failure correspond to a unique combination of features. Then you cannot calculate the log-odds, as the event either happened with (probability 1ドル$) or did not happen (probability 0ドル$). Dividing by zero and taking logarithms of zero is problematic.
  3. Why should they be equal? The optimization procedures are totally different. One involves minimizing square loss (corresponding to maximum likelihood estimation for a Gaussian conditional distribution) while the other involves minimizing log loss (corresponding to maximum likelihood estimation for a binomial conditional distribution).

For that second one, consider how you would define the log-odds in a table like the one generated by the following code.

set.seed(2022) # soon will be time for a new seed value!
N <- 100
concentration <- runif(N, 0, 1)
water <- rbinom(N, 1, 0.5)
total <- rep(1, N)
successes <- rbinom(N, 1, 0.5)
d <- data.frame(
 concentration,
 water,
 total,
 successes
)
d
Nick Cox
62.1k8 gold badges145 silver badges231 bronze badges
answered Dec 30, 2022 at 15:24
$\endgroup$
1
  • 1
    $\begingroup$ Thank you! You're right, I wouldn't be able to do a logit transformation on values of zero events. I'll have to learn more about the different maximum likelihood optimization procedures. $\endgroup$ Commented Dec 30, 2022 at 18:48
3
$\begingroup$

In addition to Dave's answer, I think this has been already addressed in: What is the difference between logit-transformed linear regression, logistic regression, and a logistic mixed model?

My suggestion is to think on the experimental/sampling design. Since your data look indeed binomial, a GLM might in principle be more powerful and appropriate than a linear model (Warton, D. I., & Hui, F. K. (2011). The arcsine is asinine: the analysis of proportions in ecology. Ecology, 92(1), 3-10).

However, given the repeated, discrete values of concentration, your data look like from a manipulative experiment (sensu Hurlbert, S. H. (1984). Pseudoreplication and the design of ecological field experiments. Ecological monographs, 54(2), 187-211). If this is case, to avoid pseudoreplication a linear model (with some transformation!) or a GLMM (i.e. generalized linear mixed model) might be preferable (see e.g. Waller, B. M., Warmelink, L., Liebal, K., Micheletta, J., & Slocombe, K. E. (2013). Pseudoreplication: a widespread problem in primate communication research. Animal Behaviour, 86(2), 483-488; Arnqvist, G. (2020). Mixed models offer no freedom from degrees of freedom. Trends in ecology & evolution, 35(4), 329-335).

answered Dec 30, 2022 at 16:47
$\endgroup$
1

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.