Proc Logistic and Genmod with Interactions

Discussion:

(too old to reply)

2006-04-19 15:54:08 UTC

Hello,
I'm doing a two-stage conditional regression model. The first stage
models presence-absence with proc logistic, and the second models
abundance given presence with proc genmod.
I have five independent variables, and I included all 2-way interactions
and then reduced using backward elimination. This worked ok with the
logisitic regression, but I encounter a problem when I get to the second
stage, because there are only 30 samples at this point and the model won't
converge with that many interaction variables.
So when is it ok NOT to check for interactions? I can't really drop any
of the main effects, but I feel like not including interactions is somehow
wrong.....has anyone else encountered something like this before?
Thanks.

Peter Flom

2006-04-19 16:15:33 UTC

Permalink

<<<
I'm doing a two-stage conditional regression model. The first stage
models presence-absence with proc logistic, and the second models
abundance given presence with proc genmod.

I have five independent variables, and I included all 2-way interactions
and then reduced using backward elimination. This worked ok with the
logisitic regression, but I encounter a problem when I get to the second
stage, because there are only 30 samples at this point and the model won't converge with that many interaction variables.
Backward elimination is a BAD way to do model selection. Bad bad bad. For extensive reasons, search the archives using keywords stepwise and author = David Cassell.

Briefly, it gives wrong p values. Wrong standard errors. Maybe nonsensical models. But most of all, when you do this, you are letting the computer do your thinking for you.

<<<
So when is it ok NOT to check for interactions? I can't really drop any
of the main effects, but I feel like not including interactions is somehow
wrong.....has anyone else encountered something like this before?
Well, why include the interactions?
Why stop at 2 way?
Why do you feel you have to include main effects?

With only 30 samples, you shouldn't even be including all 5 main effects, because it is likely you are overfitting.

It's hard to give more advice, because you haven't said what you are trying to do, what your DV is, what your IVs are, how they are distributed, how they were gotten, or anything. So, at this point, all I can say is that what you are doing is not a good thing to do.

If you write back to SAS-L with details, then someone here (maybe even me) may be able to help you.

Sorry

Those who have been here a while will know why I am signing this

David - in training (grin, duck, run)

Peter

Peter L. Flom, PhD
Assistant Director, Statistics and Data Analysis Core
Center for Drug Use and HIV Research
National Development and Research Institutes
71 W. 23rd St
http://cduhr.ndri.org
www.peterflom.com
New York, NY 10010
(212) 845-4485 (voice)
(917) 438-0894 (fax)

2006-04-19 18:48:31 UTC

Permalink

OK thanks, I'll try to explain this better.
First off, when I say I did backward elimination, I meant I did it
manually. I'd run the model and then take out a variable, one at a time,
based on what had the highest prob value, and did not take out any main
effects if they were included in an interaction that was significant. A
former stats prof of mine advised to do it this way. I'll look into using
stepwise, but aren't there problems with that too?

So here is what I'm trying to do:
I have results from a field survey to look at a species abundance and
distribution. At ~100 locations, I have a point count of the number of
individuals observed. I want to test what characterisitcs of the habitat
influence their abundance. These were all variables that were also
recorded during the survey (e.g., % cover, habitat type). In total, I
have five independent variables (1 categorical, 4 continuous) that I want
to include.

Because there are a lot of zeros in the data (thus violating any
homogeneity and normality assumptions, and simply transforming the raw
data did not fix this) the two-part conditional approach seemed like a
good way to model the data. Out of the 100 locations, individuals were
present at 30. So first, I used a logistic model to see what variables
significantly affected presence/absence (absent=0, present=1).

Then, using only the sites with individuals present (n=30), I modelled the
abundance, given presence, with a generalized linear model with a negative
binomial distribution. This time, the dependent variable was the # of
individuals observed, rather than presence/absence. All of the same
independent variables and interactions were included, hence when I began
to realize I had a problem.

I realize that I may be including too many independent variables and
overfitting the model. There are a couple of the effects that I
don't "think" will be that important, but don't have an statistical
justification for not including them. I get rather confused about when to
include interactions, and how far to go (2-way, 3-way, etc.), and then how
the heck to interpret them when they are significant. I also have major
multi-collinearity issues when I start adding interactions.

Hope this makes more sense and sorry if it doesn't. I appreciate any
feedback.

Peter Flom

2006-04-19 19:12:45 UTC

Permalink

<<<
OK thanks, I'll try to explain this better.

First off, when I say I did backward elimination, I meant I did it
manually. I'd run the model and then take out a variable, one at a time, based on what had the highest prob value, and did not take out any main effects if they were included in an interaction that was significant. A former stats prof of mine advised to do it this way. I'll look into using stepwise, but aren't there problems with that too?
Stepwise is WORSE than backwards. When I said to search on stepwise, it was because I think that has been in the title of many of the threads about how bad ALL these methods are. (Backwards is bad, forwards is worse, stepwise combines the worst of both)

Doing it manually, if all you are doing is what you say, at least eliminates the problem of having interaction terms without main effects, but is still not a good method.

<<<
So here is what I'm trying to do:
I have results from a field survey to look at a species abundance and
distribution. At ~100 locations, I have a point count of the number of
individuals observed. I want to test what characterisitcs of the habitat
influence their abundance. These were all variables that were also
recorded during the survey (e.g., % cover, habitat type). In total, I
have five independent variables (1 categorical, 4 continuous) that I want to include.
Then you probably want to be using one of the SURVEY PROCS. Luckily, we have an expert on these PROCs right here on SAS-L,
(cue David)

<<<
Because there are a lot of zeros in the data (thus violating any
homogeneity and normality assumptions, and simply transforming the raw data did not fix this) the two-part conditional approach seemed like a good way to model the data. Out of the 100 locations, individuals were present at 30. So first, I used a logistic model to see what variables significantly affected presence/absence (absent=0, present=1).
Per the above, you probably want to switch to PROC SURVEYLOGISTIC

<<<
Then using only the sites with individuals present (n=30), I modelled the abundance, given presence, with a generalized linear model with a negative binomial distribution. This time, the dependent variable was the # of individuals observed, rather than presence/absence. All of the same independent variables and interactions were included, hence when I began to realize I had a problem.

I realize that I may be including too many independent variables and
overfitting the model. There are a couple of the effects that I
don't "think" will be that important, but don't have an statistical
justification for not including them. I get rather confused about when to include interactions, and how far to go (2-way, 3-way, etc.), and then how the heck to interpret them when they are significant. I also have major multi-collinearity issues when I start adding interactions.
OK, first like a lot of people, you have switched the cart and horse. The horse is substance. The cart is statistics. You picked 5 independent variables. So, based on your own knowledge, you excluded the infinite number of variables you MIGHT have included. You had no statistical justifcation for excluding them, you just think you know what you are doing. This is as it should be.
Randomly including variables is known as PROC FISH :-)

Now, take these 5. You say you have some that you don't think are important. Good. Take them out. See what happens. Do parameter estimates change (for the other varaibles)?

Next, as to which interactions: Well, again, which ones do you think MIGHT make sense?

Figure out this stuff substantively, and then we can figure out what to do statistically.

Hope this helps

Peter

2006-04-19 21:01:13 UTC

Permalink

On Wed, 19 Apr 2006 15:12:45 -0400, Peter Flom <***@NDRI.ORG> wrote:

Thanks.....I'll try your advice, but I also have a follow-up question...

Post by Peter Flom
Stepwise is WORSE than backwards. When I said to search on stepwise, it

was because I think that has been in the title of many of the threads
about how bad ALL these methods are. (Backwards is bad, forwards is
worse, stepwise combines the worst of both)

Post by Peter Flom
Doing it manually, if all you are doing is what you say, at least

eliminates the problem of having interaction terms without main effects,
but is still not a good method.

Maybe I'm missing something here, but if these are ALL bad, what is a GOOD
way to reduce a model?? Or do you just leave all of the insignificant
terms in there or something? This just doesn't make sense to me.

Kevin Roland Viel

2006-04-19 21:34:22 UTC

Permalink

Post by LB
Maybe I'm missing something here, but if these are ALL bad, what is a GOOD
way to reduce a model?? Or do you just leave all of the insignificant
terms in there or something? This just doesn't make sense to me.

I'm in the same learning position myself. There is, of course, dissent in
my department on this issue, but it just makes journal clubs that much
more fun :)

First, insignificant parameter estimates do not differ from zero. What is
the effect of removing them on the point estimate(s) and the corresponding
precision of the variable(s) of interest?

My committee favored the a priori development of a model. They encouraged
me to explore different representation of certain covariates (splines,
quadratic, etc), which I could eliminate via a LRT. However, if I said
that I expected that variable X might confound the relationship between
the DV and IV, then I left it in the model, regardless of the p-value.
Perhaps in future analyses, I might exclude variable X.

A second approach might be to accept a "gold standard" model, then remove
certain variables, checking on the point estimate and precision. In
addition to significance, you might also use the AIC.

However, I think I tend to favor my committee's sage approach. Know what
data are available to you, hit the literature and "experts" (in my case
other researchers, nurses, doctors, law enforcement, EMTs, etc), develope
a model, then plot, plot, plot, and then run the model and plot again.

Good luck,

Kevin

PS I used to have some fascination with magical powers of a model, as if I
might reveal something hidden, something important. Now I see that it is
my data, or just my data. My analogy to modeling is having a cube of
jello and a hemispherical mold over which to fit it, or is it a pyramidal
mold or a rectangular mold, or...???

Kevin Viel
Department of Epidemiology
Rollins School of Public Health
Emory University
Atlanta, GA 30322

Swank, Paul R

2006-04-19 23:21:22 UTC

Permalink

Except for the survey part, this sounds like a job for ZINB, the zero
inflated negative binomial, rather than a two stage analysis. There have
been some dynamite posts on the list by Dale McLerran that you could
search for. However, it's appropriateness is going to depend on your
meaning of survey.

Paul R. Swank, Ph.D.
Professor, Developmental Pediatrics
Director of Research, Center for Improving the Readiness of Children for
Learning and Education (C.I.R.C.L.E.)
Medical School
UT Health Science Center at Houston

-----Original Message-----
From: SAS(r) Discussion [mailto:SAS-***@LISTSERV.UGA.EDU] On Behalf Of LB
Sent: Wednesday, April 19, 2006 12:49 PM
To: SAS-***@LISTSERV.UGA.EDU
Subject: Re: Proc Logistic and Genmod with Interactions

OK thanks, I'll try to explain this better.
First off, when I say I did backward elimination, I meant I did it
manually. I'd run the model and then take out a variable, one at a
time, based on what had the highest prob value, and did not take out any
main effects if they were included in an interaction that was
significant. A former stats prof of mine advised to do it this way.
I'll look into using stepwise, but aren't there problems with that too?

So here is what I'm trying to do:
I have results from a field survey to look at a species abundance and
distribution. At ~100 locations, I have a point count of the number of
individuals observed. I want to test what characterisitcs of the
habitat influence their abundance. These were all variables that were
also recorded during the survey (e.g., % cover, habitat type). In
total, I have five independent variables (1 categorical, 4 continuous)
that I want to include.

Because there are a lot of zeros in the data (thus violating any
homogeneity and normality assumptions, and simply transforming the raw
data did not fix this) the two-part conditional approach seemed like a
good way to model the data. Out of the 100 locations, individuals were
present at 30. So first, I used a logistic model to see what variables
significantly affected presence/absence (absent=0, present=1).

Then, using only the sites with individuals present (n=30), I modelled
the abundance, given presence, with a generalized linear model with a
negative binomial distribution. This time, the dependent variable was
the # of individuals observed, rather than presence/absence. All of the
same independent variables and interactions were included, hence when I
began to realize I had a problem.

I realize that I may be including too many independent variables and
overfitting the model. There are a couple of the effects that I don't
"think" will be that important, but don't have an statistical
justification for not including them. I get rather confused about when
to include interactions, and how far to go (2-way, 3-way, etc.), and
then how the heck to interpret them when they are significant. I also
have major multi-collinearity issues when I start adding interactions.

Hope this makes more sense and sorry if it doesn't. I appreciate any
feedback.

David L Cassell

2006-04-20 04:41:27 UTC

Permalink

Post by LB
I'm doing a two-stage conditional regression model. The first stage
models presence-absence with proc logistic, and the second models
abundance given presence with proc genmod.
I have five independent variables, and I included all 2-way interactions
and then reduced using backward elimination. This worked ok with the
logisitic regression, but I encounter a problem when I get to the second
stage, because there are only 30 samples at this point and the model won't
converge with that many interaction variables.
So when is it ok NOT to check for interactions? I can't really drop any
of the main effects, but I feel like not including interactions is somehow
wrong.....has anyone else encountered something like this before?

I agree with Peter on every point.

Automatic backward selection is bad. Automatic stepwise is also bad.
Any selection method needs to be done with full attention to detail.

<hypothetical case>
What do you do if everything works with all the variables in, but once you
drop a non-significant variable out, suddenly you find that the residuals
no longer look normal? Oops.
</hypothetical case>

You have to handle this at a human-intervention level. I'll let you sneak
over and use PROC GLMSELECT, but I'll still want you to check your model
assumptions very carefully, and make sure that model violations are not
driving the parameter estimates you end up with.

When is it okay not to check for interactions? Anytime the theory doesn't
point you toward model interactions as key components of the model.
Anytime the experts don't think the interactions matter. Anytime you
have so many main effects that you cannot estimate interactions or
separate them from other main effects.

There are a LOT of times when you don't need the interactions. Why
are you so sure you are supposed to check them?

Oh, and are your data from a sample survey, or from an observational
study, or just a haphazard bunch of people? The context matters
too.

HTH,
David
--
David L. Cassell
mathematical statistician
Design Pathways
3115 NW Norwood Pl.
Corvallis OR 97330

_________________________________________________________________
Dont just search. Find. Check out the new MSN Search!
http://search.msn.click-url.com/go/onm00200636ave/direct/01/

David L Cassell

2006-04-20 04:45:28 UTC

Permalink

Post by LB
<<<
I'm doing a two-stage conditional regression model. The first stage
models presence-absence with proc logistic, and the second models
abundance given presence with proc genmod.
I have five independent variables, and I included all 2-way interactions
and then reduced using backward elimination. This worked ok with the
logisitic regression, but I encounter a problem when I get to the second
stage, because there are only 30 samples at this point and the model won't
converge with that many interaction variables.
Backward elimination is a BAD way to do model selection. Bad bad bad. For
extensive reasons, search the archives using keywords stepwise and author =
David Cassell.
Briefly, it gives wrong p values. Wrong standard errors. Maybe nonsensical
models. But most of all, when you do this, you are letting the computer do
your thinking for you.
<<<
So when is it ok NOT to check for interactions? I can't really drop any
of the main effects, but I feel like not including interactions is somehow
wrong.....has anyone else encountered something like this before?
Well, why include the interactions?
Why stop at 2 way?
Why do you feel you have to include main effects?
With only 30 samples, you shouldn't even be including all 5 main effects,
because it is likely you are overfitting.
It's hard to give more advice, because you haven't said what you are trying
to do, what your DV is, what your IVs are, how they are distributed, how
they were gotten, or anything. So, at this point, all I can say is that
what you are doing is not a good thing to do.
If you write back to SAS-L with details, then someone here (maybe even me)
may be able to help you.
Sorry
Those who have been here a while will know why I am signing this
David - in training (grin, duck, run)

Now don't worry about a thing, Peter. We'll have those training
wheels off in no time.

(ducks and speeds off before Peter can throw his trike at me :-)

David, who thinks Peter isn't grouchy enough yet :-) :-)
--
David L. Cassell
mathematical statistician
Design Pathways
3115 NW Norwood Pl.
Corvallis OR 97330

_________________________________________________________________
Express yourself instantly with MSN Messenger! Download today - it's FREE!
http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/

David L Cassell

2006-04-20 05:07:06 UTC

Permalink

Post by LB
OK thanks, I'll try to explain this better.
First off, when I say I did backward elimination, I meant I did it
manually. I'd run the model and then take out a variable, one at a time,
based on what had the highest prob value, and did not take out any main
effects if they were included in an interaction that was significant. A
former stats prof of mine advised to do it this way. I'll look into using
stepwise, but aren't there problems with that too?
I have results from a field survey to look at a species abundance and
distribution. At ~100 locations, I have a point count of the number of
individuals observed. I want to test what characterisitcs of the habitat
influence their abundance. These were all variables that were also
recorded during the survey (e.g., % cover, habitat type). In total, I
have five independent variables (1 categorical, 4 continuous) that I want
to include.
Because there are a lot of zeros in the data (thus violating any
homogeneity and normality assumptions, and simply transforming the raw
data did not fix this) the two-part conditional approach seemed like a
good way to model the data. Out of the 100 locations, individuals were
present at 30. So first, I used a logistic model to see what variables
significantly affected presence/absence (absent=0, present=1).
Then, using only the sites with individuals present (n=30), I modelled the
abundance, given presence, with a generalized linear model with a negative
binomial distribution. This time, the dependent variable was the # of
individuals observed, rather than presence/absence. All of the same
independent variables and interactions were included, hence when I began
to realize I had a problem.
I realize that I may be including too many independent variables and
overfitting the model. There are a couple of the effects that I
don't "think" will be that important, but don't have an statistical
justification for not including them. I get rather confused about when to
include interactions, and how far to go (2-way, 3-way, etc.), and then how
the heck to interpret them when they are significant. I also have major
multi-collinearity issues when I start adding interactions.
Hope this makes more sense and sorry if it doesn't. I appreciate any
feedback.

Okay, I am going to agree with Peter on everything, except one point.

You don't have a survey sample task. So stick with PROC REG, PROC
LOGISTIC, etc.

Unless these data come from a carefully-designed environmentally-based
survey sample, like one of the EMAP samples built by the U.S. EPA, you
have an observational study. There will not be a sample design with known
design effects, nor will there be discernable sampling weights. So just
analyze it using standard linear regression, logistic regression, etc.

HTH,
David
--
David L. Cassell
mathematical statistician
Design Pathways
3115 NW Norwood Pl.
Corvallis OR 97330

_________________________________________________________________
Is your PC infected? Get a FREE online computer virus scan from McAfee®
Security. http://clinic.mcafee.com/clinic/ibuy/campaign.asp?cid=3963

David L Cassell

2006-04-20 05:21:37 UTC

Permalink

Post by LB
Thanks.....I'll try your advice, but I also have a follow-up question...

Post by Peter Flom
Stepwise is WORSE than backwards. When I said to search on stepwise, it

was because I think that has been in the title of many of the threads
about how bad ALL these methods are. (Backwards is bad, forwards is
worse, stepwise combines the worst of both)

Post by Peter Flom
Doing it manually, if all you are doing is what you say, at least

eliminates the problem of having interaction terms without main effects,
but is still not a good method.
Maybe I'm missing something here, but if these are ALL bad, what is a GOOD
way to reduce a model?? Or do you just leave all of the insignificant
terms in there or something? This just doesn't make sense to me.

There are more BAD ways to reduce a model than there are good ways.
Peter is just pointing you away from the bad ways that people actually
continue to use. (Thanks, Dr. Flom!) He is also protecting you from one
of my massive rants on the subject of what goes wrong with stepwise
regression, and why you should not use it. To paraphrase the great
von Neumann, "Anyone who believes the results of stepwise regression
is living in a state of sin." :-) [Computer science people will get the
joke.
If there is one. You may feel that my career as a comedian is not likely
to blossom.]

The most important way to reduce a model is working with subject matter
experts. Do this whenever possible. If Expert #1 can tell you that X2 and
X4 are usually correlated, because they are really surrogates for the
unmeasured variable U3, then you can make wise decisions. And those
decisions may be quite different than if Expert #2 tells you that X2 and X4
are usually correlated, because research has shown that X2 is driven by
previous levels of X4.

For large numbers of variables, the best approach may involve Bayesian
methods;
LASSO and/or LAR (as implemented in PROC GLMSELECT); factor analysis or
similar data reduction methods before the analysis; tools like principal
component
regression or partial least squares regression (as in PROC PLS); measurement
error models as in SAS/ETS; or maybe even structural equaiton models as in
PROC CALIS. It all depends on the nature of the data and the nature of the
problem and the expertise you can extract from subject matter experts,
in many cases from research articles by said experts. And, of course, said
experts are sometimes wrong, or may disagree violently with one another.

Once you have applied subject-matter expertise to building your model,
then you're at the Frank Harrell stage. Fit your model. Study your
regression
diagnostics. Once things are working and your model is viable, then you
can look at parameter estimates and assess which variables are providing
meaningful contributions to your model.

HTH,
David
--
David L. Cassell
mathematical statistician
Design Pathways
3115 NW Norwood Pl.
Corvallis OR 97330

_________________________________________________________________
Dont just search. Find. Check out the new MSN Search!
http://search.msn.click-url.com/go/onm00200636ave/direct/01/