Should the binary, categorical variables be standardized?

Discussion:

(too old to reply)

Ming Chen

2006-11-21 04:47:46 UTC

Hi All,

Now I am reading the famous "The Elements of Statistical Leaning" and trying
to carry out the some examples using SAS.
For one example in the linear regression section, the authors claimed that
they fitted the linear model after standardizing all the predictors.
However, among the predictors, there are two categorical variables.
Basically I have two questions:

1. I can understand that why standardize continuous variables. But how about
the categorical variables?
2. Suppose we can, does the standardization mean we can transform the
categorical variables into continuous variables?

Also, it would be great you can give me some references about how and when
to rescale the variables.

Thanks

Ming

BruceBrad

2006-11-21 06:42:12 UTC

Permalink

Post by Ming Chen
1. I can understand that why standardize continuous variables. But how about
the categorical variables?
2. Suppose we can, does the standardization mean we can transform the
categorical variables into continuous variables?

Peter Flom

2006-11-21 10:49:18 UTC

Permalink

Post by Ming Chen
1. I can understand that why standardize continuous variables. But how

about

Post by Ming Chen
the categorical variables?
2. Suppose we can, does the standardization mean we can transform the
categorical variables into continuous variables?

There is nothing to stop you standardising binary variable (ie
subtracting the mean value and dividing by the standard deviation of
each variable). However, this is usually not very informative (and it
doesn't mean that we have transformed them into continous variables).

The main reason for standardising variables is if they have no natural
metric, and so you want to describe the impact of a one standard
deviation increase on the dependent variable.
I missed the original post to which Bruce responded. But let me add a
little
to his reply (with which I agree).

Standardizing a binary value makes no difference to anything, really,
and, I think just confuses the interpretation.
A binary value has a mean and an sd, if you code the levels as numbers.
But these numbers don't really mean anything, so the mean and sd also
don't mean anything. E.G. if your variable is 'sex' (male or female)
then if you code male as 0 and female as 1, and if half the population
is male, the mean is .5 and the sd .5. Standardizing means that the
regression coefficient is about an increase of 1 SD, or 1/2 of a
'male-female unit'......huh? Unstandardized, it's about the effect of
being female as opposed to male. That's much clearer.

In the original subject line, it says 'binary, categorical' and in the
message it says 'categorical'. While standardizing binary values seems
kind of silly, it doesn't really do any harm. The same is not true if
there are more than two categories. This would be a big mistake.

And, of course, as Bruce noted, nothing will change a categorical
variable into a continuous one.

Peter

Wensui Liu

2006-11-21 12:52:41 UTC

Permalink

Peter,

Conceptually, I don't understand if there is any difference between a
binary variable and a multi-level categorical variable, which can be
encoded as multiple binaries.

could you please shed some light on it?

Thanks.

Post by Peter Flom
In the original subject line, it says 'binary, categorical' and in the
message it says 'categorical'. While standardizing binary values seems
kind of silly, it doesn't really do any harm. The same is not true if
there are more than two categories. This would be a big mistake.
And, of course, as Bruce noted, nothing will change a categorical
variable into a continuous one.
Peter

--
WenSui Liu
(http://spaces.msn.com/statcompute/blog)
Senior Decision Support Analyst
Cincinnati Children Hospital Medical Center

Peter Flom

2006-11-21 13:27:50 UTC

Permalink

<<<
Conceptually, I don't understand if there is any difference between a
binary variable and a multi-level categorical variable, which can be
encoded as multiple binaries.

could you please shed some light on it?
There are two issues:
1) A multicategory variable might not be coded this way. Suppose, for
example, the variable was race/ethnicity, coded e.g. White = 1, Black =
2, Asian = 3, etc. If you leave it this way, SAS won't know that the
numbers are arbitrary, and will be perfectly willing to standardize the
variable. It will have a mean and an sd. No problem for SAS, big
problem for the end user. Obviously, this is nonsensical. But I've
seen it done. More than once. (When I am in charge of data entry, I
make sure categorical items are coded as letters. This makes this sort
of nonsense easier to avoid).

2) Somewhat more sensibly, one could code race as a series of binary
variables. Obviously, the proportions will not be equal. If it's
typical of the US population, it might be something like 70% White, 10%
Black, 10% Latino, 5% Asian, 5% other. Now, since they are proportions,
the variance is determined by the proportion. For whites, the SD will
be
(.7*.3)^.5 = .46
for Blacks (and for Latinos)
(.9*.1)^.5 = .3
for Asians
(.95*.05)^.5 = .22

now, when we look at the regression output, we will be comparing an
increase of .22 in 'Asianness' to one of .3 in 'Blackness', to one of
.46 in 'Whiteness'

Oy vey!

What a mess!

Peter

Wensui Liu

2006-11-21 21:31:21 UTC

Permalink

Ming,

In the context of data mining, encoding N-level categorical variables
into N-1 binaries and then standardizing them makes perfect sense. In
SAS glmselect procedure, the variable selection method of lars and
lasso uses this logic if I understand correctly. On other hand, it
also makes sense to do so in neural networks if your purpose is
prediction rather than drawing inference.

Here is my $0.02.

Post by Ming Chen
Hi All,
Now I am reading the famous "The Elements of Statistical Leaning" and trying
to carry out the some examples using SAS.
For one example in the linear regression section, the authors claimed that
they fitted the linear model after standardizing all the predictors.
However, among the predictors, there are two categorical variables.
1. I can understand that why standardize continuous variables. But how about
the categorical variables?
2. Suppose we can, does the standardization mean we can transform the
categorical variables into continuous variables?
Also, it would be great you can give me some references about how and when
to rescale the variables.
Thanks
Ming

--
WenSui Liu
(http://spaces.msn.com/statcompute/blog)
Senior Decision Support Analyst
Cincinnati Children Hospital Medical Center

David L Cassell

2006-11-23 04:30:33 UTC

Permalink

Post by Ming Chen
Hi All,
Now I am reading the famous "The Elements of Statistical Leaning" and
trying
to carry out the some examples using SAS.
For one example in the linear regression section, the authors claimed that
they fitted the linear model after standardizing all the predictors.
However, among the predictors, there are two categorical variables.
1. I can understand that why standardize continuous variables. But how
about
the categorical variables?
2. Suppose we can, does the standardization mean we can transform the
categorical variables into continuous variables?
Also, it would be great you can give me some references about how and when
to rescale the variables.
Thanks
Ming

[1] You can never convert categorical variables into discrete variables
by a simple linear transform.

[2] You do not need to convert binary 0/1 variables by standardizing
them. Changing them from 0 & 1 to maybe -0.3434 & +0.5656 is
hardly going to help anything. They will still have the same correlation
with other regressor variables, and you have made them horrible to
interpret.

[3] Standardizing continuous variables is not always the answer either.
You see people do it when they have a variable like YEAR and they
would like to have the varaible, plus its square, plus its cube (just as
an example, mind you) and they know that standardizing YEAR will
make the YEAR*YEAR and YEAR**3 variables come out as fairly
uncorrelated with YEAR, as opposed to leaving YEAR as ranging from
(say) 1955 to 2006 and thereby having serious correlations with the
square and cubic terms. But in many cases, it is just not that effective.

HTH,
David
--
David L. Cassell
mathematical statistician
Design Pathways
3115 NW Norwood Pl.
Corvallis OR 97330

_________________________________________________________________
Get the latest Windows Live Messenger 8.1 Beta version. Join now.
http://ideas.live.com

Wensui Liu

2006-11-23 13:07:23 UTC

Permalink

David,

Yes, statistically, standardizing binary won't help any. but
numerically, standardizing is desirable.

Also, please take a look at lars in proc glmselect. I will be
surprised if SAS doesn't do such statndardization.

happy thanksgiving.!

Post by David L Cassell

Post by Ming Chen
Hi All,
Now I am reading the famous "The Elements of Statistical Leaning" and
trying
to carry out the some examples using SAS.
For one example in the linear regression section, the authors claimed that
they fitted the linear model after standardizing all the predictors.
However, among the predictors, there are two categorical variables.
1. I can understand that why standardize continuous variables. But how
about
the categorical variables?
2. Suppose we can, does the standardization mean we can transform the
categorical variables into continuous variables?
Also, it would be great you can give me some references about how and when
to rescale the variables.
Thanks
Ming

[1] You can never convert categorical variables into discrete variables
by a simple linear transform.
[2] You do not need to convert binary 0/1 variables by standardizing
them. Changing them from 0 & 1 to maybe -0.3434 & +0.5656 is
hardly going to help anything. They will still have the same correlation
with other regressor variables, and you have made them horrible to
interpret.
[3] Standardizing continuous variables is not always the answer either.
You see people do it when they have a variable like YEAR and they
would like to have the varaible, plus its square, plus its cube (just as
an example, mind you) and they know that standardizing YEAR will
make the YEAR*YEAR and YEAR**3 variables come out as fairly
uncorrelated with YEAR, as opposed to leaving YEAR as ranging from
(say) 1955 to 2006 and thereby having serious correlations with the
square and cubic terms. But in many cases, it is just not that effective.
HTH,
David
--
David L. Cassell
mathematical statistician
Design Pathways
3115 NW Norwood Pl.
Corvallis OR 97330
_________________________________________________________________
Get the latest Windows Live Messenger 8.1 Beta version. Join now.
http://ideas.live.com

--
WenSui Liu
(http://spaces.msn.com/statcompute/blog)
Senior Decision Support Analyst
Cincinnati Children Hospital Medical Center

Peter Flom

2006-11-23 13:32:50 UTC

Permalink

<<<
David,

Yes, statistically, standardizing binary won't help any. but
numerically, standardizing is desirable.

Also, please take a look at lars in proc glmselect. I will be
surprised if SAS doesn't do such statndardization.

happy thanksgiving.!
I've just been reading an article by Efron et al., on the LARS
algorithm. They do standardize
all the variables for the computations, but they then convert them back
to the original metric
in the output. The article is about an algorithm used in R; right now I
don't have time to check what
GLMSELECT does (I'm home, the docs are at work, SAS isn't licensed for
my home computer blah blah blah) but
I would be surprised if this was different in SAS. So, as David and I
said, you (the end user) needn't standardize dichotomies, and probably
shouldn't. What the algorithm does internally is its own businees. :-)

Happy thanksgiving to you, too

Peter