Discussion:
get count of unique values for all variables in a dataset
(too old to reply)
TimBraam
2005-10-07 18:41:48 UTC
Permalink
All,

I've been asked to write a macro that will provide counts of the number
of unique values of each variable in a dataset.

I've written something very brute force - using the dictionary tables
to get the names of all the variables in the dataset, then using PROC
FREQ and a macro loop to generate a table statement for each variable
(code below). This approach does okay on a medium-size dataset, but
when I ran it with 100,000 observations and 3000 variables, it took
over 15 minutes. Can anyone think of a better approach? Using SQL with
count and distinct was no better than FREQ.

Basically, these folks want to 'guess' whether their variables are
categorical or continuous based on the number of distinct values as a
proportion of the number of observations in the data set, which should
explain the bulk of my so-far-developed code below.

Thanks,

Tim Braam
US Census Bureau


%macro makedata ;
data work.big (keep=nvars1-nvars2900 cvars1-cvars101);
array nvars (2900) ;
array cvars (101) $32 ;
array lim (2801:2900) ;
retain lim2801 - lim2900 1;

do j=2801 to 2900 ;
lim(j)=ceil(100000*ranuni(2)) ;
end ;

do m=1 to 100000 ; /* number of observations in dataset */
do j=1 to 2800 ;
nvars(j)=int(100*ranuni(1)) ;
end ;

do j=2801 to 2900 ;
nvars(j)=int(lim(j)*ranuni(3)) ;
end ;

do k=1 to 101 ;
cvars(k)=put(nvars(K+2799),words32.) ;
end ;
output ;
end ;
run ;

proc sql noprint ;
create index cvars2 on sasuser.big(cvars2) ;
create index mc on sasuser.big (cvars3, cvars4) ;
quit ;
%mend ;
/* run this only once */
/*%makedata ; */

%macro getsizes(dsn=sashelp.class,critval=10) ;
options nomprint ;

%let lib=%scan(&dsn,1,'.') ;
%let dsname=%scan(&dsn,2,'.') ;

proc sql noprint ;
select ceil((nobs - delobs)* %eval(&critval/100)) into :crit
from dictionary.tables where libname=upcase("&lib") and
memname=upcase("&dsname") ;
select name into :vname1 - :vname32000
from dictionary.columns where libname=upcase("&lib") and
memname=upcase("&dsname") ;
%let nvars=%eval(&sqlobs) ;
quit ;

proc freq data=sasuser.big;
%do j=1 %to &nvars ;
table &&vname&j /noprint nopercent nocum nocol nofreq norow out
=&&vname&j.._c(keep=&&vname&j);
%end ;
run;

data counts;
length vname $32 count 3. vrole $10.;
stop;
run ;

proc sql noprint ;
%do j=1 %to &nvars ;
insert into counts select "&&vname&j" as vname, count(*) as count, ' '
as vrole from &&vname&j.._c;
drop table &&vname&j.._c;
%end ;
update counts set vrole= case when count < &crit then 'Category' else
'Continuous' end ;
quit ;

%mend ;
%getsizes(dsn=work.big,critval=10) ;
Chang Chung
2005-10-07 18:56:44 UTC
Permalink
On Fri, 7 Oct 2005 11:41:48 -0700, TimBraam <***@CENSUS.GOV>
wrote:
...
Post by TimBraam
Basically, these folks want to 'guess' whether their variables are
categorical or continuous based on the number of distinct values as a
proportion of the number of observations in the data set, which should
explain the bulk of my so-far-developed code below.
...

Hi, Tim,
If this is the purpose, then I wonder if it would be easier and faster to
randomly select certain percentage of the observations first, then count the
unique number of values in each variable? Given the less number of
observations, any method of counting will run faster. Just a thought.
Cheers,
Chang
Jeff Abolafia
2005-10-07 19:51:47 UTC
Permalink
/* create statements to get unique values of all variables */
proc sql noprint ;
select catx(' ',"count(distinct ",name,") as ",name)
into :vars separated by ','
from sashelp.vcolumn
where libname="DER" and memname="PKANLY"
;
quit ;
%put &vars ;

/* get unique values of all variables */
proc sql noprint ;
create table onex as
select &vars
from der.pkanly
;
quit ;
/* transpose to 1 record/variable */
proc transpose data=onex out=onext;
run;


Jeffrey Abolafia
Assistant Director, Biostatistics
RHO, inc.
6330 Quadrangle Drive, Suite 500
Chapel Hill, NC 27517
Phone: (919)-408-8000 Ext. 331
Fax: (919)-408-0999
Email: ***@rhoworld.com
Dennis Diskin
2005-10-07 19:38:23 UTC
Permalink
Tim,
This probably won't run any faster, but this will produce all your tables.
ods listing close;
ods output OneWayFreqs(match_all)=3Da_(drop=3Df_:)
proc freq data =3D sashelp.vcolumn / nocum nopercent;
tables _all_ ;
run;
If any of these variables are non-integer numerics, you will potentially
use a lot of time building the counts no matter how you do it.
HTH,
Dennis Diskin
Post by TimBraam
All,
I've been asked to write a macro that will provide counts of the number
of unique values of each variable in a dataset.
I've written something very brute force - using the dictionary tables
to get the names of all the variables in the dataset, then using PROC
FREQ and a macro loop to generate a table statement for each variable
(code below). This approach does okay on a medium-size dataset, but
when I ran it with 100,000 observations and 3000 variables, it took
over 15 minutes. Can anyone think of a better approach? Using SQL with
count and distinct was no better than FREQ.
Basically, these folks want to 'guess' whether their variables are
categorical or continuous based on the number of distinct values as a
proportion of the number of observations in the data set, which should
explain the bulk of my so-far-developed code below.
Thanks,
Tim Braam
US Census Bureau
%macro makedata ;
data work.big (keep=3Dnvars1-nvars2900 cvars1-cvars101);
array nvars (2900) ;
array cvars (101) $32 ;
array lim (2801:2900) ;
retain lim2801 - lim2900 1;
do j=3D2801 to 2900 ;
lim(j)=3Dceil(100000*ranuni(2)) ;
end ;
do m=3D1 to 100000 ; /* number of observations in dataset */
do j=3D1 to 2800 ;
nvars(j)=3Dint(100*ranuni(1)) ;
end ;
do j=3D2801 to 2900 ;
nvars(j)=3Dint(lim(j)*ranuni(3)) ;
end ;
do k=3D1 to 101 ;
cvars(k)=3Dput(nvars(K+2799),words32.) ;
end ;
output ;
end ;
run ;
proc sql noprint ;
create index cvars2 on sasuser.big(cvars2) ;
create index mc on sasuser.big (cvars3, cvars4) ;
quit ;
%mend ;
/* run this only once */
/*%makedata ; */
%macro getsizes(dsn=3Dsashelp.class,critval=3D10) ;
options nomprint ;
%let lib=3D%scan(&dsn,1,'.') ;
%let dsname=3D%scan(&dsn,2,'.') ;
proc sql noprint ;
select ceil((nobs - delobs)* %eval(&critval/100)) into :crit
from dictionary.tables where libname=3Dupcase("&lib") and
memname=3Dupcase("&dsname") ;
select name into :vname1 - :vname32000
from dictionary.columns where libname=3Dupcase("&lib") and
memname=3Dupcase("&dsname") ;
%let nvars=3D%eval(&sqlobs) ;
quit ;
proc freq data=3Dsasuser.big;
%do j=3D1 %to &nvars ;
table &&vname&j /noprint nopercent nocum nocol nofreq norow out
=3D&&vname&j.._c(keep=3D&&vname&j);
%end ;
run;
data counts;
length vname $32 count 3. vrole $10.;
stop;
run ;
proc sql noprint ;
%do j=3D1 %to &nvars ;
insert into counts select "&&vname&j" as vname, count(*) as count, ' '
as vrole from &&vname&j.._c;
drop table &&vname&j.._c;
%end ;
update counts set vrole=3D case when count < &crit then 'Category' else
'Continuous' end ;
quit ;
%mend ;
%getsizes(dsn=3Dwork.big,critval=3D10) ;
Phil Scott
2005-10-10 03:42:19 UTC
Permalink
Post by Dennis Diskin
This probably won't run any faster, but this will produce all your tables.
ods listing close;
ods output OneWayFreqs(match_all)=3Da_(drop=3Df_:)
proc freq data =3D sashelp.vcolumn / nocum nopercent;
tables _all_ ;
run;
what are the '3D' values above? is this supposed to be like this:
ods listing close;
ods output OneWayFreqs(match_all)=a_(drop=f_:)

proc freq data = sashelp.vcolumn / nocum nopercent;

tables _all_ ;

run;
Pudding Man
2005-10-07 20:57:22 UTC
Permalink
What am I missing, here? Didn't Ian W. demonstrate a very nice
hash object solution to this problem a scant 2 weeks ago?

Tim is with Census. Census is the "900 lb. Gorilla" of
SAS licensees? So V9 should be available, if not installed ...

Search the archives for Ian's post under subject
"PROC SQL Select Limit ...".

Skoal,
Puddin'

*****************************************************************
*** Puddin' Man PuddingDotMan at GmailDotCom **
*****************************************************************;
Post by TimBraam
All,
I've been asked to write a macro that will provide counts of the number
of unique values of each variable in a dataset.
I've written something very brute force - using the dictionary tables
to get the names of all the variables in the dataset, then using PROC
FREQ and a macro loop to generate a table statement for each variable
(code below). This approach does okay on a medium-size dataset, but
when I ran it with 100,000 observations and 3000 variables, it took
over 15 minutes. Can anyone think of a better approach? Using SQL with
count and distinct was no better than FREQ.
Basically, these folks want to 'guess' whether their variables are
categorical or continuous based on the number of distinct values as a
proportion of the number of observations in the data set, which should
explain the bulk of my so-far-developed code below.
Thanks,
Tim Braam
US Census Bureau
%macro makedata ;
data work.big (keep=nvars1-nvars2900 cvars1-cvars101);
array nvars (2900) ;
array cvars (101) $32 ;
array lim (2801:2900) ;
retain lim2801 - lim2900 1;
do j=2801 to 2900 ;
lim(j)=ceil(100000*ranuni(2)) ;
end ;
do m=1 to 100000 ; /* number of observations in dataset */
do j=1 to 2800 ;
nvars(j)=int(100*ranuni(1)) ;
end ;
do j=2801 to 2900 ;
nvars(j)=int(lim(j)*ranuni(3)) ;
end ;
do k=1 to 101 ;
cvars(k)=put(nvars(K+2799),words32.) ;
end ;
output ;
end ;
run ;
proc sql noprint ;
create index cvars2 on sasuser.big(cvars2) ;
create index mc on sasuser.big (cvars3, cvars4) ;
quit ;
%mend ;
/* run this only once */
/*%makedata ; */
%macro getsizes(dsn=sashelp.class,critval=10) ;
options nomprint ;
%let lib=%scan(&dsn,1,'.') ;
%let dsname=%scan(&dsn,2,'.') ;
proc sql noprint ;
select ceil((nobs - delobs)* %eval(&critval/100)) into :crit
from dictionary.tables where libname=upcase("&lib") and
memname=upcase("&dsname") ;
select name into :vname1 - :vname32000
from dictionary.columns where libname=upcase("&lib") and
memname=upcase("&dsname") ;
%let nvars=%eval(&sqlobs) ;
quit ;
proc freq data=sasuser.big;
%do j=1 %to &nvars ;
table &&vname&j /noprint nopercent nocum nocol nofreq norow out
=&&vname&j.._c(keep=&&vname&j);
%end ;
run;
data counts;
length vname $32 count 3. vrole $10.;
stop;
run ;
proc sql noprint ;
%do j=1 %to &nvars ;
insert into counts select "&&vname&j" as vname, count(*) as count, ' '
as vrole from &&vname&j.._c;
drop table &&vname&j.._c;
%end ;
update counts set vrole= case when count < &crit then 'Category' else
'Continuous' end ;
quit ;
%mend ;
%getsizes(dsn=work.big,critval=10) ;
Patnaik, Tirthankar
2005-10-08 07:34:43 UTC
Permalink
-----Original Message-----
Fehd, Ronald J
Sent: Saturday, October 08, 2005 12:34 AM
Subject: Re: get count of unique values for all variables in a dataset
This code looks good from where I sit.
What's the problem with taking 15 minutes to crunch the numbers?
You're only going to run this once on a data set, right?
I have developed a similar routine that does a freq of each variable
and shows the high and low N(default=5) values.
you use SQL count, I use proc freq.
the numbers are the same
and I suspect the SQL will be faster
since it doesn't have to calculate percent as does proc freq.
you're having to do a pass of the data set
for every variable, there's no way around that.
I've been working with this problem for sometime now. The issue is the same: get distinct values of all the variables in a dataset, to decide if you wanna classify a variable.

Of course, you can use Roy Pardee's solution of using SQL to get the distinct entries. Just that you'll need to pass through the dataset each time, just like Ron Fehd has pointed out. I try to use this method as a benchmark.

As it turns out, you don't have to do a pass of the dataset for each var. You can get all the unique values of each variable with _one_ pass (or, as many as you like) of the dataset. The trick is to use a hash for each var. The problem then is that you need enough RAM to load all the unique values of all the variables. With typical filesizes I use (<2GB), I can get by with the Gig of RAM on my thinkpad.
However, this is not practical for large datasets. What I then try is to set the number of passes that I want to get all the unique values of all the vars.

Consider:

575 /* ------------------------------------------------------------------------- */
576 options nomprint;
577 /* Create the dataset, with missing values too. */
578 %let size=10;
579 %let N=1000;
580
581 data ATest(compress=binary);
582 array norm norm1-norm&size.;
583 do i = 1 to &n.;
584 do j = 1 to &size.;
585 norm{j} = ceil(1e5*ranuni(i));
586 if norm{j} <= j*5e2 then norm{j} = .;
587 end;
588 output;
589 end;
590 drop j i;
591 run;

NOTE: The data set WORK.ATEST has 1000 observations and 10 variables.
NOTE: Compressing data set WORK.ATEST increased size by 18.18 percent.
Compressed is 13 pages; un-compressed would require 11 pages.
NOTE: DATA statement used (Total process time):
real time 1.37 seconds
user cpu time 0.03 seconds
system cpu time 0.00 seconds
Memory 154k


592
593 /* Get the list of vars in the ds. */
594
595 ods listing close;
596 ods output Variables=cnt_Atest;
597 proc contents data=Atest;
598 run;

NOTE: The data set WORK.CNT_ATEST has 10 observations and 6 variables.

599 ods listing;
600
601 /* Use the SQL method to get the number of distinct entries. */
602 /* Generate the code, and put in a macro. */
603 proc sql _method stimer buffersize=1024000 noprint;
NOTE: SQL Statement used (Total process time):
real time 0.08 seconds
user cpu time 0.00 seconds
system cpu time 0.00 seconds
Memory 15k

604 select "count(distinct "||left(trim(Variable))||")" into :SqlVList
605 separated by ","
606 from cnt_Atest;

NOTE: SQL execution methods chosen are:

sqxslct
sqxfil
sqxsrc( WORK.CNT_ATEST )
NOTE: PROCEDURE SQL used (Total process time):
real time 0.07 seconds
user cpu time 0.00 seconds
system cpu time 0.00 seconds
Memory 2k


608 %put SqlVList = &SqlVlist.;
SqlVList = count(distinct norm1),count(distinct norm2),count(distinct norm3),count(distinct
norm4),count(distinct norm5),count(distinct norm6),count(distinct norm7),count(distinct
norm8),count(distinct norm9),count(distinct norm10)
609
610
611 /* Use the macro for the code. Use feedback to confirm code generated. */
612 proc sql _method feedback stimer buffersize=1024000;
NOTE: SQL Statement used (Total process time):
real time 0.08 seconds
user cpu time 0.00 seconds
system cpu time 0.00 seconds
Memory 15k

613 select &SqlVList.
614 from ATest;
NOTE: Statement transforms to:

select COUNT(distinct ATEST.norm1), COUNT(distinct ATEST.norm2), COUNT(distinct
ATEST.norm3), COUNT(distinct ATEST.norm4), COUNT(distinct ATEST.norm5), COUNT(distinct
ATEST.norm6), COUNT(distinct ATEST.norm7), COUNT(distinct ATEST.norm8), COUNT(distinct
ATEST.norm9), COUNT(distinct ATEST.norm10)
from WORK.ATEST;


NOTE: SQL execution methods chosen are:

sqxslct
sqxsumn
sqxsrc( WORK.ATEST )
NOTE: PROCEDURE SQL used (Total process time):
real time 0.08 seconds
user cpu time 0.00 seconds
system cpu time 0.00 seconds
Memory 2k


616
617
618 /* Getting the number of distinct entries using a hash. */
619 /* Method for two variables first. */
620 options nocenter msglevel=i fullstimer source source2 compress=yes;
621 data _null_;
622 set Atest(obs=1 keep=norm1 norm2);
623
624 declare hash hh1();
625 rc = hh1.DefineKey("norm1");
626 rc = hh1.DefineData("norm1");
627 rc = hh1.DefineDone();
628
629 declare hash hh2();
630 rc = hh2.DefineKey("norm2");
631 rc = hh2.DefineData("norm2");
632 rc = hh2.DefineDone();
633
634 do until(eof);
635 set Atest(obs=max keep=norm1 norm2) end=eof;
636 rc1 = hh1.replace();
637 rc2 = hh2.replace();
638 end;
639 count1 = hh1.Num_Items;
640 count2 = hh2.Num_Items;
641 put count1= count2=;
642 run;

count1=998 count2=992
NOTE: There were 1 observations read from the data set WORK.ATEST.
NOTE: There were 1000 observations read from the data set WORK.ATEST.
NOTE: DATA statement used (Total process time):
real time 2.92 seconds
user cpu time 0.00 seconds
system cpu time 0.00 seconds
Memory 513k

We've got output from the hash. There's a difference of 1 between the outputs since IMHO the SQL method does not count missing values.


643
644 /* We then extend the method to multiple vars. This involves using the vname()
645 * function. Use SQL again to get the list of vars.
646 */
647 proc sql _method stimer buffersize=1024000 noprint;
NOTE: SQL Statement used (Total process time):
real time 0.08 seconds
user cpu time 0.00 seconds
system cpu time 0.00 seconds
Memory 15k

648 select Variable into :VList separated by " "
649 from cnt_Atest;

NOTE: SQL execution methods chosen are:

sqxslct
sqxfil
sqxsrc( WORK.CNT_ATEST )
NOTE: SQL Statement used (Total process time):
real time 0.65 seconds
user cpu time 0.00 seconds
system cpu time 0.00 seconds
Memory 97k

650 quit;
NOTE: PROCEDURE SQL used (Total process time):
real time 0.08 seconds
user cpu time 0.00 seconds
system cpu time 0.00 seconds
Memory 2k

We now extend the hash method to more than two vars, in fact all the vars in the dataset:


651 %put VList = &Vlist.;
VList = norm1 norm2 norm3 norm4 norm5 norm6 norm7 norm8 norm9 norm10
652
653 %macro GetDis;
654 %let NL = 0;
655 %do %until (%scan(&vlist.,&NL.) eq );
656 %let NL = %eval(&NL. + 1);
657 %end;
658 %let NL = %eval(&NL. - 1);
659 %put NL = &NL.;
660 data DisGet(keep=Name Distinct);
661 length Name $30;
662 set Atest(obs=1 keep=&vlist.);
663 array av{&NL.} $30 vname1-vname&NL.;
664 array v{&NL.} $30 &vlist.;
665 do i = 1 to &NL.;
666 av{i} = vname(v{i});
667 end;
668 %do i = 1 %to &NL.;
669 put vname&i.= @;
670 declare hash hh&i.();
671 rc = hh&i..DefineKey(vname&i.);
672 rc = hh&i..DefineData(vname&i.);
673 rc = hh&i..DefineDone();
674 %end;
675 do until(eof);
676 set Atest(obs=max keep=&vlist.) end=eof;
677 %do i = 1 %to &NL.;
678 rc&i. = hh&i..replace();
679 %end;
680 end;
681 %do i = 1 %to &NL.;
682 Name = vname&i.;
683 Distinct = hh&i..Num_Items;
684 output;
685 %end;
686 run;
687 %mend;
688 %GetDis;
NL = 10

vname1=norm1 vname2=norm2 vname3=norm3 vname4=norm4 vname5=norm5 vname6=norm6 vname7=norm7
vname8=norm8 vname9=norm9 vname10=norm10
NOTE: There were 1 observations read from the data set WORK.ATEST.
NOTE: There were 1000 observations read from the data set WORK.ATEST.
NOTE: The data set WORK.DISGET has 10 observations and 2 variables.
NOTE: Compressing data set WORK.DISGET increased size by 100.00 percent.
Compressed is 2 pages; un-compressed would require 1 pages.
NOTE: DATA statement used (Total process time):
real time 1.65 seconds
user cpu time 0.00 seconds
system cpu time 0.03 seconds
Memory 777k


As one can see, the hash method is faster than the SQL way. In fact, the difference between the two rises as the size increases. Of course, as pointed out above, memory considerations would restrict scalability. We then use the multi-pass method.

HTH and best,
-Tir


Tirthankar Patnaik
India Analytics Center
Citibank, N. A.
+91-44-5228 6385
+91-98410 69545
-----Original Message-----
From: TimBraam
I've been asked to write a macro that will provide counts of
the number
of unique values of each variable in a dataset.
I've written something very brute force - using the
dictionary tables
to get the names of all the variables in the dataset, then
using PROC
FREQ and a macro loop to generate a table statement for
each variable
(code below). This approach does okay on a medium-size dataset, but
when I ran it with 100,000 observations and 3000 variables, it took
over 15 minutes. Can anyone think of a better approach?
Using SQL with
count and distinct was no better than FREQ.
Basically, these folks want to 'guess' whether their variables are
categorical or continuous based on the number of distinct
values as a
proportion of the number of observations in the data set,
which should
explain the bulk of my so-far-developed code below.
Thanks,
Tim Braam
US Census Bureau
%macro makedata ;
data work.big (keep=nvars1-nvars2900 cvars1-cvars101);
array nvars (2900) ;
array cvars (101) $32 ;
array lim (2801:2900) ;
retain lim2801 - lim2900 1;
do j=2801 to 2900 ;
lim(j)=ceil(100000*ranuni(2)) ;
end ;
do m=1 to 100000 ; /* number of observations in dataset */
do j=1 to 2800 ;
nvars(j)=int(100*ranuni(1)) ;
end ;
do j=2801 to 2900 ;
nvars(j)=int(lim(j)*ranuni(3)) ;
end ;
do k=1 to 101 ;
cvars(k)=put(nvars(K+2799),words32.) ;
end ;
output ;
end ;
run ;
proc sql noprint ;
create index cvars2 on sasuser.big(cvars2) ;
create index mc on sasuser.big (cvars3, cvars4) ;
quit ;
%mend ;
/* run this only once */
/*%makedata ; */
%macro getsizes(dsn=sashelp.class,critval=10) ;
options nomprint ;
%let lib=%scan(&dsn,1,'.') ;
%let dsname=%scan(&dsn,2,'.') ;
proc sql noprint ;
select ceil((nobs - delobs)* %eval(&critval/100)) into :crit
from dictionary.tables where libname=upcase("&lib") and
memname=upcase("&dsname") ;
select name into :vname1 - :vname32000
from dictionary.columns where libname=upcase("&lib") and
memname=upcase("&dsname") ;
%let nvars=%eval(&sqlobs) ;
quit ;
proc freq data=sasuser.big;
%do j=1 %to &nvars ;
table &&vname&j /noprint nopercent nocum nocol nofreq norow out
=&&vname&j.._c(keep=&&vname&j);
%end ;
run;
data counts;
length vname $32 count 3. vrole $10.;
stop;
run ;
proc sql noprint ;
%do j=1 %to &nvars ;
insert into counts select "&&vname&j" as vname, count(*) as
count, ' '
as vrole from &&vname&j.._c;
drop table &&vname&j.._c;
%end ;
update counts set vrole= case when count < &crit then
'Category' else
'Continuous' end ;
quit ;
%mend ;
%getsizes(dsn=work.big,critval=10) ;
Ian Whitlock
2005-10-08 20:34:59 UTC
Permalink
Tim,

I used your data and one hash table each for numeric and character
variables based on the code that Pudding Man <***@GMAIL.COM>
suggested that you find in recent SAS-L archives. The macro also counts
missing values, since the code was originally written to solve such a
problem and I didn't want to lose that ability. Sub macros were used to
capture the duplication of code for numeric and character variables. I set
the internal parameter, EFF=1 to stop counting when the critical count is
reached for the times shown. To get complete counts set it to 0.

Here is part of the log.

141 options nomprint ;
142 %makedata (n=100000,t=NC)
NOTE: The data set WORK.BIG has 100000 observations and 3001 variables.
NOTE: DATA statement used (Total process time):
real time 5:53.31
cpu time 1:56.20
143 %getsizes (out=sizes , critpct=10)
NOTE: There were 1 observations read from the data set WORK.BIG.
NOTE: DATA statement used (Total process time):
real time 1.74 seconds
cpu time 0.06 seconds

NOTE: There were 100000 observations read from the data set WORK.BIG.
NOTE: The data set WORK.SIZES has 3001 observations and 4 variables.
NOTE: DATA statement used (Total process time):
real time 7:09.65
cpu time 5:52.29

Six minutes to make the data and seven to process it - looks reasonable to
me. Here are the macros.

%macro makedata (n=100000,t=NC);
data work.big
(keep= %if %index(&t,N) %then nvars1-nvars2900 ;
%if %index(&t,C) %then cvars1-cvars101 ;
);
array nvars (2900) ;
array cvars (101) $32 ;
array lim (2801:2900) ;
retain lim2801 - lim2900 1;
do j=2801 to 2900 ;
lim(j)=ceil(100000*ranuni(2)) ;
end ;
do m=1 to &n ; /* number of observations in dataset */
do j=1 to 2800 ;
nvars(j)=int(100*ranuni(1)) ;
end ;
do j=2801 to 2900 ;
nvars(j)=int(lim(j)*ranuni(3)) ;
end ;
do k=1 to 101 ;
cvars(k)=put(nvars(K+2799),words32.) ;
end ;
output ;
end ;
run ;
%mend makedata ;

%macro getsizes ( data = &syslast
, out = sizes
, critpct = 10 ) ;
%local nvc nvn critcnt ;
%let data = &data ;
data _null_ ;
set &data nobs=nobs ;
call symputx ( "critcnt" , nobs * &critpct / 100 ) ;
array __xn (*) _numeric_ ;
call symputx ( "nvn" , dim(__xn) ) ;
array __xc (*) _character_ ;
call symputx ( "nvc" , dim(__xc) ) ;
stop ;
run ;

data &out ( keep = var ndistinct misscnt role) ;
goto main ; %* buffer must be at top of step ;
read: set &data end = eof ; return ;
main:
%if &nvn > 0 %then %__gszsetup ( t = n ) ;
%if &nvc > 0 %then %__gszsetup ( t = c ) ;
length var $ 32 role $10 ndistinct misscnt 8 ;
do until ( eof ) ;
link read ;
%if &nvn > 0 %then %__gszloop ( t = n ) ;
%if &nvc > 0 %then %__gszloop ( t = c ) ;
end ;
%if &nvn > 0 %then %__gszeof ( t = n ) ;
%if &nvc > 0 %then %__gszeof ( t = c ) ;
***remn.output(dataset: "work.lookn") ;
***remc.output(dataset: "work.lookc") ;
run ;
%mend getsizes ;

%macro __gszsetup ( t = n /* or c */ ) ;
%let t = %upcase (&t) ;
array __x&t (&&nv&t )
%if &t = C %then _character_ ;
%else _numeric_ ;
;
array __cts&t (&&nv&t) _temporary_ ;
array __mct&t (&&nv&t) _temporary_ ;
%if &t = C %then
%str(length __valC $200 ;) ;
if _n_ = 1 then
do ;
declare hash rem&t(hashexp:16) ;
rc = rem&t..defineKey("__ix", "__val&t" ) ;
rc = rem&t..defineData ( "__ix", "__val&t" ) ;
call missing ( __ix, __val&t ) ;
rc = rem&t..defineDone() ;
end ;
%mend __gszsetup ;

%macro __gszloop ( t = n /* or c */
, eff = 1
/* 1 counts stop at critical pt
0 no stop
*/
) ;
%let t = %upcase(&t) ;
do __ix = 1 to dim ( __x&t ) ;
__val&t = __x&t[__ix] ;
if missing ( __val&t ) then __mct&t[__ix] + 1 ;
else
%if &eff %then
%do ;
if __cts&t[__ix] >= &critcnt then ;
else
%end ;
do ;
__rc = rem&t..add() ;
if __rc = 0 then __cts&t[__ix] + 1 ;
end ;
end ;
%mend __gszloop ;

%macro __gszeof ( t = n /* or c */ ) ;
%let t = %upcase(&t) ;
do __ix = 1 to dim ( __x&t ) ;
var = vname ( __x&t[__ix] ) ;
ndistinct = __cts&t[__ix] ;
if ndistinct >= &critcnt then role = "Continuous" ;
else role = "Category" ;
misscnt = __mct&t[__ix] ;
output ;
end ;
%mend __gszeof ;

Your macro contains several bugs. In at least one place the name
SASUSER.BIG is used. You have the expression

ceil((nobs - delobs)* %eval(&critval/100))

Since %EVAL does integer arithmetic the expression is 0 for all positive
integers less than 100. I assumed this was a mistake.

Ian Whitlock
================
Date: Fri, 7 Oct 2005 11:41:48 -0700
Reply-To: TimBraam <***@CENSUS.GOV>
Sender: "SAS(r) Discussion"
From: TimBraam <***@CENSUS.GOV>
Organization: http://groups.google.com
Subject: get count of unique values for all variables in a dataset
Comments: To: sas-l
Content-Type: text/plain; charset="iso-8859-1"
All,
I've been asked to write a macro that will provide counts of the number
of unique values of each variable in a dataset.
I've written something very brute force - using the dictionary tables
to get the names of all the variables in the dataset, then using PROC
FREQ and a macro loop to generate a table statement for each variable
(code below). This approach does okay on a medium-size dataset, but
when I ran it with 100,000 observations and 3000 variables, it took
over 15 minutes. Can anyone think of a better approach? Using SQL with
count and distinct was no better than FREQ.
Basically, these folks want to 'guess' whether their variables are
categorical or continuous based on the number of distinct values as a
proportion of the number of observations in the data set, which should
explain the bulk of my so-far-developed code below.
Thanks,
Tim Braam
US Census Bureau

%macro makedata ;
data work.big (keep=nvars1-nvars2900 cvars1-cvars101);
array nvars (2900) ;
array cvars (101) $32 ;
array lim (2801:2900) ;
retain lim2801 - lim2900 1;
do j=2801 to 2900 ;
lim(j)=ceil(100000*ranuni(2)) ;
end ;
do m=1 to 100000 ; /* number of observations in dataset */
do j=1 to 2800 ;
nvars(j)=int(100*ranuni(1)) ;
end ;
do j=2801 to 2900 ;
nvars(j)=int(lim(j)*ranuni(3)) ;
end ;
do k=1 to 101 ;
cvars(k)=put(nvars(K+2799),words32.) ;
end ;
output ;
end ;
run ;
proc sql noprint ;
create index cvars2 on sasuser.big(cvars2) ;
create index mc on sasuser.big (cvars3, cvars4) ;
quit ;
%mend ;
/* run this only once */
/*%makedata ; */
%macro getsizes(dsn=sashelp.class,critval=10) ;
options nomprint ;
%let lib=%scan(&dsn,1,'.') ;
%let dsname=%scan(&dsn,2,'.') ;
proc sql noprint ;
select ceil((nobs - delobs)* %eval(&critval/100)) into :crit
from dictionary.tables where libname=upcase("&lib") and
memname=upcase("&dsname") ;
select name into :vname1 - :vname32000
from dictionary.columns where libname=upcase("&lib") and
memname=upcase("&dsname") ;
%let nvars=%eval(&sqlobs) ;
quit ;
proc freq data=sasuser.big;
%do j=1 %to &nvars ;
table &&vname&j /noprint nopercent nocum nocol nofreq norow out
=&&vname&j.._c(keep=&&vname&j);
%end ;
run;
data counts;
length vname $32 count 3. vrole $10.;
stop;
run ;
proc sql noprint ;
%do j=1 %to &nvars ;
insert into counts select "&&vname&j" as vname, count(*) as count, ' '
as vrole from &&vname&j.._c;
drop table &&vname&j.._c;
%end ;
update counts set vrole= case when count < &crit then 'Category' else
'Continuous' end ;
quit ;
%mend ;
%getsizes(dsn=work.big,critval=10) ;
Dennis Diskin
2005-10-10 11:42:19 UTC
Permalink
Phil,
I have no idea. There were no "3D''s in my post. Maybe an artifact of your
mail/newsgroup program?
Regards
Dennis Diskin
Post by Dennis Diskin
Post by Dennis Diskin
This probably won't run any faster, but this will produce all your
tables.
Post by Dennis Diskin
ods listing close;
ods output OneWayFreqs(match_all)=3D3Da_(drop=3D3Df_:)
proc freq data =3D3D sashelp.vcolumn / nocum nopercent;
tables _all_ ;
run;
ods listing close;
ods output OneWayFreqs(match_all)=3Da_(drop=3Df_:)
proc freq data =3D sashelp.vcolumn / nocum nopercent;
tables _all_ ;
run;
Howard Schreier <hs AT dc-sug DOT org>
2005-10-10 11:47:36 UTC
Permalink
Specifically, an artifact derived from "quoted printable" encoding.

http://en.wikipedia.org/wiki/Quoted-printable
Post by Dennis Diskin
Phil,
I have no idea. There were no "3D''s in my post. Maybe an artifact of your
mail/newsgroup program?
Regards
Dennis Diskin
Post by Dennis Diskin
Post by Dennis Diskin
This probably won't run any faster, but this will produce all your
tables.
Post by Dennis Diskin
ods listing close;
ods output OneWayFreqs(match_all)=3Da_(drop=3Df_:)
proc freq data =3D sashelp.vcolumn / nocum nopercent;
tables _all_ ;
run;
ods listing close;
ods output OneWayFreqs(match_all)=a_(drop=f_:)
proc freq data = sashelp.vcolumn / nocum nopercent;
tables _all_ ;
run;
David L Cassell
2005-10-10 18:53:27 UTC
Permalink
Post by Dennis Diskin
I have no idea. There were no "3D''s in my post. Maybe an artifact of
your
mail/newsgroup program?
Regards
Dennis Diskin
Post by Dennis Diskin
Post by Dennis Diskin
This probably won't run any faster, but this will produce all your
tables.
Post by Dennis Diskin
ods listing close;
ods output OneWayFreqs(match_all)=3D3Da_(drop=3D3Df_:)
proc freq data =3D3D sashelp.vcolumn / nocum nopercent;
tables _all_ ;
run;
ods listing close;
ods output OneWayFreqs(match_all)=3Da_(drop=3Df_:)
proc freq data =3D sashelp.vcolumn / nocum nopercent;
tables _all_ ;
run;
It's an artifact of the translation tables used by the mail readers, in
conjunction
with any interesting features attached by the original Mail User Agent. I
get it a
lot on this system.

David
--
David L. Cassell
mathematical statistician
Design Pathways
3115 NW Norwood Pl.
Corvallis OR 97330

_________________________________________________________________
On the road to retirement? Check out MSN Life Events for advice on how to
get there! http://lifeevents.msn.com/category.aspx?cid=Retirement
Peter
2005-10-12 09:14:52 UTC
Permalink
May I return to the original post? A hint of something else is there.

"Basically, these folks want to 'guess' whether their variables are
categorical or continuous"

I think that question can be answered without "counting all unique
values".

For a similar investigation, (that needed to be able to run on SAS8), I
used proc means to collect percentiles for all numeric vars. It seemed
reasonable that when a numeric var is categorical, it will show up
fairly clearly across percentiles - probably as integer. However, many
vars are character, so I also ran a datastep with a very oldfashioned
array process for those. Count occurrence of only the first so many
unique values(I used 20) - if there are too many distinct values, it is
more of an address-type or name-id-type variable.
The approach isn't foolproof, but in three passes (two in proc means,
one in datastep), this quick analysis reveals appropriate candidates
for proper categorical analysis.

Rather than post my 150 lines of code, I invite challenges to the
principle.

Regards
Peter
(currently available for contract in UK)
TimBraam
2005-10-12 11:50:41 UTC
Permalink
My thanks to everyone who responded, particularly Ian Whitlock who
provided exactly the solution I was seeking. I assumed there must be a
hash table approach, but I wasn't making any progress with my attempts
at coding one. When I searched the archives I came up with solutions
only involving a single variable - and proc sql does fine for that, but
that solution suffers badly with a large number of variables. And, as
the intent here is to provide a 'utility' program that will run on
datasets with potentially large numbers of both observations and
variables, a different approach was required. The hash table solution
provided by Ian seems to work as well as anything can be expected to
work with this rather goofy problem.

I say goofy because one would hope that people would know their data
well enough to not have to use such an approach to classify their
variables. To some extent some of these users know their data too well
- and have thus decided against creating any formal documentation (data
dictionary?) to go along with their data. Works fine as long as only
the one savvy group has access; the issue at hand is to provide a
mechanism to enable sharing the data among a broader audience.

We have categorical variables (e.g. HS codes used to classify
imports/exports) that can have tens of thousands of distinct values.

Thanks again,

Tim Braam
US Census Bureau
Post by Peter
May I return to the original post? A hint of something else is there.
"Basically, these folks want to 'guess' whether their variables are
categorical or continuous"
I think that question can be answered without "counting all unique
values".
For a similar investigation, (that needed to be able to run on SAS8), I
used proc means to collect percentiles for all numeric vars. It seemed
reasonable that when a numeric var is categorical, it will show up
fairly clearly across percentiles - probably as integer. However, many
vars are character, so I also ran a datastep with a very oldfashioned
array process for those. Count occurrence of only the first so many
unique values(I used 20) - if there are too many distinct values, it is
more of an address-type or name-id-type variable.
The approach isn't foolproof, but in three passes (two in proc means,
one in datastep), this quick analysis reveals appropriate candidates
for proper categorical analysis.
Rather than post my 150 lines of code, I invite challenges to the
principle.
Regards
Peter
(currently available for contract in UK)
Continue reading on narkive:
Loading...