-----Original Message-----
Fehd, Ronald J
Sent: Saturday, October 08, 2005 12:34 AM
Subject: Re: get count of unique values for all variables in a dataset
This code looks good from where I sit.
What's the problem with taking 15 minutes to crunch the numbers?
You're only going to run this once on a data set, right?
I have developed a similar routine that does a freq of each variable
and shows the high and low N(default=5) values.
you use SQL count, I use proc freq.
the numbers are the same
and I suspect the SQL will be faster
since it doesn't have to calculate percent as does proc freq.
you're having to do a pass of the data set
for every variable, there's no way around that.
I've been working with this problem for sometime now. The issue is the same: get distinct values of all the variables in a dataset, to decide if you wanna classify a variable.
Of course, you can use Roy Pardee's solution of using SQL to get the distinct entries. Just that you'll need to pass through the dataset each time, just like Ron Fehd has pointed out. I try to use this method as a benchmark.
As it turns out, you don't have to do a pass of the dataset for each var. You can get all the unique values of each variable with _one_ pass (or, as many as you like) of the dataset. The trick is to use a hash for each var. The problem then is that you need enough RAM to load all the unique values of all the variables. With typical filesizes I use (<2GB), I can get by with the Gig of RAM on my thinkpad.
However, this is not practical for large datasets. What I then try is to set the number of passes that I want to get all the unique values of all the vars.
Consider:
575 /* ------------------------------------------------------------------------- */
576 options nomprint;
577 /* Create the dataset, with missing values too. */
578 %let size=10;
579 %let N=1000;
580
581 data ATest(compress=binary);
582 array norm norm1-norm&size.;
583 do i = 1 to &n.;
584 do j = 1 to &size.;
585 norm{j} = ceil(1e5*ranuni(i));
586 if norm{j} <= j*5e2 then norm{j} = .;
587 end;
588 output;
589 end;
590 drop j i;
591 run;
NOTE: The data set WORK.ATEST has 1000 observations and 10 variables.
NOTE: Compressing data set WORK.ATEST increased size by 18.18 percent.
Compressed is 13 pages; un-compressed would require 11 pages.
NOTE: DATA statement used (Total process time):
real time 1.37 seconds
user cpu time 0.03 seconds
system cpu time 0.00 seconds
Memory 154k
592
593 /* Get the list of vars in the ds. */
594
595 ods listing close;
596 ods output Variables=cnt_Atest;
597 proc contents data=Atest;
598 run;
NOTE: The data set WORK.CNT_ATEST has 10 observations and 6 variables.
599 ods listing;
600
601 /* Use the SQL method to get the number of distinct entries. */
602 /* Generate the code, and put in a macro. */
603 proc sql _method stimer buffersize=1024000 noprint;
NOTE: SQL Statement used (Total process time):
real time 0.08 seconds
user cpu time 0.00 seconds
system cpu time 0.00 seconds
Memory 15k
604 select "count(distinct "||left(trim(Variable))||")" into :SqlVList
605 separated by ","
606 from cnt_Atest;
NOTE: SQL execution methods chosen are:
sqxslct
sqxfil
sqxsrc( WORK.CNT_ATEST )
NOTE: PROCEDURE SQL used (Total process time):
real time 0.07 seconds
user cpu time 0.00 seconds
system cpu time 0.00 seconds
Memory 2k
608 %put SqlVList = &SqlVlist.;
SqlVList = count(distinct norm1),count(distinct norm2),count(distinct norm3),count(distinct
norm4),count(distinct norm5),count(distinct norm6),count(distinct norm7),count(distinct
norm8),count(distinct norm9),count(distinct norm10)
609
610
611 /* Use the macro for the code. Use feedback to confirm code generated. */
612 proc sql _method feedback stimer buffersize=1024000;
NOTE: SQL Statement used (Total process time):
real time 0.08 seconds
user cpu time 0.00 seconds
system cpu time 0.00 seconds
Memory 15k
613 select &SqlVList.
614 from ATest;
NOTE: Statement transforms to:
select COUNT(distinct ATEST.norm1), COUNT(distinct ATEST.norm2), COUNT(distinct
ATEST.norm3), COUNT(distinct ATEST.norm4), COUNT(distinct ATEST.norm5), COUNT(distinct
ATEST.norm6), COUNT(distinct ATEST.norm7), COUNT(distinct ATEST.norm8), COUNT(distinct
ATEST.norm9), COUNT(distinct ATEST.norm10)
from WORK.ATEST;
NOTE: SQL execution methods chosen are:
sqxslct
sqxsumn
sqxsrc( WORK.ATEST )
NOTE: PROCEDURE SQL used (Total process time):
real time 0.08 seconds
user cpu time 0.00 seconds
system cpu time 0.00 seconds
Memory 2k
616
617
618 /* Getting the number of distinct entries using a hash. */
619 /* Method for two variables first. */
620 options nocenter msglevel=i fullstimer source source2 compress=yes;
621 data _null_;
622 set Atest(obs=1 keep=norm1 norm2);
623
624 declare hash hh1();
625 rc = hh1.DefineKey("norm1");
626 rc = hh1.DefineData("norm1");
627 rc = hh1.DefineDone();
628
629 declare hash hh2();
630 rc = hh2.DefineKey("norm2");
631 rc = hh2.DefineData("norm2");
632 rc = hh2.DefineDone();
633
634 do until(eof);
635 set Atest(obs=max keep=norm1 norm2) end=eof;
636 rc1 = hh1.replace();
637 rc2 = hh2.replace();
638 end;
639 count1 = hh1.Num_Items;
640 count2 = hh2.Num_Items;
641 put count1= count2=;
642 run;
count1=998 count2=992
NOTE: There were 1 observations read from the data set WORK.ATEST.
NOTE: There were 1000 observations read from the data set WORK.ATEST.
NOTE: DATA statement used (Total process time):
real time 2.92 seconds
user cpu time 0.00 seconds
system cpu time 0.00 seconds
Memory 513k
We've got output from the hash. There's a difference of 1 between the outputs since IMHO the SQL method does not count missing values.
643
644 /* We then extend the method to multiple vars. This involves using the vname()
645 * function. Use SQL again to get the list of vars.
646 */
647 proc sql _method stimer buffersize=1024000 noprint;
NOTE: SQL Statement used (Total process time):
real time 0.08 seconds
user cpu time 0.00 seconds
system cpu time 0.00 seconds
Memory 15k
648 select Variable into :VList separated by " "
649 from cnt_Atest;
NOTE: SQL execution methods chosen are:
sqxslct
sqxfil
sqxsrc( WORK.CNT_ATEST )
NOTE: SQL Statement used (Total process time):
real time 0.65 seconds
user cpu time 0.00 seconds
system cpu time 0.00 seconds
Memory 97k
650 quit;
NOTE: PROCEDURE SQL used (Total process time):
real time 0.08 seconds
user cpu time 0.00 seconds
system cpu time 0.00 seconds
Memory 2k
We now extend the hash method to more than two vars, in fact all the vars in the dataset:
651 %put VList = &Vlist.;
VList = norm1 norm2 norm3 norm4 norm5 norm6 norm7 norm8 norm9 norm10
652
653 %macro GetDis;
654 %let NL = 0;
655 %do %until (%scan(&vlist.,&NL.) eq );
656 %let NL = %eval(&NL. + 1);
657 %end;
658 %let NL = %eval(&NL. - 1);
659 %put NL = &NL.;
660 data DisGet(keep=Name Distinct);
661 length Name $30;
662 set Atest(obs=1 keep=&vlist.);
663 array av{&NL.} $30 vname1-vname&NL.;
664 array v{&NL.} $30 &vlist.;
665 do i = 1 to &NL.;
666 av{i} = vname(v{i});
667 end;
668 %do i = 1 %to &NL.;
669 put vname&i.= @;
670 declare hash hh&i.();
671 rc = hh&i..DefineKey(vname&i.);
672 rc = hh&i..DefineData(vname&i.);
673 rc = hh&i..DefineDone();
674 %end;
675 do until(eof);
676 set Atest(obs=max keep=&vlist.) end=eof;
677 %do i = 1 %to &NL.;
678 rc&i. = hh&i..replace();
679 %end;
680 end;
681 %do i = 1 %to &NL.;
682 Name = vname&i.;
683 Distinct = hh&i..Num_Items;
684 output;
685 %end;
686 run;
687 %mend;
688 %GetDis;
NL = 10
vname1=norm1 vname2=norm2 vname3=norm3 vname4=norm4 vname5=norm5 vname6=norm6 vname7=norm7
vname8=norm8 vname9=norm9 vname10=norm10
NOTE: There were 1 observations read from the data set WORK.ATEST.
NOTE: There were 1000 observations read from the data set WORK.ATEST.
NOTE: The data set WORK.DISGET has 10 observations and 2 variables.
NOTE: Compressing data set WORK.DISGET increased size by 100.00 percent.
Compressed is 2 pages; un-compressed would require 1 pages.
NOTE: DATA statement used (Total process time):
real time 1.65 seconds
user cpu time 0.00 seconds
system cpu time 0.03 seconds
Memory 777k
As one can see, the hash method is faster than the SQL way. In fact, the difference between the two rises as the size increases. Of course, as pointed out above, memory considerations would restrict scalability. We then use the multi-pass method.
HTH and best,
-Tir
Tirthankar Patnaik
India Analytics Center
Citibank, N. A.
+91-44-5228 6385
+91-98410 69545
-----Original Message-----
From: TimBraam
I've been asked to write a macro that will provide counts of
the number
of unique values of each variable in a dataset.
I've written something very brute force - using the
dictionary tables
to get the names of all the variables in the dataset, then
using PROC
FREQ and a macro loop to generate a table statement for
each variable
(code below). This approach does okay on a medium-size dataset, but
when I ran it with 100,000 observations and 3000 variables, it took
over 15 minutes. Can anyone think of a better approach?
Using SQL with
count and distinct was no better than FREQ.
Basically, these folks want to 'guess' whether their variables are
categorical or continuous based on the number of distinct
values as a
proportion of the number of observations in the data set,
which should
explain the bulk of my so-far-developed code below.
Thanks,
Tim Braam
US Census Bureau
%macro makedata ;
data work.big (keep=nvars1-nvars2900 cvars1-cvars101);
array nvars (2900) ;
array cvars (101) $32 ;
array lim (2801:2900) ;
retain lim2801 - lim2900 1;
do j=2801 to 2900 ;
lim(j)=ceil(100000*ranuni(2)) ;
end ;
do m=1 to 100000 ; /* number of observations in dataset */
do j=1 to 2800 ;
nvars(j)=int(100*ranuni(1)) ;
end ;
do j=2801 to 2900 ;
nvars(j)=int(lim(j)*ranuni(3)) ;
end ;
do k=1 to 101 ;
cvars(k)=put(nvars(K+2799),words32.) ;
end ;
output ;
end ;
run ;
proc sql noprint ;
create index cvars2 on sasuser.big(cvars2) ;
create index mc on sasuser.big (cvars3, cvars4) ;
quit ;
%mend ;
/* run this only once */
/*%makedata ; */
%macro getsizes(dsn=sashelp.class,critval=10) ;
options nomprint ;
%let lib=%scan(&dsn,1,'.') ;
%let dsname=%scan(&dsn,2,'.') ;
proc sql noprint ;
select ceil((nobs - delobs)* %eval(&critval/100)) into :crit
from dictionary.tables where libname=upcase("&lib") and
memname=upcase("&dsname") ;
select name into :vname1 - :vname32000
from dictionary.columns where libname=upcase("&lib") and
memname=upcase("&dsname") ;
%let nvars=%eval(&sqlobs) ;
quit ;
proc freq data=sasuser.big;
%do j=1 %to &nvars ;
table &&vname&j /noprint nopercent nocum nocol nofreq norow out
=&&vname&j.._c(keep=&&vname&j);
%end ;
run;
data counts;
length vname $32 count 3. vrole $10.;
stop;
run ;
proc sql noprint ;
%do j=1 %to &nvars ;
insert into counts select "&&vname&j" as vname, count(*) as
count, ' '
as vrole from &&vname&j.._c;
drop table &&vname&j.._c;
%end ;
update counts set vrole= case when count < &crit then
'Category' else
'Continuous' end ;
quit ;
%mend ;
%getsizes(dsn=work.big,critval=10) ;