Discussion:
Deleting SAS Data from a SAS DATASET
(too old to reply)
SUBSCRIBE SAS-L Chandra Gadde
2008-08-15 17:14:22 UTC
Permalink
Hi All

I have several SAS datasets that are very very big. (50GB of size). Every
month, the data is being appended to these datasets. I need to deleted the
data which is greater than 24 months. What is the best method to do this?
Please help me.

I tried PROC SQS and DATA STEP. But these two are taking very long time.

Data prod.Master_data;
set prod.master_date;
if snap_dt = "&end_dt"d then delete;
run;

proc sql;
crete table prod.master_date from prod.master_date
where snap_dt ne "&end_dt"d;
quit;
Patrick
2008-08-15 19:04:39 UTC
Permalink
Hi
You could use a 'modify' statement (data step). This statement does
the changes in place and needs therefore only have of the space, BUT:
there is a risk of corrupting your data; look it up in the manual.

I also think that some re-designing could solve some problems and
improve performance. I assume you're loading kind of a data mart with
a history of 24 months.

What you could do (just an idea):
- Create a SAS file per month (eg: prod.master_date&yyyymm)
- Create a SAS view for the most recent 24 SAS files (prod.master).
- Create an index over the date field.

Advantages:
- Loading of new data will be much faster (create new file with
current month, re-create view, delete SAS file with data older 24
months).
- As the SAS view has the same name like previously the SAS data file
your users won't even get that something changed (assuming they only
read the data).
- Queries using the index will perform much faster.
- SAS powerusers can also be told the new data organisation and they
can query the monthly files directly
- Smaller files are also much better to handle for backup & recovery
processes.

HTH
Patrick
Mary
2008-08-15 18:17:25 UTC
Permalink
One thing you might do is to add an index on the snap_dt to the dataset; =
if that's there then you should be able to delete the records in place:

proc sql;
delete from prod.master_date;
where snap_dt =3D "&end_dt"d;
quit;
run;

In both the ways you are trying now you are creating new data sets =
rather than deleting records from the current data set; it would seem to =
me that a SQL delete statement would be faster than creating new =
datasets even if there isn't an index on the date.=20

-Mary
----- Original Message -----=20
From: SUBSCRIBE SAS-L Chandra Gadde=20
To: SAS-***@LISTSERV.UGA.EDU=20
Sent: Friday, August 15, 2008 12:14 PM
Subject: Deleting SAS Data from a SAS DATASET


Hi All

I have several SAS datasets that are very very big. (50GB of size). =
Every
month, the data is being appended to these datasets. I need to deleted =
the
data which is greater than 24 months. What is the best method to do =
this?
Please help me.

I tried PROC SQS and DATA STEP. But these two are taking very long =
time.

Data prod.Master_data;
set prod.master_date;
if snap_dt =3D "&end_dt"d then delete;
run;

proc sql;
crete table prod.master_date from prod.master_date
where snap_dt ne "&end_dt"d;
quit;
Patrick
2008-08-15 19:18:37 UTC
Permalink
Hi
You've got good ideas how to delete the old observations. My modify
proposal was wrong. Sorry about that!
I still think that changing the data model would be a good idea.
Regards
Patrick
Lou
2008-08-16 00:30:28 UTC
Permalink
Post by Mary
One thing you might do is to add an index on the snap_dt to the dataset; =
proc sql;
delete from prod.master_date;
where snap_dt =3D "&end_dt"d;
quit;
run;
Your SQL code will work whether or not there's an index on the dataset. The
problem however, is that the observations aren't physically deleted, they're
just marked for deletion. The size of the dataset on disk remains the same,
and the next time you process it, it'll take just as much time (more or
less) to read.

And trying it on a very simple, one-variable dataset with 100,000
observations, where half are deleted, the SQL step used .10 CPU seconds and
a data step recreating the dataset without the deleted observations consumed
.06 CPU seconds. Wall time was a couple seconds shorter for the data step
too.

Different datasets on different machines might come up with different
results, but it doesn't look certain to me that deleting in place is faster.
Post by Mary
In both the ways you are trying now you are creating new data sets =
rather than deleting records from the current data set; it would seem to =
me that a SQL delete statement would be faster than creating new =
datasets even if there isn't an index on the date.=20
-Mary
----- Original Message -----=20
From: SUBSCRIBE SAS-L Chandra Gadde=20
Sent: Friday, August 15, 2008 12:14 PM
Subject: Deleting SAS Data from a SAS DATASET
Hi All
I have several SAS datasets that are very very big. (50GB of size). =
Every
month, the data is being appended to these datasets. I need to deleted =
the
data which is greater than 24 months. What is the best method to do =
this?
Please help me.
I tried PROC SQS and DATA STEP. But these two are taking very long =
time.
Data prod.Master_data;
set prod.master_date;
if snap_dt =3D "&end_dt"d then delete;
run;
proc sql;
crete table prod.master_date from prod.master_date
where snap_dt ne "&end_dt"d;
quit;
Choate,
2008-08-15 18:30:16 UTC
Permalink
Chandra - this isn't my expertise area - I'm sure others may have better
ideas but I'll make three comments.

If your datasets have a large number of variables, in your datastep
method a WHERE clause will be more efficient - the IF statement causes
each record to be fully read before it can be selected for deletion. A
WHERE clause would only read the single variable in the case of records
that meet the criteria for deletion.

Data prod.Master_date;
set prod.master_date;
where snap_dt ne "&end_dt"d;
run;


You probably would be better off changing your data model though - my
first thought would to be to store your data as separate yearly or
monthly files and then access them with a view of the past two years or
24 months. With each month or year you would delete the oldest file and
change the view one time period forward.

*set up the data;
data time1 time2 time3 time4;
do i = 1 to 100;
do t = 1 to 4;
if t=1 then output time1;
if t=2 then output time2;
if t=3 then output time3;
if t=4 then output time4;
end;
end;
run;

*three successive example views;
data filet1 / view=filet1;
set time1 time2;
run;
data filet2 / view=filet2;
set time2 time3;
run;
data filet3 / view=filet3;
set time3 time4;
run;


One more thought - since the data are appended - if the file is not
reordered and you have the record counts from each append - then you can
use firstobs to only read starting with the recent time period. I would
replace the append with a datastep that skips the appended data from 24
months prior and appends the new data in a single pass:

*set up the data;
data history time2 time3 time4;
do i = 1 to 100;
do t = 1 to 4;
if t=1 then output history;
if t=2 then output time2;
if t=3 then output time3;
if t=4 then output time4;
end;
end;
run;

proc append base=history data=time2;
run;

*two example single pass delete/appends;
data history;
do obsnum=101 to last;
set history time3 point=obsnum nobs=last;
output;
end;
stop;
run;

data history;
do obsnum=101 to last;
set history time4 point=obsnum nobs=last;
output;
end;
stop;
run;

You'll need to store prior record counts and then access them for the
obsnum starting position.

hth

Paul Choate
DDS Data Extraction
(916) 654-2160
-----Original Message-----
From: SAS(r) Discussion [mailto:SAS-***@LISTSERV.UGA.EDU] On Behalf Of
SUBSCRIBE SAS-L Chandra Gadde
Sent: Friday, August 15, 2008 10:14 AM
To: SAS-***@LISTSERV.UGA.EDU
Subject: Deleting SAS Data from a SAS DATASET

Hi All

I have several SAS datasets that are very very big. (50GB of size).
Every
month, the data is being appended to these datasets. I need to deleted
the
data which is greater than 24 months. What is the best method to do
this?
Please help me.

I tried PROC SQS and DATA STEP. But these two are taking very long time.

Data prod.Master_data;
set prod.master_date;
if snap_dt = "&end_dt"d then delete;
run;

proc sql;
crete table prod.master_date from prod.master_date
where snap_dt ne "&end_dt"d;
quit;
Jack Hamilton
2008-08-15 19:05:50 UTC
Permalink
The disadvantages of using SQL delete are (1) NOBS= is no longer
accurate, and (2) POINT= may behave unexpectedly.

If those are not issues, then the delete would be faster, with or
without an index, than recreating the data set (especially if the data
set has indexes that would need to be rebuilt).


--
Jack Hamilton
Post by Mary
One thing you might do is to add an index on the snap_dt to the
dataset; if that's there then you should be able to delete the
proc sql;
delete from prod.master_date;
where snap_dt = "&end_dt"d;
quit;
run;
In both the ways you are trying now you are creating new data sets
rather than deleting records from the current data set; it would
seem to me that a SQL delete statement would be faster than creating
new datasets even if there isn't an index on the date.
-Mary
----- Original Message -----
From: SUBSCRIBE SAS-L Chandra Gadde
Sent: Friday, August 15, 2008 12:14 PM
Subject: Deleting SAS Data from a SAS DATASET
Hi All
I have several SAS datasets that are very very big. (50GB of size).
Every
month, the data is being appended to these datasets. I need to
deleted the
data which is greater than 24 months. What is the best method to do
this?
Please help me.
I tried PROC SQS and DATA STEP. But these two are taking very long
time.
Data prod.Master_data;
set prod.master_date;
if snap_dt = "&end_dt"d then delete;
run;
proc sql;
crete table prod.master_date from prod.master_date
where snap_dt ne "&end_dt"d;
quit;
./ ADD NAME=Data _null_,
2008-08-15 20:51:25 UTC
Permalink
Summary: PROC DATASETS; AGE statement. + VIEWs

This won't help you delete data from your very big data set, but you
may find this example interesting.

You say you append data monthly to a big data set then when big gets
too big you need to clean out the old. And that takes a very long
time.

However if you don't physically append but use a view to
append/combine you may find it easier to get rid of the unwanted old
data.

Consider this code. it pushes MonthlyUpdate onto the stack of 24 data
sets and the 24th data set is deleted. Then all the data sets get
renamed to produce a new group of 24. You can see from the notes how
the operation works. The data sets don't have to use a numbered range
M01-M24 I did that for my convenience.

proc datasets library=work;
age MonthlyUpdate m01-m24;
run;
quit;

NOTE: Deleting WORK.M24 (memtype=DATA).
NOTE: Aging the name WORK.M23 to WORK.M24 (memtype=DATA).
NOTE: Aging the name WORK.M22 to WORK.M23 (memtype=DATA).
NOTE: Aging the name WORK.M21 to WORK.M22 (memtype=DATA).
NOTE: Aging the name WORK.M20 to WORK.M21 (memtype=DATA).
NOTE: Aging the name WORK.M19 to WORK.M20 (memtype=DATA).
NOTE: Aging the name WORK.M18 to WORK.M19 (memtype=DATA).
NOTE: Aging the name WORK.M17 to WORK.M18 (memtype=DATA).
NOTE: Aging the name WORK.M16 to WORK.M17 (memtype=DATA).
NOTE: Aging the name WORK.M15 to WORK.M16 (memtype=DATA).
NOTE: Aging the name WORK.M14 to WORK.M15 (memtype=DATA).
NOTE: Aging the name WORK.M13 to WORK.M14 (memtype=DATA).
NOTE: Aging the name WORK.M12 to WORK.M13 (memtype=DATA).
NOTE: Aging the name WORK.M11 to WORK.M12 (memtype=DATA).
NOTE: Aging the name WORK.M10 to WORK.M11 (memtype=DATA).
NOTE: Aging the name WORK.M09 to WORK.M10 (memtype=DATA).
NOTE: Aging the name WORK.M08 to WORK.M09 (memtype=DATA).
NOTE: Aging the name WORK.M07 to WORK.M08 (memtype=DATA).
NOTE: Aging the name WORK.M06 to WORK.M07 (memtype=DATA).
NOTE: Aging the name WORK.M05 to WORK.M06 (memtype=DATA).
NOTE: Aging the name WORK.M04 to WORK.M05 (memtype=DATA).
NOTE: Aging the name WORK.M03 to WORK.M04 (memtype=DATA).
NOTE: Aging the name WORK.M02 to WORK.M03 (memtype=DATA).
NOTE: Aging the name WORK.M01 to WORK.M02 (memtype=DATA).
NOTE: Aging the name WORK.MONTHLYUPDATE to WORK.M01 (memtype=DATA).


Here is a complete example. Somewhat contrived but I think is covers
most of the relevant points.

/* create example data base 2 years of monthly data */
filename makedata temp;
data _null_;
file makedata;
do m = 24 to 1 by -1;
month = intnx('MONTH',today(),-m,'B');
format month date9.;
put 'data m' m z2. ';';
put +3 'month="' month +(-1)'"d;';
put +3 'do day = month to intnx("month",month,0,"E");';
put +6 'output;';
put +6 'end;';
put +3 'Format month monyy. day date.;';
put +3 'run;';
end;
run;
%inc makedata / source2;

/* macro var to make long list of member easier to manage */
%let members = ;
proc sql noprint;
select memname into :members separated by ' '
from dictionary.members
where libname eq 'WORK' and memname eqt 'M';
quit;
run;
%put NOTE: MEMBERS=&members;

/* a 2 year view */
data view24 / view=view24;
set &members open=defer;
run;

/* New data for montly update */
data MonthlyUpdate;
month = intnx('MONTH',today(),0,'B');
do day = month to intnx("Month",month,0,"E");
output;
end;
format month monyy.;
run;

proc datasets library=work;
age MonthlyUpdate m01-m24;
run;
quit;

/* use the view to access the past 24 months of data */
proc print data=view24;
run;
Post by SUBSCRIBE SAS-L Chandra Gadde
Hi All
I have several SAS datasets that are very very big. (50GB of size). Every
month, the data is being appended to these datasets. I need to deleted the
data which is greater than 24 months. What is the best method to do this?
Please help me.
I tried PROC SQS and DATA STEP. But these two are taking very long time.
Data prod.Master_data;
set prod.master_date;
if snap_dt = "&end_dt"d then delete;
run;
proc sql;
crete table prod.master_date from prod.master_date
where snap_dt ne "&end_dt"d;
quit;
Mary
2008-08-15 21:40:12 UTC
Permalink
I also read that this could be a problem on machines that have multiple =
processors, such as running DB2; if the data is clustered and each =
processor takes care of one cluster, then you could have queries all =
hitting the same processor instead of distributing the load across all =
processors, thus potentially slowing down queries in peak periods where =
you have multiple queries at the same time, if queries tend to be based =
on the month (such as bank statements). Thus I'd proceed with caution =
on the idea of splitting one table into 24 tables with one view without =
making sure the other consequences of doing so are fully understood.

-Mary
----- Original Message -----=20
From: Mary=20
To: SAS-***@LISTSERV.UGA.EDU=20
Sent: Friday, August 15, 2008 4:09 PM
Subject: Re: Re: Deleting SAS Data from a SAS DATASET


A view helps on deletes, but I wonder how it affects performance of =
querying the data- wouldn't storing the data in 24 different locations =
cause a significant slowdown in perfomance upon querying the data versus =
having it all in one table that is indexed? If this data is queryied a =
lot but only deleted once a month, the time in querying (which probably =
is in peak time) could be much more important than the time in deleting =
(which could be run when the computer is not busy, such as nights or =
weekends). =20

-Mary
Nat Wooding
2008-08-15 21:50:09 UTC
Permalink
I have been watching this thread today and I just now recall an example
that Paul Dorfman gave at a RUG several years ago and no, I don't recall
exactly where or when other than it was NESUG or SESUG in the past 5 years.
Anyway, Paul had a client who was storing something like the past 12 months
of transactions as variables andat the start of each month needed to stick
the just ended month on the end and drop the oldest month. Paul used Peek
to read the last 11 months of each obs as a single chunk of data, append
the current's month, and then used Poke to write the whole new obs as a
unit. The time reduction was dramatic.

It probably would not work here unless you knew a whole lot about the
structure of the data set and it has a very fixed structure but I was so
impressed with the solution that I like to mention it whenever it seems
even remotely useful.

Nat Wooding
Environmental Specialist III
Dominion, Environmental Biology
4111 Castlewood Rd
Richmond, VA 23234
Phone:804-271-5313, Fax: 804-271-2977

CONFIDENTIALITY NOTICE: This electronic message contains
information which may be legally confidential and/or privileged and
does not in any case represent a firm ENERGY COMMODITY bid or offer
relating thereto which binds the sender without an additional
express written confirmation to that effect. The information is
intended solely for the individual or entity named above and access
by anyone else is unauthorized. If you are not the intended
recipient, any disclosure, copying, distribution, or use of the
contents of this information is prohibited and may be unlawful. If
you have received this electronic transmission in error, please
reply immediately to the sender that you have received the message
in error, and delete it. Thank you.
Mary
2008-08-15 21:09:46 UTC
Permalink
A view helps on deletes, but I wonder how it affects performance of =
querying the data- wouldn't storing the data in 24 different locations =
cause a significant slowdown in perfomance upon querying the data versus =
having it all in one table that is indexed? If this data is queryied a =
lot but only deleted once a month, the time in querying (which probably =
is in peak time) could be much more important than the time in deleting =
(which could be run when the computer is not busy, such as nights or =
weekends). =20

-Mary
----- Original Message -----=20
From: ./ ADD NAME=3DData _null_,=20
To: SAS-***@LISTSERV.UGA.EDU=20
Sent: Friday, August 15, 2008 3:51 PM
Subject: Re: Deleting SAS Data from a SAS DATASET


Summary: PROC DATASETS; AGE statement. + VIEWs

This won't help you delete data from your very big data set, but you
may find this example interesting.

You say you append data monthly to a big data set then when big gets
too big you need to clean out the old. And that takes a very long
time.

However if you don't physically append but use a view to
append/combine you may find it easier to get rid of the unwanted old
data.

Consider this code. it pushes MonthlyUpdate onto the stack of 24 data
sets and the 24th data set is deleted. Then all the data sets get
renamed to produce a new group of 24. You can see from the notes how
the operation works. The data sets don't have to use a numbered range
M01-M24 I did that for my convenience.

proc datasets library=3Dwork;
age MonthlyUpdate m01-m24;
run;
quit;

NOTE: Deleting WORK.M24 (memtype=3DDATA).
NOTE: Aging the name WORK.M23 to WORK.M24 (memtype=3DDATA).
NOTE: Aging the name WORK.M22 to WORK.M23 (memtype=3DDATA).
NOTE: Aging the name WORK.M21 to WORK.M22 (memtype=3DDATA).
NOTE: Aging the name WORK.M20 to WORK.M21 (memtype=3DDATA).
NOTE: Aging the name WORK.M19 to WORK.M20 (memtype=3DDATA).
NOTE: Aging the name WORK.M18 to WORK.M19 (memtype=3DDATA).
NOTE: Aging the name WORK.M17 to WORK.M18 (memtype=3DDATA).
NOTE: Aging the name WORK.M16 to WORK.M17 (memtype=3DDATA).
NOTE: Aging the name WORK.M15 to WORK.M16 (memtype=3DDATA).
NOTE: Aging the name WORK.M14 to WORK.M15 (memtype=3DDATA).
NOTE: Aging the name WORK.M13 to WORK.M14 (memtype=3DDATA).
NOTE: Aging the name WORK.M12 to WORK.M13 (memtype=3DDATA).
NOTE: Aging the name WORK.M11 to WORK.M12 (memtype=3DDATA).
NOTE: Aging the name WORK.M10 to WORK.M11 (memtype=3DDATA).
NOTE: Aging the name WORK.M09 to WORK.M10 (memtype=3DDATA).
NOTE: Aging the name WORK.M08 to WORK.M09 (memtype=3DDATA).
NOTE: Aging the name WORK.M07 to WORK.M08 (memtype=3DDATA).
NOTE: Aging the name WORK.M06 to WORK.M07 (memtype=3DDATA).
NOTE: Aging the name WORK.M05 to WORK.M06 (memtype=3DDATA).
NOTE: Aging the name WORK.M04 to WORK.M05 (memtype=3DDATA).
NOTE: Aging the name WORK.M03 to WORK.M04 (memtype=3DDATA).
NOTE: Aging the name WORK.M02 to WORK.M03 (memtype=3DDATA).
NOTE: Aging the name WORK.M01 to WORK.M02 (memtype=3DDATA).
NOTE: Aging the name WORK.MONTHLYUPDATE to WORK.M01 (memtype=3DDATA).


Here is a complete example. Somewhat contrived but I think is covers
most of the relevant points.

/* create example data base 2 years of monthly data */
filename makedata temp;
data _null_;
file makedata;
do m =3D 24 to 1 by -1;
month =3D intnx('MONTH',today(),-m,'B');
format month date9.;
put 'data m' m z2. ';';
put +3 'month=3D"' month +(-1)'"d;';
put +3 'do day =3D month to intnx("month",month,0,"E");';
put +6 'output;';
put +6 'end;';
put +3 'Format month monyy. day date.;';
put +3 'run;';
end;
run;
%inc makedata / source2;

/* macro var to make long list of member easier to manage */
%let members =3D ;
proc sql noprint;
select memname into :members separated by ' '
from dictionary.members
where libname eq 'WORK' and memname eqt 'M';
quit;
run;
%put NOTE: MEMBERS=3D&members;

/* a 2 year view */
data view24 / view=3Dview24;
set &members open=3Ddefer;
run;

/* New data for montly update */
data MonthlyUpdate;
month =3D intnx('MONTH',today(),0,'B');
do day =3D month to intnx("Month",month,0,"E");
output;
end;
format month monyy.;
run;

proc datasets library=3Dwork;
age MonthlyUpdate m01-m24;
run;
quit;

/* use the view to access the past 24 months of data */
proc print data=3Dview24;
run;
Post by SUBSCRIBE SAS-L Chandra Gadde
Hi All
I have several SAS datasets that are very very big. (50GB of size). =
Every
Post by SUBSCRIBE SAS-L Chandra Gadde
month, the data is being appended to these datasets. I need to =
deleted the
Post by SUBSCRIBE SAS-L Chandra Gadde
data which is greater than 24 months. What is the best method to do =
this?
Post by SUBSCRIBE SAS-L Chandra Gadde
Please help me.
I tried PROC SQS and DATA STEP. But these two are taking very long =
time.
Post by SUBSCRIBE SAS-L Chandra Gadde
Data prod.Master_data;
set prod.master_date;
if snap_dt =3D "&end_dt"d then delete;
run;
proc sql;
crete table prod.master_date from prod.master_date
where snap_dt ne "&end_dt"d;
quit;
./ ADD NAME=Data _null_,
2008-08-15 21:59:33 UTC
Permalink
A view helps on deletes, but I wonder how it affects performance of querying the data- wouldn't storing the data in 24 different locations cause a significant slowdown in perfomance upon querying the data versus having it all in one table that is indexed? If this data is queryied a lot but only deleted once a month, the time in querying (which probably is in peak time) could be much more important than the time in deleting (which could be run when the computer is not busy, such as nights or weekends).
It is not the VIEW that has an influence on deleting the old data.

I would think that having the 24 indexed data sets might be about as
good as having the giant data set. I would think the indexes could be
used while accessing the data through views. Where's that guy that
says he knows everything about using indexed data sets?

I would agree that much depends on how the data is used. And I don't
know the answers to those questions.
-Mary
----- Original Message -----
From: ./ ADD NAME=Data _null_,
Sent: Friday, August 15, 2008 3:51 PM
Subject: Re: Deleting SAS Data from a SAS DATASET
Summary: PROC DATASETS; AGE statement. + VIEWs
This won't help you delete data from your very big data set, but you
may find this example interesting.
You say you append data monthly to a big data set then when big gets
too big you need to clean out the old. And that takes a very long
time.
However if you don't physically append but use a view to
append/combine you may find it easier to get rid of the unwanted old
data.
Consider this code. it pushes MonthlyUpdate onto the stack of 24 data
sets and the 24th data set is deleted. Then all the data sets get
renamed to produce a new group of 24. You can see from the notes how
the operation works. The data sets don't have to use a numbered range
M01-M24 I did that for my convenience.
proc datasets library=work;
age MonthlyUpdate m01-m24;
run;
quit;
NOTE: Deleting WORK.M24 (memtype=DATA).
NOTE: Aging the name WORK.M23 to WORK.M24 (memtype=DATA).
NOTE: Aging the name WORK.M22 to WORK.M23 (memtype=DATA).
NOTE: Aging the name WORK.M21 to WORK.M22 (memtype=DATA).
NOTE: Aging the name WORK.M20 to WORK.M21 (memtype=DATA).
NOTE: Aging the name WORK.M19 to WORK.M20 (memtype=DATA).
NOTE: Aging the name WORK.M18 to WORK.M19 (memtype=DATA).
NOTE: Aging the name WORK.M17 to WORK.M18 (memtype=DATA).
NOTE: Aging the name WORK.M16 to WORK.M17 (memtype=DATA).
NOTE: Aging the name WORK.M15 to WORK.M16 (memtype=DATA).
NOTE: Aging the name WORK.M14 to WORK.M15 (memtype=DATA).
NOTE: Aging the name WORK.M13 to WORK.M14 (memtype=DATA).
NOTE: Aging the name WORK.M12 to WORK.M13 (memtype=DATA).
NOTE: Aging the name WORK.M11 to WORK.M12 (memtype=DATA).
NOTE: Aging the name WORK.M10 to WORK.M11 (memtype=DATA).
NOTE: Aging the name WORK.M09 to WORK.M10 (memtype=DATA).
NOTE: Aging the name WORK.M08 to WORK.M09 (memtype=DATA).
NOTE: Aging the name WORK.M07 to WORK.M08 (memtype=DATA).
NOTE: Aging the name WORK.M06 to WORK.M07 (memtype=DATA).
NOTE: Aging the name WORK.M05 to WORK.M06 (memtype=DATA).
NOTE: Aging the name WORK.M04 to WORK.M05 (memtype=DATA).
NOTE: Aging the name WORK.M03 to WORK.M04 (memtype=DATA).
NOTE: Aging the name WORK.M02 to WORK.M03 (memtype=DATA).
NOTE: Aging the name WORK.M01 to WORK.M02 (memtype=DATA).
NOTE: Aging the name WORK.MONTHLYUPDATE to WORK.M01 (memtype=DATA).
Here is a complete example. Somewhat contrived but I think is covers
most of the relevant points.
/* create example data base 2 years of monthly data */
filename makedata temp;
data _null_;
file makedata;
do m = 24 to 1 by -1;
month = intnx('MONTH',today(),-m,'B');
format month date9.;
put 'data m' m z2. ';';
put +3 'month="' month +(-1)'"d;';
put +3 'do day = month to intnx("month",month,0,"E");';
put +6 'output;';
put +6 'end;';
put +3 'Format month monyy. day date.;';
put +3 'run;';
end;
run;
%inc makedata / source2;
/* macro var to make long list of member easier to manage */
%let members = ;
proc sql noprint;
select memname into :members separated by ' '
from dictionary.members
where libname eq 'WORK' and memname eqt 'M';
quit;
run;
%put NOTE: MEMBERS=&members;
/* a 2 year view */
data view24 / view=view24;
set &members open=defer;
run;
/* New data for montly update */
data MonthlyUpdate;
month = intnx('MONTH',today(),0,'B');
do day = month to intnx("Month",month,0,"E");
output;
end;
format month monyy.;
run;
proc datasets library=work;
age MonthlyUpdate m01-m24;
run;
quit;
/* use the view to access the past 24 months of data */
proc print data=view24;
run;
Post by SUBSCRIBE SAS-L Chandra Gadde
Hi All
I have several SAS datasets that are very very big. (50GB of size). Every
month, the data is being appended to these datasets. I need to deleted the
data which is greater than 24 months. What is the best method to do this?
Please help me.
I tried PROC SQS and DATA STEP. But these two are taking very long time.
Data prod.Master_data;
set prod.master_date;
if snap_dt = "&end_dt"d then delete;
run;
proc sql;
crete table prod.master_date from prod.master_date
where snap_dt ne "&end_dt"d;
quit;
Chang Chung
2008-08-15 22:03:52 UTC
Permalink
Post by Mary
A view helps on deletes, but I wonder how it affects performance of
querying the data- wouldn't storing the data in 24 different locations
cause a significant slowdown in perfomance upon querying the data versus
having it all in one table that is indexed? If this data is queryied a
lot but only deleted once a month, the time in querying (which probably is
in peak time) could be much more important than the time in deleting
(which could be run when the computer is not busy, such as nights or
weekends).

hi,
good point! but as shown below, you can make a view of multiple datasets
having the separate indexes, and where seems to be fine with this.
interesting. ran on sas 9.1.3 sp4 on windows.
cheers,
chang

/* test datasets */
data one two;
set sashelp.class;
ds = "one";
output one;
ds = "two";
output two;
run;

/* separately index on age */
proc datasets lib=work;
modify one;
index delete age;
index create age;
modify two;
index delete age;
index create age;
quit;

/* make a view of two datasets appended */
proc sql feedback;
drop view work.oneTwo;
create view oneTwo as
select * from one union select * from two
order by age;
quit;

/* where is utilizing the indexes */
options msglevel=i;
data sixteenAndOlder;
set oneTwo; /* we are using the view */
where age >= 16;
run;
/* on log
INFO: Index Age selected for WHERE clause optimization.
INFO: Index Age selected for WHERE clause optimization.
*/
Chang Chung
2008-08-15 22:39:19 UTC
Permalink
And what about the time of the database administrator who now needs to keep
track of 24 indexes to each index in the previous table? I just can't see
our DB2 administrators would have ever thought to organize the data this
way, though we had many very large tables, like the user had, that were
purged by month; it does seem like it would be a lot of "people work" to
manage 24 tables plus one index instead of just one table, even if querying
didn't suffer(and I do think that it could).
...
hi,
There are only 24, not like thousands or millions! :-) and with any DB
system, if you have a huge table, then it is more likely that it will map to
multiple physical files behind the scene anyway.
In terms of querying speed, i rather think it can be faster. One way is to
spawn many sas sessions to do multiple separate queries at the same time,
then put together... sort of like do-it-yourself parallel processing.
implementation is left as homework. happy friday!
cheers,
chang
Mary
2008-08-15 22:20:13 UTC
Permalink
And what about the time of the database administrator who now needs to =
keep track of 24 indexes to each index in the previous table? I just =
can't see our DB2 administrators would have ever thought to organize the =
data this way, though we had many very large tables, like the user had, =
that were purged by month; it does seem like it would be a lot of =
"people work" to manage 24 tables plus one index instead of just one =
table, even if querying didn't suffer(and I do think that it could).=20

-Mary
----- Original Message -----=20
From: Chang Chung=20
To: SAS-***@LISTSERV.UGA.EDU=20
Sent: Friday, August 15, 2008 5:03 PM
Subject: Re: Deleting SAS Data from a SAS DATASET



hi,
good point! but as shown below, you can make a view of multiple =
datasets
having the separate indexes, and where seems to be fine with this.
interesting. ran on sas 9.1.3 sp4 on windows.
cheers,
chang

/* test datasets */
data one two;
set sashelp.class;
ds =3D "one";
output one;
ds =3D "two";
output two;
run;

/* separately index on age */
proc datasets lib=3Dwork;
modify one;
index delete age;
index create age;
modify two;
index delete age;
index create age;
quit;

/* make a view of two datasets appended */
proc sql feedback;
drop view work.oneTwo;
create view oneTwo as
select * from one union select * from two
order by age;
quit;

/* where is utilizing the indexes */
options msglevel=3Di;
data sixteenAndOlder;
set oneTwo; /* we are using the view */
where age >=3D 16;
run;
/* on log
INFO: Index Age selected for WHERE clause optimization.
INFO: Index Age selected for WHERE clause optimization.
*/
Dave Scocca
2008-08-15 22:54:08 UTC
Permalink
Post by Chang Chung
good point! but as shown below, you can make a view of multiple datasets
having the separate indexes, and where seems to be fine with this.
interesting. ran on sas 9.1.3 sp4 on windows.
/* make a view of two datasets appended */
proc sql feedback;
drop view work.oneTwo;
create view oneTwo as
select * from one union select * from two
order by age;
quit;
One thought on this--at least in my general approach to using views--is you
probably want to avoid the ORDER BY statement--especially if it's the variable
you're indexing on and you want to use a subset.

For any amount of data above a trivial one, there's no sense in forcing the
view to re-order th
e data if it doesn't have to.

Dave
Dale McLerran
2008-08-15 22:38:46 UTC
Permalink
I am not expert with indexes, but I would expect that an index on each
data set would be employed even when the data are accessed through
a view. If that is correct, there would be clear advantage to indexing
data for each month separately. If the data are in one large file,
then when old data are deleted and new data added, the entire index
will need to be recreated on all of the data. That is a big operation.
If, instead, data are maintained in separate files for each month and
each month of data is indexed independently, then indexing needs to
be performed only for the new data. The indexes for months which do
not age out would not need to be touched.

Advantage, VIEWS.

Dale

---------------------------------------
Dale McLerran
Fred Hutchinson Cancer Research Center
mailto: ***@NO_SPAMfhcrc.org
Ph: (206) 667-2926
Fax: (206) 667-5977
---------------------------------------
Post by Mary
Subject: Re: Deleting SAS Data from a SAS DATASET
Date: Friday, August 15, 2008, 2:59 PM
Post by Mary
A view helps on deletes, but I wonder how it affects
performance of querying the data- wouldn't storing the
data in 24 different locations cause a significant slowdown
in perfomance upon querying the data versus having it all in
one table that is indexed? If this data is queryied a lot
but only deleted once a month, the time in querying (which
probably is in peak time) could be much more important than
the time in deleting (which could be run when the computer
is not busy, such as nights or weekends).
It is not the VIEW that has an influence on deleting the
old data.
I would think that having the 24 indexed data sets might be
about as
good as having the giant data set. I would think the
indexes could be
used while accessing the data through views. Where's
that guy that
says he knows everything about using indexed data sets?
I would agree that much depends on how the data is used.
And I don't
know the answers to those questions.
Post by Mary
-Mary
----- Original Message -----
From: ./ ADD NAME=Data _null_,
Sent: Friday, August 15, 2008 3:51 PM
Subject: Re: Deleting SAS Data from a SAS DATASET
Summary: PROC DATASETS; AGE statement. + VIEWs
This won't help you delete data from your very
big data set, but you
Post by Mary
may find this example interesting.
You say you append data monthly to a big data set
then when big gets
Post by Mary
too big you need to clean out the old. And that
takes a very long
Post by Mary
time.
However if you don't physically append but use a
view to
Post by Mary
append/combine you may find it easier to get rid of
the unwanted old
Post by Mary
data.
Consider this code. it pushes MonthlyUpdate onto the
stack of 24 data
Post by Mary
sets and the 24th data set is deleted. Then all the
data sets get
Post by Mary
renamed to produce a new group of 24. You can see
from the notes how
Post by Mary
the operation works. The data sets don't have to
use a numbered range
Post by Mary
M01-M24 I did that for my convenience.
proc datasets library=work;
age MonthlyUpdate m01-m24;
run;
quit;
NOTE: Deleting WORK.M24 (memtype=DATA).
NOTE: Aging the name WORK.M23 to WORK.M24
(memtype=DATA).
Post by Mary
NOTE: Aging the name WORK.M22 to WORK.M23
(memtype=DATA).
Post by Mary
NOTE: Aging the name WORK.M21 to WORK.M22
(memtype=DATA).
Post by Mary
NOTE: Aging the name WORK.M20 to WORK.M21
(memtype=DATA).
Post by Mary
NOTE: Aging the name WORK.M19 to WORK.M20
(memtype=DATA).
Post by Mary
NOTE: Aging the name WORK.M18 to WORK.M19
(memtype=DATA).
Post by Mary
NOTE: Aging the name WORK.M17 to WORK.M18
(memtype=DATA).
Post by Mary
NOTE: Aging the name WORK.M16 to WORK.M17
(memtype=DATA).
Post by Mary
NOTE: Aging the name WORK.M15 to WORK.M16
(memtype=DATA).
Post by Mary
NOTE: Aging the name WORK.M14 to WORK.M15
(memtype=DATA).
Post by Mary
NOTE: Aging the name WORK.M13 to WORK.M14
(memtype=DATA).
Post by Mary
NOTE: Aging the name WORK.M12 to WORK.M13
(memtype=DATA).
Post by Mary
NOTE: Aging the name WORK.M11 to WORK.M12
(memtype=DATA).
Post by Mary
NOTE: Aging the name WORK.M10 to WORK.M11
(memtype=DATA).
Post by Mary
NOTE: Aging the name WORK.M09 to WORK.M10
(memtype=DATA).
Post by Mary
NOTE: Aging the name WORK.M08 to WORK.M09
(memtype=DATA).
Post by Mary
NOTE: Aging the name WORK.M07 to WORK.M08
(memtype=DATA).
Post by Mary
NOTE: Aging the name WORK.M06 to WORK.M07
(memtype=DATA).
Post by Mary
NOTE: Aging the name WORK.M05 to WORK.M06
(memtype=DATA).
Post by Mary
NOTE: Aging the name WORK.M04 to WORK.M05
(memtype=DATA).
Post by Mary
NOTE: Aging the name WORK.M03 to WORK.M04
(memtype=DATA).
Post by Mary
NOTE: Aging the name WORK.M02 to WORK.M03
(memtype=DATA).
Post by Mary
NOTE: Aging the name WORK.M01 to WORK.M02
(memtype=DATA).
Post by Mary
NOTE: Aging the name WORK.MONTHLYUPDATE to WORK.M01
(memtype=DATA).
Post by Mary
Here is a complete example. Somewhat contrived but I
think is covers
Post by Mary
most of the relevant points.
/* create example data base 2 years of monthly data
*/
Post by Mary
filename makedata temp;
data _null_;
file makedata;
do m = 24 to 1 by -1;
month =
intnx('MONTH',today(),-m,'B');
Post by Mary
format month date9.;
put 'data m' m z2. ';';
put +3 'month="' month
+(-1)'"d;';
Post by Mary
put +3 'do day = month to
intnx("month",month,0,"E");';
Post by Mary
put +6 'output;';
put +6 'end;';
put +3 'Format month monyy. day
date.;';
Post by Mary
put +3 'run;';
end;
run;
%inc makedata / source2;
/* macro var to make long list of member easier to
manage */
Post by Mary
%let members = ;
proc sql noprint;
select memname into :members separated by '
'
Post by Mary
from dictionary.members
where libname eq 'WORK' and memname
eqt 'M';
Post by Mary
quit;
run;
%put NOTE: MEMBERS=&members;
/* a 2 year view */
data view24 / view=view24;
set &members open=defer;
run;
/* New data for montly update */
data MonthlyUpdate;
month =
intnx('MONTH',today(),0,'B');
Post by Mary
do day = month to
intnx("Month",month,0,"E");
Post by Mary
output;
end;
format month monyy.;
run;
proc datasets library=work;
age MonthlyUpdate m01-m24;
run;
quit;
/* use the view to access the past 24 months of data
*/
Post by Mary
proc print data=view24;
run;
On 8/15/08, SUBSCRIBE SAS-L Chandra Gadde
Post by SUBSCRIBE SAS-L Chandra Gadde
Hi All
I have several SAS datasets that are very very
big. (50GB of size). Every
Post by Mary
Post by SUBSCRIBE SAS-L Chandra Gadde
month, the data is being appended to these
datasets. I need to deleted the
Post by Mary
Post by SUBSCRIBE SAS-L Chandra Gadde
data which is greater than 24 months. What is
the best method to do this?
Post by Mary
Post by SUBSCRIBE SAS-L Chandra Gadde
Please help me.
I tried PROC SQS and DATA STEP. But these two
are taking very long time.
Post by Mary
Post by SUBSCRIBE SAS-L Chandra Gadde
Data prod.Master_data;
set prod.master_date;
if snap_dt = "&end_dt"d then
delete;
Post by Mary
Post by SUBSCRIBE SAS-L Chandra Gadde
run;
proc sql;
crete table prod.master_date from
prod.master_date
Post by Mary
Post by SUBSCRIBE SAS-L Chandra Gadde
where snap_dt ne "&end_dt"d;
quit;
unknown
2008-08-15 23:16:13 UTC
Permalink
Post by Dave Scocca
Post by Chang Chung
good point! but as shown below, you can make a view of multiple datasets
having the separate indexes, and where seems to be fine with this.
interesting. ran on sas 9.1.3 sp4 on windows.
/* make a view of two datasets appended */
proc sql feedback;
drop view work.oneTwo;
create view oneTwo as
select * from one union select * from two
order by age;
quit;
One thought on this--at least in my general approach to using views--is you
probably want to avoid the ORDER BY statement--especially if it's the variable
you're indexing on and you want to use a subset.
For any amount of data above a trivial one, there's no sense in forcing the
view to re-order th
e data if it doesn't have to.
Eliminating ORDER BY is not going to do any good unless the ALL option is
invoked for the UNION operator. Otherwise the view is going to purge
duplicate rows, and will sort the rows to support that process. Ian
mentioned the ALL option but did not emphasize its importance.
Post by Dave Scocca
Dave
Ian Whitlock
2008-08-15 23:07:31 UTC
Permalink
Summary: Suggestions for maintaining large data set
#iw-value=1

Chandra,

You are essentially making a copy of a 50GB file every month. Of
course, it takes long. Consequently you either have to copy the data
more efficiently or modify how the data is stored and/or used.

Have you looked at system options to make this as efficient as
possible? Big buffers, many buffers. Different disk handlers for
input and output would also speed up the process since it is mainly IO
bound. (These topics have been discussed on SAS-L, in papers, and
documentation. I have not found the techniques necessary, but then I
have not worked with 50 GB data sets.)

You are adding about 2 gig per month. That is 4% increase. Perhaps
accepting the increase one month and then modifying the second month
would make more sense, if the data is accessed infrequently. To do
this you need to control all programs that read the data set.

It might pay to combine the previous suggestion with the MODIFY
command. I suspect that SAS can skip over deleted records pretty
quickly.

It might pay to store the data by month and each month create a new
view

data prod.master / view = prod.master ;
set prod.mon_x ... prod.mon_y ;
run ;

The process could even be automated with something like

%macro prod_list ( end_dt = 1jul2008 ) ;
%local i mon ;
%do i = -11 %to 0 ;
%let mon = %sysfunc(intnx(month,"&end_dt"d,&i)) ;
%let mon = %sysfunc(putn(&mon, monyy.)) ;
%do ; prod.&mon %end ;
%end ;
%mend prod_list ;

%macro last_month() ;
%local mon ;
%let mon = %sysfunc(today()) ;
%sysfunc(intnx(month,&mon,-1),date9.)
%mend last_month ;

data prod.master / view = prod.master ;
set %prod_list(end_dt = %last_month()) ;
run ;

SQL with the UNION CORR ALL operator provides another way to make the
view which might provide SQL options for efficiency. (Not tested.)

The view takes seconds to set up. There is a cost each time the view
is accessed but this cost may be quite preferable to making a copy
each month. One advantage to this solution is it could get you
permanently out of the loop. The only programs requiring change would
be those that

1) use options such as NOBS= that will not work with views,
2) access meta data refering to TYPE=DATA,
3) use SQL programs and refer to TABLE PROD.MASTER, and
4) use PROD.MASTER as output.

Ian Whitlock
================

Date: Fri, 15 Aug 2008 13:14:22 -0400
Reply-To: SUBSCRIBE SAS-L Chandra Gadde <***@GMAIL.COM>
Sender: "SAS(r) Discussion"
From: SUBSCRIBE SAS-L Chandra Gadde <***@GMAIL.COM>
Subject: Deleting SAS Data from a SAS DATASET

Hi All

I have several SAS datasets that are very very big. (50GB of size).
Every month, the data is being appended to these datasets. I need to
deleted the data which is greater than 24 months. What is the best
method to do this? Please help me.

I tried PROC SQS and DATA STEP. But these two are taking very long
time.

Data prod.Master_data; set prod.master_date; if snap_dt = "&end_dt"d
then delete; run;

proc sql; crete table prod.master_date from prod.master_date where
snap_dt ne "&end_dt"d; quit;
Ian Whitlock
2008-08-16 15:26:49 UTC
Permalink
Summary: SAS Scalable Performance Data Server
#iw-value=1

Chandra,

You now have a number of reasonable suggestions to your problem that
place the burden on the programmer, but I did not see mention of the
SAS Scalable Performance Data Server. This is a SAS product designed
to handle large amounts of data. Considering the size of your data
set suggests that perhaps it is time to spend money on the problem.
Even the SPDE engine included with the base product might help,
particularly if the computer has several CPUs.

Ian Whitlock
===============

Date: Fri, 15 Aug 2008 13:14:22 -0400
Reply-To: SUBSCRIBE SAS-L Chandra Gadde <***@GMAIL.COM>
Sender: "SAS(r) Discussion"
From: SUBSCRIBE SAS-L Chandra Gadde <***@GMAIL.COM>
Subject: Deleting SAS Data from a SAS DATASET

Hi All

I have several SAS datasets that are very very big. (50GB of size).
Every month, the data is being appended to these datasets. I need to
deleted the data which is greater than 24 months. What is the best
method to do this? Please help me.

I tried PROC SQS and DATA STEP. But these two are taking very long
time.

Data prod.Master_data; set prod.master_date; if snap_dt = "&end_dt"d
then delete; run;

proc sql; crete table prod.master_date from prod.master_date where
snap_dt ne "&end_dt"d; quit;
SUBSCRIBE SAS-L Chandra Gadde
2008-08-18 14:26:14 UTC
Permalink
When I used this code, I am receiving the following warning. It completely
deleted my prod.master_date SAS Dataset. Do you know what is going on here?

WARNING: This SAS global statement is not supported in PROC SQL. It has
been ignored.
Post by Mary
One thing you might do is to add an index on the snap_dt to the dataset;
proc sql;
delete from prod.master_date;
where snap_dt = "&end_dt"d;
quit;
run;
In both the ways you are trying now you are creating new data sets rather
than deleting records from the current data set; it would seem to me that a
SQL delete statement would be faster than creating new datasets even if
there isn't an index on the date.
Post by Mary
-Mary
----- Original Message -----
From: SUBSCRIBE SAS-L Chandra Gadde
Sent: Friday, August 15, 2008 12:14 PM
Subject: Deleting SAS Data from a SAS DATASET
Hi All
I have several SAS datasets that are very very big. (50GB of size). Every
month, the data is being appended to these datasets. I need to deleted
the
Post by Mary
data which is greater than 24 months. What is the best method to do this?
Please help me.
I tried PROC SQS and DATA STEP. But these two are taking very long time.
Data prod.Master_data;
set prod.master_date;
if snap_dt = "&end_dt"d then delete;
run;
proc sql;
crete table prod.master_date from prod.master_date
where snap_dt ne "&end_dt"d;
quit;
Mary
2008-08-18 14:02:45 UTC
Permalink
Chandra,

This is an excellent suggestion from Ian! Perhaps just recreate the =
data set (you would in that case want to recreate, and thus NOT use the =
SQL delete statement) once every 2-3 months and delete 2-3 months worth =
of data at that time; it would seem that that wouldn't take too much =
longer than deleting 1 month at a time, and thus the total time would be =
reduced by as much as 50%.

And this would meet my favorite efficiency marker, since I've met a lot =
of people who don't want to work weekends and have yet to meet a =
computer who cares, "less people time!"

-Mary
----- Original Message -----=20
From: Ian Whitlock=20
To: SAS-***@LISTSERV.UGA.EDU=20
Sent: Friday, August 15, 2008 6:07 PM
Subject: Re: Deleting SAS Data from a SAS DATASET


Summary: Suggestions for maintaining large data set
#iw-value=3D1

Chandra,


You are adding about 2 gig per month. That is 4% increase. Perhaps
accepting the increase one month and then modifying the second month
would make more sense, if the data is accessed infrequently. To do
this you need to control all programs that read the data set.
unknown
2008-08-18 15:38:07 UTC
Permalink
On Mon, 18 Aug 2008 10:26:14 -0400, SUBSCRIBE SAS-L Chandra Gadde
Post by SUBSCRIBE SAS-L Chandra Gadde
When I used this code, I am receiving the following warning. It completely
deleted my prod.master_date SAS Dataset. Do you know what is going on here?
WARNING: This SAS global statement is not supported in PROC SQL. It has
been ignored.
In PROC SQL, WHERE must be a clause within a statement, not a freestanding
statement. Get rid of the semicolon immediately before the word "where".

Incidentally, I think the warning is inaccurate in referring to a WHERE
statement as a global statement.
Post by SUBSCRIBE SAS-L Chandra Gadde
Post by Mary
One thing you might do is to add an index on the snap_dt to the dataset;
proc sql;
delete from prod.master_date;
where snap_dt = "&end_dt"d;
quit;
run;
In both the ways you are trying now you are creating new data sets rather
than deleting records from the current data set; it would seem to me that a
SQL delete statement would be faster than creating new datasets even if
there isn't an index on the date.
Post by Mary
-Mary
----- Original Message -----
From: SUBSCRIBE SAS-L Chandra Gadde
Sent: Friday, August 15, 2008 12:14 PM
Subject: Deleting SAS Data from a SAS DATASET
Hi All
I have several SAS datasets that are very very big. (50GB of size). Every
month, the data is being appended to these datasets. I need to deleted
the
Post by Mary
data which is greater than 24 months. What is the best method to do this?
Please help me.
I tried PROC SQS and DATA STEP. But these two are taking very long time.
Data prod.Master_data;
set prod.master_date;
if snap_dt = "&end_dt"d then delete;
run;
proc sql;
crete table prod.master_date from prod.master_date
where snap_dt ne "&end_dt"d;
quit;
Loading...