Discussion:
SQL Joins of Very Large Oracle Tables
(too old to reply)
Paul Bartells
2008-06-19 21:41:33 UTC
Permalink
Hey, guys. I need some help. I have two jobs which execute on a UNIX
server and access one particular Oracle database. It needs to pull
information from four tables of varying sizes, ranging from 18.8M to 213M
observations. I have tried a number of techniques to try to get past
messages like:
ERROR: Write to WORK.'#tf0021'n.UTILITY failed. File is full and may be
damaged.
NOTE: Error was encountered during utility-file processing.
You may be able to execute the SQL statement successfully
if you allocate more space to the WORK
library.
ERROR: There is not enough WORK disk space to store the results of an
internal sorting phase.
ERROR: An error has
occurred.
ERROR: Write to WORK.CCT_LEG_CALLS4.DATA failed. File is full and may be
damaged.

I have tried restructuring the query, limiting the number of fields
retrieved, filtering out as many rows as I can based on dates or other
criteria, breaking the query into pieces, etc., with minimal improvement.
To my knowledge, these tables are not indexed. I have tried HASH methods,
but I haven't been able to make them work. I was told that pass-through
connections might perform better, but they don't work with local files. I
feel certain there must be some other approach that would be effective,
but I've pretty much run out of ideas.

So, if you gurus would ponder the following query and offer some
suggestions, I would be eternally grateful.

proc sql;

create table cct_leg_calls4 as

select a.*,
b.conhistid,
b.starttime as cont_start_dttm,
c.actionid as action_cd,
c.useractionnotes,
c.sysactionnotes,
e.sysacc

from cct_reps a,
ATLX.CONTACTSEGMENT b,
ATLX.CONTACTACTION c,
ATLX.CONTACTHISTORY d,
ATLX.ACCOUNT e

where
a.employeeid eq b.employeeid and
b.conhistid eq c.conhistid and
b.conhistid eq d.conhistid and
d.accountid eq e.accountid and
b.channelid eq 'CSR' and
b.conhistid ge 202195000 and
c.actionid not in ('OT') and
datepart(b.starttime) ge mdy&start_d and
datepart(b.starttime) le mdy&end_d and
b.starttime ne . and
b.conhistid ne . and
c.conhistid ne . and
d.conhistid ne . and
d.accountid ne . and
e.accountid ne . ; quit;

Table work.cct_reps has roughly 8400 observations.
Table ATLX.CONTACTSEGMENT has 213 million observations.
Table ATLX.CONTACTACTION has 155 million observations.
Table ATLX.CONTACTHISTORY has 198 million observations.
Table ATLX.ACCOUNT has 19 million observations.

This query will churn for several hours (8+) before it finally chokes.

Ideas, suggestions, questions?

Thanks.

Paul Bartells
TXU Energy
Irving, TX
Mary
2008-06-19 21:49:38 UTC
Permalink
You might try breaking it up into multiple queries, starting with =
combining cctreps and contactsegment (your a to b), as well as all your =
clauses on that; then join that table to your c table, and so forth.

-Mary
----- Original Message -----=20
From: Paul Bartells=20
To: SAS-***@LISTSERV.UGA.EDU=20
Sent: Thursday, June 19, 2008 4:41 PM
Subject: SQL Joins of Very Large Oracle Tables


Hey, guys. I need some help. I have two jobs which execute on a UNIX
server and access one particular Oracle database. It needs to pull
information from four tables of varying sizes, ranging from 18.8M to =
213M
observations. I have tried a number of techniques to try to get past
messages like:
ERROR: Write to WORK.'#tf0021'n.UTILITY failed. File is full and may =
be
damaged.
NOTE: Error was encountered during utility-file processing.
You may be able to execute the SQL statement successfully
if you allocate more space to the WORK
library.
ERROR: There is not enough WORK disk space to store the results of an
internal sorting phase.
ERROR: An error has
occurred.
ERROR: Write to WORK.CCT_LEG_CALLS4.DATA failed. File is full and may =
be
damaged.

I have tried restructuring the query, limiting the number of fields
retrieved, filtering out as many rows as I can based on dates or other
criteria, breaking the query into pieces, etc., with minimal =
improvement.
To my knowledge, these tables are not indexed. I have tried HASH =
methods,
but I haven't been able to make them work. I was told that =
pass-through
connections might perform better, but they don't work with local =
files. I
feel certain there must be some other approach that would be =
effective,
but I've pretty much run out of ideas.

So, if you gurus would ponder the following query and offer some
suggestions, I would be eternally grateful.

proc sql;

create table cct_leg_calls4 as

select a.*,
b.conhistid,
b.starttime as cont_start_dttm,
c.actionid as action_cd,
c.useractionnotes,
c.sysactionnotes,
e.sysacc

from cct_reps a,
ATLX.CONTACTSEGMENT b,
ATLX.CONTACTACTION c,
ATLX.CONTACTHISTORY d,
ATLX.ACCOUNT e

where
a.employeeid eq b.employeeid and
b.conhistid eq c.conhistid and
b.conhistid eq d.conhistid and
d.accountid eq e.accountid and
b.channelid eq 'CSR' and
b.conhistid ge 202195000 and
c.actionid not in ('OT') and
datepart(b.starttime) ge mdy&start_d and
datepart(b.starttime) le mdy&end_d and
b.starttime ne . and
b.conhistid ne . and
c.conhistid ne . and
d.conhistid ne . and
d.accountid ne . and
e.accountid ne . ; quit;

Table work.cct_reps has roughly 8400 observations.
Table ATLX.CONTACTSEGMENT has 213 million observations.
Table ATLX.CONTACTACTION has 155 million observations.
Table ATLX.CONTACTHISTORY has 198 million observations.
Table ATLX.ACCOUNT has 19 million observations.

This query will churn for several hours (8+) before it finally chokes.

Ideas, suggestions, questions?

Thanks.

Paul Bartells
TXU Energy
Irving, TX
Jack Hamilton
2008-06-19 21:56:16 UTC
Permalink
A common approach to this kind of problem is to get permission to create
an Oracle table with the data in your one SAS table (cct_reps in this
case). Then you can run the entire query in Oracle via passthrough.

It also looks like only one of your WHERE conditions uses the local
table. If you can't create a copy of the local table in Oracle, perhaps
you could create a passthrough view of all the remaining rows and
columns, and join that with your local table. You might have to have
the database do a sort on employeeid.

Another possibility is to generate a WHERE clause containing the
employeeids you want to keep, and use that in the Oracle passthrough
query. There are limits on the size of IN lists, and your might be too
large.



On Thu, 19 Jun 2008 17:41:33 -0400, "Paul Bartells"
Post by Paul Bartells
Hey, guys. I need some help. I have two jobs which execute on a UNIX
server and access one particular Oracle database. It needs to pull
information from four tables of varying sizes, ranging from 18.8M to 213M
observations. I have tried a number of techniques to try to get past
ERROR: Write to WORK.'#tf0021'n.UTILITY failed. File is full and may be
damaged.
NOTE: Error was encountered during utility-file processing.
You may be able to execute the SQL statement successfully
if you allocate more space to the WORK
library.
ERROR: There is not enough WORK disk space to store the results of an
internal sorting phase.
ERROR: An error has
occurred.
ERROR: Write to WORK.CCT_LEG_CALLS4.DATA failed. File is full and may be
damaged.
I have tried restructuring the query, limiting the number of fields
retrieved, filtering out as many rows as I can based on dates or other
criteria, breaking the query into pieces, etc., with minimal improvement.
To my knowledge, these tables are not indexed. I have tried HASH
methods,
but I haven't been able to make them work. I was told that pass-through
connections might perform better, but they don't work with local files.
I
feel certain there must be some other approach that would be effective,
but I've pretty much run out of ideas.
So, if you gurus would ponder the following query and offer some
suggestions, I would be eternally grateful.
proc sql;
create table cct_leg_calls4 as
select a.*,
b.conhistid,
b.starttime as cont_start_dttm,
c.actionid as action_cd,
c.useractionnotes,
c.sysactionnotes,
e.sysacc
from cct_reps a,
ATLX.CONTACTSEGMENT b,
ATLX.CONTACTACTION c,
ATLX.CONTACTHISTORY d,
ATLX.ACCOUNT e
where
a.employeeid eq b.employeeid and
b.conhistid eq c.conhistid and
b.conhistid eq d.conhistid and
d.accountid eq e.accountid and
b.channelid eq 'CSR' and
b.conhistid ge 202195000 and
c.actionid not in ('OT') and
datepart(b.starttime) ge mdy&start_d and
datepart(b.starttime) le mdy&end_d and
b.starttime ne . and
b.conhistid ne . and
c.conhistid ne . and
d.conhistid ne . and
d.accountid ne . and
e.accountid ne . ; quit;
Table work.cct_reps has roughly 8400 observations.
Table ATLX.CONTACTSEGMENT has 213 million observations.
Table ATLX.CONTACTACTION has 155 million observations.
Table ATLX.CONTACTHISTORY has 198 million observations.
Table ATLX.ACCOUNT has 19 million observations.
This query will churn for several hours (8+) before it finally chokes.
Ideas, suggestions, questions?
Thanks.
Paul Bartells
TXU Energy
Irving, TX
--
Jack Hamilton
Sacramento, California
***@alumni.stanford.org <== Use this, not ***@stanfordalumni.org
Mary
2008-06-19 22:08:33 UTC
Permalink
One thing another user tried was to put all the ID's that you wanted to =
pull in a macro variable. He reported back that it worked great! Here's =
that discussion.

-Mary

Hi, Dan,=20

Howard Schreier gave you code to get your variables into macro =
variables- here's how to take it from there (you'd substitute your SQL =
in the second SQL of the run_pulls code below. Also make sure before =
you run that the quote around the id does not contain spaces:

Good luck !!!

-Mary

data example;

do _n_ =3D 21 to 100000;

id =3D put(_n_,z5.);

output;

end;

run;



data listset(drop =3D id);

run + 1;

length list $ 20000;

do _n_ =3D 1 to 500 until(nomore);

set example end=3Dnomore;

list =3D catx( ',' , list , quote(id) );

end;

run;



data allset;

run;

=20

%macro run_pulls(model_count);

=20

proc sql noprint;

select list into :idlist

from listset

where run =3D&model_count;

quit;

=20

proc sql;

create table tempset as

select id from example=20

where id in (&idlist);

quit;

run;

=20

data allset;

set allset tempset;

run;

=20

%Mend run_pulls;

=20

%Macro docalls;

%Local I;

proc sql noprint;

select count(*) into :model_count

from listset;

quit;

%Do I =3D 1 %To &model_count ;

%run_pulls(&i);

%End ;

%Mend docalls ;

=20

%docalls;

=20

run;



----- Original Message -----=20
From: Dan=20
To: SAS-***@LISTSERV.UGA.EDU=20
Sent: Tuesday, January 22, 2008 2:27 PM
Subject: PROC SQL: Using local database when querying DSN


I need to use a local file to filter the results I am recieving in a =
PROC
SQL statement on the fly.

I have a database of ID's:
Local.IDList

My code so far, without joining Local.IDList is here:
PROC SQL;
CONNECT TO DB2(DSN=3DDSNServ user=3D&ID password=3D&PWD);
CREATE TABLE Local.SampleSet AS
SELECT * FROM CONNECTION TO DB2
(
SELECT
S.ID,
S.Date,
S.CDType,
A.Parp,
K.PartNum
FROM DD.FullList S
LEFT OUTER JOIN DD.ParpList A ON
S.CDSysKey =3D A.CDSysKey
LEFT OUTER JOIN DD.PartList K ON
A.PartIndex =3D K.PartIndex
WHERE K.PartNum in ('01','07','09')
);
%PUT &SQLXMSG;
DISCONNECT FROM DB2;

My problem is that the above pull is over 700 GB, so I cannot just =
pull it
and later filter it. My other problem is that I do not have write =
access
to the DSN Server, so I cannot upload my file to a table. Is what I =
ask
even possible?

IDList is too long to put into the SQL itself, as it is over 100,000 =
rows.
One idea I thought of was a macro that would run the query once for =
each
item in IDList, but I wouldn't know how to impliment that, and it =
doesn't
seem very efficient either especially for the size of the data sets I =
am
working with.

Is what I ask even possible to do without blowing up my computer =
and/or
temp space with the massive size of the tables I am querying against?

Thanks for your time,
Dan


----- Original Message -----=20
From: Paul Bartells=20
To: SAS-***@LISTSERV.UGA.EDU=20
Sent: Thursday, June 19, 2008 4:41 PM
Subject: SQL Joins of Very Large Oracle Tables


Hey, guys. I need some help. I have two jobs which execute on a UNIX
server and access one particular Oracle database. It needs to pull
information from four tables of varying sizes, ranging from 18.8M to =
213M
observations. I have tried a number of techniques to try to get past
messages like:
ERROR: Write to WORK.'#tf0021'n.UTILITY failed. File is full and may =
be
damaged.
NOTE: Error was encountered during utility-file processing.
You may be able to execute the SQL statement successfully
if you allocate more space to the WORK
library.
ERROR: There is not enough WORK disk space to store the results of an
internal sorting phase.
ERROR: An error has
occurred.
ERROR: Write to WORK.CCT_LEG_CALLS4.DATA failed. File is full and may =
be
damaged.

I have tried restructuring the query, limiting the number of fields
retrieved, filtering out as many rows as I can based on dates or other
criteria, breaking the query into pieces, etc., with minimal =
improvement.
To my knowledge, these tables are not indexed. I have tried HASH =
methods,
but I haven't been able to make them work. I was told that =
pass-through
connections might perform better, but they don't work with local =
files. I
feel certain there must be some other approach that would be =
effective,
but I've pretty much run out of ideas.

So, if you gurus would ponder the following query and offer some
suggestions, I would be eternally grateful.

proc sql;

create table cct_leg_calls4 as

select a.*,
b.conhistid,
b.starttime as cont_start_dttm,
c.actionid as action_cd,
c.useractionnotes,
c.sysactionnotes,
e.sysacc

from cct_reps a,
ATLX.CONTACTSEGMENT b,
ATLX.CONTACTACTION c,
ATLX.CONTACTHISTORY d,
ATLX.ACCOUNT e

where
a.employeeid eq b.employeeid and
b.conhistid eq c.conhistid and
b.conhistid eq d.conhistid and
d.accountid eq e.accountid and
b.channelid eq 'CSR' and
b.conhistid ge 202195000 and
c.actionid not in ('OT') and
datepart(b.starttime) ge mdy&start_d and
datepart(b.starttime) le mdy&end_d and
b.starttime ne . and
b.conhistid ne . and
c.conhistid ne . and
d.conhistid ne . and
d.accountid ne . and
e.accountid ne . ; quit;

Table work.cct_reps has roughly 8400 observations.
Table ATLX.CONTACTSEGMENT has 213 million observations.
Table ATLX.CONTACTACTION has 155 million observations.
Table ATLX.CONTACTHISTORY has 198 million observations.
Table ATLX.ACCOUNT has 19 million observations.

This query will churn for several hours (8+) before it finally chokes.

Ideas, suggestions, questions?

Thanks.

Paul Bartells
TXU Energy
Irving, TX
Sigurd Hermansen
2008-06-20 02:21:59 UTC
Permalink
Paul:
We don't usually get eternal gratitude on the 'L. Glad to help when we
can ....

I'd start with

....
from select * from
(
(select * from cct_reps) as a inner join (select * from
ATLX.CONTACTSEGMENT
where channelid = 'CSR'
and conhistid > 202195000
and starttime ne . and
b.conhistid ne .
and datepart(starttime)
between mdy&start_d and mdy&end_d
) as b
on a.employeeid=b.employeeid
)
;
With cct_reps in the 8,400 tuple range, the multiples of employeeids
that remain after subsetting will determine the upper bound on the
yield of the outer select query. Of course, even a subset of 213 million
tuples (rows) can be very large. Try running the query to select the
tuples into a WORK dataset and observe the cardinality of the relation
(number of tuples) it selects.

Follow the same strategy to reduce the search space as much as possible
at each step. Remove the irrelevant tuples prior to each join.


ATLX.CONTACTACTION c,
ATLX.CONTACTHISTORY d,
ATLX.ACCOUNT e

where
b.conhistid eq c.conhistid and
b.conhistid eq d.conhistid and
d.accountid eq e.accountid and
c.actionid not in ('OT') and
c.conhistid ne . and
d.conhistid ne . and
d.accountid ne . and
e.accountid ne . ; quit;

None of this will succeed if SAS SQL in fact is optimizing the query.
Unless you have large numbers of multiples of join keys in at least one
of the tables, progressive subsetting of tuples prior to the next join
should succeed. I suspect in this case that SAS SQL is transferring full
sets of tuples from Oracle to SAS and then attempting to sort them.
Sometimes it works best to override the SAS SQL query optimizer.

Next steps could be to test a "pass-thru" query to check whether
existing Oracle indexes on employeeid and other id's will limit tables
on the Oracle side to manageable datasets. More on that later.
S


-----Original Message-----
From: SAS(r) Discussion [mailto:SAS-***@LISTSERV.UGA.EDU] On Behalf Of
Paul Bartells
Sent: Thursday, June 19, 2008 5:42 PM
To: SAS-***@LISTSERV.UGA.EDU
Subject: SQL Joins of Very Large Oracle Tables


Hey, guys. I need some help. I have two jobs which execute on a UNIX
server and access one particular Oracle database. It needs to pull
information from four tables of varying sizes, ranging from 18.8M to
213M observations. I have tried a number of techniques to try to get
past messages like:
ERROR: Write to WORK.'#tf0021'n.UTILITY failed. File is full and may be
damaged.
NOTE: Error was encountered during utility-file processing.
You may be able to execute the SQL statement successfully
if you allocate more space to the WORK
library.
ERROR: There is not enough WORK disk space to store the results of an
internal sorting phase.
ERROR: An error has
occurred.
ERROR: Write to WORK.CCT_LEG_CALLS4.DATA failed. File is full and may be
damaged.

I have tried restructuring the query, limiting the number of fields
retrieved, filtering out as many rows as I can based on dates or other
criteria, breaking the query into pieces, etc., with minimal
improvement. To my knowledge, these tables are not indexed. I have
tried HASH methods, but I haven't been able to make them work. I was
told that pass-through connections might perform better, but they don't
work with local files. I feel certain there must be some other approach
that would be effective, but I've pretty much run out of ideas.

So, if you gurus would ponder the following query and offer some
suggestions, I would be eternally grateful.

proc sql;

create table cct_leg_calls4 as

select a.*,
b.conhistid,
b.starttime as cont_start_dttm,
c.actionid as action_cd,
c.useractionnotes,
c.sysactionnotes,
e.sysacc

from cct_reps a,
ATLX.CONTACTSEGMENT b,
ATLX.CONTACTACTION c,
ATLX.CONTACTHISTORY d,
ATLX.ACCOUNT e

where
a.employeeid eq b.employeeid and
b.conhistid eq c.conhistid and
b.conhistid eq d.conhistid and
d.accountid eq e.accountid and
b.channelid eq 'CSR' and
b.conhistid ge 202195000 and
c.actionid not in ('OT') and
datepart(b.starttime) ge mdy&start_d and
datepart(b.starttime) le mdy&end_d and
b.starttime ne . and
b.conhistid ne . and
c.conhistid ne . and
d.conhistid ne . and
d.accountid ne . and
e.accountid ne . ; quit;

Table work.cct_reps has roughly 8400 observations.
Table ATLX.CONTACTSEGMENT has 213 million observations.
Table ATLX.CONTACTACTION has 155 million observations.
Table ATLX.CONTACTHISTORY has 198 million observations.
Table ATLX.ACCOUNT has 19 million observations.

This query will churn for several hours (8+) before it finally chokes.

Ideas, suggestions, questions?

Thanks.

Paul Bartells
TXU Energy
Irving, TX
A. Michielsen
2008-06-21 03:18:02 UTC
Permalink
Paul,
I think you have a solution for now - but here's some information
you may find of use in the future....
I'll type inside your original message.
ERROR: Write to WORK.'#tf0021'n.UTILITY failed. File is full...
if you allocate more space to the WORK...
This means your SAS Work space has run out of room - disk full.
Beacuse -all- of this query - can't be passed to Oracle - each
of the tables are brought individually over from Oracle - aka
SELECT * FROM _each_oracle_table. The best solution would be
to move as much 'work' away from SAS as possible - aka - move
the work back over to Oracle.
proc sql;
create table cct_leg_calls4 as
select a.*,
b.conhistid,
b.starttime as cont_start_dttm,
c.actionid as action_cd,
c.useractionnotes,
c.sysactionnotes,
e.sysacc
from cct_reps a,
ATLX.CONTACTSEGMENT b,
ATLX.CONTACTACTION c,
ATLX.CONTACTHISTORY d,
ATLX.ACCOUNT e
where
a.employeeid eq b.employeeid and
b.conhistid eq c.conhistid and
b.conhistid eq d.conhistid and
d.accountid eq e.accountid and
b.channelid eq 'CSR' and
b.conhistid ge 202195000 and
c.actionid not in ('OT') and
datepart(b.starttime) ge mdy&start_d and
datepart(b.starttime) le mdy&end_d and
b.starttime ne . and
b.conhistid ne . and
c.conhistid ne . and
d.conhistid ne . and
d.accountid ne . and
e.accountid ne . ; quit;
You might consider:
proc sql;
create table tmp_calls4 as
select b.employeeid,
b.conhistid,
b.starttime ,
c.actionid ,
c.useractionnotes,
c.sysactionnotes,
e.sysacc
from ATLX.CONTACTSEGMENT b,
ATLX.CONTACTACTION c,
ATLX.CONTACTHISTORY d,
ATLX.ACCOUNT e
where b.conhistid eq c.conhistid and
b.conhistid eq d.conhistid and
d.accountid eq e.accountid and
b.channelid eq 'CSR' and
b.conhistid ge 202195000 and
c.actionid not in ('OT') and
datepart(b.starttime) ge mdy&start_d and
datepart(b.starttime) le mdy&end_d and
b.starttime ne . and
b.conhistid ne . and
c.conhistid ne . and
d.conhistid ne . and
d.accountid ne . and
e.accountid ne . ; quit;

proc sql;
create table cct_leg_calls4 as
select a.*,
b.conhistid,
b.starttime as cont_start_dttm,
b.actionid as action_cd,
b.useractionnotes,
b.sysactionnotes,
b.sysacc
from cct_reps a,
tmp_calls4 b
where a.employeeid eq b.employeeid

Likewise - the first alternate query could be modified to be a
SAS Pass-THru Query - which absolutely - would move the joins to
oracle if you still run out of work space with the same error.

Continue reading on narkive:
Loading...