JL Computer Consultancy
A slightly unusual use of pipeline functions |
July 2006 |
A large part of my job is to find ways to make databases “go faster” without making any significant changes to structure or code. “Significant” is a bit vague, of course, but typically means things like “invisible to the application” as far as the structure is concerned, or “very localized” if code is involved, and always means “minimum risk, minimum side-effects”.
I have come across two examples recently where short data-loading scripts could be modified very locally to reduce the workload resources used – particularly CPU and disk activity, quite dramatically. It isn’t a mechanism that will be universally appropriate – but for the right data pattern it’s certainly worth considering.
The requirement is quite simple. We had data loads where rows belong to ‘groups’, need to be given some form of ‘group id’ as they get into the database. There are many ways to implement this type of requirement, with various costs involved. One option is to load the data, create a summary (or parent) table for the loaded data, then use the summary table to generate a sequence-based id. Another option would be to play around with analytic functions, partitioning on the “group” columns, and using a mechanism to bring in a “group id” on the first row of each “group”.
Every option introduces its own drawbacks – the first example above, for example, may require you to do too much work populating and updating real (or perhaps global temporary) tables; the second option may do excessive amounts of sorting as a side effect of using analytic functions if the rows are wide rows.
SO, as an extra weapon in the armoury, here’s a method that uses pipeline functions to minimise the I/O and CPU costs of sorting or using tables. The example is given as a script with in-line comments. The notional task is to add a sequence number to rows, based on grouping values by multiple columns.
rem
rem Script: pipe_sequence.sql
rem Author: Jonathan Lewis
rem Dated: June 2006
rem Purpose: Getting a sequence number onto grouped data.
rem
rem Last tested
rem 10.2.0.1
rem 9.2.0.6
rem Not tested
rem 10.1.0.4
rem Not relevant
rem 8.1.7.4
rem
rem Notes:
rem There are various ways you might group data, and then
rem add a sequence number (or other unique identifier) to
rem rows that belong to the same group.
rem
rem This case was based on a 'store card' number used to group
rem together purchases that occured within a short time period
rem of each other. (The real system defined as visit as all
rem purchases within 10 minutes of each other - which required
rem playing games with rounding on a time interval).
rem
rem Other options
rem use of analytic functions
rem creating global temporary tables
rem calling functions that return either nextval or currval
rem
rem Enhancements:
rem Change the function to use pl/sql array fetches (bulk collect)
rem
drop table t1;
drop sequence s1;
drop type jpl_table_type;
drop type jpl_scalar_type;
rem
rem We will be collating rows with the same value for columns
rem group_1 and group_2, and giving them a single sequence value
rem
create sequence s1;
create table t1 as
select
rownum id,
mod(rownum-1,10) group_1,
lpad(mod(rownum-1,10),10,'0') group_2,
trunc((rownum-1)/10) data1,
rpad('x',100) padding
from
all_Objects
where
rownum <= 100
;
rem
rem A pipeline function needs an array type as its return,
rem a corresponding scalar type for each row created
rem
create type jpl_scalar_type as object(
group_1 number,
group_2 varchar2(10),
seq_id number
);
/
create type jpl_table_type as table of jpl_scalar_type;
/
rem
rem We select distinct the columns that are the identifying
rem columns – by hiding this query in a pipeline function we
rem a) can use the nextval construct
rem b) avoid creating an intermediate table
rem c) minimise the width of the row being sorted
rem
create or replace function pipe_fun
return jpl_table_type
pipelined
as
begin
for r in (
select group_1, group_2, s1.nextval seq_id
from (
select distinct group_1, group_2
from t1
order by group_1, group_2
)
)
loop
pipe row (jpl_scalar_type( r.group_1, r.group_2, r.seq_id));
end loop;
return;
end;
.
/
rem
rem To demonstrate how to use the mechanism we join the
rem the pipeline function to the base table on the grouping
rem columns – which allows us to append the sequence value
rem to the end of the row in (apparently) one step
rem
break on seq_id skip 1
column padding noprint
select
/*+ ordered use_hash (t) */
v.seq_id,
t.group_1,
t.group_2,
t.id,
t.data1,
t.padding
from
table(pipe_fun) v,
t1 t
where
t.group_1 = v.group_1
and t.group_2 = v.group_2
order by
v.seq_id, t.group_1, t.group_2, t.id
;
rem
rem Note – there is an “implementation detail” in 9i
rem which means you hit three rows in the dictionary
rem cache every time you fetch a row from the pipeline
rem function. This is a concurrency threat that has been
rem eliminated in 10g.
rem




