
file consists of a collection of slotted-pages, and each
slotted-page contains a collection of records. A page is
a fixed-size block of allocated memory. A slotted-page is
a page that contains an array of byte-offsets that reference
the start of each record allocated within the page. Typi-
cally, each attribute of the record is stored in consecutive
memory addresses on the slotted page, in what is called the
N-ary storage model. Common implementations of the N-
ary storage model place variable length attributes at the end
of the fixed-length attributes, and use fixed length place-
holders to reference the relocated attributes.
A recently proposed storage model, called PAX, ver-
tically decomposes each record, storing each attribute in
subdivisions of a page, called mini-pages [2]. The PAX
storage model improves the cache utilization for queries
that access only a few attributes from a high percentage
of records, such as during a sequential file scan. This im-
provement in cache utilization is a direct result of being
able to pack attributes of many different records into the
same cache line. However, for access plans where only a
fraction of the records are accessed, such as during an index
scan, the PAX storage model can result in a greater number
of cache misses. In such cases, N-ary may actually perform
better, especially as the number of attributes accessed per
record increases.
As previously demonstrated, both the N-ary and the
PAX storage models work well for different families of ac-
cess plans. A choice between the two storage models is dif-
ficult as the query workload may be large or may vary over
time. A better storage model should incorporate the best
characteristics of both models; namely, vertical decompo-
sition and groups of sequentially allocated attributes. This
storage model should also provide the flexibility to adapt to
changing workloads. In this paper, we propose a novel data
storage technique, called Data Morphing (DM) that meets
these goals.
This paper makes the following contributions:
• We present a flexible page architecture that is a gener-
alization of the previously proposed PAX page architec-
ture. In the Data Morphing page architecture, attributes
of the same tuple can be stored in non-contiguous groups,
which increases the spatial locality of memory accesses.
As a result, fewer cache misses are incurred during query
processing.
• We present two algorithms for calculating the attribute
groups that are stored on each page. The Naive algo-
rithm performs an exhaustive search of the possible at-
tribute layouts, but the algorithm is expensive to compute
for relations with a large number of attributes. The Hill-
Climb algorithm performs a greedy search of the layout
space, trading optimality for faster time complexity.
• We present an empirical evaluation of the Data Morphing
technique. We show that the partitioning algorithms cal-
culate attribute groupings that reduce the number of cache
misses incurred during query execution. As a direct result
of improving the cache utilization, our prototype DBMS
implementing the DM technique can evaluate queries up
to 45% faster than the N-ary storage model and up to 25%
faster than the PAX model.
The remainder of this paper is organized as follows:
Section 2 presents an example motivating the need for Data
Morphing. Section 3 provides definitions for the common
terms used in the subsequent sections. In Section 4, we
present the Data Morphing technique. A detailed experi-
mental evaluation is presented in Section 5. Related work
is discussed in Section 6, and Section 7 contains our con-
clusions.
2 Motivating Example
As described in the introduction, neither the N-ary storage
model nor the PAX storage model provide the optimal stor-
age model for data. To illustrate this fact, we will use the
relation and query shown in Figure 1 in the following ex-
ample.
Client (id:Integer,priority:Integer,name: Varchar(32),
usage: Real, address: Varchar(32),
location: Integer) key (id);
SELECT location, usage
FROM Client WHERE priority < 12
Figure 1: Client Relation and Query
In Figure 1, the Client relation consists of six at-
tributes of types Integer, Real, and Varchar. The Integer and
Real data types are each four bytes in size. The Varchar(32)
data type is a variable length string with a maximum length
of 32 bytes. The selectivity of the predicate on priority
is 12.5%, and the processor cache line size is 32 bytes. Fig-
ure 2 shows a single record in the Client relation as it is
stored on a page using the N-ary storage model; as in typ-
ical implementations, the variable-length strings are stored
at the end of the fixed-length attributes. To visualize the at-
tributes that are read into the processor cache as the result
of a cache miss, the width of the diagram is shown to be the
same size as a cache line.
Assuming that a sequential file scan is used to answer
this query, the records are retrieved as follows. The first
priority attribute is requested from memory. This
memory access incurs a cache miss to read the contigu-
ous block of memory containing the attribute into the pro-
cessor cache. If the v alue of the priority attribute is
less than 12,thelocation and usage attributes are ac-
cessed. Because the location and usage attributes are
contained in the processor cache, retrieving them incurs no
additional cache misses. The select operator then continues
with the next record, until all of the records have been read.
The cost of answering this query totals approximately one
cache miss per record.
The PAX storage model vertically decomposes the
records into zones, called mini-pages, as shown in Fig-
ure 3; for simplicity, we do not show all of the details.
In the figure, the attributes are presented as belonging
评论