
Abstract:Vertical data partitioning technology logically stores database table attributes satisfying certain semantic conditions in the same
physical block, so as to reduce the cost of data accessing and improve the efficiency of query processing. Every query is usually related
only to the table’s some attributes in the database, so a subset of the table’s attributes can be used to get the accurate query results.
Reasonable vertical data partitioning can make most queries answered without scanning the whole table, so as to reduce the amount of
data accessing and improve the efficiency of query processing. Traditional database vertical partitioning methods are mainly based on
heuristic rulesset by experts. The granularity of partitioning is coarse, and it can not provide different partition optimizations according to
the characteristics of workload. Besides, when the scale of workload or the number of attributes becomes large, the execution time of the
existing methods are too long to meet the performance requirements of online real-time tuning of database. Therefore, a method called
spectral clustering based vertical partitioning (SCVP) isproposed for the online environment. The idea of phased solution is adapted to
reduce the time complexity of the algorithm and speed up partitioning. Firstly, SCVP reduces the solution space by increasing the
constraint conditions, that is, generating initial partitions by spectral clustering. Secondly, SCVP designs the algorithm to search solution
space, that is, the initial partitions are optimized by combining frequent itemset mining and greedy search. In order to further improve the
performanceof SCVP under high-dimensional attributes, a new methodcalled special clusteringbased vertical partitioningredesign (SCVP-
R) is proposed which is an improved version of SCVP. SCVP-R optimizes the partitions combiner component of SCVP by introducing
sympatric-competition mechanism, double-elimination mechanism,and loopmechanism.The experimental results on differentdatasetsshow
thatSCVPandSCVP-Rhavefasterexecutiontimeandbetterperformancethanthecurrentstate-of-the-artverticalpartitioningmethod.
Key words:verticalpartitioning;spectralclustering;frequentitemsets;greedysearch;multistagedecision
P
′
=
{
P
1
, . . . , P
k
}
, k ⩽ n
垂直分区是数据库物理设计中的经典技术.它将一个表或多个表纵向切分成不相交的多个列组,每个列组拥
有相同数量的元组.其形式化定义为:给定关系表 的属性集合 、某时间段t内的查询工作负载
和一个成本函数 ,寻找一个最优的垂直分区方案 ,使得数据库的查询执
行成本最低,即:
例如,给定某关系表 和 上的两个查询:
:SELECT FROM ;
:SELECT FROM .
查询 和 需要分别在属性 、 和 、 、 上执行投影操作.未采取任何垂直分区策略时,执行 、
需要访问 的全部属性a
0
–a
4
;若将表R分为 和 两个分区,则 只需要访问分区
中的 、 属性,而 只需要访问分区 的 、 、 属性便可完成查询操作.可以看出,采用合适的垂直分
区策略可以减少数据访问的代价,提高查询处理的性能.
n
k
=
n − 1
k − 1
+ k ·
n − 1
k
求解最优垂直分区方案的最直接方法是暴力枚举法,即枚举所有可能的分区方案,并为每个方案计算执行所
有工作负载的预期访问成本,选择成本最小的方案.Bell数是指使用暴力枚举的方式求出垂直分区所有可能组合
数.对于含有 个属性关系表,其对应的Bell数 能够通过公式 得到,其中 ,
, .以TPC-H数据集中含有8个属性的customer表为例,其垂直分区的可
能方案共有 个.
暴力枚举法的时间复杂度为 ,显然非常低效.为此,人们设计了多种高效的垂直分区方法.目前垂直分区
领域性能最好的算法是HillsClimb
[1]
.它将单列作为初始分区方案,每次选择将使总I/O成本降低最多的两个分区
合并,在此基础上不断迭代,直到I/O成本不再减少.HillsClimb算法通过自底向上的迭代过程寻找较优解,每次迭
代,都在上一轮迭代的方案下进行决策,虽然不能保证全局下的最优解,但避免了暴力枚举的复杂性.HillsClimb
的时间复杂度为 ,其中 为表的属性数, 为查询负载规模, 为迭代次数.它在低维属性表和轻量负载下
表现优异,但一旦属性维度或负载规模提高时,其算法执行时间便呈指数级和高斜率线性增长,经实验验证,面对
200维属性的表、2000条查询规模的工作负载,其平均执行时间高达2901s.
在线分区系统
[2−5]
是一种根据工作负载特征变化的显著程度,决定是否调用分区算法进行在线的分区部署,它
的目标是保证数据库系统在各个时间段都能采用合适的分区策略,保持最佳的性能,因此它对分区算法的执行时
刘鹏举等:基于谱聚类的在线数据库垂直分区多阶段生成方法 2805
评论