基于谱聚类的在线数据库垂直分区多阶段生成方法_软件学报2023_华为.pdf

迹部景吾

29页

3次

2024-08-23

免费下载



基于谱聚类的在线数据库垂直分区多阶段生成方法

刘鹏举

1,2

,李好洋

1,2

,王天一

1,2

,刘欢

1,2

,孙路明

1,2

,任逸飞

,李翠平

1,2

,陈红

1,2

(数据工程与知识工程教育部重点实验室(中国人民大学),北京100872)

(中国人民大学信息学院,北京100872)

(华为云数据库创新Lab,广东深圳518100)

通信作者:李翠平,E-mail:licuiping@ruc.edu.cn

摘　要:垂直数据分区技术从逻辑上将满足一定语义条件的数据库表属性存放在同一个物理块中,进而降低数据

访问成本,提高查询效率.数据库查询负载中的每条查询通常只与数据库表中的部分属性有关,因此只需使用数据

库表的某个属性子集便可以得到准确的查询结果.合理的垂直数据分区方式可以使大多数查询负载不需要扫描完

整数据库就可以完成查询任务,从而达到减少数据访问量,提高查询处理效率的目的.传统的数据库垂直分区方法

主要基于专家设置的启发式规则,分区策略粒度较粗,且不能根据负载的特征进行有针对性的分区优化.同时,当

负载规模较大或者属性个数较多时,现有垂直分区方法执行时间过长,尤其无法满足数据库在线实时调优的性能

需求.为此,提出在线环境下基于谱聚类的垂直数据分区方法(spectralclusteringbasedverticalpartitioning,SCVP),

采用分阶段求解的思想,减少算法时间复杂度,加快分区执行速度.首先通过增加约束条件缩小解空间(即根据谱

聚类生成初始分区),然后对解空间设计算法进行精细的搜索(即采用频繁项集和贪心搜索相结合的策略对初始分

区进行优化).为了进一步提升SCVP在高维属性下的性能,提出了SCVP的改进版本SCVP-R(spectralclustering

basedverticalpartitioningredesign).SCVP-R通过引入同域竞争机制、双败淘汰机制和循环机制,对SCVP在分区

优化过程中的合并方案进行了进一步优化.在不同数据集上的实验结果表明,相比于目前最好的垂直分区方法,

SCVP和SCVP-R有着更快的执行时间和更好的性能表现.

关键词:垂直分区;谱聚类;频繁项集;贪心搜索;多阶段决策

中图法分类号:TP311

中文引用格式:刘鹏举,李好洋,王天一,刘欢,孙路明,任逸飞,李翠平,陈红.基于谱聚类的在线数据库垂直分区多阶段生成方

法.软件学报,2023,34(6):2804–2832.http://www.jos.org.cn/1000-9825/6496.htm

英文引用格式:LiuPJ,LiHY,WangTY,LiuH,SunLM,ShenYF,LiCP,ChenH.Multi-stageMethodforOnlineVerticalData

PartitioningBasedonSpectralClustering.RuanJianXueBao/JournalofSoftware,2023,34(6):2804–2832(inChinese).http://www.

jos.org.cn/1000-9825/6496.htm

Multi-stage Method for Online Vertical Data Partitioning Based on Spectral Clustering

LIU Peng-Ju

1,2

, LI Hao-Yang

1,2

, WANG Tian-Yi

1,2

, LIU Huan

1,2

, SUN Lu-Ming

1,2

, REN Yi-Fei

, LI Cui-Ping

1,2

CHENHong

1,2

(Key Laboratory of Data Engineering and Knowledge Engineering of the Ministry of Education (Renmin University of China), Beijing

100872,China)

(SchoolofInformation,RenminUniversityofChina,Beijing100872,China)

(HuaweiCloudDatabaseInnovationLab,Shenzhen518100,China)



*

基金项目:国家重点研发计划(2018YFB1004401);国家自然科学基金(61772537,61772536,62072460,62076245,62172424)

收稿时间:2021-06-25;修改时间:2021-08-21;采用时间:2021-09-25;jos在线出版时间:2022-11-16

CNKI网络首发时间:2022-11-18



软件学报ISSN1000-9825,CODENRUXUEW E-mail:jos@iscas.ac.cn

Journal of Software,2023,34(6):2804−2832[doi:10.13328/j.cnki.jos.006496] http://www.jos.org.cn

Abstract:Vertical data partitioning technology logically stores database table attributes satisfying certain semantic conditions in the same

physical block, so as to reduce the cost of data accessing and improve the efficiency of query processing. Every query is usually related

only to the table’s some attributes in the database, so a subset of the table’s attributes can be used to get the accurate query results.

Reasonable vertical data partitioning can make most queries answered without scanning the whole table, so as to reduce the amount of

data accessing and improve the efficiency of query processing. Traditional database vertical partitioning methods are mainly based on

heuristic rulesset by experts. The granularity of partitioning is coarse, and it can not provide different partition optimizations according to

the characteristics of workload. Besides, when the scale of workload or the number of attributes becomes large, the execution time of the

existing methods are too long to meet the performance requirements of online real-time tuning of database. Therefore, a method called

spectral clustering based vertical partitioning (SCVP) isproposed for the online environment. The idea of phased solution is adapted to

reduce the time complexity of the algorithm and speed up partitioning. Firstly, SCVP reduces the solution space by increasing the

constraint conditions, that is, generating initial partitions by spectral clustering. Secondly, SCVP designs the algorithm to search solution

space, that is, the initial partitions are optimized by combining frequent itemset mining and greedy search. In order to further improve the

performanceof SCVP under high-dimensional attributes, a new methodcalled special clusteringbased vertical partitioningredesign (SCVP-

R) is proposed which is an improved version of SCVP. SCVP-R optimizes the partitions combiner component of SCVP by introducing

sympatric-competition mechanism, double-elimination mechanism,and loopmechanism.The experimental results on differentdatasetsshow

thatSCVPandSCVP-Rhavefasterexecutiontimeandbetterperformancethanthecurrentstate-of-the-artverticalpartitioningmethod.

Key words:verticalpartitioning;spectralclustering;frequentitemsets;greedysearch;multistagedecision



A =

{

, . . . , a

}

{

, . . . , q

}

′

{

, . . . , P

}

, k ⩽ n

垂直分区是数据库物理设计中的经典技术.它将一个表或多个表纵向切分成不相交的多个列组,每个列组拥

有相同数量的元组.其形式化定义为:给定关系表 的属性集合 、某时间段t内的查询工作负载

和一个成本函数 ,寻找一个最优的垂直分区方案 ,使得数据库的查询执

行成本最低,即:

′

= argmin

(

, P

)

(1)

(

, a

)

例如,给定某关系表 和 上的两个查询:

, a

 :SELECT FROM ;

, a

 :SELECT FROM .

(

, a

)

(

, a

)

查询 和 需要分别在属性 、 和 、 、 上执行投影操作.未采取任何垂直分区策略时,执行 、

需要访问 的全部属性a

–a

;若将表R分为 和 两个分区,则 只需要访问分区

中的 、 属性,而 只需要访问分区 的 、 、 属性便可完成查询操作.可以看出,采用合适的垂直分

区策略可以减少数据访问的代价,提高查询处理的性能.

n+1



k=0













= 1







n − 1

k − 1



+ k ·



n − 1



= 1

= 4140

求解最优垂直分区方案的最直接方法是暴力枚举法,即枚举所有可能的分区方案,并为每个方案计算执行所

有工作负载的预期访问成本,选择成本最小的方案.Bell数是指使用暴力枚举的方式求出垂直分区所有可能组合

数.对于含有 个属性关系表,其对应的Bell数 能够通过公式 得到,其中 ,

, .以TPC-H数据集中含有8个属性的customer表为例,其垂直分区的可

能方案共有 个.

(

)





暴力枚举法的时间复杂度为 ,显然非常低效.为此,人们设计了多种高效的垂直分区方法.目前垂直分区

领域性能最好的算法是HillsClimb

[1]

.它将单列作为初始分区方案,每次选择将使总I/O成本降低最多的两个分区

合并,在此基础上不断迭代,直到I/O成本不再减少.HillsClimb算法通过自底向上的迭代过程寻找较优解,每次迭

代,都在上一轮迭代的方案下进行决策,虽然不能保证全局下的最优解,但避免了暴力枚举的复杂性.HillsClimb

的时间复杂度为 ,其中 为表的属性数, 为查询负载规模, 为迭代次数.它在低维属性表和轻量负载下

表现优异,但一旦属性维度或负载规模提高时,其算法执行时间便呈指数级和高斜率线性增长,经实验验证,面对

200维属性的表、2000条查询规模的工作负载,其平均执行时间高达2901s.

在线分区系统

[2−5]

是一种根据工作负载特征变化的显著程度,决定是否调用分区算法进行在线的分区部署,它

的目标是保证数据库系统在各个时间段都能采用合适的分区策略,保持最佳的性能,因此它对分区算法的执行时

刘鹏举等:基于谱聚类的在线数据库垂直分区多阶段生成方法 2805

of 29

免费下载

关注

评论