In-Database Time Series Clustering.pdf

Apache IoTDB

26页

0次

2025-02-25

100墨值下载

In-Database Time Series Clustering

YUNXIANG SU, Tsinghua University, China

KENNY YE LIANG, Tsinghua University, China

SHAOXU SONG

∗

, BNRist, Tsinghua University , China

Time series data are often clustered repeatedly across various time ranges to mine frequent subsequence

patterns from dierent periods, which could further support downstream applications. Existing state-of-the-art

(SOTA) time series clustering method, such as K-Shape, can prociently cluster time series data referring

to their shapes. However, in-database time series clustering problem has been neglected, especially in IoT

scenarios with large-volume data and high eciency demands. Most time series databases employ LSM-Tree

based storage to support intensive writings, yet causing underlying data points out-of-order in timestamps.

Therefore, to apply existing out-of-database methods, all data points must be fully loaded into memory and

chronologically sorted. Additionally, out-of-database methods must cluster from scratch each time, making

them inecient when handling queries across dierent time ranges. In this work, we propose an in-database

adaptation of SOTA time series clustering method K-Shape. Moreover, to solve the problem that K-Shape

cannot eciently handle long time series, we propose Medoid-Shape, as well as its in-database adaptation

for further acceleration. Extensive experiments are conducted to demonstrate the higher eciency of our

proposals, with comparable eectiveness. Remarkably, all proposals have already been implemented in an

open-source commodity time series database, Apache IoTDB.

CCS Concepts: • Information systems

→

Database query processing; • Computing methodologies

→

Machine learning.

Additional Key Words and Phrases: time series clustering, database query processing

ACM Reference Format:

Yunxiang Su, Kenny Ye Liang, and Shaoxu Song. 2025. In-Database Time Series Clustering. Proc. ACM Manag.

Data 3, 1 (SIGMOD), Article 46 (February 2025), 26 pages. https://doi.org/10.1145/3709696

1 Introduction

Time series clustering is of great importance for analysis. For example, time series clustering

could assist pattern mining of daily stock prices in nance [

], serve anomalous subsequence

detection for yearly climate analysis in meteorology [

], facilitate the analysis of the characteristics

associated with sleep apnea [

] and so on. The state-of-the-art (SOTA) time series clustering

method K-Shape [

] can prociently cluster time series by shapes and achieve signicantly

better accuracy than other existing time series clustering methods.

However, K-Shape unfortunately faces challenges when meeting IoT scenarios, where extensive

time series data stored in databases pose serious challenges for time series clustering. On the one

hand, the arrival of IoT data is often out-of-order, due to transmission issues or sensor failures [

Most commodity time series databases employ Log Structured Merged Tree (LSM-Tree) [

] to

∗

Shaoxu Song (https://sxsong.github.io/) is the corresponding author.

Authors’ Contact Information: Yunxiang Su, Tsinghua University, China, suyx21@mails.tsinghua.edu.cn; Kenny Ye Liang,

Tsinghua University, China, liangy24@mails.tsinghua.edu.cn; Shaoxu Song, BNRist, Tsinghua University , China, sxsong@

tsinghua.edu.cn.

This work is licensed under a Creative Commons Attribution International 4.0 License.

ACM 2836-6573/2025/2-ART46

https://doi.org/10.1145/3709696

Proc. ACM Manag. Data, Vol. 3, No. 1 (SIGMOD), Article 46. Publication date: February 2025.

46:2 Yunxiang Su, Kenny Ye Liang, and Shaoxu Song

!"#$%&'(

'#)

*+,'%#(

,#-.&$/

0#1.

2&3#

!"#

$!"#%

45&'#(6&7#)8

*#-.$+9)7 "

*:$$#-.8;33$#3&.9+-

()&

$()&%

$%&

$()&%

;)<&=#-.

$()&%

# $ # % &

;%%82&3#78

>$&"#$7#)

!"#$%&&'(&)*+,-$,.$/01*+21($3)&(4 !5#$6+-)1$7(401*

?')&.#)8;33$#3&.9+-

()&

))'

()&

))'

!8#$/+551($7(401*

!)#$9+:($;('+(4$4*,'(5$+-$/01*+21($3)&(4$

*+,'%#.#84#@

A-=+,'%#.#84#@

0:,B#$

>9,#

!"#$%&''#)

*+,'%#,#-.&$/

C&9%#)84#@

2+779B%#8>9,#8D&-3#7

Fig. 1. The workflow of in-database K-Shape

handle such out-of-order arrivals, where data points are batched into les with possibly overlapped

time intervals on disk. Therefore, applying the existing K-Shape algorithm typically involves loading

all the data outside databases and sorting chronologically. Unfortunately, it is obviously inecient

in terms of I/O and incurs extra time.

On the other hand, time series clustering often needs to be performed multiple times over

dierent time ranges in IoT databases to discover various patterns across dierent time periods. For

instance, a forging machine in a steel processing factory may handle dierent types of steel bars

during various time periods, depending on the production orders. When processing each type of

steel bar, the machine may exhibit several working patterns inuenced by its operating conditions

and the environment. To identify all machine working patterns across dierent product types,

analysts need to cluster subsequences of time series data over various time intervals to achieve a

comprehensive clustering result. Apparently, loading all related data out of databases and clustering

from scratch each time is extremely time-consuming.

1.1 Challenges

Considering such IoT scenarios with time series databases, the challenges of in-database time series

clustering are as follows.

(1) Time series are often stored out-of-order in LSM-Tree databases owing to delayed arrivals,

while existing clustering methods require data in chronological order. It incurs extra pre-processing

time overheads to sort data chronologically. Therefore, we propose in-database adaptations for

existing clustering method K-Shape, which can handle the out-of-order issues in LSM-Tree.

(2) Clustering may need to be performed repeatedly with dierent time lters for dierent tasks.

K-Shape requires clustering from scratch each time, making it inecient. To support frequent

clustering queries, we propose in-database K-Shape with pre-computation.

(3) K-Shape is also inecient in handling long subsequences due its high complexity with respect

to the subsequence length. Therefore, we propose Medoid-Shape and in-database Medoid-Shape

for acceleration, by leveraging approximate clustering and avoiding time-consuming eigenvector

decomposition.

To tackle these challenges, the paper mainly focuses on how to eciently cluster time series in

databases by leveraging database properties. We primarily concentrate on achieving high eciency

for in-database time series clustering. The following example illustrates the challenges of LSM-Tree

based database storage.

Example 1. Figure 1(a) presents a time series stored in an LSM-Tree based time series database,

recording the solar radiation intensity in a wind farm, where the x-axis denotes time and the y-axis

denotes page number. Data points are batched into 4 pages

𝑃

, 𝑃

on disk, each denoted by a gray

rectangle. The higher the page number is, the later the data points arrive in the database. The dotted

Proc. ACM Manag. Data, Vol. 3, No. 1 (SIGMOD), Article 46. Publication date: February 2025.

of 26

100墨值下载

关注

评论