ICDE2024_GaussML：An End-to-End In-database Machine Learning System_华为.pdf

刺史武都

283

17页

2次

2024-05-24

免费下载

GaussML: An End-to-End In-Database Machine

Learning System

Guoliang Li

‡

, Ji Sun

†

, Lijie Xu

, Shifu Li

†

, Jiang Wang

†

, Wen Nie

†

‡

Tsinghua University

†

Huawei Company

ETH Z

urich

liguoliang@tsinghua.edu.cn,{sunji11,niewen2,lishifu,wangjiang16}@huawei.com, lijie.xu@inf.ethz.ch

Abstract—In-database machine learning (In-DB ML) is appeal-

ing to database users with security and privacy concerns, as it

avoids copying data out of the database to a separate machine

learning system. The common way to implement in-DB ML is the

ML-as-UDF approach, which utilizes the User-Deﬁned Functions

(UDFs) within SQL to implement the ML training and prediction.

However, UDFs may introduce security risks with vulnerable

code, and suffer from performance problems, as constrained by

data access and execution patterns of SQL query operators.

To address these limitations, we propose a new in-database

machine learning system, namely GaussML, which provides an

end-to-end machine-learning ability with native SQL interface.

To support ML training/inference within SQL query, GaussML

directly integrates typical ML operators into the query engine

without UDFs. GaussML also introduces an ML-aware car-

dinality and cost estimator to optimize the SQL+ML query

plan. Moreover, GaussML leverages Single Instruction Multiple

Data (SIMD) and data prefetching techniques to accelerate the

ML operators for training. We have implemented a series of

algorithms inside GaussML in openGauss database. Compared

to the state-of-the-art in-DB ML systems like Apache MADlib,

our GaussML achieves 2-6× speed-up in extensive experiments.

I. INTRODUCTION

Machine learning (ML) is now widely used for data analysis

tasks. Researchers and engineers invest substantial effort into

designing user-friendly machine learning interface [1], [2],

constructing end-to-end machine learning pipeline [3]–[6],

accelerating the model training [7]–[9], managing training

data effectively [10]–[12] and developing end-to-end machine

learning systems [13]–[17]. However, most of them provide

Python interface that requires the extraction of training data

from data storage systems, typically databases.

Nowadays, relational databases like openGauss [18] has

been widely used in many large commercial businesses, in-

cluding government cloud and ﬁnancial services. For these

businesses, the data stored in databases is one of the most

valuable assets, and thus copying data out of databases for ML

training leads to high risk of data leakage. Even in the same

company/organization, the data infrastructure department and

the application department are separated, and the data visit is

strictly controlled for data security reasons. If the application

team (e.g., the AI team) wants to train models, what they get

is just a small sample of data from the far past. For training on

fresh full dataset, a promising solution is in-database machine

learning, where the database natively supports ML training

and inference. In this way, customers can also leverage SQL

interface for end-to-end ML tasks, which is easier for them

to query data and perform ML training than writing complex

Python programs.

In-database machine learning has been studied for many

years [19], [20]. The most common approach is ML-as-UDF

that utilizes User-Deﬁned Functions (UDFs) within SQL to

implement the computation of model training and prediction.

For instance, the state-of-the-art in-DB ML tool, Apache

MADlib, leverages PostgreSQL’s UDFs to execute Stochastic

Gradient Descent (SGD), while utilizing additional Python

drivers for handling ML iterations. However, UDF-based ap-

proaches suffer from two problems. (1) Security risks: UDFs

can introduce vulnerable code on data read from database to

outsize. (2) Efﬁciency limitations: UDF-based solutions are

constrained by the data access patterns and execution patterns

of SQL query operators. As a result, they currently only

support standard SGD and lack support for more efﬁcient

methods like mini-batch SGD because that it’s would be inef-

ﬁcient by using naive data scan in SQL. Furthermore, UDF-

based solutions cannot achieve comprehensive optimization in

conjunction with query plan because it cannot touch the plan

generation process.

To address these problems, we propose GaussML, which

is a fully in-database machine learning system, and all the

components are seamlessly integrated with native database

kernel. It has three advantages. First, GaussML designs native

executors to avoid data transfer, which can reduce the risk

of data leakage and accelerate the data analysis, without

introducing vulnerable code like UDFs. Second, GaussML

can make co-optimization on traditional query execution path

and machine learning operators, which can also improve the

performance of end-to-end machine learning. Third, GaussML

contains speciﬁc optimizations for ML operators like SIMD

and data prefeching, further improving the performance.

In summary, we make the following contributions.

(1) We propose a new in-database machine learning system

GaussML, which supports efﬁcient end-to-end machine learn-

ing using SQL queries. We seamlessly integrate GaussML into

an open-source database openGauss

(2) We design an ML-aware cardinality and cost estimator in

GaussML, which extends the ability of database optimizer

to support complex SQL queries with machine learning (see

Section III).

https://gitee.com/opengauss-db4ai/openGauss-server

(3) We summarize the common computation patterns of

widely-used ML algorithms, and then we develop native

operators to accelerate model training, which can be organized

as a node in the native query plan tree in database. Operations

are accelerated by SIMD and data prefetching techniques, and

they support distributed computing (see Section IV).

(4) We conduct extensive experiments to compare with state-

of-the-art machine learning systems including MADlib and

ML-A (an ML engine implemented with popular Python

library), and the results show that our system outperforms

existing methods by 2-6 times (see Section V).

II. SYSTEM OVERVIEW

We ﬁrst deﬁne the end-to-end machine learning (ML) pro-

cess in Section II-A, and then introduce the architecture of

GaussML from a perspective of database to show how ML

is seamlessly integrated in relational database in Section II-B.

Finally we introduce the native MLSQL grammar for using

GaussML in Section II-C.

A. End-to-end Machine Learning

Given a relational database D with tables

, t

, · · · , t

} and a complicated data analytic problem

with both database operators and ML algorithms P . The steps

of an end-to-end ML pipeline for solving the problem is as

follows. (i) Feature engineering: creating a training view V

from D including transformed features related to the problem

P ; (ii) Model training: training a model from view V for

the target on problem P ; (iii) Model inference: searching

the proper model and predicting the labels by using it for

given tuples; (iv) Data analysis: fetching and analysing tuples

satisfying some constraints on model prediction results. We

next show three typical scenarios for better understanding the

beneﬁts of conducting end-to-end machine learning within

database.

Scenario 1: Data analytic makes a lot of efforts on prepro-

cessing the dataset for easier training and better representation

(i.e. feature engineering). In this process, data manipulation

operations (e.g. join, projection, aggregation) and data pre-

processing operations (e.g. normalization) are often used. The

former operations can be optimized by optimizer in database,

and the later operations can utilize the native data statistics in

database. On the contrary, these operations would spend a large

amount of time (even more than model training itself) in other

ML systems. Moreover, as the whole process is conducted

inside the database, data access is controlled by permission

subsystem in database, GaussML is more secure and trusted

when training on sensitive data in core business.

Similarly, in the inference phase, users can directly obtain

the predicted targets from database without touching the model

or the original data. This avoids the data transition overhead

and guarantees the security of original data.

Scenario 2: It’s very common for real users to do the data

clustering and the features are splitted on different relations,

and they must get the result containing all features before

training. For example, users are training K-means on data, they

can write a SQL of creating model from subquery which joins

all tables. In GaussML, if the distance is L1 and L2, we can

also factorize the distance calculates on different tables, and

push down to the scan node. For 1:n or n:n joins, GaussML

will signiﬁcantly reduce the training overhead with ML-DB

co-optimization methods.

Scenario 3: In this scenario, the model predicts result are

taken as the constraints for selecting desired tuples above the

scan node. For example, if we want to select patients with

anxiety, we should ﬁlter scanned tuples by using a well-trained

anxiety models. Moreover, if we merge data from different

data sources with constraints, the database will push down

the predicates for better performance. As ﬁgure 1 shows,

we train three models on individual tables t

, t

and t

and then we predict the labels on new tables t

′

, t

′

and t

′

In traditional relational database, tables are often joined on

primary-foreign keys and take the larger side as outer table.

However, GaussML should also consider the predicting cost

because the cost differences of models cannot be neglected.

During the process, GaussML can offer the optimal execution

path by using advanced model-aware cost estimator and plan

selector. Moreover, GaussML computes ML-based constraints

on the ﬂy to avoid schema change and storage overhead, and

reduce the risk of information leakage.

B. GaussML Architecture

As Figure 2 shows, GaussML is composed of ﬁve major

components, and they offer full-ﬂedged machine learning

ability inside relational database (openGauss in this paper).

MLSQL Parser. In this layer, GaussML extends SQL to

support MLSQL by seamlessly integrating machine learning

operations into SQL. It supports model training by using create

model ... with ..., and model inference by using expression

predict by .... The detailed usage will be introduced in Sec-

tion II-C. Note that GaussML supports PBE (i.e. parser-bind-

execute) protocol which is a lazy parameter binding approach

to execute queries, SQLs with the same template are parsed

only once.

MLSQL Optimizer. In this layer, GaussML supports opti-

mization on MLSQL to handle both scenario 1 and scenario

2. GaussML optimizer can not only use model-aware cost

estimator to ﬁnd the optimal data visit path and model

prediction orders, it also conducts interleaving optimizations

between data visit operators and model training operators. We

also design a brand-new in-database cardinality estimation

component customized for machine learning operations in

GaussML. The details of the optimizer will be introduced

in Section III.

MLSQL Executor. In this layer, we deﬁne a set of ML op-

erators to support high performance ML execution. GaussML

supports over 20 popular machine learning algorithms, and

the training executors of these algorithms are composed of

4 basic operators, including matrix computation operator,

statistic operator, gradient descent operator, and distance com-

putation operators. To increase the parallelism and reduce the

latency of model training, we also develop parallel training

of 17

免费下载

关注

评论