暂无图片
暂无图片
暂无图片
暂无图片
暂无图片
ICDE2024_GaussML:An End-to-End In-database Machine Learning System_华为.pdf
283
17页
2次
2024-05-24
免费下载
GaussML: An End-to-End In-Database Machine
Learning System
Guoliang Li
, Ji Sun
, Lijie Xu
§
, Shifu Li
, Jiang Wang
, Wen Nie
Tsinghua University
Huawei Company
§
ETH Z
¨
urich
liguoliang@tsinghua.edu.cn,{sunji11,niewen2,lishifu,wangjiang16}@huawei.com, lijie.xu@inf.ethz.ch
Abstract—In-database machine learning (In-DB ML) is appeal-
ing to database users with security and privacy concerns, as it
avoids copying data out of the database to a separate machine
learning system. The common way to implement in-DB ML is the
ML-as-UDF approach, which utilizes the User-Defined Functions
(UDFs) within SQL to implement the ML training and prediction.
However, UDFs may introduce security risks with vulnerable
code, and suffer from performance problems, as constrained by
data access and execution patterns of SQL query operators.
To address these limitations, we propose a new in-database
machine learning system, namely GaussML, which provides an
end-to-end machine-learning ability with native SQL interface.
To support ML training/inference within SQL query, GaussML
directly integrates typical ML operators into the query engine
without UDFs. GaussML also introduces an ML-aware car-
dinality and cost estimator to optimize the SQL+ML query
plan. Moreover, GaussML leverages Single Instruction Multiple
Data (SIMD) and data prefetching techniques to accelerate the
ML operators for training. We have implemented a series of
algorithms inside GaussML in openGauss database. Compared
to the state-of-the-art in-DB ML systems like Apache MADlib,
our GaussML achieves 2-6× speed-up in extensive experiments.
I. INTRODUCTION
Machine learning (ML) is now widely used for data analysis
tasks. Researchers and engineers invest substantial effort into
designing user-friendly machine learning interface [1], [2],
constructing end-to-end machine learning pipeline [3]–[6],
accelerating the model training [7]–[9], managing training
data effectively [10]–[12] and developing end-to-end machine
learning systems [13]–[17]. However, most of them provide
Python interface that requires the extraction of training data
from data storage systems, typically databases.
Nowadays, relational databases like openGauss [18] has
been widely used in many large commercial businesses, in-
cluding government cloud and financial services. For these
businesses, the data stored in databases is one of the most
valuable assets, and thus copying data out of databases for ML
training leads to high risk of data leakage. Even in the same
company/organization, the data infrastructure department and
the application department are separated, and the data visit is
strictly controlled for data security reasons. If the application
team (e.g., the AI team) wants to train models, what they get
is just a small sample of data from the far past. For training on
fresh full dataset, a promising solution is in-database machine
learning, where the database natively supports ML training
and inference. In this way, customers can also leverage SQL
interface for end-to-end ML tasks, which is easier for them
to query data and perform ML training than writing complex
Python programs.
In-database machine learning has been studied for many
years [19], [20]. The most common approach is ML-as-UDF
that utilizes User-Defined Functions (UDFs) within SQL to
implement the computation of model training and prediction.
For instance, the state-of-the-art in-DB ML tool, Apache
MADlib, leverages PostgreSQLs UDFs to execute Stochastic
Gradient Descent (SGD), while utilizing additional Python
drivers for handling ML iterations. However, UDF-based ap-
proaches suffer from two problems. (1) Security risks: UDFs
can introduce vulnerable code on data read from database to
outsize. (2) Efficiency limitations: UDF-based solutions are
constrained by the data access patterns and execution patterns
of SQL query operators. As a result, they currently only
support standard SGD and lack support for more efficient
methods like mini-batch SGD because that it’s would be inef-
ficient by using naive data scan in SQL. Furthermore, UDF-
based solutions cannot achieve comprehensive optimization in
conjunction with query plan because it cannot touch the plan
generation process.
To address these problems, we propose GaussML, which
is a fully in-database machine learning system, and all the
components are seamlessly integrated with native database
kernel. It has three advantages. First, GaussML designs native
executors to avoid data transfer, which can reduce the risk
of data leakage and accelerate the data analysis, without
introducing vulnerable code like UDFs. Second, GaussML
can make co-optimization on traditional query execution path
and machine learning operators, which can also improve the
performance of end-to-end machine learning. Third, GaussML
contains specific optimizations for ML operators like SIMD
and data prefeching, further improving the performance.
In summary, we make the following contributions.
(1) We propose a new in-database machine learning system
GaussML, which supports efficient end-to-end machine learn-
ing using SQL queries. We seamlessly integrate GaussML into
an open-source database openGauss
1
.
(2) We design an ML-aware cardinality and cost estimator in
GaussML, which extends the ability of database optimizer
to support complex SQL queries with machine learning (see
Section III).
1
https://gitee.com/opengauss-db4ai/openGauss-server
1
(3) We summarize the common computation patterns of
widely-used ML algorithms, and then we develop native
operators to accelerate model training, which can be organized
as a node in the native query plan tree in database. Operations
are accelerated by SIMD and data prefetching techniques, and
they support distributed computing (see Section IV).
(4) We conduct extensive experiments to compare with state-
of-the-art machine learning systems including MADlib and
ML-A (an ML engine implemented with popular Python
library), and the results show that our system outperforms
existing methods by 2-6 times (see Section V).
II. SYSTEM OVERVIEW
We first define the end-to-end machine learning (ML) pro-
cess in Section II-A, and then introduce the architecture of
GaussML from a perspective of database to show how ML
is seamlessly integrated in relational database in Section II-B.
Finally we introduce the native MLSQL grammar for using
GaussML in Section II-C.
A. End-to-end Machine Learning
Given a relational database D with tables
{t
1
, t
2
, t
3
, · · · , t
n
} and a complicated data analytic problem
with both database operators and ML algorithms P . The steps
of an end-to-end ML pipeline for solving the problem is as
follows. (i) Feature engineering: creating a training view V
from D including transformed features related to the problem
P ; (ii) Model training: training a model from view V for
the target on problem P ; (iii) Model inference: searching
the proper model and predicting the labels by using it for
given tuples; (iv) Data analysis: fetching and analysing tuples
satisfying some constraints on model prediction results. We
next show three typical scenarios for better understanding the
benefits of conducting end-to-end machine learning within
database.
Scenario 1: Data analytic makes a lot of efforts on prepro-
cessing the dataset for easier training and better representation
(i.e. feature engineering). In this process, data manipulation
operations (e.g. join, projection, aggregation) and data pre-
processing operations (e.g. normalization) are often used. The
former operations can be optimized by optimizer in database,
and the later operations can utilize the native data statistics in
database. On the contrary, these operations would spend a large
amount of time (even more than model training itself) in other
ML systems. Moreover, as the whole process is conducted
inside the database, data access is controlled by permission
subsystem in database, GaussML is more secure and trusted
when training on sensitive data in core business.
Similarly, in the inference phase, users can directly obtain
the predicted targets from database without touching the model
or the original data. This avoids the data transition overhead
and guarantees the security of original data.
Scenario 2: It’s very common for real users to do the data
clustering and the features are splitted on different relations,
and they must get the result containing all features before
training. For example, users are training K-means on data, they
can write a SQL of creating model from subquery which joins
all tables. In GaussML, if the distance is L1 and L2, we can
also factorize the distance calculates on different tables, and
push down to the scan node. For 1:n or n:n joins, GaussML
will significantly reduce the training overhead with ML-DB
co-optimization methods.
Scenario 3: In this scenario, the model predicts result are
taken as the constraints for selecting desired tuples above the
scan node. For example, if we want to select patients with
anxiety, we should filter scanned tuples by using a well-trained
anxiety models. Moreover, if we merge data from different
data sources with constraints, the database will push down
the predicates for better performance. As figure 1 shows,
we train three models on individual tables t
1
, t
2
and t
3
,
and then we predict the labels on new tables t
1
, t
2
and t
3
.
In traditional relational database, tables are often joined on
primary-foreign keys and take the larger side as outer table.
However, GaussML should also consider the predicting cost
because the cost differences of models cannot be neglected.
During the process, GaussML can offer the optimal execution
path by using advanced model-aware cost estimator and plan
selector. Moreover, GaussML computes ML-based constraints
on the fly to avoid schema change and storage overhead, and
reduce the risk of information leakage.
B. GaussML Architecture
As Figure 2 shows, GaussML is composed of five major
components, and they offer full-fledged machine learning
ability inside relational database (openGauss in this paper).
MLSQL Parser. In this layer, GaussML extends SQL to
support MLSQL by seamlessly integrating machine learning
operations into SQL. It supports model training by using create
model ... with ..., and model inference by using expression
predict by .... The detailed usage will be introduced in Sec-
tion II-C. Note that GaussML supports PBE (i.e. parser-bind-
execute) protocol which is a lazy parameter binding approach
to execute queries, SQLs with the same template are parsed
only once.
MLSQL Optimizer. In this layer, GaussML supports opti-
mization on MLSQL to handle both scenario 1 and scenario
2. GaussML optimizer can not only use model-aware cost
estimator to find the optimal data visit path and model
prediction orders, it also conducts interleaving optimizations
between data visit operators and model training operators. We
also design a brand-new in-database cardinality estimation
component customized for machine learning operations in
GaussML. The details of the optimizer will be introduced
in Section III.
MLSQL Executor. In this layer, we define a set of ML op-
erators to support high performance ML execution. GaussML
supports over 20 popular machine learning algorithms, and
the training executors of these algorithms are composed of
4 basic operators, including matrix computation operator,
statistic operator, gradient descent operator, and distance com-
putation operators. To increase the parallelism and reduce the
latency of model training, we also develop parallel training
2
of 17
免费下载
【版权声明】本文为墨天轮用户原创内容,转载时必须标注文档的来源(墨天轮),文档链接,文档作者等基本信息,否则作者和墨天轮有权追究责任。如果您发现墨天轮中有涉嫌抄袭或者侵权的内容,欢迎发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

评论

关注
最新上传
暂无内容,敬请期待...
下载排行榜
Top250 周榜 月榜