(3) We summarize the common computation patterns of
widely-used ML algorithms, and then we develop native
operators to accelerate model training, which can be organized
as a node in the native query plan tree in database. Operations
are accelerated by SIMD and data prefetching techniques, and
they support distributed computing (see Section IV).
(4) We conduct extensive experiments to compare with state-
of-the-art machine learning systems including MADlib and
ML-A (an ML engine implemented with popular Python
library), and the results show that our system outperforms
existing methods by 2-6 times (see Section V).
II. SYSTEM OVERVIEW
We first define the end-to-end machine learning (ML) pro-
cess in Section II-A, and then introduce the architecture of
GaussML from a perspective of database to show how ML
is seamlessly integrated in relational database in Section II-B.
Finally we introduce the native MLSQL grammar for using
GaussML in Section II-C.
A. End-to-end Machine Learning
Given a relational database D with tables
{t
1
, t
2
, t
3
, · · · , t
n
} and a complicated data analytic problem
with both database operators and ML algorithms P . The steps
of an end-to-end ML pipeline for solving the problem is as
follows. (i) Feature engineering: creating a training view V
from D including transformed features related to the problem
P ; (ii) Model training: training a model from view V for
the target on problem P ; (iii) Model inference: searching
the proper model and predicting the labels by using it for
given tuples; (iv) Data analysis: fetching and analysing tuples
satisfying some constraints on model prediction results. We
next show three typical scenarios for better understanding the
benefits of conducting end-to-end machine learning within
database.
Scenario 1: Data analytic makes a lot of efforts on prepro-
cessing the dataset for easier training and better representation
(i.e. feature engineering). In this process, data manipulation
operations (e.g. join, projection, aggregation) and data pre-
processing operations (e.g. normalization) are often used. The
former operations can be optimized by optimizer in database,
and the later operations can utilize the native data statistics in
database. On the contrary, these operations would spend a large
amount of time (even more than model training itself) in other
ML systems. Moreover, as the whole process is conducted
inside the database, data access is controlled by permission
subsystem in database, GaussML is more secure and trusted
when training on sensitive data in core business.
Similarly, in the inference phase, users can directly obtain
the predicted targets from database without touching the model
or the original data. This avoids the data transition overhead
and guarantees the security of original data.
Scenario 2: It’s very common for real users to do the data
clustering and the features are splitted on different relations,
and they must get the result containing all features before
training. For example, users are training K-means on data, they
can write a SQL of creating model from subquery which joins
all tables. In GaussML, if the distance is L1 and L2, we can
also factorize the distance calculates on different tables, and
push down to the scan node. For 1:n or n:n joins, GaussML
will significantly reduce the training overhead with ML-DB
co-optimization methods.
Scenario 3: In this scenario, the model predicts result are
taken as the constraints for selecting desired tuples above the
scan node. For example, if we want to select patients with
anxiety, we should filter scanned tuples by using a well-trained
anxiety models. Moreover, if we merge data from different
data sources with constraints, the database will push down
the predicates for better performance. As figure 1 shows,
we train three models on individual tables t
1
, t
2
and t
3
,
and then we predict the labels on new tables t
′
1
, t
′
2
and t
′
3
.
In traditional relational database, tables are often joined on
primary-foreign keys and take the larger side as outer table.
However, GaussML should also consider the predicting cost
because the cost differences of models cannot be neglected.
During the process, GaussML can offer the optimal execution
path by using advanced model-aware cost estimator and plan
selector. Moreover, GaussML computes ML-based constraints
on the fly to avoid schema change and storage overhead, and
reduce the risk of information leakage.
B. GaussML Architecture
As Figure 2 shows, GaussML is composed of five major
components, and they offer full-fledged machine learning
ability inside relational database (openGauss in this paper).
MLSQL Parser. In this layer, GaussML extends SQL to
support MLSQL by seamlessly integrating machine learning
operations into SQL. It supports model training by using create
model ... with ..., and model inference by using expression
predict by .... The detailed usage will be introduced in Sec-
tion II-C. Note that GaussML supports PBE (i.e. parser-bind-
execute) protocol which is a lazy parameter binding approach
to execute queries, SQLs with the same template are parsed
only once.
MLSQL Optimizer. In this layer, GaussML supports opti-
mization on MLSQL to handle both scenario 1 and scenario
2. GaussML optimizer can not only use model-aware cost
estimator to find the optimal data visit path and model
prediction orders, it also conducts interleaving optimizations
between data visit operators and model training operators. We
also design a brand-new in-database cardinality estimation
component customized for machine learning operations in
GaussML. The details of the optimizer will be introduced
in Section III.
MLSQL Executor. In this layer, we define a set of ML op-
erators to support high performance ML execution. GaussML
supports over 20 popular machine learning algorithms, and
the training executors of these algorithms are composed of
4 basic operators, including matrix computation operator,
statistic operator, gradient descent operator, and distance com-
putation operators. To increase the parallelism and reduce the
latency of model training, we also develop parallel training
2
评论