暂无图片
暂无图片
暂无图片
暂无图片
暂无图片
biggy - An Implementation of Unified Framework for Big Data Management System.pdf
198
9页
9次
2022-05-30
免费下载
biggy: An Implementation of Unified Framework
for Big Data Management System
Yao Wu, Henan Guan
Key Laboratory of Data Engineering and Knowledge Engineering of Ministry of Education, Beijing, China
School of Information, Renmin University of China, Beijing, China
Email: {ideamaxwu, henanguan}@ruc.edu.cn
Abstract—Various tools, softwares and systems are proposed
and implemented to tackle the challenges in big data on different
emphases, e.g., data analysis, data transaction, data query, data
storage, data visualization, data privacy. In this paper, we propose
datar, a new prospective and unified framework for Big Data
Management System (BDMS) from the point of system architec-
ture by leveraging ideas from mainstream computer structure.
We introduce five key components of datar by reviewing the cur-
rent status of BDMS. Datar features with configuration chain of
pluggable engines, automatic dataflow on job pipelines, intelligent
self-driving system management and interactive user interfaces.
Moreover, we present biggy as an implementation of datar with
manipulation details demonstrated by four running examples.
Evaluations on efficiency and scalability are carried out to show
the performance. Our work argues that the envisioned datar is
a feasible solution to the unified framework of BDMS, which
can manage big data pluggablly, automatically and intelligently
with specific functionalities, where specific functionalities refer
to input, storage, computation, control and output of big data.
Index Terms—big data management system, data processing,
unified framework, datar, biggy
I. INTRODUCTION
A. Motivation of Datar
As Alan Turing proposed the question “Can machines
think?” [1], the imitation game begins. Von Neumann started
an engineering research on computer and described a logical
design of a computer using the stored-program concept, which
is known as the Von Neumann architecture [2]. Charles Bab-
bage proposed the Analytical Engine, a designed mechanical
general-purpose computer. The goal of these pioneers is to de-
sign a computing machine better than human brains, which can
liberate human from manual work and tedious computation.
During the last several decades, data management principles
such as relational model of data, physical and logical inde-
pendence, declarative querying and cost-based optimization
have led to several fields of researches and a prosperous
industry. Many novel challenges and opportunities associated
with big data necessitate rethinking many aspects of these data
management platforms, while retaining other desirable aspects.
The practice and theory contributions [3], [4] of Bachman and
Codd open up the research on database. And the steps on
the road to data management never stop such as, Ingres [5],
Postgres [6], Mariposa [7], C-Store [8], VoltDB [9], AsterixDB
[10] and P-Store [11] in database field, as well as, Megastore
Fig. 1. A typical workflow for big data management.
[12], Spanner [13], MillWheel [14], Azure CosmosDB
1
and
TiDB
2
in distributed system field.
Michael Stonebraker proposed “On Size Doesn’t Fit All”,
and in this paper, we try to argue that All Can Fit in One”.
Since the computing power of machines becomes stronger, we
can sniff the shift from computation to data management to
explore more in-sight information and knowledge from data.
Jim Gray foresighted the transformation from computation-
intensive to data-intensive science discovery and brought for-
ward The Fourth Paradigm [15]. He also thought the way to
cope with such paradigm was to develop a new generation
of computing tools to manage, visualize, and analyze massive
data. As we all know, Big Data Management System (BDMS)
is a complex set of functionalities, we think it necessary to
propose a unified architecture to guide the design of BDMS.
From these observations, we summarize and conclude with five
main components in BDMS to provide a better explanation for
a full understanding of our proposed datar architecture.
As shown in Fig. 1, BDMS consists of several core com-
ponents such as, collect, storage, process and visualize. Com-
pared with traditional database systems, BDMS architecture is
more flexible and open for varied requirements due to different
focus-ons. In this paper, we unify the BDMS as datar, a
general framework to design and build BDMS, corresponding
to term computer.
1
https://azure.microsoft.com/zh-cn/services/cosmos-db/
2
https://www.pingcap.com/
arXiv:1810.09378v1 [cs.DB] 22 Oct 2018
(a) Computer Architecture (b) Datar Architecture
Fig. 2. Computer VS Datar comparison in terms of architecture.
As we all know, the mainstream computer architecture
is divided into ve parts, i.e., input, storage, computation,
control and output, in which, computation is the center.
If we look closely, we can find that, BDMS is much the
same as computer, consisting of (data) input, (data) storage,
(data) computation (query/analysis), (data) control (transac-
tion/recovery) and data output (visualization), in which, data
storage is the center. In other words, we can name a computer
Fast Computation Processing System (FCPS). Likewise, the
BDMS can be called as datar, focusing on data. We use Fig. 2
to illustrate the similarities and differences between a computer
and a datar, in terms of architecture. In Fig. 2 (a), five core
components of a computer are shown in separate rectangles,
while in Fig. 2 (b), ve corresponding parts are shown. A
computer and a datar share the similar functionalities with
different emphases on computation or data storage.
B. Concept of Datar
Definition (Datar) A datar is a set of coherent softe-
wares/systems based on a unified architecture that can manage
(big) data pluggablly, automatically and intelligently with
specific functionalities, where specific functionalities refer to
input, storage, computation, control and output of the (big)
data. Datar is featured with Interactive Interface Clients,
Pluggable Engines Configuration, Automatic Dataflow on Job
Pipelines and Intelligent Self-driving System Management
based on the unified framework. In this paper, we implement
datar with these features as biggy, a data-storage-centered
solution to datar implementation.
A datar, i.e., a full-function BDMS, consists of five parts,
data input, data storage, data computation, data control and
data output. Compared with the computation-centered com-
puter, a datar is data-centered. We take AsterixDB [10] for
example, which is a new, full-function BDMS. Data input is
how data gets into the system. In AsterixDB, data feed is
a built-in mechanism allowing new data to be continuously
ingested into system from external sources, incrementally
populating the datasets and their associated indexes [16]. Data
storage is how the data is stored in the system and how the
indexes are built. In AstrixDB, data and index are stored based
on LSM structure [17]. Data computation is how to mine
valuable information from stored data. A bunch of methods
can be applied, such as popular in-memory computation
framework Spark on AsterixDB [18]. Besides, the execution
of data processing is also part of data analysis, like Hyracks
[19] in AsterixDB. Data control is how to control data when
it is processed. It is different from the traditional database
systems which have strict ACID properties. Another important
aspect of datar is data output, e.g., visualization. Cloudberry
3
is a research prototype to support interactive analytics and
visualization of large amounts of spatial-temporal data using
AsterixDB. Based on these features of AsterixDB, it is ideal
for us to explain the five main components of BDMS by one
system. The key drawback of taking AsterixDB as BDMS
is that it is a strongly coupled system, which is not suitable
for varied and dynamic requirement in real scenarios when
processing big data. And it is not easy for developers to combo
it with new emerging engines. Datar is proposed to achieve a
unified framework for building your own BDMS more flexible.
C. Contributions of Datar
With the development of Internet services, data contents are
rapidly growing, and we have to face the challenges of han-
dling such big data. Data system research has come into a new
era, which brings the traditional concepts from row-based store
to column-based store, from disk-based query to in-memory
based analysis, and from ACID properties to CAP theorem.
Big data shows great value in real application and challenges
arise. Various tools and systems are proposed and developed
to tackle these challenges on different emphases. In this paper,
we describe the BDMS from a new perspective, the view of a
computer architecture, to propose a unified framework datar.
We focus our attention on the system architecture in BDMS
and break it down into ve main components to elaborate.
The envisioned datar is implemented as biggy with favorable
features. The key contributions can be summarized as,
We review current big data management systems by five
core components and state our contributions.
3
http://cloudberry.ics.uci.edu/
of 9
免费下载
【版权声明】本文为墨天轮用户原创内容,转载时必须标注文档的来源(墨天轮),文档链接,文档作者等基本信息,否则作者和墨天轮有权追究责任。如果您发现墨天轮中有涉嫌抄袭或者侵权的内容,欢迎发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

评论

关注
最新上传
暂无内容,敬请期待...
下载排行榜
Top250 周榜 月榜