
Chen Wang
1,2
, Xiangdong Huang
1∗
, Jialin Qiao
1
, Tian Jiang
1
, Lei Rui
1
, Jinrui Zhang
3
, Rong Kang
1
, Julian Feinauer
4
, Kevin A. McGrail
5
, Peng Wang
6
, Diaohan Luo
1
,
Jun Yuan
1
, Jianmin Wang
1
, Jiaguang Sun
1
Spark
Adaptor
series
register
PLC data
MQTT etc..
TsFile TsFile
Edge-side IoTDB TsFile LibPLC Controller
IoT Devices
Cluster Engine
Query Engine
Index
Manager
Data Reader Manager
Cache
Manager
TsFile
O3-
TsFile
D-File
Index
File
Storage Engine
Time
Detector
Ordered Memtable
Out-of-order Memtable
File Sync
Raft Protocol
Series-based
Partitioner
Single-node IoTDB
Time-based
Partitioner
Metadata Management
Schema
Management
TsFile
Hive
Adaptor
Hadoop
Eco-
systems
Restful
APi
JDBC
+
SQL-like
Language
Native
API
Client
Figure 1: Main modules of IoTDB
performance and poor compression for IoT workloads. InfluxDB
has about 10x improvements on these aspects [
5
], but still has a gap
on query performance especially when aggregating large amounts of
historical data. In practice, Parquet [
3
] is the choice for time-series
storage for analytics and OLAP workloads with Hadoop ecosystem,
but it requires ETL from the time-series database. Moreover, opti-
mization is needed to support native time-series data for both storage
and query efficiency beyond columnar storage.
In this demo, we introduce Apache IoTDB [
1
], a native time-series
storage format and database for both edge and cloud computing. It
has the following key features: (1) IoTDB has a lightweight archi-
tecture running on the edge appliances, and a cluster version for
the data center under the same code base, as well as efficient data
synchronization from edge to cloud. (2) IoTDB provides a novel
columnar file format, called TsFile (Time-series File), as the main
storage format to optimize the data organization, size reduction, and
query performance with time-series data. (3) IoTDB supports high
throughput ingestion by an elaborate buffering design and storage
strategy to handle frequent out-of-order data ingestion and sorted
data query. (4) IoTDB leverages metadata in TsFiles and index files
to support low latency queries and complex similarity search. (5)
TsFile, as the file format of IoTDB, can also be accessed directly in
Hadoop ecosystem by Spark and Hive for data analysis.
Currently, IoTDB supports the ingestion rate up to 30 million
data points per second on a single node, and the latency of hundreds
of milliseconds for raw data queries and tens of milliseconds for
aggregation queries on billions of data points. A more comprehensive
functional and experimental evaluation can be accessed in public [
4
].
IoTDB has been deployed in the production environment by several
industry users.
2 SYSTEM OVERVIEW
The architecture of IoTDB is shown in Figure 1. IoTDB is designed
to manage huge volumes of time-series data points from IoT de-
vices, where one data point is logically depicted as (<device, sensor,
timestamp, value>. Herein the device and sensor identifiers together
present a unique time-series ID. The Metadata Management module
manages the naming space of devices with a tree structure. For in-
stance, Location1.Windfarm2.Manufactuer3.Turbine4 is a full path
to describe a single wind turbine. The design of IoTDB chooses
to store the data in an open native time-series file format for both
database access with Query/Storage Engine and Hadoop/Spark ac-
cess against a single copy of the data. It also serves as a distributed
time-series database, where data is partitioned by grouping of time-
series in Cluster Engine among different nodes while time-based
data slicing is implemented on each node to improve the perfor-
mances. IoTDB provides an SQL-like language, native API, and
restful API to access the data. We then introduce the main features
in the following subsections.
2.1 Uniform Edge-Cloud Design
In IoT scenario, edge computing and cloud side deployment are
equally important. Therefore, IoTDB is designed to fit three deploy-
ment models: 1) file-based storage or embedded time-series database
on edge appliance like Raspberry PI, 2) standalone time-series data-
base on Industrial PC and 3) distributed time-series database or
Hadoop cluster with TsFile storage format.
Typically, IoT devices collect data from sensors and industrial
controllers, and send data to data center using customized or stan-
dard protocols like MQTT in real-time. However, in some cases, the
edge intelligence requires real-time analytics, such as fault alerts,
to retrieve data from a local data store. Therefore, IoTDB has a
lightweight, embedded version to be deployed on the IoT devices,
where the minimal runtime memory requirement is 32MB and com-
putation is supported with an ARM7 processor. Local storage is also
mandatory to prevent data loss in case of the temporary network
outage. In this scenario, TsFile Lib allows the devices to persist
data in TsFile format, and afterwards the generated TsFiles can be
directly synchronized and merged with active IoTDB instance on
the cloud using the File Sync module.
On the cloud side, using the Cluster Engine, a raft-based protocol
is implemented to manage multiple IoTDB nodes. In the cluster
mode, data partitions can be defined according to both time slice
and time-series ID. The distribution of data and query operations are
completely transparent to the end users.
2.2 TsFile Format
TsFile is the primary data file format for time-series data storage
in IoTDB. Figure 2 shows the structure of the TsFile. TsFile is
similar to Parquet but optimized for time-series data. A TsFile mainly
consists of two parts: the data content (Chunks, Pages) and the
index. Each chunk stores the data of a time-series for a certain time
range. Inside a chunk, the data is split into several pages, which
is the fundamental unit of the data storage on the disk. Each page
stores data points in a pair of columns, i.e. the timestamps and the
value. Timestamps are encoded by second order difference and the
value field supports compression algorithms like bitmap, Gorilla,
RLE, etc. to save disk space. Snappy is also employed for advanced
compression on historical data.
To accelerate the query, the data in the chunks of each time-series
is ordered by time in TsFile. In this way, queries with time range
filters can quickly skip the chunks out of the given time window.
2902
文档被以下合辑收录
评论