IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. XX, XXX XXXX 2
methods directly model the activity “Chat with friends” with
discrete keywords including “Chat”, “with”, and “friends”.
Leaving aside the disruption of the semantic coherence of
sentences, we cannot enumerate all keywords. To address this
challenge, we propose a new method to model a semantic
trajectory into a geospatial sequence and a semantic sequence,
respectively. In a semantic sequence, we embed semantic texts
into vectors, capturing the rich and deep semantics. Then,
we combine the spatial and semantic similarity into a spatial-
semantic similarity metric, while supporting popular measures
including DTW [23], LCSS [24], and EDR [25].
Challenge II: How to design a scalable distributed se-
mantic trajectory similarity search framework? Compared
to raw trajectories that consist of locations, semantic trajec-
tories introduce semantic events, leading to a great increase
in data sizes. Taking the widely used trajectory datasets T-
drive [26] and GeoLife [27] as examples, enriching them with
semantics according to online map APIs [15], [28] leads to
10 × storage increase. Even worse, owing to the intricacy
involved in semantic distance computation, the calculation of
similarity between semantic trajectories entails higher com-
plexity. According to our experimental study, after incorpo-
rating semantic information into T-drive, the processing time
required for a random single query on a single machine
exceeds 2,000 seconds. As a result, semantic trajectories
may easily exceed the storage capacity and computational
capability of a single machine, necessitating a distributed
and scalable framework. However, most existing distributed
trajectory query processing approaches only support spatial-
aware distributed computation. They partition trajectory data
using methods such as IDs [29], round-robin [3], or techniques
based on separated trajectory points and Minimum Bounding
Rectangles (MBRs) [30], [31]. However, these methods are
unable to partition data based on the semantic information
of trajectories. As a result, most trajectories within the same
partition are semantically non-similar, leading to significant
redundant computation. On the other hand, existing semantic
trajectory query methods do not support large-scale distributed
processing and do not have pruning ability (i.e., partition
skipping) during partitioning. To address this challenge, we
propose a two-phase and computation-aware partitioning ar-
chitecture over Spark [32]. In contrast to existing trajec-
tory partitioning methods that typically use IDs [29], round-
robin [3], or methods based on separated trajectory points
and Minimum Bounding Rectangles (MBRs) [30], [33] to
perform direct trajectory partitioning, we leverage both the
fine-grained semantic and spatial features of trajectories, as
well as the inter-trajectory distances to enable two-phase and
computation-aware distributed partitioning.
Challenge III: How to accelerate the query process? With
the above computation-aware distributed data partitioning, the
input trajectory dataset is assigned to different partitions.
Given a query trajectory, it will be broadcast into relevant
partitions for parallel searching. This brings the idea of de-
signing an effective index in each partition for local pruning.
Existing studies typically build inverted indexes over keywords
and manage trajectories with multiple inverted files. In each
file, they build traditional trajectory indexes, including tree
indexes [34], [35] and grid indexes [36], [37]. However,
the traditional indexes cannot reduce the time complexity of
scanning a partition, but only reduce the number of evaluated
trajectories. To address this challenge, we extend the HNSW
index [38], which was proposed for vector retrieval, for
semantic trajectory similarity search. Specifically, in each data
partition, we organize trajectories into a nearest neighbor graph
by connecting trajectories that are close in spatial-semantic
distance with edges. Based on that, we build a semantic
trajectory oriented HNSW index, called ST-HNSW, where the
query starts at a node in the graph and iteratively traverses
its neighboring nodes. Besides, we also develop a series of
optimization techniques on ST-HNSW for further acceleration.
Overall, we revisit semantic trajectory similarity search
from semantic modeling, distributed partitioning, and opti-
mized query. And we make the following contributions.
• We propose an expressive semantic trajectory modeling
method. Then, we adopt a linear combination way to
calculate the spatial-semantic inter-trajectory similarities,
while supporting three trajectory distance measures.
• We build a distributed framework for semantic trajec-
tory similarity search, incorporating a novel two-phase,
computation-aware partitioning strategy, which enables
efficient query processing by allowing the pruning of
most irrelevant partitions when a query is executed.
• We develop the ST-HNSW index in each data partition for
local query acceleration. Besides, a series of optimization
techniques is designed to further speed up local queries.
• Extensive experiments on three large-scale datasets
demonstrate the effectiveness, efficiency, and scalability
of the proposed framework over competitors.
II. PRELIMINARIES
A. Problem Statement
When reviewing existing semantic trajectory representation
methods, there are individual-based forms [13], [14], [19],
[22], [39] and global-based forms [40], [41]. The definitions
of the two forms of semantic trajectory are as follows.
Definition 1. (Individual-based Semantic Trajectory) An
individual-based semantic trajectory is an ordered sequence
of points, i.e., τ = ⟨p
1
, p
2
, ..., p
n
⟩, where each point p
i
(1 ≤ i ≤ n) contains a location l
i
and an optional keyword
set K
i
. Here K
i
= {key
1
, key
2
, ...} is used to describe the
attributes of the corresponding location p
i
.
Definition 2. (Global-based Semantic Trajectory) A global-
based semantic trajectory is an ordered point sequence
associated with a description keyword set, i.e., τ =
{⟨p
1
, p
2
, ..., p
n
⟩, K
i
}, where K = {key
1
, key
2
, ...} is used to
describe the attributes of the whole trajectory.
Under the definition of semantic trajectories mentioned
above, STSS aims to find other similar trajectories in a
semantic trajectory dataset for a given query trajectory. The
definition is as follows.
Definition 3. (Semantic Trajectory Similarity Search,
STSS) Given a set of individual-based or global-based se-
mantic trajectories T , a query trajectory τ
q
, a positive integer
This article has been accepted for publication in IEEE Transactions on Knowledge and Data Engineering. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2025.3623108
© 2025 IEEE. All rights reserved, including rights for text and data mining and training of artificial intelligence and similar technologies. Personal use is permitted,
but republication/redistribution requires IEEE permission. See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: ZTE Corporation (One Dept). Downloaded on October 28,2025 at 06:50:47 UTC from IEEE Xplore. Restrictions apply.
评论