4362 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 37, NO. 7, JULY 2025
noises by DBSCAN in Fig. 1(a) strongly suggest the need to focus
on refining C2. (See more examples in Fig. 8 in Section VI).
In this study, we propose to repair the inaccurate data points
during clustering. For example, as shown in Fig. 1(b),anarrow
(a → b) denotes that a point is repaired from location a to
location b. Since points are concentrated after repairing, two
clusters are formed in Fig. 1(c).
The problem of clustering with repairing, however, is non-
trivial. Simply repairing noise points to the closest clusters is
not sufficient, e.g., repairing all the noise points to C1 in Fig. 1
does not help in identifying the second cluster C2. It should
be considered that dirty points may possibly form clusters with
repairing (i.e., C2).
Clustering and cleaning data provide significant benefits in
various real-world applications. For example, in personal at-
tribute analysis [26], cleaning GPS data improves clustering
accuracy, enhancing the interpretation of regional studies. In
healthcare [20], cleaning GPS data enhances the clustering of
lifelog data, ensuring better accuracy in identifying individuals’
activity patterns. In location-based services [31], cleaning GPS
data ensures correct coordinates, which is crucial for accurate
clustering in tasks such as attendance tracking and urban plan-
ning. As shown in Figs. 9 and 10 of application study, cleaning
enhances the accuracy and reliability of location-based services
like attendance tracking.
B. Contribution
Our proposed DORC techniques complement existing repair
methods. Compared to these methods (details in Section II),
DORC has two major advantages: (1) it does not rely on external
knowledge of integrity constraints or rules, and (2) it utilizes
the density information embedded within the data, which many
methods overlook.
This paper focuses on the grid-based
GDORC algorithm, which
simultaneously repairs and clusters with a constant factor ap-
proximation. Compared to
QDORC and LDORC [29], GDORC
achieves comparable accuracy but with reduced time, as it
searches data cells instead of individual points.
Our major contributions in this paper are summarized as:
1) We formalize the
DORC problem of simultaneous cluster-
ing and repairing (presented in Section III). In particular,
no additional parameters are introduced for
DORC besides
the density and distance thresholds η and ε for clustering.
2) We provide the hardness analysis and tractable special
case for the
DORC problem, and formulate DORC as an ILP
problem (detailed in Section IV).
3) We devise a constant factor approximation method,
GDORC, based on a grid that decomposes the data space
into cells (described in Section V). The correctness and the
complexity of the algorithm are analyzed in Propositions 6
and 7, respectively. Furthermore, the approximation ratio
of
GDORC is examined in Proposition 8.
4) We conduct extensive experiments on real datasets (illus-
trated in Section VI ). The results demonstrate that our
proposed
GDORC approach has both high clustering and
repairing performances, while keeping time costs low.
Full proofs of major results can be found at [18].
II. R
ELATED WORK
A. Clustering
Density-based clustering is widely used (see [17]). Given
a distance threshold ε and a minimum neighborhood size η
(MinPts),
DBSCAN [7] categorizes points into cores, borders,
and noise. In this study, we adhere to these established set-
tings. Numerous algorithms elaborate on the concept of
DB-
SCAN. Various versions of DBSCAN have been developed, in-
cluding the incremental version [6], the distributed version NG-
DBSCAN [21] and DBSCAN-MS [34], and the dynamic ver-
sion [9].
DBSCAN++ [13] aims to reduce computational cost by
estimating densities for subsets of points.
DBSVEC [32] integrates
support vectors to reduce unnecessary range queries, enhancing
efficiency.
ANYDBC [22] compresses data into subsets and labels
objects based on connected components to reduce time cost.
DBSCAN-DIST [1] combines the concepts of density and minimax
distance to formalize density-based clustering by a loss function.
Several methods focus on stream data. DISC [16] intro-
duces a density-based incremental strategy with elaborations
on points’ types, including ex-cores and neo-cores. D
ENFOR-
EST [15] presents a concept of ”nostalgic core” and uses spanning
trees for better cluster quality and efficiency.
All these studies focus primarily on identifying noise/non-
noise points, while the large amount of dirty data (identified as
noises) are still not employed to form clusters. In contrast, our
proposal considers cleaning and clustering with dirty data. While
outlier detection (see [12] for a survey) identifies dirty points,
data repairing further modifies these points for correction. In-
deed, our proposal incorporates density-based
DBSCAN.Inthis
sense, data repairing and outlier detection are complementary.
B. Repairing
Besides the minimum modification model [33], which we
also adopt, t he deletion model [4] is commonly used to identify
the minimum removal of dirty data. In this context,
DBSCAN
can be seen as a deletion-based technique, eliminating dirty
data. However, simply discarding large amounts of dirty data
as noise using existing clustering methods is not ideal and can
significantly affect clustering results.
Recent data repair methods, such as
HOLOCLEAN [5], IMR [35],
RSR [19], and MISC [28], address data errors in various ways.
H
OLOCLEAN integrates multiple data sources to detect and repair
anomalies, but relies on external data and integrity constraints.
IMR is an incremental regression method for time-series data,
which is less effective for static data. RSR uses regular ex-
pressions to repair s equence and token values, but requires
predefined rules.
Our study, as mentioned in the introduction, focuses on the
density information within the data, without relying on ex-
ternal knowledge or constraints. Our approach complements
these techniques when additional information, such as rules,
is available. MISC repairs data errors on selected attributes
but is less suited for datasets like GPS data, where errors im-
pact both longitude and latitude simultaneously. Therefore, we
perform full-feature repairs directly on GPS data, where MISC’s
advantages are less applicable.
Authorized licensed use limited to: Tsinghua University. Downloaded on July 28,2025 at 07:57:48 UTC from IEEE Xplore. Restrictions apply.
评论