
Please generate a table regarding "united states house of representatives elections,
2004 in Ohio", the attributes are ["district", "incumbent", "first elected", "candidates"].
and the values under "district" column are ["Ohio's_1st_congressional_district",
"Ohio's_2nd_congressional_district“, "Ohio's_3rd_congressional_district"]. The values
under other columns are missing, which you should fill in with your knowledge.
District Incumbent
First
Elected
Candidates
Ohio's_1st_congressi
onal_district
Steve Chabot
(R)
1994
Steve Chabot (R), Emily
Johnson (D), Alex Smith (I)
Ohio's_2nd_congressi
onal_district
Rob Portman
(R)
1993 (special)
Rob Portman (R), Sarah
Davis (D), James White (I)
Ohio’s_3rd_congressi
onal_district
Mike Turner
(R)
2002 (special)
Mike Turner (R), Jennifer
Lee (D), Robert Green (I)
Party Candidate Votes %
Republican Steve Chabot 173,430
59.83
Democratic Greg Harris 116,235 40.10
Independent Rich Stevenson 198 0.07
Ohio's 1st Congressional District election, 2004
Ohio's 2nd Congressional District election, 2004
Party Candidate Votes %
Republican Rob Portman 227,102
71.70
Democratic Charles W. Sanders 89,598 28.29
Independent James J. Condit, Jr. 60 0.02
(a) ChatGPT for table generation (b) Retrieved relevant data for verification
2004 United States House of Representatives
Elections in Ohio
Figure 1: (a) ChatGPT generates values in tuples; (b) Fusionery retrieves relevant data from third-party sources for verication
and aggregates conicts to provide reliable data.
To avoid the above dilemmas, this paper proposes the on-demand
fusion query, which resolves the between-source conicts with
only the query-related data and avoids accessing all the data in a
centralised data management system. The advantages of the on-
demand fusion query could be concluded in three aspects. (I) Real-
time data fusion. It only utilizes the query-related data, which
commonly makes up only a small proportion of the data in a cen-
tralised data management system and can be processed in real-time.
(II) Adaptive to data updates. Both the query step and data fusion
step in the on-demand fusion query can be completed in real-time;
thus, it can be adaptive to frequent data updates. (III) Free of data
matching. By matching data from various sources with users’ in-
tents (i.e., queries), on-demand fusion queries eectively sidestep
the need for explicit across-source data matching. The advantages
include: (1) well-dened query constraints provide clear match crite-
ria; (2) many-to-many comparisons in across-source data matching
are reduced to one-to-many comparisons, taking less time com-
plexity. Despite the progress made by a few studies [
23
,
42
,
51
],
two challenges have still existed in developing on-demand fusion
queries over heterogeneous multi-source data.
Challenge I: How to support unied queries across multi-source
heterogeneous data? Due to the data type heterogeneity (i.e., struc-
tured, semi-structured, and unstructured data) and the semantic
heterogeneity (i.e., dierent sources involve dierent vocabularies)
of heterogeneous multi-source data, there is still no proven solution
for unied queries across multi-source heterogeneous data.
To solve data type heterogenity, we convert heterogeneous data
into knowledge graphs and formulate it as a knowledge graph
matching problem. Due to the richness of semantic information on
both nodes and edges, knowledge graph matching is much more
complex than plain graph matching. Specically, the search space
of the knowledge graph matching is exponential to the scale of
the knowledge graph. Given a query graph with
|𝑉
𝑞
|
nodes and
|𝑅
𝑞
|
edges, a data graph with
|𝑉
𝑑
|
nodes and
|𝑅
𝑑
|
edges, taking
the simplest solution BFS as an example, the time complexity is
𝑂 ((|𝑉
𝑞
| + |𝑅
𝑞
|)(|𝑉
𝑑
| + |𝑅
𝑑
|))
in the best case, which is infeasible
in practice. To speed up knowledge graph matching, we introduce
knowledge line graph transformation to decouple semantic infor-
mation from graph structure, reducing the time complexity of graph
matching to
𝑂 (|𝑅
𝑞
||𝑅
𝑑
|)
. To solve semantic heterogeneity, we fo-
cus on approximate matching in semantic information encoded by
pre-trained language models, which excel in capturing semantic
relations between words. For example, it can capture similar mean-
ings for dierent terms, such as "spouse", "wife" and "husband";
meanwhile it can also identify dierent meanings for the similar
words like "Apple Inc" and "Big Apple".
Challenge II: How to perform high-quality data fusion in the
on-demand setting? The performance of data fusion is highly de-
pendent on data-hungry probability estimations. In the on-demand
setting, we only have a small amount of query-related data; thus,
it is necessary to cope with the data starvation of data fusion and
develop a novel on-demand data fusion method.
To this end, we develop an Expectation Maximization (EM)-style
learning strategy that consists of two steps. (i) The data veracity
estimation learns the probability that a data item is a correct an-
swer to the query and (ii) the source trustworthiness estimation
learns the probability that a data source provides the correct data.
The two steps are repeated iteratively until convergence. Besides,
considering that observed data is limited, we propose an incremen-
tal estimation for source trustworthiness based on the historical
estimate and the current query results. Furthermore, to improve
the eectiveness and eciency of data fusion, we design an au-
tonomous semantic matching threshold update mechanism to strike
a balance between retrieval precision and recall.
Incorporating optimization strategies addressing challenges men-
tioned above, we propose Fusionery, an ecient framework for
on-demand fusion queries over heterogeneous data.
1.1 Motivating Example
There are several potential applicatioins for Fusionery, such as
retrival-based data cleaning [
3
,
12
] and veried generative AI [
43
].
Here, we reinforce the motivations for Fusionery by illustrating
a real-world application in the realm of the veried generative AI.
1338
文档被以下合辑收录
评论