Entity Reconciliation in Big Data Sources: a Systematic Mapping Study

J.G. Enríquez , F.J. Domínguez-Mayo , M.J. Escalona , Margaret Ross, Geoff Staples

Research output: Contribution to journalArticle

Abstract

The entity reconciliation (ER) problem aroused much interest as a research topic in today's Big Data era, full of big and open heterogeneous data sources. This problem poses when relevant information on a topic needs to be obtained using methods based on: (i) identifying records that represent the same real world entity, and (ii) identifying those records that are similar but do not correspond to the same real-world entity. ER is an operational intelligence process, whereby organizations can unify different and heterogeneous data sources in order to relate possible matches of non-obvious entities. Besides, the complexity that the heterogeneity of data sources involves, the large number of records and differences among languages, for instance, must be added. This paper describes a Systematic Mapping Study (SMS) of journal articles, conferences and workshops published from 2010 to 2017 to solve the problem described before, first trying to understand the state-of-the-art, and then identifying any gaps in current research. Eleven digital libraries were analyzed following a systematic, semiautomatic and rigorous process that has resulted in 61 primary studies. They represent a great variety of intelligent proposals that aim to solve ER. The conclusion obtained is that most of the research is based on the operational phase as opposed to the design phase, and most studies have been tested on real-world data sources, where a lot of them are heterogeneous, but just a few apply to industry. There is a clear trend in research techniques based on clustering/blocking and graphs, although the level of automation of the proposals is hardly ever mentioned in the research work.
Original languageEnglish
Pages (from-to)14-27
JournalExpert Systems with Applications
Volume80
Issue number1
DOIs
Publication statusPublished - 10 Mar 2017

Fingerprint

Digital libraries
Automation
Big data
Industry

Cite this

Enríquez , J. G., Domínguez-Mayo , F. J., Escalona , M. J., Ross, M., & Staples, G. (2017). Entity Reconciliation in Big Data Sources: a Systematic Mapping Study. Expert Systems with Applications, 80(1), 14-27. https://doi.org/10.1016/j.eswa.2017.03.010
Enríquez , J.G. ; Domínguez-Mayo , F.J. ; Escalona , M.J. ; Ross, Margaret ; Staples, Geoff . / Entity Reconciliation in Big Data Sources : a Systematic Mapping Study. In: Expert Systems with Applications. 2017 ; Vol. 80, No. 1. pp. 14-27.
@article{8d90928485e3420ba64b6ef88691be6a,
title = "Entity Reconciliation in Big Data Sources: a Systematic Mapping Study",
abstract = "The entity reconciliation (ER) problem aroused much interest as a research topic in today's Big Data era, full of big and open heterogeneous data sources. This problem poses when relevant information on a topic needs to be obtained using methods based on: (i) identifying records that represent the same real world entity, and (ii) identifying those records that are similar but do not correspond to the same real-world entity. ER is an operational intelligence process, whereby organizations can unify different and heterogeneous data sources in order to relate possible matches of non-obvious entities. Besides, the complexity that the heterogeneity of data sources involves, the large number of records and differences among languages, for instance, must be added. This paper describes a Systematic Mapping Study (SMS) of journal articles, conferences and workshops published from 2010 to 2017 to solve the problem described before, first trying to understand the state-of-the-art, and then identifying any gaps in current research. Eleven digital libraries were analyzed following a systematic, semiautomatic and rigorous process that has resulted in 61 primary studies. They represent a great variety of intelligent proposals that aim to solve ER. The conclusion obtained is that most of the research is based on the operational phase as opposed to the design phase, and most studies have been tested on real-world data sources, where a lot of them are heterogeneous, but just a few apply to industry. There is a clear trend in research techniques based on clustering/blocking and graphs, although the level of automation of the proposals is hardly ever mentioned in the research work.",
author = "J.G. Enr{\'i}quez and F.J. Dom{\'i}nguez-Mayo and M.J. Escalona and Margaret Ross and Geoff Staples",
year = "2017",
month = "3",
day = "10",
doi = "10.1016/j.eswa.2017.03.010",
language = "English",
volume = "80",
pages = "14--27",
journal = "Expert Systems with Applications",
issn = "0957-4174",
publisher = "Elsevier Limited",
number = "1",

}

Enríquez , JG, Domínguez-Mayo , FJ, Escalona , MJ, Ross, M & Staples, G 2017, 'Entity Reconciliation in Big Data Sources: a Systematic Mapping Study' Expert Systems with Applications, vol. 80, no. 1, pp. 14-27. https://doi.org/10.1016/j.eswa.2017.03.010

Entity Reconciliation in Big Data Sources : a Systematic Mapping Study. / Enríquez , J.G. ; Domínguez-Mayo , F.J. ; Escalona , M.J.; Ross, Margaret; Staples, Geoff .

In: Expert Systems with Applications, Vol. 80, No. 1, 10.03.2017, p. 14-27.

Research output: Contribution to journalArticle

TY - JOUR

T1 - Entity Reconciliation in Big Data Sources

T2 - a Systematic Mapping Study

AU - Enríquez , J.G.

AU - Domínguez-Mayo , F.J.

AU - Escalona , M.J.

AU - Ross, Margaret

AU - Staples, Geoff

PY - 2017/3/10

Y1 - 2017/3/10

N2 - The entity reconciliation (ER) problem aroused much interest as a research topic in today's Big Data era, full of big and open heterogeneous data sources. This problem poses when relevant information on a topic needs to be obtained using methods based on: (i) identifying records that represent the same real world entity, and (ii) identifying those records that are similar but do not correspond to the same real-world entity. ER is an operational intelligence process, whereby organizations can unify different and heterogeneous data sources in order to relate possible matches of non-obvious entities. Besides, the complexity that the heterogeneity of data sources involves, the large number of records and differences among languages, for instance, must be added. This paper describes a Systematic Mapping Study (SMS) of journal articles, conferences and workshops published from 2010 to 2017 to solve the problem described before, first trying to understand the state-of-the-art, and then identifying any gaps in current research. Eleven digital libraries were analyzed following a systematic, semiautomatic and rigorous process that has resulted in 61 primary studies. They represent a great variety of intelligent proposals that aim to solve ER. The conclusion obtained is that most of the research is based on the operational phase as opposed to the design phase, and most studies have been tested on real-world data sources, where a lot of them are heterogeneous, but just a few apply to industry. There is a clear trend in research techniques based on clustering/blocking and graphs, although the level of automation of the proposals is hardly ever mentioned in the research work.

AB - The entity reconciliation (ER) problem aroused much interest as a research topic in today's Big Data era, full of big and open heterogeneous data sources. This problem poses when relevant information on a topic needs to be obtained using methods based on: (i) identifying records that represent the same real world entity, and (ii) identifying those records that are similar but do not correspond to the same real-world entity. ER is an operational intelligence process, whereby organizations can unify different and heterogeneous data sources in order to relate possible matches of non-obvious entities. Besides, the complexity that the heterogeneity of data sources involves, the large number of records and differences among languages, for instance, must be added. This paper describes a Systematic Mapping Study (SMS) of journal articles, conferences and workshops published from 2010 to 2017 to solve the problem described before, first trying to understand the state-of-the-art, and then identifying any gaps in current research. Eleven digital libraries were analyzed following a systematic, semiautomatic and rigorous process that has resulted in 61 primary studies. They represent a great variety of intelligent proposals that aim to solve ER. The conclusion obtained is that most of the research is based on the operational phase as opposed to the design phase, and most studies have been tested on real-world data sources, where a lot of them are heterogeneous, but just a few apply to industry. There is a clear trend in research techniques based on clustering/blocking and graphs, although the level of automation of the proposals is hardly ever mentioned in the research work.

U2 - 10.1016/j.eswa.2017.03.010

DO - 10.1016/j.eswa.2017.03.010

M3 - Article

VL - 80

SP - 14

EP - 27

JO - Expert Systems with Applications

JF - Expert Systems with Applications

SN - 0957-4174

IS - 1

ER -