A Heuristic Approach To Record Deduplication

  • Unique Paper ID: 142476
  • Volume: 2
  • Issue: 2
  • PageNo: 148-154
  • Abstract:
  • Databases and database related technologies are having a major impact on the growing use of computers. Many global data repositories collect data from various data sources, due to this the chances of duplicates in repositories are more. The duplicate present in database is the result of misleading words and different writing styles. The presence of duplicate records decreases the system performance as it will take more time to retrieve correct relevant data from database. The clean and replica free repositories allow retrieval of higher quality information. The record deduplication is process of identifying and removal of duplicates present in database. The different approaches used to design the deduplication function are domain knowledge approach, probabilistic approach, and machine learning approach. These approaches additionally require human judgment and large computation time. To resolve the above problem, this project proposes a model to design the deduplication function for identifying the duplicate records presents in data repository by using genetic programming approach. Genetic Programming (GP) approach is a heuristic approach which automatically suggests deduplication function based on the evidence present in the data repositories. The deduplication function will help to predict whether the records are duplicates or not. Its main policy is to avoid the problems that arise due to the existence of duplicate values in the database. The proposed model uses the jaro winkler similarity function to calculate similarity measure between the records.

Cite This Article

  • ISSN: 2349-6002
  • Volume: 2
  • Issue: 2
  • PageNo: 148-154

A Heuristic Approach To Record Deduplication

Related Articles