A Fast Filtering Scheme for Large Database Cleansing

Sung Sam Y; Li Zhao; Sun Peng

DSpace Home
→
Ingenierías y Ciencias de la Computación
→
*Ingenierías y Ciencias de la Computación (Proyecto VLIR)
→
Documentos
→
View Item

dc.contributor.author	Sung Sam Y
dc.contributor.author	Li Zhao
dc.contributor.author	Sun Peng
dc.date.accessioned	2018-01-22T17:24:54Z
dc.date.available	2018-01-22T17:24:54Z
dc.date.issued	2002
dc.identifier.uri	http://hdl.handle.net/123456789/6950
dc.description.abstract	Existing data cleansing methods are costly and will take very long time to cleanse large databases. Since large databases are common nowadays, it is necessary to reduce the cleansing time. Data cleansing consists of two main components, detection method and comparison method. In this paper, we first propose a simple and fast comparison method, TI-Similarity, which reduces the time for each comparison. Based on TI-Similarity, we propose a new detection method, RAR, to further reduce the number of comparisons. With RAR and TI-Similarity, our new approach for cleansing large databases is composed of two processes: Filtering process and Pruning process. In filtering process, a fast scan on the database is carried out with RAR and TI-Similarity. This process guarantees the detection of potential duplicate records but may introduce false positives. In pruning process, the duplicate result from the filtering process is pruned to eliminate the false positives using more trustworthy comparison methods. The performance study shows that our approach is efficient and scalable for cleansing large databases, and is about an order of magnitude faster than existing cleansing methods.
dc.format	application/pdf
dc.title	A Fast Filtering Scheme for Large Database Cleansing
dc.type	journal-article
dc.source.journal	CIKM'02