DSpace Repository

A Fast Filtering Scheme for Large Database Cleansing

Show simple item record

dc.contributor.author Sung Sam Y
dc.contributor.author Li Zhao
dc.contributor.author Sun Peng
dc.date.accessioned 2018-01-22T17:24:54Z
dc.date.available 2018-01-22T17:24:54Z
dc.date.issued 2002
dc.identifier.uri http://hdl.handle.net/123456789/6950
dc.description.abstract Existing data cleansing methods are costly and will take very long time to cleanse large databases. Since large databases are common nowadays, it is necessary to reduce the cleansing time. Data cleansing consists of two main components, detection method and comparison method. In this paper, we first propose a simple and fast comparison method, TI-Similarity, which reduces the time for each comparison. Based on TI-Similarity, we propose a new detection method, RAR, to further reduce the number of comparisons. With RAR and TI-Similarity, our new approach for cleansing large databases is composed of two processes: Filtering process and Pruning process. In filtering process, a fast scan on the database is carried out with RAR and TI-Similarity. This process guarantees the detection of potential duplicate records but may introduce false positives. In pruning process, the duplicate result from the filtering process is pruned to eliminate the false positives using more trustworthy comparison methods. The performance study shows that our approach is efficient and scalable for cleansing large databases, and is about an order of magnitude faster than existing cleansing methods.
dc.format application/pdf
dc.title A Fast Filtering Scheme for Large Database Cleansing
dc.type journal-article
dc.source.journal CIKM'02


Files in this item

This item appears in the following Collection(s)

Show simple item record

Search DSpace


Browse

My Account