preview

Analyzing The Data Of Data Storage

Better Essays

INTRODUCTION Duplicate data is defined as the existence of data in several records which is also known as redundancy. The definition has different interpretations. Data warehouse contains voluminous data for mining and analyzing it for better decision making process. In any data warehouse, data comes from n number of sources and hence the result is increase in data and the duplication of data. In order to clean the data, data preprocessing is done which includes data cleaning, data integration, reduction etc. which attempt to clean the data and make the process of decision making much easier. DUPLICATE DATA DETECTION There are several ways to detect duplicate data. Two of them that were mentioned in the papers are: 1. Pre-duplicate record detection phase: Here, data is standardized. Data that is repeated in the fields is converted to a specific format. This removes all the duplicate data that is present in the warehouse. All the duplicate entries are erroneously designated to a non-duplicate value. This is an inexpensive stage for identifying duplicate entries which are later used for comparison. 2. Detection using factors: The pre- duplicate record elimination stage is useful for removing data but it helps in retaining only one copy of the duplicate data and removing the rest. For this purpose, a threshold value is calculated for all the records and a similarity. Threshold value is calculated for elimination purpose. All the possible pairs are selected from the clusters

Get Access