مرکز منطقه ای اطلاع رساني علوم و فناوري - How valuable is your data? A quantitative approach using data mining

Abstract :

Unstructured textual data has grown rapidly in the past two decades in various domains like enterprises, web, scientific, etc. A question that arises naturally when there is such a surfeit of data is: how valuable is a certain piece of data as compared to another? In an enterprise, the answer to this question would determine how valuable said data is to the enterprise. In this paper, we build a framework using data mining that quantifies the value of data. We first identify a specific notion of "value" that is motivated by applications in Enterprise unstructured Information Management (EIM). Namely, we posit that for several applications in EIM, the value of unstructured data is determined by the associations it captures between concepts. The more such associations in data, the more valuable it is. Next, we build a framework using data mining that "counts" the number of associations in data. Our framework uses clustering and frequent itemsets. It also normalizes for data size. We demonstrate our approach on two of the most widely used text benchmark datasets: Reuters and 20 Newsgroups. Our general intuition is that a corpus of professionally written news articles are more valuable (in the sense of capturing more associations between concepts) than newsgroup postings of variable quality written by non-experts. Our quantitative approach indeed reaches the same inference.