DocumentCode
628151
Title
Terms extraction from unstructured data silos
Author
Lomotey, Richard K. ; Deters, Ralph
Author_Institution
Dept. of Comput. Sci., Univ. of Saskatchewan, Saskatoon, SK, Canada
fYear
2013
fDate
2-6 June 2013
Firstpage
19
Lastpage
24
Abstract
The major challenge that the big data era brings to the services computing landscape is debris of unstructured data. The high-dimensional data is in heterogeneous formats, schemaless, and requires multiple storage APIs is some cases. This situation has made it almost impractical to apply existing data mining techniques which are designed for schema-based data sources in a knowledge discovery in database (KDD) process. In this paper, a tool called TouchR is proposed which algorithmically relies on the Hidden Markov Model (HMM) to extract terms from data silos; specifically, distributed NoSQL databases- which we model as network graph. Our use case graph consists of storage nodes such as CouchDB, Neo4J, DynamoDB etc. The evaluation of TouchR shows high accuracy for terms extraction and organization.
Keywords
SQL; data mining; distributed databases; document handling; graph theory; hidden Markov models; network theory (graphs); API; CouchDB; DynamoDB; HMM; KDD process; Neo4J; TouchR tool; data mining techniques; distributed NoSQL database; heterogeneous-schemaless high-dimensional data; hidden Markov model; knowledge discovery-in-database process; network graph; schema-based data sources; storage nodes; term extraction; term organization; unstructured data silos; Data mining; Dictionaries; Distributed databases; Feature extraction; Hidden Markov models; Mathematical model; Hidden Markov Model (HMM); NoSQL; Unstructured data mining; big data; terms extraction;
fLanguage
English
Publisher
ieee
Conference_Titel
System of Systems Engineering (SoSE), 2013 8th International Conference on
Conference_Location
Maui, HI
Print_ISBN
978-1-4673-5596-4
Type
conf
DOI
10.1109/SYSoSE.2013.6575236
Filename
6575236
Link To Document