DocumentCode :
3717487
Title :
Using Word2Vec to process big text data
Author :
Long Ma;Yanqing Zhang
Author_Institution :
Computer Science Department, Georgia State University, Atlanta, Georgia
fYear :
2015
Firstpage :
2895
Lastpage :
2897
Abstract :
Big data is a broad data set that has been used in many fields. To process huge data set is a time consuming work, not only due to its big volume of data size, but also because data type and structure can be different and complex. Currently, many data mining and machine learning technique are being applied to deal with big data problem; some of them can construct a good learning algorithm in terms of lots of training example. However, considering the data dimension, it will be more efficient if learning algorithm is capable of selecting useful features or decreasing the feature dimension. Word2Vec, proposed and supported by Google, is not an individual algorithm, but it consists of two learning models, Continuous Bag of Words (CBOW) and Skip-gram. By feeding text data into one of learning models, Word2Vec outputs word vectors that can be represented as a large piece of text or even the entire article. In our work, we first training the data via Word2Vec model and evaluated the word similarity. In addition, we clustering the similar words together and use the generated clusters to fit into a new data dimension so that the data dimension is decreased.
Keywords :
"Big data","Training data","Training","Clustering algorithms","Machine learning algorithms","Algorithm design and analysis","Data models"
Publisher :
ieee
Conference_Titel :
Big Data (Big Data), 2015 IEEE International Conference on
Type :
conf
DOI :
10.1109/BigData.2015.7364114
Filename :
7364114
Link To Document :
بازگشت