Title of article :
A Two-level Semi-supervised Clustering Technique for News Articles
Author/Authors :
Sadjadi, S.M Faculty of Computer Engineering - Shahrood University of Technology - Shahrood, Iran , Mashayekhi, H Faculty of Computer Engineering - Shahrood University of Technology - Shahrood, Iran , Hassanpour, H Faculty of Computer Engineering - Shahrood University of Technology - Shahrood, Iran
Pages :
10
From page :
2648
To page :
2657
Abstract :
The web and social media are overcrowded with news pieces in terms of amount and diversity. Document clustering is a useful technique that is widely used in organizing and managing data into smaller groups. One of the factors influencing the quality of clustering is the way documents are represented. Some traditional methods of document representation depend on word frequencies and create sparse and large-sized document vectors. These methods cannot preserve proximity information between documents. In addition, neural network-based methods that preserve proximity information suffer from poor interpretability. Conceptual text representation methods have overcome the shortcomings of previous methods, but semi-supervised text clustering does not currently use concept-based document representation. This paper presents a two-level semi-supervised text clustering method that uses labeled and unlabeled data simultaneously to achieve higher clustering quality. In the first level, documents are represented based on the concepts extracted from the raw corpus. Second, the semi-supervised clustering process applies unlabeled data to capture the overall structure of the clusters and a small amount of labeled data to adjust the center of the clusters. Experiments on the Reuters-21578 and BBC News data collections show that the proposed model is superior to other semi-supervised approaches in both text classification and text clustering
Farsi abstract :
ﺻﻔﺤﺎت وب و رﺳﺎﻧﻪﻫﺎي اﺟﺘﻤﺎﻋﯽ از ﻧﻈﺮ ﻣﻘﺪار و ﺗﻨﻮع ﻣﻤﻠﻮ از اﺧﺒﺎر ﻫﺴﺘﻨﺪ. ﺧﻮﺷﻪﺑﻨﺪي اﺳﻨﺎد ﯾﮏ روش ﻣﻔﯿﺪ اﺳﺖ ﮐﻪ ﺑﻪ ﻃﻮر ﮔﺴﺘﺮدهاي در ﺳﺎزﻣﺎﻧﺪﻫﯽ و ﻣﺪﯾﺮﯾﺖ دادهﻫﺎ ﺑﻪ ﮔﺮوهﻫﺎي ﮐﻮﭼﮑﺘﺮ اﺳﺘﻔﺎده ﻣﯽﺷﻮد. ﯾﮑﯽ از ﻋﻮاﻣﻞ ﺗﺄﺛﯿﺮﮔﺬار ﺑﺮ ﮐﯿ ﻔﯿﺖ ﺧﻮﺷﻪﺑﻨﺪي ، ﻧﺤﻮه ﺑﺎزﻧﻤﺎﯾﯽ اﺳﻨﺎد اﺳﺖ. ﺑﺮﺧﯽ از روشﻫﺎي ﺳﻨﺘﯽ ﺑﺎزﻧﻤﺎﯾﯽ اﺳﻨﺎد ﺑﻪ ﺗﮑﺮارﻫﺎي ﮐﻠﻤﻪ در ﻣﺘﻦ ﺑﺴﺘﮕﯽ دارﻧﺪ و ﺑﺮدارﻫﺎي ﺳﻨﺪ ﭘﺮاﮐﻨﺪه و ﺑﺰرﮔﯽ را اﯾﺠﺎد ﻣﯽﮐﻨﻨﺪ. اﯾﻦ روشﻫﺎ ﻧﻤﯽﺗﻮاﻧﻨﺪ اﻃﻼﻋﺎت ﻣﺠﺎورﺗﯽ ﺑﯿﻦ اﺳﻨﺎد را ﺣﻔﻆ ﮐﻨﻨﺪ. ﻋﻼوه ﺑﺮ اﯾﻦ ، روشﻫﺎي ﻣﺒﺘﻨﯽ ﺑﺮ ﺷﺒﮑﻪ ﻋﺼ ﺒﯽ ﮐﻪ اﻃﻼﻋﺎت ﻣﺠﺎورﺗﯽ را ﺣﻔﻆ ﻣﯽﮐﻨﻨﺪ، از ﺗﻔﺴﯿﺮﭘﺬﯾﺮي ﺿﻌ ﯿﻒ رﻧﺞ ﻣﯽﺑﺮﻧﺪ. روشﻫﺎي ﺑﺎزﻧﻤﺎﯾﯽ ﻣﺘﻦِ ﻣﺒﺘﻨﯽ ﺑﺮ ﻣﻔﺎﻫﯿﻢ ﺑﺮ ﮐﺎﺳﺘﯽﻫﺎي روشﻫﺎي ﻗﺒﻠﯽ ﻏﻠﺒﻪ ﻣﯽﮐﻨﻨﺪ، اﻣﺎ روشﻫﺎي ﺧﻮﺷﻪﺑﻨﺪي ﻧﯿﻤﻪﻧﻈﺎر ﺗﯽ ﻣﺘﻦ در ﺣﺎل ﺣﺎﺿﺮ از ﻧﻤﺎﯾﺶ اﺳﻨﺎد ﻣﺒﺘﻨﯽ ﺑﺮ ﻣﻔﻬﻮم اﺳﺘﻔﺎده ﻧﻤﯽﮐ ﻨﻨﺪ. در اﯾﻦ ﻣﻘﺎﻟﻪ ﯾﮏ روش ﺧﻮﺷﻪﺑﻨﺪي ﻧﯿﻤﻪﻧﻈﺎرﺗﯽ ﻣﺘﻮن ﺧﺒﺮي ﻣﺒﺘﻨﯽ ﺑﺮ ﻣﻔﻬﻮم اراﺋﻪ ﺷﺪه اﺳﺖ ﮐﻪ ﺑﺮاي دﺳﺘﯿﺎﺑﯽ ﺑﻪ ﮐ ﯿﻔﯿﺖ ﺧﻮﺷﻪﺑﻨﺪي ﺑﺎﻻﺗﺮ از دادهﻫﺎي داراي ﺑﺮﭼﺴﺐ و ﺑﺪون ﺑﺮﭼﺴﺐ ﺑﻪ ﻃﻮر ﻫﻤﺰﻣﺎن اﺳﺘﻔﺎده ﻣﯽﮐﻨﺪ. در ﻣﺮﺣﻠﻪ اول اﺳﻨﺎد ﺑﺮ اﺳﺎس ﻣﻔﺎﻫ ﯿﻢ اﺳﺘﺨﺮاج ﺷﺪه از ﻣﺠﻤﻮﻋﻪ اﺳﻨﺎد ﻧﻤﺎﯾﺶ داده ﻣﯽﺷﻮﻧﺪ. ﺳﭙﺲ، ﻓﺮآﯾﻨﺪ ﺧﻮﺷﻪﺑﻨﺪي ﻧﯿﻤﻪﻧﻈﺎر ﺗﯽ دادهﻫﺎي ﺑﺪون ﺑﺮﭼﺴﺐ را ﺑﺮاي ﮔﺮﻓﺘﻦ ﺳﺎﺧﺘﺎر ﮐﻠ ﯽ ﺧﻮﺷﻪﻫﺎ و ﻣﻘﺪار ﮐﻤﯽ از دادهﻫﺎي داراي ﺑﺮﭼﺴﺐ را ﺑﺮاي ﺗﻨﻈﯿﻢ ﻣﺮﮐﺰ ﺧﻮﺷﻪﻫﺎ ﺑﻪ ﻃﻮر ﻫﻤﺰﻣﺎن اﻋﻤﺎل ﻣﯽﮐﻨﺪ. آزﻣﺎﯾﺶﻫﺎي اﻧﺠﺎم ﺷﺪه ﺑﺮ روي ﻣﺠﻤﻮﻋﻪ داده ﻫﺎي 21578-Reuters و BBC News ﻧﺸﺎن ﻣﯽدﻫﺪ ﮐﻪ ﻣﺪل ﭘﯿﺸﻨﻬﺎدي ﻫﻢ در ﻃﺒﻘﻪﺑﻨﺪي ﻣﺘﻦ و ﻫﻢ در ﺧﻮﺷﻪﺑﻨﺪي ﻣﺘﻦ از ﺳﺎﯾﺮ روشﻫﺎي ﻧﯿﻤﻪ ﻧﻈﺎرﺗﯽ ﺑﻬﺘﺮ ﻋﻤﻞ ﻣ ﯽﮐﻨﺪ.
Keywords :
Document clustering , News Clustering , Two-level clustering , Semi-supervised , word embedding
Journal title :
International Journal of Engineering
Serial Year :
2021
Record number :
2698698
Link To Document :
بازگشت