شماره ركورد :
1122902
عنوان مقاله :
يك روش جديد انتخاب ويژگي يك‌طرفه در دسته‌بندي داده‌هاي متني نامتوازن
عنوان به زبان ديگر :
A Novel One Sided Feature Selection Method for Imbalanced Text Classification
پديد آورندگان :
پوراميني، جعفر دانشگاه پيام نور تهران - دانشكده فني و مهندسي - گروه مهندسي فناوري اطلاعات , مينايي بيدگلي، بهروز دانشگاه علم و صنعت ايران، تهران - دانشكده مهندسي كامپيوتر , اسماعيلي، مهدي دانشگاه آزاد اسلامي كاشان - دانشكده مهندسي كامپيوتر
تعداد صفحه :
20
از صفحه :
21
تا صفحه :
40
كليدواژه :
انتخاب ويژگي , روش پالايه , داده‌هاي نامتوازن , دسته‌بندي متون
چكيده فارسي :
توزيع نامتوازن داده‌ها باعث افت كارايي دسته‌بندها مي‌شود. راه‌حل‌هاي پيشنهاد‌شده براي حل اين مشكل به چند دسته تقسيم مي‌شوند، كه روش‌هاي مبتني بر نمونه‌گيري و روش‌هاي مبتني بر الگوريتم از مهم‌ترين روش‌ها هستند. انتخاب ويژگي نيز به‌‌عنوان يكي از راه‌حل‌هاي افزايش كارايي دسته‌بندي داده‌هاي نامتوازن مورد توجه قرار گرفته است. در اين مقاله يك روش جديد انتخاب ويژگي يك‌طرفه براي دسته‌بندي متون نامتوازن ارائه شده است. روش پيشنهادي با استفاده از توزيع ويژگي‌ها ميزان نشان‌گر‌بودن ويژگي را محاسبه مي‌كند. به‌منظور مقايسه عملكرد روش پيشنهادي، روش‌هاي انتخاب ويژگي مختلفي پياده‌سازي و براي ارزيابي روش پيشنهادي از درخت تصميم C4.5 و نايوبيز استفاده شد. نتايج آزمايش‌ها بر روي پيكره‌هاي Reuters-21875 و WebKB برحسب معيار Micro F ، Macro F و G-mean نشان مي‌دهد كه روش پيشنهادي نسبت به روش‌هاي ديگر، كارايي دسته‌بندها را به ‌اندازه قابل توجهي بهبود بخشيده است.
چكيده لاتين :
The imbalance data can be seen in various areas such as text classification, credit card fraud detection, risk management, web page classification, image classification, medical diagnosis/monitoring, and biological data analysis. The classification algorithms have more tendencies to the large class and might even deal with the minority class data as the outlier data. The text data is one of the areas where the imbalance occurs. The amount of text information is rapidly increasing in the form of books, reports, and papers. The fast and precise processing of this amount of information requires efficient automatic methods. One of the key processing tools is the text classification. Also, one of the problems with text classification is the high dimensional data that lead to the impractical learning algorithms. The problem becomes larger when the text data are also imbalance. The imbalance data distribution reduces the performance of classifiers. The various solutions proposed for this problem are divided into several categories, where the sampling-based methods and algorithm-based methods are among the most important methods. Feature selection is also considered as one of the solutions to the imbalance problem. In this research, a new method of one-way feature selection is presented for the imbalance data classification. The proposed method calculates the indicator rate of the feature using the feature distribution. In the proposed method, the one-figure documents are divided in different parts, based on whether they contain a feature or not, and also if they belong to the positive-class or not. According to this classification, a new method is suggested for feature selection. In the proposed method, the following items are used. If a feature is repeated in most positive-class documents, this feature is a good indicator for the positive-class; therefore, this feature should have a high score for this class. This point can be shown as a proportion of positive-class documents that contain this feature. Besides, if most of the documents containing this feature are belonged to the positive-class, a high score should be considered for this feature as the class indicator. This point can be shown by a proportion of documents containing feature that belong to the positive-class. If most of the documents that do not contain a feature are not in the positive-class, a high score should be considered for this feature as the representative of this class. Moreover, if most of the documents that are not in the positive class do not contain this feature, a high score should be considered for this feature. Using the proposed method, the score of features is specified. Finally, the features are sorted in descending order based on score, and the necessary number of required features is selected from the beginning of the feature list. In order to evaluate the performance of the proposed method, different feature selection methods such as the Gini, DFS, MI and FAST were implemented. To assess the proposed method, the decision tree C4.5 and Naive Bayes were used. The results of tests on Reuters-21875 and WebKB figures per Micro F , Macro F and G-mean criteria show that the proposed method has considerably improved the efficiency of the classifiers than other methods.
سال انتشار :
1398
عنوان نشريه :
پردازش علائم و داده ها
فايل PDF :
7755288
لينک به اين مدرک :
بازگشت