كليدواژه :
تشخيص جنسيت نويسنده , جنگل تصادفي , درخت بيز ساده , متنكاوي , دستهبندي
چكيده فارسي :
امروزه استفاده زياد كاربران از محيطهاي مجازي و ارتباط آنها از طريق شبكههاي اجتماعي مانند فيسبوك و توييتر لزوم بررسي مطالب موجود را در فضاي مجازي بيشتر از گذشته كرده است. از آنجا كه بالاترين ميزان تبادل اطلاعات در فضاي مجازي از طريق متن صورت ميگيرد؛ لذا تشخيص هويت كاربران از نظر سن، جنس، عقايد مذهبي و سياسي از روي متنهاي اينترنت، پراهميت خواهد بود. مسأله تشخيص جنسيت در حوزههاي امنيت و بازاريابي، ميتواند مؤثر واقع شود. در مقاله حاضر به تشخيص جنسيت نويسندگان مطالب بلاگها پرداخته ميشود و جهت تشخيص جنسيت نويسنده، ويژگيهاي نحوي، مبتني بر واژه، مبتني بر حروف و واژگان گرامري مورد استفاده قرار ميگيرند. بهعلاوه نتايج نشان ميدهد كه استفاده از ويژگيهاي -nگرمي حروف در بهبود عملكرد، بسيار مؤثر است. جهت انجام عمل دستهبندي روش جديدي با عنوان جنگل تصادفي بيز ارائه ميشود. نتايج آزمايشها نشان ميدهد كه اين روش در مقايسه با الگوريتمهايي مانند الگوريتم بيز ساده، درخت بيز ساده و جنگل تصادفي، نتايج بهتري ارائه داده و دقت دستهبندي را تا 89/5 % افزايش داده است.
چكيده لاتين :
Nowadays high usage of users from virtual environments and their connection via social networks like Facebook, Instagram, and Twitter shows the necessity of finding out shared subjects in this environment more than before. There are several applications that benefit from reliable methods for inferring age and gender of users in social media. Such applications exist across a wide area of fields, from personalized advertising to law enforcement of reputation management. Text posts represent a large portion of user generated content, and contain information which can be relevant to discovering undisclosed user attributes, or investigating the honesty of self-reported age and gender. Because the highest rate of information exchanges is in text format, author identification from the aspects like age, gender, political and religious opinions from these contents will seem more considerable. Gender identification that could be useful in security and marketing, also answers the following question: given a short text document, can we identify if the author is a male or a female? This question is motivated by recent events where people faked their gender on the Internet. In this paper, author gender identification in blog’s data is investigated. In this regard, four groups of features include syntactic features, word-based features, character-based features, and function words are employed. In addition, character n-gram features is used for improving the accuracy of classification. For evaluation of the proposed method, 3212 texts were collected from Technorati.com and blogger.com. Experimental results demonstrate that these types of features are practical. furthermore, a new classification method called "Bayesian Random Forest" is introduced. Each tree in Bayesian Random Forest is a Bayes tree. The results of experiment show that this method attains noticeable results in comparison with other classification algorithms such as Naïve Bayes, Naïve Bayes Tree, and Random Forest and it increases accuracy of gender identification to 89.5%.