استخراج ويژگي‌هاي ساختاري پوشه هاي رايانه اي مبتني بر تحليل و ارزيابي آماري

عنوان به زبان ديگر

Feature Extraction of Computer Files Structure by Statistical Analysis

پديد آورندگان

وفايي جهان، مجيد دانشگاه آزاد اسلامي مشهد - گروه كامپيوتر

تعداد صفحه

از صفحه

تا صفحه

كليدواژه

فايل‌هاي كامپيوتري , مدل n-gram , خوشه‌بندي لغات , ضريب خود همبستگي , TF-IDF , نرخ آنتروپي , فاصله كانبرا , بعد فركتال

چكيده فارسي

پوشه ها مهم‌ترين منبع ارائه اطلاعات به صورت‌هاي مختلف از قبيل متن، صوت، تصوير، صفحات وب و غيره هستند؛ تحليل و آناليز فايل‌ها به منظور شناخت و بررسي ويژگي‌ها و خصوصيات منحصربه‌فرد آن‌ها، يكي از مسائل بسيار مهم در زمينه حريم خصوصي، امنيت اطلاعات، شناسايي نوع فايل‌ها، تحليل ساختاري كدها و غيره مي‌باشد. در اين مقاله با تحليل و آناليز آماري بر روي محتواي باينري فايل‌ها مبتني بر مدل n-gram، ويژگي‌ها و خصوصيات مختلف يك فايل مورد بررسي قرار گرفته است. علاوه بر اين به منظور كاهش حجم محاسبات و حافظه مورد نياز مدل n-gram، از خوشه‌بندي لغات استفاده شده و محتواي هر فايل در دو حالت كامل و بلوك‌بندي شده مورد تجزيه و تحليل قرار گرفته است. در حالت كامل ويژگي‌هايي همچون آنتروپي، فراواني، TF-IDF، خود همبستگي و در حالت بلوكي، ويژگي‌هايي همچون نرخ آنتروپي، بعد فركتال، فاصله و غيره بررسي شده است. نتايج بررسي‌ها نشان داده ويژگي‌هاي استخراج شده در روش اول به خوبي مي‌توانند خصوصيات منحصر به فرد فايل‌هاي jpg، mp3، swf و html را منعكس نمايند. ويژگي‌هاي استخراج شده در روش دوم نيز به خوبي مي‌توانند خصوصيات فايل‌هاي doc، html و pdf را منعكس نمايند.

چكيده لاتين

Files are the most important sources of information presenting in various formats such as texts, audio, video, images, web pages, etc. …; (in-depth) analysis of files for the purpose of recognition and investigating their unique properties (or characteristics) is one of the most significant issues in the field of personal security safety, information security, file-type identification, codes structuration analysis etc…. Statistical analytic methodology of working on the binary files contents based on the n-gram model has been opted for in the present paper in order to full investigate all different aspects of a file’s range of characteristics. Moreover, to reduce down the calculations volume and the n-gram model peculiar to the needed amount of memory, use has been made of word clustering. Later on analysis has been conducted on both files’ contents in two states of “blocking” and “full”: it is to be noted that in the “full” case such characteristics as Chi-square, Auto-correlation, Weighted term frequency-Inverse document frequency (TF-IDF), Fractal dimension etc … have been brought under comprehensive study; while in the “blocking” case, other properties like the entropy rate, the distance, etc … have been delved into. The gained results indicate that the extracted characteristics in the first method could well easily reflect the unique properties belonging to jpg, mp3, swf and html files; and in the second method, are able to clearly well reflect doc, html and pdf files properties.

سال انتشار

1395

عنوان نشريه

پردازش علائم و داده ها

فايل PDF

7329071

عنوان نشريه

پردازش علائم و داده ها

لينک به اين مدرک

https://search.isc.ac/dl/search/defaultta.aspx?DTC=8&DC=997139