Title :
Mining biomedical data from hypertext documents
Author :
Salahuddin, Sazia ; Rahman, Rashedur M.
Author_Institution :
Dept. of Electr. Eng. & Comput. Sci., North South Univ., Dhaka, Bangladesh
Abstract :
Data mining is a process of discovering useful information from a database and analysis of extracted information. Text mining uses many techniques of data mining. It primarily deals with unstructured data. Web mining is an extension of text mining since it deals with unstructured data. Data mining relates to find data from “static databases” which contains “structured” data where as, web mining plays with data that are “dynamic” and “unstructured”. In this papers our goal is to mine biomedical data from hypertext documents (e.g., mining data from web contents) using text mining techniques with the help of “biomedical ontology”. Web data repositories are the hypertext documents. Texts in the Hypertext documents are unstructured and they contain Hypertext Markup Language (HTML) tags, scripting languages, images, audios, videos, URLs etc. We collect a number of documents using Google crawler and preprocess the hypertext documents and extract the text data. Next, we identify whether a word is a biomedical entity or not by using a biomedical database the “Unified Medical Language System (UMLS) metathesaurus”. The mapping of biomedical entity from the metathesaurus will be done based on keyword query. Then we apply the result to re-rank the web documents to find most relevant documents. We conclude that the more occurrence of a biomedical entity in a page, the more relevant the page is, and thus, we can re-rank the documents to find the most relevant documents by using text mining technique.
Keywords :
Internet; data mining; database management systems; hypermedia markup languages; medical information systems; ontologies (artificial intelligence); query processing; text analysis; Google crawler; URL; Web data repositories; Web documents; Web mining; biomedical data mining; biomedical ontology; extracted information analysis; hypertext documents; hypertext markup language tags; keyword query; scripting languages; static databases; text mining; unified medical language system metathesaurus; Communities; Engines; Unified modeling language; Biomedical ontology; Datamining; classification; document clustering; performance analysis;
Conference_Titel :
Computer and Information Technology (ICCIT), 2011 14th International Conference on
Conference_Location :
Dhaka
Print_ISBN :
978-1-61284-907-2
DOI :
10.1109/ICCITechn.2011.6164825