• DocumentCode
    1717295
  • Title

    DOM tree based approach for Web content extraction

  • Author

    Mehta, Bhavdeep ; Narvekar, Meera

  • Author_Institution
    Dept. of Comput. Eng., D.J. Sanghvi Coll. of Eng., Mumbai, India
  • fYear
    2015
  • Firstpage
    1
  • Lastpage
    6
  • Abstract
    The World Wide Web plays an important role while searching for information in the data network. Users are constantly exposed to an ever-growing flood of information. Our approach will help in searching for the exact user relevant content from multiple search engines thus, making the search more efficient and reliable. Our framework will extract the relevant result records based on two approaches i.e. Stored URL list and Run time Generated URL list. Finally, the unique set of records is displayed in a common framework´s search result page. The extraction is performed using the concepts of Document Object Model (DOM) tree. The paper comprises of a concept of threshold and data filters to detect and remove irrelevant & redundant data from the web page. The data filters will also be used to further improve the similarity check of data records. Our system will be able to extract 75%-80% user relevant content by eliminating noisy content from the different structured web pages like blogs, forums, articles etc. in the dynamic environment. Our approach shows significant advantages in both precision and recall.
  • Keywords
    Internet; Web sites; application program interfaces; content-based retrieval; information filters; DOM tree; Document Object Model; Web content extraction; Web page; data filters; information searching; run time generated URL list; stored URL list; threshold concept; Accuracy; Data mining; HTML; Information filtering; Noise measurement; Uniform resource locators; Web pages; Content extraction techniques etc.; DOM tree; Information extraction;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Communication, Information & Computing Technology (ICCICT), 2015 International Conference on
  • Conference_Location
    Mumbai
  • Print_ISBN
    978-1-4799-5521-3
  • Type

    conf

  • DOI
    10.1109/ICCICT.2015.7045706
  • Filename
    7045706