• DocumentCode
    2789311
  • Title

    An improved random forest approach for detection of hidden web search interfaces

  • Author

    Deng, Xiao-bai ; Ye, Yun-ming ; Li, Hong-bo ; Huang, Joshua Zhexue

  • Author_Institution
    Shenzhen Grad. Sch., Dept. of Comput. Sci., Harbin Inst. of Technol., Harbin
  • Volume
    3
  • fYear
    2008
  • fDate
    12-15 July 2008
  • Firstpage
    1586
  • Lastpage
    1591
  • Abstract
    Search interface detection is an essential technique for extracting information from the hidden Web. The challenge for this task is search interface data that is represented in high dimensional and sparse features with many missing values. This paper presents a new multi-classifier ensemble approach to solving this problem. In this approach, we have extended the random forest algorithm with a weighted feature selection method to build individual classifiers. With this improved random forest algorithm (IRFA), each classifier can be learnt from a weighted subset of the feature space so that the ensemble of decision trees can fully exploit the useful features of search interface patterns. We have compared our ensemble approach with other well-known classification algorithms, such as SVM and C4.5. The experimental results have shown that our method is more effective in detecting search interfaces of the hidden Web.
  • Keywords
    Internet; information retrieval; random processes; classification algorithm; hidden Web search interface; improved random forest algorithm; information extraction; multiclassifier ensemble; search interface detection; weighted feature selection; Classification algorithms; Classification tree analysis; Cybernetics; Data mining; Decision trees; Feature extraction; HTML; Machine learning; Support vector machines; Web search; Search interface detection; form classification; hidden Web; random forest;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Machine Learning and Cybernetics, 2008 International Conference on
  • Conference_Location
    Kunming
  • Print_ISBN
    978-1-4244-2095-7
  • Electronic_ISBN
    978-1-4244-2096-4
  • Type

    conf

  • DOI
    10.1109/ICMLC.2008.4620659
  • Filename
    4620659