• DocumentCode
    2465012
  • Title

    Web information extraction based on hidden Markov model

  • Author

    Lai, Jianbing ; Liu, Qiang ; Liu, Yi

  • Author_Institution
    School of Software, Tsinghua University, Beijing, China
  • fYear
    2010
  • fDate
    14-16 April 2010
  • Firstpage
    234
  • Lastpage
    238
  • Abstract
    This paper proposes a semantic-block-based hidden Markov model. Semantic block is segmented from the elicited information of various websites based on their characteristic of semi-structure. The model adopts semantic block as the basic element in an observation sequence, replacing the original element — word, in order to improve the accuracy and efficiency of the transition matrix. Also, it optimizes the observation probability distribution and the estimation accuracy of state transition sequence by adopting the “voting strategy” and modifying Viterbi algorithm. In the end, the experiment results are able to show that the new model and algorithms give satisfying performance in recall and precision for web information extraction.
  • Keywords
    Algorithm design and analysis; Collaborative work; Data mining; Dictionaries; Hidden Markov models; Internet; Probability distribution; State estimation; Viterbi algorithm; Voting; hidden Markov model; semantic block; semi-structure; voting strategy;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Computer Supported Cooperative Work in Design (CSCWD), 2010 14th International Conference on
  • Conference_Location
    Shanghai, China
  • Print_ISBN
    978-1-4244-6763-1
  • Type

    conf

  • DOI
    10.1109/CSCWD.2010.5471969
  • Filename
    5471969