• DocumentCode
    2352544
  • Title

    Near-Duplicates Detection for Vietnamese Documents in Large Database

  • Author

    Cong Thanh Truong ; The Duy Bui ; Bao Son Pham

  • Author_Institution
    Vietnam Nat. Univ., Hanoi
  • fYear
    2008
  • fDate
    23-25 July 2008
  • Firstpage
    70
  • Lastpage
    75
  • Abstract
    Near-duplicate documents exacerbate the problem of information overload. Research in detecting near-duplicates has attracted a lot of attention from both industry and academia. In this paper, we focus on addressing this problem for Vietnamese documents which, to the best of our knowledge, has not been done before. Most of the current algorithms have been designed for English which are not directly applicable to Vietnamese - a monosyllabic language. We propose to combine Charikarpsilas algorithm (Alexander Andoni et al., 2006) with a ldquoweighting schemerdquo and Vietnamese specific features to address the language intricacy. Experimental results indicate that our scheme is effective for detecting near-duplicates in a corpus of Vietnamese documents.
  • Keywords
    database management systems; document handling; natural language processing; Charikar algorithm; Vietnamese documents; large database; monosyllabic language; near-duplicates detection; Algorithm design and analysis; Clustering algorithms; Data mining; Databases; Information technology; Natural languages; Search engines; Web pages; Web search; White spaces; Charikar; LSH; hash scheme; near-duplicate Vietnamese detection; weighting scheme;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Advanced Language Processing and Web Information Technology, 2008. ALPIT '08. International Conference on
  • Conference_Location
    Dalian Liaoning
  • Print_ISBN
    978-0-7695-3273-8
  • Type

    conf

  • DOI
    10.1109/ALPIT.2008.76
  • Filename
    4584344