• DocumentCode
    2733160
  • Title

    An Intelligent Web Agent to Mine Bilingual Parallel Pages via Automatic Discovery of URL Pairing Patterns

  • Author

    Kit, Chunyu ; Ng, Jessica Yee Ha

  • Author_Institution
    City Univ. of Hong Kong, Hong Kong
  • fYear
    2007
  • fDate
    5-12 Nov. 2007
  • Firstpage
    526
  • Lastpage
    529
  • Abstract
    This paper describes an intelligent agent to facilitate bi-text mining from the Web via automatic discovery of URL pairing patterns (or keys) for retrieving parallel Web pages. The linking power of a key, defined as the number of URL pairs that it can match, is used as the objective function for the search for the best set of keys that can find the greatest number of Web page pairs within a bilingual Web site. Our experiments show that, with no prior knowledge such as ad hoc heuristics, no labelled data for training and no similarity analysis of Web page structure and content that are commonly involved in the existing approaches, a best-first search to approximate this optimization with an empirical threshold can recognize 98.1% true parallel Web pages and discover many irregular pairing patterns that are unlikely to be discovered by other approaches.
  • Keywords
    Web sites; data mining; natural language processing; software agents; text analysis; World Wide Web; automatic URL pairing pattern discovery; bi-text mining; bilingual Web site; bilingual parallel page mining; intelligent Web agent; parallel Web page retrieval; Conferences; Intelligent agent; Joining processes; Natural language processing; Pattern analysis; Pattern matching; Pattern recognition; Uniform resource locators; Web mining; Web pages; parallel web pagesURL pairing patternbitext mining;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Web Intelligence and Intelligent Agent Technology Workshops, 2007 IEEE/WIC/ACM International Conferences on
  • Conference_Location
    Silicon Valley, CA
  • Print_ISBN
    0-7695-3028-1
  • Type

    conf

  • DOI
    10.1109/WI-IATW.2007.107
  • Filename
    4427643