• DocumentCode
    3322330
  • Title

    Automatic Extraction of Useful Facet Hierarchies from Text Databases

  • Author

    Dakka, Wisam ; Ipeirotis, Panagiotis G.

  • Author_Institution
    Dept. of Comput. Sci., Columbia Univ., New York, NY
  • fYear
    2008
  • fDate
    7-12 April 2008
  • Firstpage
    466
  • Lastpage
    475
  • Abstract
    Databases of text and text-annotated data constitute a significant fraction of the information available in electronic form. Searching and browsing are the typical ways that users locate items of interest in such databases. Faceted interfaces represent a new powerful paradigm that proved to be a successful complement to keyword searching. Thus far, the identification of the facets was either a manual procedure, or relied on apriori knowledge of the facets that can potentially appear in the underlying collection. In this paper, we present an unsupervised technique for automatic extraction of facets useful for browsing text databases. In particular, we observe, through a pilot study, that facet terms rarely appear in text documents, showing that we need external resources to identify useful facet terms. For this, we first identify important phrases in each document. Then, we expand each phrase with ";context"; phrases using external resources, such as WordNet and Wikipedia, causing facet terms to appear in the expanded database. Finally, we compare the term distributions in the original database and the expanded database to identify the terms that can be used to construct browsing facets. Our extensive user studies, using the Amazon Mechanical Turk service, show that our techniques produce facets with high precision and recall that are superior to existing approaches and help users locate interesting items faster.
  • Keywords
    information retrieval; text analysis; unsupervised learning; very large databases; Amazon mechanical turk service; automatic facet extraction; electronic form; keyword searching; text database; text document; unsupervised technique; Computer science; Concrete; Data mining; Image databases; Information management; Keyword search; Motion pictures; TV; Taxonomy; YouTube;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Data Engineering, 2008. ICDE 2008. IEEE 24th International Conference on
  • Conference_Location
    Cancun
  • Print_ISBN
    978-1-4244-1836-7
  • Electronic_ISBN
    978-1-4244-1837-4
  • Type

    conf

  • DOI
    10.1109/ICDE.2008.4497455
  • Filename
    4497455