• DocumentCode
    2201456
  • Title

    MDL hierarchical clustering with incomplete data

  • Author

    Lai, Po-Hsiang ; O´Sullivan, Joseph A.

  • Author_Institution
    Electr. & Syst. Eng., Washington Univ. in St. Louis, St. Louis, MO, USA
  • fYear
    2010
  • fDate
    Jan. 31 2010-Feb. 5 2010
  • Firstpage
    1
  • Lastpage
    5
  • Abstract
    The goal of stemmatology is to reconstruct a family tree of different variants of a text resulting from imperfect copying, which is a crucial part of textual criticism. In reality, historians often have incomplete data because some variants are not yet discovered and there are missing portions in available variants due to physical damage. Stemmatology is similar to molecular phylogenetics where biologists aim to reconstruct the evolutionary history of species based on genetic or protein sequences. Adoption of phylogenetics methods has lead to encouraging results in automatic stemmatology. We discuss and demonstrate the potential application of minimum description length (MDL) concepts to stemmatology. Our method is applied to a realistic dataset and outperforms major existing methods.
  • Keywords
    pattern clustering; text analysis; MDL hierarchical clustering; genetic sequence; minimum description length; molecular phylogenetics; protein sequence; stemmatology; textual criticism; Bifurcation; Data engineering; Evolution (biology); Genetic mutations; History; Phylogeny; Printing; Proteins; Sequences; Systems engineering and theory;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Information Theory and Applications Workshop (ITA), 2010
  • Conference_Location
    San Diego, CA
  • Print_ISBN
    978-1-4244-7012-9
  • Electronic_ISBN
    978-1-4244-7014-3
  • Type

    conf

  • DOI
    10.1109/ITA.2010.5454099
  • Filename
    5454099