• DocumentCode
    3323178
  • Title

    Efficient Information Extraction over Evolving Text Data

  • Author

    Chen, Fei ; Doan, AnHai ; Yang, Jun ; Ramakrishnan, Raghu

  • Author_Institution
    Univ. of Wisconsin-Madison, Madison, WI
  • fYear
    2008
  • fDate
    7-12 April 2008
  • Firstpage
    943
  • Lastpage
    952
  • Abstract
    Most current information extraction (IE) approaches have considered only static text corpora, over which we typically have to apply IE only once. Many real-world text corpora however are dynamic. They evolve over time, and to keep extracted information up to date, we often must apply IE repeatedly, to consecutive corpus snapshots. We describe Cyclex, an approach that efficiently executes such repeated IE, by recycling previous IE efforts. Specifically, given a current corpus snapshot U, Cyclex identifies text portions of U that also appear in the previous corpus snapshot V. Since Cyclex has already executed IE over V, it can now recycle the IE results of these parts, by combining these results with the results of executing IE over the remaining parts of U, to produce the complete IE results for U. Realizing Cyclex raises many challenges, including modeling information extractors, exploring the trade-off between runtime and completeness in identifying overlapping text, and making informed, cost-based decisions between redoing IE from scratch and recycling previous IE results. We describe initial solutions to these challenges, and experiments over two real-world data sets that demonstrate the utility of our approach.
  • Keywords
    information retrieval; text analysis; Cyclex; evolving text data; information extraction; real-world text corpora; static text corpora; Computer integrated manufacturing; Data mining; Databases; Debugging; Electronic mail; Information retrieval; Recycling; Runtime; Uniform resource locators; Web pages;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Data Engineering, 2008. ICDE 2008. IEEE 24th International Conference on
  • Conference_Location
    Cancun
  • Print_ISBN
    978-1-4244-1836-7
  • Electronic_ISBN
    978-1-4244-1837-4
  • Type

    conf

  • DOI
    10.1109/ICDE.2008.4497503
  • Filename
    4497503