Title :
Parsing without a grammar: making sense of unknown file formats
Author :
Lloyd, Levon ; Skiena, Steven
Author_Institution :
Dept. of Comput. Sci., State Univ. of New York, Stony Brook, NY, USA
Abstract :
The thousands of specialized structured file formats in use today present a substantial barrier to freely exchanging information between applications programs. We consider the problem of deducing such basic features as the whitespace characters, bracketing delimiter symbols, and self-delimiter characters of a given file format from one or more example files. We demonstrate that for sufficiently large example files, we can typically identify the basic features of interest.
Keywords :
hypermedia markup languages; symbol manipulation; text analysis; word processing; application program; bracketing delimiter symbol; self delimiter character; structured file format; whitespace character; Application software; Computer science; Data mining; HTML; Markup languages; Page description languages; Poles and towers; Spatial databases; Text processing; XML;
Conference_Titel :
Data Mining, 2003. ICDM 2003. Third IEEE International Conference on
Print_ISBN :
0-7695-1978-4
DOI :
10.1109/ICDM.2003.1250920