Title : 
Inferring lexical and grammatical structure from sequences
         
        
            Author : 
Manning, Craig G Nevill ; Witten, Ian H.
         
        
            Author_Institution : 
Dept. of Biochem., Stanford Univ., CA, USA
         
        
        
        
        
        
            Abstract : 
In a wide variety of sequences from various sources, from music and text to DNA and computer programs, two different but related kinds of structure can be discerned. First, some segments tend to be repeated exactly, such as motifs in music, words or phrases in text, identifiers and syntactic idioms in computer programs. Second, these segments interact with each other in variable but constrained ways. For example, in English text only certain syntactic word classes can appear after the word `the´ many parts of speech (such as verbs) are necessarily excluded. This paper shows how these kinds of structure can be inferred automatically from sequences. We begin with an example that both illustrates the utility of inferring the kinds of structure we seek and shows what our techniques can do. Next we present an efficient and non-obvious algorithm for identifying exact repetitions-including nested repetitions-in time which is linear with the length of the sequence. Then we describe a very simple algorithm for identifying interactions between sequence elements. The focus of this paper is on how these two algorithms can work together, for their combination is far more powerful than either alone. We show how they combine to generate the kind of structure sought in the original motivating example. Although the two methods work well together on many simple examples, the results frequently conflict with intuition in the inference of branching structure. The minimum description length principle seems to provide the only satisfactory general approach
         
        
            Keywords : 
grammars; inference mechanisms; sequences; DNA; branching structure inference; computer programs; efficient algorithm; exact repetitions; grammatical structure; lexical structure; minimum description length principle; music; nested repetitions; sequences; text; Automata; Biochemistry; Computer science; DNA computing; Fractals; Frequency; Large-scale systems; Sequences; Speech; Switches;
         
        
        
        
            Conference_Titel : 
Compression and Complexity of Sequences 1997. Proceedings
         
        
            Conference_Location : 
Salerno
         
        
            Print_ISBN : 
0-8186-8132-2
         
        
        
            DOI : 
10.1109/SEQUEN.1997.666921