Author_Institution :
Center for Signal & Image Process., Georgia Inst. of Technol., Atlanta, GA
Abstract :
With an increasing amount of audio and video materials made available on the web, information extraction from multimedia documents is becoming a key area of growing business and technology interest. Research opportunities range from traditional topics, such as multimedia signal representation, processing, coding, modeling, authentication, and recognition, to emerging subjects, such as language modeling, semantic concept decoding, media data mining, and knowledge discovery. Conventional multimedia processing often focuses on techniques developed for an individual medium. However for multimedia pattern recognition purposes, a number of algorithms are well-positioned and applicable to many cross-media applications. We present three families of such algorithms. The first, derived from speech and image coding, is unsupervised tokenization of multimedia patterns into a finite set of alphabets through segment or block quantization. Acoustic and visual lexicons can then be constructed. The second, derived from information retrieval, is a vector space representation of multimedia documents via extraction of high-dimensional salient feature vectors using co-occurrences statistics of acoustic and visual words. This can be accomplished through a feature extraction and feature reduction framework, known as latent semantic analysis (LSA), serving as a unified representation of multimedia patterns. This allows us to convert heterogeneous multimedia patterns into uniform text-like documents. Finally we discuss decision-feedback discriminative learning, derived from automatic speech and speaker recognition, for document classification, such as text categorization (TC) or topic identification. Machine learning techniques have been extensively used in the TC community to design discriminative classifiers. We present a recently developed maximal figure-of-merit (MFoM) learning framework for TC. It attempts to optimize parameters for any classifier with any feature representation on an- desired performance metric, and was shown to outperform other well-known machining learning algorithms, such as support vector machine (SVM), especially for topics with only very few training documents. The mathematical formulation of the above three sets of techniques will be described in detail first, followed by their applications to text categorization, automatic image annotation, video story segmentation, audio fingerprinting, and automatic language identification. The three frameworks, all derived from the speech and language processing community, provide a natural linkage to language characterization and concept modeling of multimedia documents and seem to serve as an ideal combination of tools for bridging the gap from conventional, low-level, content-based signal processing to high-level, concept-based processing of multimedia patterns.
Keywords :
information retrieval; learning (artificial intelligence); multimedia systems; pattern classification; support vector machines; decision-feedback discriminative learning; discriminative classifier learning; document classification; feature extraction; feature reduction; feature representation; image coding; information extraction; information retrieval; knowledge discovery; language modeling; language processing; latent semantic analysis; machine learning; maximal figure-of-merit; media data mining; multimedia documents; multimedia pattern recognition; multimedia signal authentication; multimedia signal coding; multimedia signal modeling; multimedia signal processing; multimedia signal recognition; multimedia signal representation; semantic concept decoding; speech coding; support vector machine; text categorization; tokenization; topic identification; vector space representation; Data mining; Feature extraction; Image segmentation; Machine learning; Natural languages; Signal representations; Speech; Support vector machine classification; Support vector machines; Text categorization;