Abstract :
The classical (ad hoc) document retrieval problem has been traditionally approached through ranking according to heuristically developed functions (such as tf.idf or bm25) or generative language modeling, which requires explicit assumptions about the term distributions. The nowadays popular discriminative (classification, machine learning, statistical forecasting etc.) approaches have been mostly abandoned while solving this task in spite of their success in a different task of text categorization. In this paper, we studied if a classifier can be trained solely based on labeled examples to successfully generalize to new (unseen by the system) queries and provide performance comparable with popular heuristic or language models. Our SVM-based classifier learns from the relevance judgments available with the standard test collections and generalizes to new, previously unseen queries its ability to compare and rank documents with respect to a given query. To accomplish this, we have designed a representation scheme, which is based on the discretized form of the high level statistics of the query term occurrences (such as tf, df, and document length) rather than individual terms. Using the standard metrics of average precision and the standard large and small test collections we confirmed that our machine learning approach can achieve the performance comparable with and better than the performance of the current state of the art models.