• DocumentCode
    2040966
  • Title

    Statistical learning and analyses of Chinese ancient books for information retrieval

  • Author

    Zhang, Min ; Ma, Sha Ping ; Jiang, Zhe ; Huang, Ke

  • Author_Institution
    Dept. of Comput. Sci. & Technol., Tsinghua Univ., Beijing, China
  • Volume
    2
  • fYear
    2001
  • fDate
    2001
  • Firstpage
    869
  • Abstract
    The technique of full text retrieval for modern Chinese has been studied for a long time, but the same cannot be said for ancient Chinese books, especially in China. This paper tries to find the characteristics of Chinese ancient books which can be used for information retrieval. Statistical analysis was carried out on ancient Chinese books of over 35,000,000 words, including most of the works in common use. Based on these experiments some characteristics of ancient Chinese works are analyzed and compared with modern Chinese, including the basic unit of ancient works, the proportion of double character words, sentence length, and the field dependency of ancient Chinese works. We then give conclusions on ancient Chinese which is useful for information retrieval, especially when building inverted indexes and selecting the index unit. Depending on the conclusion, a full-text retrieval system for ancient Chinese books has been designed and realized. It shows that statistical learning and analyses are a great help in ancient Chinese information retrieval
  • Keywords
    full-text databases; information retrieval; statistical analysis; ancient Chinese books; double character words; field dependency; full text information retrieval; index unit; inverted index; sentence length; statistical analyses; statistical learning; Books; Continents; Frequency; History; Information analysis; Information retrieval; Modems; Natural languages; Statistical analysis; Statistical learning;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Systems, Man, and Cybernetics, 2001 IEEE International Conference on
  • Conference_Location
    Tucson, AZ
  • ISSN
    1062-922X
  • Print_ISBN
    0-7803-7087-2
  • Type

    conf

  • DOI
    10.1109/ICSMC.2001.973025
  • Filename
    973025