INFORMATION TECHNOLOGY IN INDUSTRY

Syntactic Indexes for Text Retrieval

Ioan Badarinza, Adrian Ioan Sterca, Maria Ionescu

Abstract

In this paper, we present three techniques for incorporating syntactic metadata in a textual retrieval system. The first technique involves just a syntactic analysis of the query and it generates a different weight for each term of the query, depending on its grammar category in the query phrase. These weights will be used for each term in the retrieval process. The second technique involves a storage optimization of the system's inverted index that is the inverse index will store only terms that are subjects or predicates in the document they appear in. Finally, the third technique builds a full syntactic index, meaning that for each term in the term collection, the inverse index stores besides the term-frequency and the inverse-document-frequency, also the grammar category of the term for each of its occurrences in a document.

Keywords

Textual Search; Syntactic Metadata; Query Term; Natural Language Processing

References

D.C. Manning, P. Raghavan, and H. Schutze, An Introduction to Information Retrieval．Cambridge, England, Cambridge University Press, 2009.

C. Stefano, Web Information Retrieval．Berlin，Springer，2013.

T. Lahtinen. Automatic Indexing: An Approach Using an Index Term Corpus and Combining Linguistic and Statistical Methods, PhD thesis, University of Helsinki, 2000.

D.A. Evans and C. Zhai, "Noun-phrase analysis in unrestricted text for information retrieval," in Proceedings of the 34th Annual Meeting of the Association for Computational Linguistics, June 1996, pp. 17-24.

C.A. Bechikh and H. Haddad, "A quality study of noun phrases as document keywords for information retrieval," International Conference on Control, Engineering and Information Technology, 2013.

F. Jelinek. Statistical Methods for Speech Recognition. The MIT Press, Cambridge, Massachusetts, 1998.

P.F. Brown, J. Cocke, S. D. Pietra, V. J. D. Pietra, F. Jelinek, J. D. Lafferty, R. L. Mercer, and P. S. Roossin, "A statistical approach to machine translation," Computational Linguistics, vol. 16, no. 2, pp. 79-85, 1990.

A. Berger and J.D. Lafferty, "Information retrieval as statistical translation," in Proceedings of SIGIR'99, 1999, pp. 222-229.

D. Hiemstra, "A linguistically motivated probabilistic model of information retrieval," in European Conference on Digital Libraries, 1988, pp. 569-584.

J. Lafferty and C. Zhai, "Document language models, query models, and risk minimization for information retrieval," in Proceedings of SIGIR'01, 2001, pp. 111-119.

The Stanford Parser, http://nlp.stanford.edu/software/lex-parser.shtml.

A.D. Grossman and O. Frieder, Information Retrieval: Algorithms and Heuristics, 2nd ed., Chicago: Springer，2004.

Full Text: PDF