Intelligent information retrieval for future GIS

Pavel Praks
Dept. of Mathematics
VSB-TU Ostrava, 17. listopadu 15, 708 33 Ostrava - Poruba, Czech Republic
tel.: +420 59 732 4181 , fax: +420 59 691 9597
E-mail: pavel.praks@vsb.cz

Abstract

The numerical linear algebra, especially Singular Value Decomposition (SVD) is used as a basis for information retrieval in the retrieval strategy called Latent Semantic Indexing (LSI), see [1]. Originally, LSI was used as an efficient tool for semantic analysis of large amount of text documents. The main reason is that more conventional retrieval strategies (such as vector space, probabilistic and extended Boolean) are not very efficient for real data, because they retrieve information solely on the basis of keywords and polysemy (words having multiple meanings) and synonymy (multiple words having the same meaning) are not correctly detected. LSI can be viewed as a variant of the vector space model with a low-rank approximation of the original data matrix via the SVD or the other numerical methods [2]. The "classical" LSI application in information retrieval algorithm has the following basic steps: i) The Singular Value Decomposition (SVD) of the term matrix using numerical linear algebra. SVD is used to identify and remove redundant noise information from data. ii) The computation of the similarity coefficients between the transformed vectors of data and thus reveal some hidden (latent) structures of data. Numerical experiments pointed out that some kind of dimension reduction, which is applied to the original data, brings to the information retrieval following two main advantages: (i) automatic noise filtering and (ii) natural clustering of data with "similar" semantic, see Fig. 1.

Praks_LSI_screenshot

Fig. 1: An example of LSI image retrieval results [3]. Images are automatically sorted by their content using the partial eigenproblem.

Acknowledgement

The research has been partially supported by the Ministry of Education, Youth and Sport of Czech Republic under the research project CEZ:6198910007.

References

  1. Grossman D. and Frieder O.: Information retrieval: Algorithms and heuristics. Kluwer Academic Publishers, Second edition, 2000.
  2. Berry W. M., Z. Drmač, and Jessup J. R.: Matrices, vector spaces, and information retrieval. SIAM Review, 41(2):336--362, 1999
  3. Praks P., Dvorský J., Snášel V.: Latent Semantic Indexing for Image Retrieval Systems. SIAM Conference on Applied Linear Algebra, July 15-19, 2003, The College of William and Mary, Williamsburg, U.S.A. Published by SIAM (8 pages), http://www.siam.org/meetings/la03/proceedings/Dvorsky.pdf
  4. Praks P., Dvorský J., Snášel V., Černohorský J.: On SVD-free Latent Semantic Indexing for Image Retrieval for application in a hard industrial environment. IEEE International Conference on Industrial Technology – ICIT 2003; Session RS4_3: Industrial Applications. Hotel Habakuk, Maribor, Slovenia, December 10-12, 2003. Published by IEEE, pg. 466-471, ISBN 0-7803-7853-9
  5. Praks P., Machala L., Snášel V.: Iris Recognition Using the SVD-Free Latent Semantic Indexing. MDM/KDD2004 - Fifth International Workshop on Multi-media Data Mining "Mining Integrated Media and Complex Data" in conjunction with KDD'2004 - The 10th ACM SIGKDD International Conference on Knowl-edge Discovery & Data Mining; Section 2. Multimedia Data Mining: Techniques and Applications, August 22, 2004, Seattle, WA, USA.
  6. Svátek V., Labský M., Praks, P., Šváb O.: Information extraction from HTML product catalogues: coupling quantitative and knowledge-based approaches. In Dagstuhl Seminar on Machine Learning for the Semantic Web. Ed. N. Kushmer-ick, F. Ciravegna, A. Doan, C. Knoblock and S. Staab, Wadern, Germany, Feb. 13–18 2005, pg. 1-5. Also available at http://www.smi.ucd.ie/Dagstuhl-MLSW/proceedings/labsky-svatek-praks-svab.pdf