Intelligent information retrieval for future GIS
Pavel Praks
Dept. of Mathematics
VSB-TU Ostrava, 17. listopadu 15, 708 33 Ostrava - Poruba, Czech
Republic
tel.: +420 59 732 4181 , fax: +420 59 691 9597
E-mail: pavel.praks@vsb.cz
Abstract
The numerical linear algebra, especially Singular Value Decomposition
(SVD) is used as a basis for information retrieval in the retrieval
strategy called Latent Semantic Indexing (LSI), see [1]. Originally,
LSI was used as an efficient tool for semantic analysis of large amount
of text documents. The main reason is that more conventional retrieval
strategies (such as vector space, probabilistic and extended Boolean)
are not very efficient for real data, because they retrieve information
solely on the basis of keywords and polysemy (words having multiple
meanings) and synonymy (multiple words having the same meaning) are not
correctly detected. LSI can be viewed as a variant of the vector space
model with a low-rank approximation of the original data matrix via the
SVD or the other numerical methods [2]. The "classical" LSI application
in information retrieval algorithm has the following basic steps: i)
The Singular Value Decomposition (SVD) of the term matrix using
numerical linear algebra. SVD is used to identify and remove redundant
noise information from data. ii) The computation of the similarity
coefficients between the transformed vectors of data and thus reveal
some hidden (latent) structures of data. Numerical experiments pointed
out that some kind of dimension reduction, which is applied to the
original data, brings to the information retrieval following two main
advantages: (i) automatic noise filtering and (ii) natural clustering
of data with "similar" semantic, see Fig. 1.
Fig. 1: An example of LSI image retrieval results
[3]. Images
are automatically sorted by their content using the partial
eigenproblem.
Acknowledgement
The research has been partially supported by the Ministry of Education,
Youth and Sport of Czech Republic under the research project
CEZ:6198910007.
References
- Grossman D. and Frieder O.: Information retrieval: Algorithms and
heuristics. Kluwer Academic Publishers, Second edition, 2000.
- Berry W. M., Z. Drmač, and Jessup J. R.: Matrices, vector spaces,
and
information retrieval.
SIAM Review, 41(2):336--362, 1999
- Praks P., Dvorský J., Snášel V.: Latent Semantic Indexing for
Image Retrieval Systems. SIAM Conference on Applied Linear Algebra,
July 15-19, 2003, The College of William and Mary, Williamsburg, U.S.A.
Published by SIAM (8 pages),
http://www.siam.org/meetings/la03/proceedings/Dvorsky.pdf
- Praks P., Dvorský J., Snášel V., Černohorský J.: On SVD-free
Latent Semantic Indexing for Image Retrieval for application in a hard
industrial environment. IEEE International Conference on Industrial
Technology – ICIT 2003; Session RS4_3: Industrial Applications. Hotel
Habakuk, Maribor, Slovenia, December 10-12, 2003. Published by IEEE,
pg. 466-471, ISBN 0-7803-7853-9
- Praks P., Machala L., Snášel V.: Iris Recognition Using the
SVD-Free Latent Semantic Indexing. MDM/KDD2004 - Fifth International
Workshop on Multi-media Data Mining "Mining Integrated Media and
Complex Data" in conjunction with KDD'2004 - The 10th ACM SIGKDD
International Conference on Knowl-edge Discovery & Data Mining;
Section 2. Multimedia Data Mining: Techniques and Applications, August
22, 2004, Seattle, WA, USA.
- Svátek V., Labský M., Praks, P., Šváb O.: Information extraction
from HTML product catalogues: coupling quantitative and knowledge-based
approaches. In Dagstuhl Seminar on Machine Learning for the Semantic
Web. Ed. N. Kushmer-ick, F. Ciravegna, A. Doan, C. Knoblock and S.
Staab, Wadern, Germany, Feb. 13–18 2005, pg. 1-5. Also available at
http://www.smi.ucd.ie/Dagstuhl-MLSW/proceedings/labsky-svatek-praks-svab.pdf