Abstract: The aim of the paper is to present the assumptions and the architecture of
the system for searching similarity in string sets. During the research all the required steps
of a procedure of text documents processing which includes text extraction, pruning,
stemming and lemmatization were analysed. Models of a text documents’ description and
the method of creating a vector of features were developed as well. This vector consists,
inter alia, of chosen words and the number of their occurrences. The process of the text
analysis is supported by a set of various dictionaries. These are Stop-words, Domain and
Lemma dictionaries and all of them were considered in the context of the Polish language.
Because the Lemma dictionary is supposed to consist of many entries, the efficient
method of its access optimisation was elaborated. Various measures used for calculating
degree of a text documents similarity were studied too. Moreover, the method for
determining the quality of user queries and text documents adjustment were proposed.
The system was realized in accordance with the idea of multi-agent systems.
Its functionality is ensured by the set of agents acting on the basis of separate threads.
In the research, tests of the system work efficiency were also performed.
Keywords: agent systems, text similarity search
ACM Classification Keywords: I.7 Document And Text Processing
Link:
MULTI-AGENT SYSTEM FOR SIMILARITY SEARCH IN STRING SETS
Katarzyna Harężlak, Michał Sala
http://www.foibg.com/ibs_isc/ibs-26/ibs-26-p09.pdf