Abstract: The traditional approach (the comparison with a "reference" result) for evaluating quality of the
technology to identify knowledge extracted from text arrays is badly applicable out of a need to create the
reference answer for each specific set of electronic documents. In this paper we show that integral quantitative
coefficients of recall, precision and F-measure can be used to assess effectiveness of linguistic technologies of
knowledge identification in texts. Justifying the possibility of using the test collections method for the experimental
validation of obtained efficiency coefficients, we propose the use of the approach based on mathematical
statistics methods. The procedures of using sampling fraction of the indicator as a characteristic of evaluating the
proportion of relevant documents in the general population are reviewed. The paper shows the argumentation to
the fact that, in important practical cases of text collection samples, asymmetry of a confidence interval at the
binomial distribution can be overcome by approximated transition to the normal distribution. We also propose the
methods of determining the confidence interval for the indicator fraction that are based on Wilson approach, and
the method of determining the required size of the relevant sample depending on the specified error and
confidence probability as well.
Key worlds: evaluation of effectiveness, semistructured text information, test collections method, size sample
ACM Classification Keywords: H.3.3 .Information Search and Retrieval, I.2.4. Knowledge Representation
Formalisms and Methods, G.3. Probability and statistics – Statistical computing
Link:
Solution of the Problem of Formal Evaluation of Effectiveness of the Technology Knowledge
Identification in Semistructured Text Information
Nina Khairova, Nataliya Sharonova, Dmytro Uzlov
http://www.foibg.com/ijicp/vol01/ijicp01-03-p03.pdf