Abstract: The majority of existing spam filtering techniques suffers from several serious disadvantages. Some of
them provide many false positives. The others are suitable only for email filtering and may not be used in IM and
social networks. Therefore content methods seem to be more efficient. One of them is based on signature
retrieval. However it is not change resistant. There are enhancements (e.g. checksums) but they are extremely
time and resource consuming. That is why the main objective of this research is to develop a transforming
message detection method. To this end we have compared spam in various languages, namely English, French,
Russian and Italian. For each language the number of examined messages including spam and notspam was
about 1000. 135 quantitative features have been retrieved. Almost all these features do not depend on the
language. They underlie the first step of the algorithm based on support vector machine. The next stage is to test
the obtained results applying trigram approach. Proposed phishing detection technique is also based on SVM.
Quantitative characteristics, message structure and key words are used as features. The obtaining results
indicate the efficiency of the suggested approach.
Keywords: spam, corpus linguistics, phishing, filtering, text categorization.
ACM Classification Keywords: I.2.7 Text analysis
Link:
SPAM AND PHISHING DETECTION IN VARIOUS LANGUAGES
Liana Ermakova
http://www.foibg.com/ijitk/ijitk-vol04/ijitk04-3-p02.pdf