Machine learning versus spam

20 Jan 2017

minute read

Authors

Andrey But

Machine learning methods are often presented by developers of security solutions as a silver bullet, or a magic catch-all technology that will protect users from a huge range of threats. But just how justified are these claims? Unless explanations are provided as to where and how exactly these technologies are used, these assertions appear to be little more than a marketing ploy.

For many years, machine learning technology has been a working component of Kaspersky Lab’s security products, and our firm belief is that they must not be seen as a super technology capable of combating all threats. Yes, they are a highly effective protection tool, but just one tool among many. My colleague Alexey Malanov even made the point of writing an article on the Myths about machine learning in cybersecurity.

At Kaspersky Lab, machine learning can be found in a number of different areas, especially when dealing with the interesting task of spam detection. This particular task is in fact much more challenging than it appears to be at first glance. A spam filter’s job is not only to detect and filter out all messages with undesired content but, more importantly, it has to ensure all legitimate messages are delivered to the recipient. In other words, type I errors, or so-called false positives, need to be kept to a minimum.

Another aspect that should not be forgotten is that the spam detection system needs to respond quickly. It must work pretty much instantaneously; otherwise, it will hinder the normal exchange of email traffic.

A graphic representation can be provided in a project management triangle, only in our case the three corners represent speed, absence of false positives, and the quality of spam detection; no compromise is possible on any of these three. If we were to go to extremes, for example, spam could be filtered manually – this would provide 100% effectiveness, but minimal speed. In another extreme case, very rigid rules could be imposed, so no email messages whatsoever would pass – the recipient would receive no spam and no legitimate messages. Yet another approach would be to filter out only known spam; in that case, some spam messages would still reach the recipient. To find the right balance inside the triangle, we use machine learning technologies, part of which is an algorithm enabling the classifier to pass prompt and error-free verdicts for every email message.

How is this algorithm built? Obviously, it requires data as input. However, before data is fed into the classifier, is must be cleansed of any ‘noise’, which is yet another problem that needs to be solved. The greatest challenge about spam filtration is that different people may have different criteria for deciding which messages are valid, and which are spam. One user may see sales promotion messages as outright spam, while another may consider them potentially useful. A message of this kind creates noise and thus complicates the process of building a quality machine learning algorithm. Using the language of statistics, there may be so-called outlier values in the dataset, i.e., values that are dramatically different from the rest of the data. To address this problem, we implemented automatic outlier filtration, based on the Isolation Forest algorithm customized for this purpose. Naturally, this removes only some of the noise data, but has already made life much easier for our algorithms.

After this, we obtain data that is practically ‘clean’. The next task is to convert the data into a format that the classifier can understand, i.e., into a set of identifiers, or features. Three of the main types of features used in our classifier are:

Text features – fragments of text that often occur in spam messages. After preprocessing, these can be used as fairly stable features.
Expert features – features based on expert knowledge accumulated over many years in our databases. They may be related to domains, the frequency of headers, etc.
Raw features. Perhaps the most difficult to understand. We use parts of the message in their raw form to identify features that we have not yet factored in. The message text is either transformed using word embedding or reduced to the Bag-of-Words model (i.e., formed into a multiset of words which does not account for grammar and word order), and then passed to the classifier, which autonomously identifies features.

All these features and their combinations will help us in the final stage – the launch of the classifier.

What we eventually want to see is a system that produces a minimum of false positives, works fast and achieves its principal aim – filtering out spam. To do this, we build a complex of classifiers, and it is unique for each set of features. For example, the best results for expert features were demonstrated by gradient boosting – the sequential building up of a composition of machine learning algorithms, in which each subsequent algorithm aims to compensate for the shortcomings of all previous algorithms. Unsurprisingly, boosting has demonstrated good results in solving a broad range of problems involving numerical and category features. As a result, the verdicts of all classifiers are integrated, and the system produces a final verdict.

Our technologies also take into account potential problems such as over-training, i.e., a situation when an algorithm works well with a training data sample, but is ineffective with a test sample. To preclude this sort of problem from occurring, the parameters of classification algorithms are selected automatically, with the help of a Random Search algorithm.

This is a general overview of how we use machine learning to combat spam. To see how effective this method is, it is best to view the results of independent testing.

Authors

Andrey But

Machine learning versus spam

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Redin

Posted on January 21, 2017. 4:05 pm

How many more years will it take to realize that filtering-based models do not work or ever work? 40 years is not enough?
The P2T method returns this control to the user and removes all motivation to the spammer to continue. In this way, the burden of managing all mail produced goes to the sender’s side.

Reply

Latest Posts

Latest Webinars

Reports

According to Kaspersky, Librarian Ghouls APT continues its series of attacks on Russian entities. A detailed analysis of a malicious campaign utilizing RAR archives and BAT scripts.

Kaspersky GReAT experts uncovered a new campaign by Lazarus APT that exploits vulnerabilities in South Korean software products and uses a watering hole approach.

MysterySnail RAT attributed to IronHusky APT group hasn’t been reported since 2021. Recently, Kaspersky GReAT detected new versions of this implant in government organizations in Mongolia and Russia.

Kaspersky researchers analyze GOFFEE’s campaign in H2 2024: the updated infection scheme, new PowerModul implant, switch to a binary Mythic agent.

Machine learning versus spam

GReAT Ideas. Balalaika Edition

GReAT Ideas. Green Tea Edition

GReAT Ideas. Powered by SAS: malware attribution and next-gen IoT honeypots

GReAT Ideas. Powered by SAS: threat actors advance on new fronts

GReAT Ideas. Powered by SAS: threat hunting and new techniques

How much security is enough?

TOP 10 unattributed APT mysteries

The future of cyberconflicts

Researchers call for a determined path to cybersecurity

What does it take to become a good reverse engineer?

Latest Posts

Sleep with one eye open: how Librarian Ghouls steal data by night

Analysis of the latest Mirai wave exploiting TBK DVR devices with CVE-2024-3721

IT threat evolution in Q1 2025. Non-mobile statistics

IT threat evolution in Q1 2025. Mobile statistics

Latest Webinars

Unmasking email dangers: Detecting and defending against mail threats

Kaspersky Scan Engine: Built to Integrate, Engineered to Protect

Kaspersky’s way of cloud workload protection

In-depth analysis of cyberattacks: key findings from Kaspersky’s Incident Response report

Reports

Sleep with one eye open: how Librarian Ghouls steal data by night

Operation SyncHole: Lazarus APT goes back to the well

IronHusky updates the forgotten MysterySnail RAT to target Russia and Mongolia

GOFFEE continues to attack organizations in Russia

Subscribe to our weekly e-mails