Machine learning versus spam

20 Jan 2017

minute read

Authors

Andrey But

Machine learning methods are often presented by developers of security solutions as a silver bullet, or a magic catch-all technology that will protect users from a huge range of threats. But just how justified are these claims? Unless explanations are provided as to where and how exactly these technologies are used, these assertions appear to be little more than a marketing ploy.

For many years, machine learning technology has been a working component of Kaspersky Lab’s security products, and our firm belief is that they must not be seen as a super technology capable of combating all threats. Yes, they are a highly effective protection tool, but just one tool among many. My colleague Alexey Malanov even made the point of writing an article on the Myths about machine learning in cybersecurity.

At Kaspersky Lab, machine learning can be found in a number of different areas, especially when dealing with the interesting task of spam detection. This particular task is in fact much more challenging than it appears to be at first glance. A spam filter’s job is not only to detect and filter out all messages with undesired content but, more importantly, it has to ensure all legitimate messages are delivered to the recipient. In other words, type I errors, or so-called false positives, need to be kept to a minimum.

Another aspect that should not be forgotten is that the spam detection system needs to respond quickly. It must work pretty much instantaneously; otherwise, it will hinder the normal exchange of email traffic.

A graphic representation can be provided in a project management triangle, only in our case the three corners represent speed, absence of false positives, and the quality of spam detection; no compromise is possible on any of these three. If we were to go to extremes, for example, spam could be filtered manually – this would provide 100% effectiveness, but minimal speed. In another extreme case, very rigid rules could be imposed, so no email messages whatsoever would pass – the recipient would receive no spam and no legitimate messages. Yet another approach would be to filter out only known spam; in that case, some spam messages would still reach the recipient. To find the right balance inside the triangle, we use machine learning technologies, part of which is an algorithm enabling the classifier to pass prompt and error-free verdicts for every email message.

How is this algorithm built? Obviously, it requires data as input. However, before data is fed into the classifier, is must be cleansed of any ‘noise’, which is yet another problem that needs to be solved. The greatest challenge about spam filtration is that different people may have different criteria for deciding which messages are valid, and which are spam. One user may see sales promotion messages as outright spam, while another may consider them potentially useful. A message of this kind creates noise and thus complicates the process of building a quality machine learning algorithm. Using the language of statistics, there may be so-called outlier values in the dataset, i.e., values that are dramatically different from the rest of the data. To address this problem, we implemented automatic outlier filtration, based on the Isolation Forest algorithm customized for this purpose. Naturally, this removes only some of the noise data, but has already made life much easier for our algorithms.

After this, we obtain data that is practically ‘clean’. The next task is to convert the data into a format that the classifier can understand, i.e., into a set of identifiers, or features. Three of the main types of features used in our classifier are:

Text features – fragments of text that often occur in spam messages. After preprocessing, these can be used as fairly stable features.
Expert features – features based on expert knowledge accumulated over many years in our databases. They may be related to domains, the frequency of headers, etc.
Raw features. Perhaps the most difficult to understand. We use parts of the message in their raw form to identify features that we have not yet factored in. The message text is either transformed using word embedding or reduced to the Bag-of-Words model (i.e., formed into a multiset of words which does not account for grammar and word order), and then passed to the classifier, which autonomously identifies features.

All these features and their combinations will help us in the final stage – the launch of the classifier.

What we eventually want to see is a system that produces a minimum of false positives, works fast and achieves its principal aim – filtering out spam. To do this, we build a complex of classifiers, and it is unique for each set of features. For example, the best results for expert features were demonstrated by gradient boosting – the sequential building up of a composition of machine learning algorithms, in which each subsequent algorithm aims to compensate for the shortcomings of all previous algorithms. Unsurprisingly, boosting has demonstrated good results in solving a broad range of problems involving numerical and category features. As a result, the verdicts of all classifiers are integrated, and the system produces a final verdict.

Our technologies also take into account potential problems such as over-training, i.e., a situation when an algorithm works well with a training data sample, but is ineffective with a test sample. To preclude this sort of problem from occurring, the parameters of classification algorithms are selected automatically, with the help of a Random Search algorithm.

This is a general overview of how we use machine learning to combat spam. To see how effective this method is, it is best to view the results of independent testing.

Authors

Andrey But

Machine learning versus spam

Redin

Posted on January 21, 2017. 4:05 pm

How many more years will it take to realize that filtering-based models do not work or ever work? 40 years is not enough?
The P2T method returns this control to the user and removes all motivation to the spammer to continue. In this way, the burden of managing all mail produced goes to the sender’s side.

Reply

Latest Posts

Latest Webinars

Reports

New unattributed DuneQuixote campaign targeting entities in the Middle East employs droppers disguised as Total Commander installer and CR4T backdoor in C and Go.

In this report Kaspersky researchers provide an analysis of the previously unknown HrServ web shell, which exhibits both APT and crimeware features and has likely been active since 2021.

Asian APT groups target various organizations from a multitude of regions and industries. We created this report to provide the cybersecurity community with the best-prepared intelligence data to effectively counteract Asian APT groups.

We unveil a Lazarus campaign exploiting security company products and examine its intricate connections with other campaigns

Machine learning versus spam

GReAT Ideas. Balalaika Edition

GReAT Ideas. Green Tea Edition

GReAT Ideas. Powered by SAS: malware attribution and next-gen IoT honeypots

GReAT Ideas. Powered by SAS: threat actors advance on new fronts

GReAT Ideas. Powered by SAS: threat hunting and new techniques

How much security is enough?

TOP 10 unattributed APT mysteries

The future of cyberconflicts

Researchers call for a determined path to cybersecurity

What does it take to become a good reverse engineer?

Latest Posts

XZ backdoor story – Initial analysis

DinodasRAT Linux implant targeting entities worldwide

Android malware, Android malware and more Android malware

Threat landscape for industrial automation systems. H2 2023

Latest Webinars

The Future of AI in cybersecurity: what to expect in 2024

Responding to a data breach: a step-by-step guide

2024 Advanced persistent threat predictions

Overview of modern car compromise techniques and methods of protection

Reports

DuneQuixote campaign targets Middle Eastern entities with “CR4T” malware

HrServ – Previously unknown web shell used in APT attack

Modern Asian APT groups’ tactics, techniques and procedures (TTPs)

A cascade of compromise: unveiling Lazarus’ new campaign

Subscribe to our weekly e-mails