On the way to better testing

Have you ever found a false positive when uploading a file to a website like VirusTotal? Sometimes it happens that not just one scanner detects the file, but several. This leads to an absurd situation where every product which doesn’t detect this file automatically looks bad to users who don’t understand that it’s just false positives.

Sadly you will find the same situation in a lot of AV tests, especially in static on-demand-tests where sometimes hundreds of thousands of samples are scanned. Naturally validating such a huge number of samples requires a lot of resources. That’s why most testers can only verify a subset of the files they use. What about the rest? The only way for them to classify the rest of their files is using a combination of source reputation and multi-scanning. This means that, like in the VirusTotal example above, every company that doesn’t detect samples that are detected by other companies will look bad – even if the samples might be either corrupted or absolutely clean.

Since good test results are a key factor for AV companies, this has led to the rise of multi-scanner based detection. Naturally AV vendors, including us, have been scanning suspicious files with each others’ scanners for years now. Obviously knowing what verdicts are produced by other AV vendors is useful. For instance, if 10 AV vendors detect a suspicious file as being a Trojan downloader, this helps you know where to start. But this is certainly different to what we’re seeing now: driven by the need for good test results, the use of multi-scanner based detection has increased a lot over the last few years. Of course no one really likes this situation – in the end our task is to protect our users, not to hack test methodologies.

This is why a German computer magazine conducted an experiment, and the results of this experiment were presented at a security conference last October: they created a clean file, asked us to add a false detection for it and finally uploaded it to VirusTotal. Some months later this file was detected by more than 20 scanners on VirusTotal. After the presentation, representatives from several AV vendors at the event agreed that a solution should be found. However, multi-scanner based detection is just the symptom – the root of the problem is the test methodology itself.Unfortunately there isn’t much AV companies can do about it, because at the end it’s magazines that order tests – and if they can chose between a cheap static-on-demand test using an impressive-sounding 1 million samples (some of which are several months old) or an expensive dynamic test with fewer, but validated, zero-day samples, most magazines will choose the first option.

As I’ve mentioned above, AV companies as well as most testers are aware of this problem, and they aren’t too happy about it. Improving test methodologies was also the reasons why two years ago, a number of AV companies (including us), independent researchers and testers founded AMTSO (Anti-Malware Testing Standards Organization). But in the end it’s the journalists that play the key role. This is why we decided to illustrate the problem during our recent press tour in Moscow where we welcomed journalists from all around the world. Naturally the goal was not to discredit any AV companies (you could also find examples where we detected a file because of the multi-scanner’s influence), but to highlight the negative effect of cheap static on-demand tests.

What we did pretty much replicated what the German computer magazine did last year, only with more samples. We created 20 clean files and added a fake detection for 10 of them. Over the next few days we re-uploaded all twenty files to VirusTotal to see what would happen. After ten days, all of our detected (but not actually malicious) files were detected by up to 14 other AV companies – in some cases the false detection was probably the result of aggressive heuristics, but multi-scanning obviously influenced some of the results. We handed out all the samples used to the journalists so they could test it for themselves. We were aware this might be a risky step: since our presentation also covered the question of intellectual property, there was a risk that journalists might focus on who copies from whom, rather than on the main issue (multi-scanning being the symptom, not the root cause) But at the end of the day, it’s the journalists who have it in their power to order better tests, so we had to start somewhere.

So where should we go from here? The good news is that in the last few months, some testers have already started to work on new test methodologies. Instead of static on-demand-scanning they try to test the whole chain of detection components: anti-spam-module -> in the cloud protection -> signature based detection -> emulation -> behavior-based real-time analysis , etc.. But ultimately, it’s up to the magazines to order this type of test and to abandon approaches that are simply outdated.

If we get rid of static on-demand-tests with their mass of unvalidated samples, the copying of classifications will at least be significantly reduced, test results will correspond more closely to reality (even if that means saying good bye to 99.x% detection rates) and in the end everyone will benefit: the press, the users and of course us as well.

On the way to better testing

Your email address will not be published. Required fields are marked *


Subscribe to our weekly e-mails

The hottest research right in your inbox