Recently, there has been some dustup around a process we developed last July to facilitate testing malware prevention products.
An important part of this process is providing users of this methodology with new, never-before-seen pieces of malware. This is important, as many antivirus (AV) engines simply aggregate hash results in their cloud, meaning that most sources of malware available to the public are already “known” malware.
The whole point of an anti-malware product is protecting you against malware that has never before been seen, and so to do a fair and honest test, we “mutate” this malware. Generally, this can be achieved by using a simple, off-the-shelf packer, like MPRESS or VMProtect.
We follow the processes outlined in our Test for Yourself methodology. Literally anyone can repeat this process if they wish, turning known malware into unknown malware.
This process is not new, nor do we make any claims on invention here. What we are proposing is standard practice in the malware industry. Many attackers go much further beyond these simple mutations and have complete recompilations using polymorphic engines.
We chose simplicity because it was easier for people to perform for themselves, and because it laid bare the basic minimum level of skill required to cause a 5 billion dollar a year industry to fail.
Understanding Testing Methodologies
We’ve long held the belief that testing firms need to consider alterations of their samples prior to testing. By using samples from public sources, or sent in by the vendors themselves, they are biasing tests to favor those who write a multitude of basic signatures, and which realistically show absolutely no predictive power over malware that hasn’t been written yet.
Without an unknown, or reliable, real-time “real-world” test, testing firms are just repeating the same basic principle of testing using known samples. Often, when a test claims “real-world” status, it’s simply using malware from the wild, collected from a known bad list again.
This just extends the nonsense testing in one more way, as these “known bad lists” are generally built using existing convictions by AV vendors.
One side-effect of this “mutation” pipeline that we use had an unintended consequence. While we are very certain that the malware inputs are 100% malware when we mutate them, sometimes, the mutation process breaks them.
This happens periodically when a malware sample is already obfuscated or hidden, such as being part of a self-extracting zip archive, or in an installer. By attempting to pack one of these, the internals of those programs change.
Generally, this is exactly the intended consequence, but in some situations, the malware (or the shell the attacker put around the malware) has consistency checks known as checksums. Since we run a simple mutation pipeline, we did not go to the level of effort to always find and fix these checksums, and sometimes this process resulted in valid (correctly formatted) but unusable malware.
In other words, it simply didn’t run. When we notice this, we do our best to rectify the process.
As a part of all of this, a specific testing vendor decided to try us out using signatures to identify specific mutations. This goes against everything we stand for, but it does bring up an interesting point. What this tester discovered was our use of centroids.
Introducing Centroids
A “centroid” is a term we use at Cylance for a specific clustering approach we’ve developed. Like our model, a centroid has access to the vast array of data points (or features) we currently understand. By training a centroid, we can improve the scores our models use.
Model training is intensive, and requires significant resources. To change a model’s behavior after training is no simple feat. We use centroids to identify areas where we’ve determined misclassification in our models.
When we train a model, we use thousands of curated sets of both malware and good files to ensure that we have the broadest possible understanding of the world of malware.
For each set, we select a part of that set to train on, and a part of that set to validate our training. In some cases, we underperform (too many false positives, or too many false negatives) on a set, and we can adjust that by creating a centroid. In others, we end up retraining the model entirely.
We have curated sets internally of mpress and vmprotect (two very common packers) that we use to ensure that we can identify malware, whether it’s packed or not. These, among several sets, have centroids built to ensure a broad set of classification correctness before we ship a model.
We let the original training do its thing, and use clusters to help even out the results. This process is unique to us, but similar processes are seen whenever there is a combination of supervised and unsupervised learning happening in a system.
Rebooting the Testing Industry
We at Cylance have been leading the industry charge regarding improving the methods for evaluating the efficacy of endpoint anti-malware security products for quite some time now.
We have been disrupting the endpoint security market by advocating for systemic change to the pay-to-play testing industry, much to the dismay of vendors and testing firms who have a financial stake in maintaining the status quo.
We thank Ars Technica for broaching this subject once more in this article which corroborates our assertions about the glaring flaws in testing methodologies, and trumpets it as a further call to action.
Testing needs to change - we've been saying it for years, and we will continue to call out bad testing practices.
We believe public testing of anti-malware products is fundamentally flawed, and thus we have encouraged everyone to Test for Yourself. Don’t trust the vendor, and don’t trust the testing community. Trust only yourself. Your environment is unique.
We also believe it's important to test with malware samples that no one has seen before. Choosing malware sets from public malware repositories is testing for the past, not the future.
It DOES matter that your endpoint security product can predict and prevent future malware, aka zero-days. To do this, you need to test against malware samples that NO ONE has seen before.
Using never-before-seen malware is the only way to truly test how products would perform in the face of an active attacker.
Ask yourself: Why do some AV vendors and testing firms refuse to add elements of the unknown to their testing methods? Their efficacy rates would plummet in such real-world scenarios.
We are pushing for reforms that will result in fair testing methods, and true independent testing that will ultimately benefit the users of these products.
The testing industry is governed by a body of members whose preferred methodologies have remained static for years. Having vendors and testers collaborate with each other on testing methods is like having the pharmaceutical companies coaching the FDA on how to test drugs.
Independent tests that don't use real-world methodologies are useless. If you don't have a way to emulate the real world in a test, it's a useless test. AV-Test and NSS Labs are endeavoring to make tests more akin to the real world, and we support them in their efforts. While no testing house is perfect, they are making changes that serve the world at large.
Testing is a for-profit industry, and implementing unbiased testing methodologies makes it harder for unethical players to produce biased reports in return for fees that range into the hundreds-of-thousands of dollars.
We appreciate resources like testmyav.com, which takes the initiative in providing testing methods and malware for users to test products for themselves, and we will continue to support such initiatives.
We encourage you to Test for Yourself. Res Ipsa Loquitur! (“The thing speaks for itself!”)