Comprehensive malware research can be a difficult task. Before reversing and constructing the timeline, the reverser needs a significant set of samples of the malware from multiple stages of its development. Finding similar samples can be quite difficult, as comparing files at scale is computationally expensive and often unfruitful. Naturally, having a problem with scale and malware, we brought out the big guns: machine learning.
First things first - I am not a Data Scientist. I am a Security Researcher slowly learning basic machine learning principles. I found it imperative to start learning how to leverage machine learning in my research after seeing the seemingly boundless possibilities of machine learning from the talented Data Scientists we have here at Cylance.
For a Security Researcher such as myself, I often find one of my more challenging problems is searching for similar samples of malware I already have so I may fully analyze a malware family over time. Using YARA signatures can be useful, but they are only as good as the signature created. Using fuzzy hashes like ssDeep can be helpful, but they are extremely limited due to scaling issues despite recent optimizations (https://www.virusbtn.com/virusbulletin/archive/2015/11/vb201511-ssDeep).
Spring Dragon APT
Palo Alto Networks released a report on the Spring Dragon threat actor earlier this year (http://researchcenter.paloaltonetworks.com/2015/06/operation-lotus-blossom/). The report provided great detail on the Lotus Elise malware (this malware is not targeted at motor vehicles, just named after them), which was used as a backdoor. Interestingly enough, the report only briefly mentions the Lotus Evora malware, but provides few details and indicators of compromise (IOCs).
This Evora malware seemed like a great candidate for a write up, as very few details were publicly available, and there were few IOCs. After obtaining some initial Evora samples, I started my analysis, working towards dumping configuration details from samples, as well as determining functionality (I will cover these in a future blog post).
After my initial analysis, I came to the conclusion that I had too few samples. Even after doing some YARA-based searches, it was clear there had to be more than the samples I already possessed. At this point, I consulted with a number of our machine learning gurus, who suggested using a centroid to find more samples.
Centroids and Distances
At this point, I’ll take a step back to describe the concepts behind a centroid and why this would be useful.
During our normal processing of a file to determine whether it is benign or malicious, we extract millions of “features” for every file. Features are a set of attributes which describe a file, and can range from the number of resources to the entropy in a particular section. When we extract all the features from a file we are processing, we end up with a “vector”, or an array of numbers representing all the features from that file.
A centroid, in this case, is represented by a vector that describes the central point of a set of vectors derived from files. A radius which represents the maximum distance from the centroid that a similar sample will fall into can then be defined.
Now one may be asking themselves, “Why is he going on about distance?” Consider the following:
To determine the distance between two points on a 2D plane, we use the Pythagorean theorem (https://en.wikipedia.org/wiki/Pythagorean_theorem). It is highly probable you learned this in grade school:
Essentially, if we square the difference between the points on the horizontal dimension and add that to the square of the difference between the points on the vertical dimension, we have the distance squared. Take the square root, and you have your distance.
If we treat each of our features as dimensions, we should be able to find the distance between vectors to determine if they fall within our radius. The high dimensional distance calculation we will use for this is referred to as Euclidean distance (https://en.wikipedia.org/wiki/Euclidean_distance), which is remarkably similar to Pythagorean’s Theorem.
To search within the centroid, we calculate the distance between our centroid and each file we have processed on the relevant features, and consider them similar to our original files if the distance is less than our radius.
That is how we use a centroid to search, but now we need to generate the centroid, which can be quite difficult. I am dedicated to updating my knowledge of machine learning any time I am able to do so, therefore, I took this opportunity to throw myself into the deep end.
To build this centroid, I first tried to find a set of features that would allow me to search with a radius of 0. This means an exact match of feature values across all relevant features. This was easy enough to determine, as I simply looked for any features that were the same across all Evora samples. Luckily for me, I was able to find a significant number of features that were an exact match across my Evora samples.
After testing this centroid against a set of unrelated samples locally, the centroid was ready to run against our collection of samples. Our expert data scientist loaded the centroid and returned a surprising set of hashes.
Finding Evora and Elise in the Results
After an initial inspection of the files that came back, I was surprised. Not only were Evora samples returned, but also a number of Elise samples! The centroid-based search was not only viable for determining other Evora samples, but additionally, other malware from the same threat actor. The efficacy of machine learning, even when used by an amateur such as myself, can be quite high.
I now had 28 Evora samples (including my original set) and 39 Elise samples (all found with machine learning).
Evora and Elise
I plan on writing a follow-up blog post detailing both Evora and Elise.
While I am not a Data Scientist, I continually find that machine learning is a powerful tool. It can have a steep learning curve, but is more than worth the effort to learn. Information security has no lack of problems to solve, and I have no doubt that machine learning is one of the tools necessary to solve the more difficult ones.