One question that is often asked is how exactly does Infinity determine what is a threat. And I usually have two answers ready, the first one being “magic” (it’s not really magic, but gives me an excuse to wave my hands around), and the second one is “how much time do you have?”
The concept for Infinity and its mathematical underpinnings draw heavily from the field of artificial intelligence, and in particular the subfield of machine learning. The overall principles of machine learning are relatively easy to grasp. Consider the case where you want to train a machine to distinguish between photos of cats and dogs. First, you provide the machine with pictures of cats, and inform the machine that these are in fact photos of cats. Then, you provide a second group of pictures of dogs, and inform the machine that these are dog photos. Ideally, once the machine has seen photos of cats and dogs, it should then be able to look at new photos and determine if the photo is of a cat or dog.
While simplistic from a high level, there are an enormous amount of variables that influence the effectiveness of machine learning techniques to solve problems. For example, how does the machine process data? And how much data is enough? How long does it take to process all of the data? These (among many other issues) are nontrivial points that we had to consider when architecting Infinity. To provide further insight into how Infinity learns to detect malware, let’s dive into some of these challenges.
Data Representation
In the case of a machine learning system designed to determine if a photo is of a cat or dog, one critical step that must be resolved is how does one represent a photo to a learning system. One simple approach would be to inform the learning system of the color of every pixel in the image. While this would provide data for a learning system, one could possibly imagine more informative representations of the data. Perhaps we could extract different shapes from the photo, or find locations of eyes or tails, or many other interesting features.
This type of data extraction from raw data is commonly referred to as feature extraction in machine learning nomenclature, and is a part of our Infinity pipeline. Machine learning systems rely heavily on proper feature extraction of data. One familiar axiom of machine learning is “garbage in, garbage out” - meaning if your data representation is poor, then your machine learning system will perform poorly. One could imagine in the example of cat and dog photos, what representation may fall on the side of poor representation, and what may fall on the side of rich representation.
Of course, in our case, instead of dealing with photos of cats and dogs, we deal with malicious and non-malicious files. To determine what we want to extract from files, we have leveraged the expertise of our reverse engineers and data scientists to develop a feature extraction component of Infinity that provides an incredibly rich representation of data for Infinity to digest. In terms of raw data, Infinity can extract well over 1 million different data points that are used to define a sample file. While a human analyst would look at 1 million data points and simply be unable to deal with the volume of information, Infinity is well equipped to handle enormous representations of data. This well designed feature extraction component of Infinity forms the basis of what we feed as training data to our machine learning system.
Data Volume
Another issue that is often discussed in machine learning is exactly how much data is ideal for a learning system to train on before it is considered mature enough to start making decisions. And whatever the determined amount of data is, typically there is a better answer, and that answer is “more”. With that in mind, we designed Infinity to be able to constantly bring in new data. In fact, Infinity can bring in well over 3 million samples a day, and can easily scale to handle considerably larger amounts of data. This constant stream of new data not only provides Infinity with more data to learn from, actively improving the ability of Infinity to detect threats - it also allows Infinity to identify new trends and anomalies occurring in the real world at a real-time pace.
Crunching Numbers
Now that we have an idea of the volume of data Infinity can handle on a day-to-day basis, another important component of Infinity is how does the machine learning component process huge volumes of data to develop the mathematical models used to identify malware? For those not familiar with the implementation details of machine learning algorithms, many of these techniques require computationally expensive operations. Consider the case where we want to have a machine learning component train on 3 million samples. Consider that each sample generates 1 million data points, and say for the sake of simplicity each datapoint is represented by one byte. If we were to construct a matrix containing this data, the matrix itself would be 3,000,000,000,000 bytes (3 TB) of data. And that is just the representation of the data. One also has to factor in the mathematics required to construct models, and one could easily see the need for a total number of calculation on the order of 10,000,000,000,000,000+ (more than ten QUADRILLION). To put this into perspective, the world’s leading supercomputers are measured in terms of petaFLOPS, which are a measure of how many quadrillions of floating point operations can be executed per second.
In order to deal with the large amount of data and required CPU cycles needed to build effective machine learning models, we architected Infinity from the ground up with heavy parallelization at its core. This allows us to easily run enormous calculations quickly and effectively over hundreds to thousands of machines at once, providing tremendous value to our ability to train and evaluate new machine learning models.
Designing Infinity with these three considerations in mind has allowed us to develop a world-class machine learning infrastructure. And as Infinity continues to grow and learn and improve, our underlying architecture provides the means to continue to scale to meet the demands of ever expanding data. Now we can revisit the original question in this context: How does Infinity determine what is a threat? Infinity has learned by processing massive amounts of malicious and non-malicious data, more data than any human could possibly ever examine on their own. And armed with this extensive knowledge of what is malicious and non-malicious, it can quite easily examine the characteristics of a single file and, ultimately, extrapolate if that single file has malicious intentions.
We're just getting started, stay tuned for more.
- The Cylance Infinity Team