The 21st century marks the rise of artificial intelligence (AI) and machine learning (ML) capabilities for mass consumption. Five years ago, there were virtually no cybersecurity vendors leveraging AI, but nowadays everyone seems to be using these 'magic' words.
According to Gartner Research, the total market for all security will exceed $120B in 2019. However, the current hype around AI and ML has created a certain amount of confusion in the marketplace. Here is my take and what I consider important when considering an AI solution, as applied to cybersecurity.
The Two Types of AI:
- Strong artificial intelligence is a term used to describe a certain mindset of artificial intelligence development. Strong AI's goal is to develop artificial intelligence to the point where the machine's intellectual capability is functionally equal to a human's. The cybersecurity industry is not there (yet?). Strong AI’s possible existence in the future is often questioned.
- Weak artificial intelligence is a form of AI specifically designed to be focused on a narrow task and to seem very intelligent at it. It contrasts with strong AI, in which an AI is capable of all and any cognitive functions that a human may have and is in essence no different than a real human mind.
Today, most successful research and advancements have come from the sub-discipline of AI called machine learning, which focuses on teaching machines to learn by applying algorithms to data. Often, the terms AI and ML are used interchangeably, generating additional confusion in the marketplace.
What to Consider and Ask for Clarification On
Features: Features are critical to any ML model because they determine what and how information is exposed. Besides the important question of what information to include, how the information is encoded also matters.
Data Sets: Effective machine learning requires vast quantities of data to ‘train’ on, which is one of the hurdles 21st century innovation has already overcome. Big Data and the Internet of Things (IoT) are now producing data at an unprecedented scale. The data used to train and evaluate the ML model fundamentally and hugely impacts its performance. If the data used to train the model are not representative of the real world, then the model will fail to do well in the field. I voluntarily left out data hygiene, unbalanced data, label, label noise etc… if you would like to learn more about the importance of Features and Data Sets, you can read our 173 pages of pure science produced by the Cylance Data Science Team: Introduction to Artificial Intelligence for Security Professionals – and here is the example code for our book on GitHub.)
Computing Power: Where is the Training Done?
Machine learning requires a massive amount of data to process, and it needs equally massive compute processing. Knowing where the training is done can be a good pre-indicator of the model’s robustness and its aptness of fit without testing the solution.
Which Technology is Used and to What End?
Is ML used to automate human tasks such as generating signatures faster, or is it used to develop an entirely new approach? How long has the model been trained for? A solution that does not change the existing paradigm will most likely not protect you against the ever-increasing sophistication of adversaries.
In addition, rushing a model to market to jump on the bandwagon of AI may also expose companies to using training information that has not been thoroughly scrubbed of anomalous data points, resulting in fail delivery in the field.
How Are ML Engines Trained?
To give an example from my own experience, here at Cylance, we have created an Endpoint Protection Platform (EPP) solution that utilizes ML to prevent malware from executing on your system. The algorithm deployed on your endpoints allows you to defend them before malware has a chance to execute.
To create our first ML engine, our data scientists initially fed the engine a large number of samples (≃ 500 million). Half of those samples were malicious, and the other half were non-malicious. The initial algorithms produced to prevent the malicious samples from executing were moderately successful. At this point it was obvious that the engine needed more training, and more importantly, a larger set of data to train from.
We thus continued to feed the ML engine with more and more files (both malicious and non-malicious), training it over time to recognize the difference between a good file (that would be allowed to execute) and a bad file (that would be prevented from executing), by analyzing and understanding the intrinsic nature and intentions of each file at the binary level. Most importantly, the engine needed to be able to tell the difference before the file was allowed to run (pre-execution).
Over time, the efficacy of our ML engine continued to increase. Our engine was and is still learning, growing and becoming more and more accurate over time.
Today, after many years of intensive training, the algorithms our engine produces are at around 99% efficacy. When we started to dig deeper into the algorithms, we noticed that the engine was now looking at well over six million features of a binary; considering that the average human malware reverse engineer is looking in the hundreds, this can be considered a huge improvement in accuracy and scope.
Below is a diagram showing how we are developing our ML engine:
Which Generation of Machine Learning are We Talking About?
Cybersecurity ML generations are distinguished from one another according to five primary factors, which reflect the intersection of data science and cybersecurity platforms. Each generation builds on the last one. Early generations can block "basic" attacks but will struggle or fail to block the most advanced ones. The dataset size and the number of features grows substantially in each generation.
- Runtime: Where does the ML training and prediction occur (e.g. in the cloud, or locally on the endpoint)?
- Features: How many features are generated? How are they pre-processed and evaluated?
- Datasets: How is trust handled in the process of data curation? How are labels generated, sourced, and validated?
- Human Interaction: How do people understand the model’s decisions and provide feedback? How are models overseen and monitored?
- Quality of Fit: How well does the model reflect the datasets? How often does it need to be updated?
The following table summarizes the characteristics of the generations according to the achievement within the factors just mentioned:
Third-Generation Machine Learning: Deep Learning
The cloud model complements and protects the local model. Decisions are explained by the model in a way that reflects its decision process. Models are evaluated and designed to be hardened against attacks. Concept drift is mitigated by great generalization. Deep learning reduces the amount of human time needed.
Fourth-Generation Machine Learning: Adaptive Learning
These models learn from local data without needing to upload observations. Features are designed by strategic interactions between humans and models. New features and models are constantly evaluated by ongoing experiments. Humans can provide feedback to correct and guide the model. Most are robust to well-known ML attacks.
Fifth-Generation Machine Learning:
Supervision becomes optional. Models learn in a distributed, semi-supervised environment. Human analysis is guided by model-provided insights. Models can be monitored and audited for tampering, and support deception capabilities for detecting ML attacks.
If you wish to learn more about all the generations of ML, you can read our Generation of ML whitepaper here.
Obviously, Ask to Test for Yourself
Ideally, test in a production environment to see the solution in the field. Any testing should ideally be transparent, and you should be able to test the solution in any way you want to without restrictions being imposed by the vendor.
Understand the Limitations of the Solution
ML cannot entirely replace humans. It can assist humans, change a paradigm, automate multiple tasks etc., but in the end, no solution can nor will protect you 100%. Human-machine teams are key to solving the most complex cybersecurity challenges.
Malicious AI Report states that as AI capabilities become more powerful and widespread, there will be an expansion of the introduction of new threats, as attackers exploit vulnerabilities in AI systems used by defenders (“Adversarial ML”) and increase the effectiveness of their existing attacks through (e.g.) automation.
Furthermore, attackers are expected to use the ability of AI to learn from experience to craft attacks that current technical systems and IT professionals may be ill-prepared to deal with. This further emphasizes the need to educate and train employees to avoid potential cybersecurity disasters.
In conclusion, it is important to understand that AI is not a 'magic tool' that solves all problems. ML solutions can only help address some of today's cybersecurity problems and give your business an advantage when facing a cyberattack.
All ML solutions are not born nor developed equal and it is of prime importance to understand the strengths and limitations of each one. Knowing what a solution can and cannot do will help you better build and manage your SOC, and lower your overall risk profile.