Introduction to AI for Security

NEWS / 10.31.17 / The BlackBerry Cylance Data Science and Machine Learning Team

Artificial Intelligence (AI) technologies are rapidly moving beyond the realms of academia and speculative fiction to enter the commercial mainstream, with innovative products utilizing AI transforming how we access and leverage information.

Cybersecurity AI is also becoming strategically important to national defense and in securing our critical financial, energy, intelligence, and communications infrastructures against state-sponsored cyber-attacks.

According to an October 2016 report issued by the federal government’s National Science and Technology Council Committee on Technology (NSTCC), “AI has important applications in cybersecurity, and is expected to play an increasing role for both defensive and offensive cyber measures.” Based on this projection, the NSTCC has issued a National Artificial Intelligence Research and Development Strategic Plan to guide federally-funded research and development.

The era of AI has most definitely arrived, but many still don’t understand the basics of this important advancement, or how it could be applied to the cybersecurity industry.

AI: Perception vs. Reality

The field of AI encompasses three distinct areas of research: Artificial Superintelligence (ASI) which is the kind popularized in speculative fiction and movies, Artificial General Intelligence (AGI) where machines are as intelligent as a human and equally capable of learning and reasoning, and Artificial Narrow Intelligence (ANI) which exploits a computer’s superior ability to process vast quantities of data and detect patterns and relationships. These are the kinds of approaches we’ll be focusing on exclusively in this article.

In recent years, most of the fruitful research and advancements have come from the sub-discipline of AI called Machine Learning (ML), which focuses on teaching machines to learn by applying algorithms to data.

Machine Learning and the Security Domain

Context is critical in the security domain. Fortunately, the security domain generates huge quantities of data from logs, network sensors, and endpoint agents, as well as from distributed directory and human resource systems that indicate which user activities are permissible and which are not.

Collectively, this mass of data can provide the contextual clues we need to identify and ameliorate threats, but only if we have tools capable of teasing them out. This is precisely the kind of processing in which ML excels.

By acquiring a broad understanding of the activity sur- rounding the assets under their control, ML systems make it possible for analysts to discern how events widely dispersed in time and across disparate hosts, users, and networks are related. Properly applied, ML can provide the context we need to reduce the risks of a breach while significantly increasing the “cost of attack.”

Clustering

The purpose of cluster analysis is to segregate data into a set of discrete groups or clusters based on similarities among their key features or attributes. Within a given cluster, data items will be more similar to one another than they are to data items within a different cluster.

In the network security domain, cluster analysis typically proceeds through a well-defined series of data preparation and analysis operations.

We typically apply statistical sampling techniques that allow us to create a more manageable subset of the data for our analysis. The sample should reflect the characteristics of the total dataset as closely as possible, or the accuracy of our results may be compromised.

Next, we decide which data elements within our samples should be extracted and subjected to analysis. In machine learning, we refer to these data elements as “features,” i.e., attributes or properties of the data that can be analyzed to produce useful insights.

In the security domain, the relevant features might include the percentage of ports that are open, closed, or filtered, the application running on each of these ports, and the application version numbers. If we’re investigating the possibility of data exfiltration, we might want to include features for bandwidth utilization and login times.

Cluster Analysis

Cluster analysis introduces the concept of a “feature space” that can contain thousands of dimensions, one each for every feature in our sample set. At the conclusion of every clustering procedure, we’re presented with a solution consisting of a set of clusters.

After completing this cluster analysis, we would expect to see the vast majority of the resulting data grouped into a set of well-defined clusters that reflect normal operational patterns, and a smaller number of very sparse clusters or “noise points” that indicate anomalous user and network activity.

For security applications, we could then probe these anomalies further by grepping through our log data to match this suspect activity to possible bad actors.

Categorization

Categorization enables us to make generalizations about objects and actions we already know about in order to predict the properties of objects and actions that are entirely new to us.

In machine learning, classification refers to a set of computational methods for predicting the likelihood that a given sample belongs to a predefined class, like whether a piece of email belongs to the class “spam” or a network connection is benign or associated with a botnet. These are examples of a binary classification problem—for example, one with only two output classes, “spam” and “not spam,” “botnet” or “benign.”

The algorithms used to perform classification are referred to as “classifiers.” There are numerous classifiers available to solve classification problems, each with its own strengths and weaknesses.

Supervised Vs. Unsupervised Learning

Classification is an example of supervised learning, in which an analyst builds a model with samples that have already been identified—or labeled—with respect to the property under investigation.

In contrast, clustering is an example of unsupervised learning, in which the properties that distinguish one group of samples from another must be discovered. It’s not uncommon to use unsupervised and supervised methods in combination.

To produce an accurate model, analysts need to secure a sufficient quantity of data that has been correctly sampled and categorized. This data is then typically divided into two or three distinct sets for training, validation, and testing. As a rule of thumb, the larger the training set, the more likely the classifier is to produce an accurate model.

A classification session typically proceeds through four phases:

1. A training or “learning” phase in which the analyst con- structs a model by applying a classifier to a set of training data

2. A validation phase in which the analyst applies the validation data to the model in order to assess its accuracy

3. A testing phase to assess the model’s accuracy with test data that was withheld from the training and validation processes

4. A deployment phase, in which the model is applied to predict the class membership of new, unlabeled data

In practice, an analyst may train and test multiple models using different algorithms and hyperparameter settings. Then, they can compare the models and choose the one that offers the optimal combination of accuracy.

Classification via Decision Trees

Decision tree algorithms determine whether a data point belongs to one class or another by defining a sequence of “if-then-else” decision rules that terminate in a class prediction. Decision trees are aptly named since they utilize roots, branches and leaves to produce class predictions.

During training, the resulting model will appear to provide a high degree of accuracy. When applied to test data, however, the accuracy scores will be much lower. Analysts refer to this as a failure to generalize.

The DT algorithm intrinsically generates a probability score for every class prediction in every leaf based on the proportion of positive and negative samples it contains. This is computed by dividing the number of samples of either class by the total number of samples in that leaf.

Once the DT model has been built, it’s subjected to the same testing and validation procedures we described earlier for logistic regression. Once the model has been sufficiently validated, it can be deployed to classify new, unlabeled data.

Deep Learning and Neural Networks

Deep learning is based on a fundamentally different approach that incorporates layers of processing with each layer performing a different kind of calculation. Samples are processed layer-by-layer in stepwise fashion with the output of each layer providing the input for the next. At least one of these processing layers will be “hidden.” It is this multi-layered approach, employing hidden layers, that distinguishes deep learning from all other machine learning methods.

The term deep learning encompasses a wide range of unsupervised, semi-supervised, supervised and reinforcement learning methods primarily based on the use of neural networks, a class of algorithms so named because they simulate the ways densely interconnected networks of neurons interact in the brain.

Neural networks are extremely flexible, general-purpose algorithms that can solve a myriad of problems in a myriad of ways. Unlike other algorithms, for example, neural networks can have millions or even billions of parameters applied to define a model.

After each training cycle, a loss function compares the classification decision assigned at the output layer to the class labels in the training set to determine how the weights in all of the hidden layers should be modified to produce a more accurate result.

This process repeats as many times as required before a set of candidate models can proceed to the validation and testing phases.

Conclusion

Like every important new technology, AI has occasioned both excitement and apprehension among industry experts and the popular media. We read about computers that beat Chess and Go masters, about the imminent superiority of self-driving cars, and about concerns by some ethicists that machines could one day take over and make humans obsolete.

We believe that some of these fears are over-stated and that AI will play a positive role in our lives as long as AI research and development is guided by sound ethical principles that ensure the systems we build now and in the future are fully transparent and accountable to humans.

In the near-term however, we think it’s important for security professionals to gain a practical understanding about what AI is, what it can do, and why it’s becoming increasingly important to our careers and the ways we approach real-world security problems.

It’s this conviction that motivated us to write Introduction to Artificial Intelligence for Security Professionals. In this book, we cover machine learning techniques in practical situations to improve your ability to thrive in a data driven world. Click here for more information on how to download your free copy today.

About The BlackBerry Cylance Data Science and Machine Learning Team

The BlackBerry Cylance Data Science and Machine Learning research team consists of experts in a variety of fields. With machine learning at the heart of all of BlackBerry’s cybersecurity products, the Data Science Research team is a critical, highly visible, and high-impact team within the company. The team brings together experts from machine learning, stats, computer science, computer security, and various applied sciences, with backgrounds including deep learning, Bayesian statistics, time-series modeling, generative modeling, topology, scalable data processing, and software engineering.

What we do:

Invent novel machine-learning techniques to tackle important problems in computer security
Write code that scales to very large datasets, often with millions of dimensions and billions of attributes
Discover ways to strengthen machine learning models against adversarial attacks
Publish papers and present research at conferences

Back