There has been a great deal of talk lately in the media about machine learning (ML). We’ve all seen the news clips of chess playing computers, self-driving cars, and emerging technologies like facial recognition, but what exactly is ML, and how does it work? As machines take on a greater share of control over our lives, it is important to understand what machine learning actually is, and more importantly, what it isn’t.
So, What is Machine Learning?
Machine learning is a branch of artificial intelligence (AI), and AI is a branch of computer science. A simple way to describe how ML works is as follows:
In traditional programming, you give the computer an input - let’s say 1+1. The computer would run an algorithm created by a human to calculate the answer and return the output. In this case, the output would be 2 (I guess it could also say "Hello World!" but for the sake of argument, let’s stick with numbers).
Here’s the crucial difference. In machine learning, you would instead provide the computer with the input AND the output (1+1=2). You’d then let the computer create an algorithm by itself that would generate the output from the input.
In essence, you’re giving the computer all the information it needs to learn for itself how to extrapolate an output from the input. In classrooms, it’s often stated that the goal of education is not so much to give a growing child all the answers, but to teach them to think for themselves.
This is precisely how machine learning works.
Seems Simple Enough, Right?
More advanced branches of machine learning function in a much more advanced way. But essentially, the only fundamental difference between traditional computing and machine learning is that with ML, the computer is generating the algorithm, rather than a human.
Let’s take a peek at the origins of machine learning to provide some extra clarity.
ML’s Humble Beginnings
Machine learning began in 1959 with an IBM employee named Arthur Samuel. Arthur wanted to teach a computer to play Checkers, so he programmed the computer to play against him. The problem Arthur ran into time and again was he would always win.
The problem was that the computer only knew how to make legal moves – it would always play by the book rather than strategically, as a human player would. The computer couldn’t come up with a plan to 'think ahead' and win. So Arthur created a small program that allowed the computer to play itself. After allowing it some time to practice, Arthur came back and played the computer and was both amazed and delighted when he lost. He played again… and lost again. The program he had created collected data and built on that knowledge over time to create a predictive engine. The computer learned by itself how to look at the whole board and use that information to move a piece to another position to create a more likely chance of winning.
This was the birth of machine learning.
If Machine Learning Was Born in the 1950s, Why Am I Only Hearing About It Now?
You may just have started hearing the term in popular usage, but you already use ML technology on a daily basis. When you turn on the Pandora radio service and start clicking ‘thumbs up’ or ‘thumbs down’ on songs, the computer is generating an ML-based algorithm just for you so it can predict what song you want to hear next. It may not be a perfect guess at first, but the more you ‘train’ it by either accepting or rejecting its choice, the better it gets in learning your taste in music. And when you’re looking at the stock market and you’re presented with a prediction of where a certain stock will go, that’s also machine learning in action.
Machine learning is everywhere, but is it magically making the world a better place? “It’s not magic,” says Greg Corrado, a Senior Research Scientist at Google. “It’s just a tool. But it’s a really important tool.”
So, How Does ML Create a Better Antivirus?
Here at Cylance, we’ve created an antivirus (AV) product that utilizes ML to prevent malware from executing on your system - even those types of malware that are unknown and never-before-seen, such as in the case of zero-days. This is where ML really excels. The algorithm allows you to defend your endpoints before malware has a chance to execute, rather than having to constantly react to attacks after they’ve already caused damage. The phrase 'post-execution cleanup' sounds impressive, but in the real world, it is akin to closing the stable door after the thief has already made off with your prized racehorse.
To create our ML engine, we fed the engine a large number of malware samples (about 500 million, in the beginning). Half of those samples were malicious and the other half were non-malicious. The initial algorithms it produced to catch the bad samples were moderately successful, but the engine needed more training to become a go-to-market product. From there, we continued to feed the ML engine with more and more files (both malicious and non-malicious), training it over time to recognize the difference between a good file and a bad file by analyzing and understanding the intrinsic nature and intentions of each file at the DNA level.
As time went by, the efficacy of our engine continued to increase. Our little engine was learning and growing. Today, after many years of intensive training, the algorithms our engine produces are just north of 99% efficacy. When we started to dig deeper into the algorithms, we noticed that the engine was looking at well over 6 million features of a binary - considering that the average human malware reverse engineer is looking in the hundreds, this was a HUGE improvement in accuracy and scope.
What this means is that our ML technology builds on what the human brain is capable of and outdoes it by millions of additional data points, in terms of recognizing malware. It cuts out the human error component and conducts its analysis in a fraction of a second. This is huge.
What’s the Difference Between Cylance’s ML and the Antivirus Product I’m Using Today?
This goes without saying, but ML is not a signature, nor does it use signatures to operate. A signature is a set of instructions written by a human. A machine can also write signatures, but only after being given a set of rules that were also created by a human. A signature tells the AV product that if a new file exactly matches a set pattern (typically the hash of the file), then it is bad. That is the sum total of its abilities. Legacy AV products use signatures as their front line of defense against malicious binaries. But a signature can’t strategize, generalize, or make decisions that lie outside of its set ‘rule book,’ and that’s why the cybercriminals have a shot at winning.
Let's imagine you wake up in the night to find a stranger trying to break into your house. Now, you as a human being only need to run through a very short list of basic data points to recognize that there is a high probability that this stranger is up to no good – that he or she likely has bad intentions:
- You don’t need to flick through a picture book showing all the faces of everyone on the planet to ‘know’ that this person is a stranger
- You don’t need to ask to see their photo ID to check that you don't know them
- You don't need to ask them why they are trying to get into your house
- You don’t need to go online and ask all your friends if they have ever seen this person trying to gain access to their houses before
All you need to know are the following data points (input):
- I don’t know this person (A)
- This person (A) is in an unexpected place (B) at an unexpected time (C)
- He or she is wearing a mask (X), has a gun in their belt (Y), and is trying to break my window (Z)
- So the new input of A + B + C (including the known risk factors X, Y and Z) = output (robber/unsafe)
A ML engine would make the above calculation in a fraction of a second. The more data points known to the engine and the longer it has been in existence to learn, the more accurately it can ‘predict’ the future (you are about to be robbed) and take immediate action to stop the robber before they get in – and all in an instant, without conducting an extensive analysis of the situation or calling for backup.
Therein lies the Achilles Heel of legacy AV: a product that relies on signatures doesn’t recognize malware that is not yet developed or that isn't already known and reported as malware. With signatures, a sacrificial lamb, a ‘Patient Zero’, must first get infected by the new malware in order for that malware to be discovered. Once the malware is discovered, more time elapses as a signature is generated (akin to a mugshot being uploaded to a police database following the first break-in), and this information is sent to the AV product’s knowledge base in order to protect other customers of that AV product.
But why wait until that one customer has been infected, or that first house has been robbed? With millions of new variants of malware released each and every year, why not get ahead in the game and stop malware before it can execute?
What About Daily Antivirus Updates? Does ML Reduce or Eliminate Those?
Because our algorithm is trained to identify what is malicious or not without using signatures, we only update our algorithm once every six to nine months. This eliminates daily or weekly employee down time where their computer was prevously bogged down running updates and scanning to check whether malware that was discovered yesterday is on the machine today.
Using predictive ML technology, we prevent attacks before they happen by stopping malware dead, pre-execution, rather than reacting to them after the fact, which lessens the burden to your IT/security teams and your employees. It’s a win-win situation.
Prove It To Me
Would you buy a car without driving it first? Of course you wouldn’t. We don’t expect you to simply take our word for it, so take our endpoint protection product CylancePROTECT® for a test drive and compare it to your existing AV solution. Throw your best (and worst) malware at it and compare it to whatever security solution you’re currently running. TEST FOR YOURSELF. The results you get and see with your own - human - eyes are the only ones that matter.