In 2005, Chris "Weldpond" Wysopal and I were colleagues working with cutting-edge tools to spot vulnerabilities in software. Back in the 1970’s and 1980’s both of our fathers had worked in quality control for jet engine manufacturers. Like our dads using dye-penetrant testing techniques to find cracks in turbine blades, decades later, we were using static analysis and fuzzing to find vulnerabilities in various software packages.
So when I saw that Sarah Zatko was giving a presentation at BlackHat in Las Vegas last month, titled “Quantifying Risk in Consumer Software at Scale”, it caught my eye. I recalled this video from 2016, where Sarah and her husband Peiter “Mudge” Zatko talked about co-founding the non-profit Cyber Independent Testing Lab (CITL). I also remembered that Mudge and Weldpond had testified to Congress about security risks on the Internet nearly twenty years ago.
Given their brilliant insights and prior contributions to the security community, I felt compelled to learn more.
The Problem in 2017
Software development tools and processes have vastly improved over the last decade. And software has been eating the world. Yet, consumers still have a lack of objective/meaningful data for comparison-shopping based on the relative security and build qualities of the software in products they purchase.
Consumers should be able to get quantifiable answers to a variety of important questions. Some are very basic: what’s the safest operating system? Or, what’s the safest browser within a given operating system?
It’s difficult to get unbiased data to answer these questions. Vendors spin their own narratives, pay off reviewers or create fake reviews. And user communities are filled with tribalism which blinds them to their side’s weaknesses.
Getting a good handle on more sophisticated questions is even harder. For example, which auxiliary parts of an OS or a browser might be best removed to reduce the risk of being attacked? Sarah and Mudge set out to provide better answers to these types of questions, using a data-driven approach.
Reporting on Consumer Software Quality
CTIL has embarked on a mission to essentially become a Consumer Reports for software safety. Towards this mission, they focus on end-results, i.e. evidence about the safety of the end-products, more than any dev team’s process, training, or certifications.
For this reason, and for a better legal ground to stand on, their tools and techniques examine the object-code of executables vs. the source code. The source may represent the best intentions of the programmers, yet the object code is the ground truth and it’s what actually runs.
CTIL also takes the “Independent” part of their name seriously. Like Consumer Reports, they do not accept any money from vendors. To date, they’ve examined well over 100,000 binaries comprising popular consumer applications, IoT devices, and even Smart TVs.
Static Analysis Metrics
CTIL’s static analysis of software binaries measures three different categories:
2. Application armoring, and
3. What they have termed “developer hygiene.” These are quantitative measurements of products proving the degree to which development teams leverage modern tools and techniques for developing secure software.
These three static analysis metrics are somewhat akin to Nutrition Facts on food labeling, or like real-world observations about a car’s seat belts, air bags, and anti-lock brakes.
Complexity measures include code size, branch density, stack adjustments and cyclomatic complexity. The NASA/JPL Laboratory for Reliable Software and others have shown that code complexity correlates to the presence of bugs, exploitable business logic flaws, and security vulnerabilities.
Armoring is comprised of three sub-categories:
1. Compiler armoring includes stack guards to help detect buffer overflows at runtime, function fortification to replace certain function calls with safer alternatives, and control-flow/code-pointer integrity to thwart ROP attacks.
2. Linker armoring includes techniques such as address-space layout randomization (ASLR) to make it harder for attackers to replicate their attacks across multiple processes or machines.
3. Loader armoring includes code signing and verification steps, e.g. marking certain memory segments as non-executable.
CTIL’s developer hygiene metrics are related to roughly 500 POSIX/ANSI C functions. The very worst functions are playfully labeled “Ick”. For example, the most notoriously icky function is gets(). Decades ago, that C function became the poster child for a bad security practice. It’s an open invitation to hackers because there’s simply no way to use it securely. When attackers have any control over its input, they can cause a buffer overflow. This, in turn, might allow them to execute arbitrary code or disrupt a service by causing a crash.
CTIL labels other functions “bad” or “risky”, such as unbounded functions for copying memory such as strcpy(), strcpy() or memcpy(), which are often at the heart of buffer overflow vulnerabilities.
Other more modern functions were designed explicitly with security in mind, like strlcat() or strlcpy(). These are labeled “good” by virtue of being judged harder to use incorrectly and reducing the odds of attackers overflowing buffers.
Dynamic Analysis Metrics
CTIL’s dynamic analysis of binaries generates metrics for exploitability and what they term “disruptability.” These tests involve running the code and examining runtime behavior, similar to the EPA figuring out miles per gallon ratings. It also involves an intent to determine the tolerances for a wide variety of inputs – like crash-testing a car or causing stress on a mechanical part.
For this purpose, CTIL has leveraged American Fuzzy Lop (AFL or afl-fuzz). This open-source fuzzer allows them to instrument code and automate some of their fuzz testing. CTIL found that about one-third of over 200 binaries crashed when fuzzed with AFL.
Also, their own fuzzer, CTIL-fuzz, crashed about half of the binaries it targeted. Since fuzzing can be very CPU-intensive, CTIL uses Bayesian analysis and linear regression testing to extrapolate and estimate some of their results with a given confidence. This is analogous to the EPA estimating MPG figures vs. testing all possible car configurations until the last drop of fuel.
Analyzing runtime/algorithmic complexity has also become an important security concern. An attacker might be able to submit specific types of input to cause the system to process inordinately more data. Or, an attacker might force an algorithm to devolve to its worst-case runtime performance.
Related: another talk at DefCon Vegas this year by a Netflix engineer showed why large cloud-hosted service providers need to defend against algorithmic complexity attacks.
Here are some fascinating takeaways from Sarah and Mudge’s work to date:
1. Although it had a greater percentage of code using stack guards, Firefox initially scored lower than Chrome or Safari on OS X because it didn’t leverage ASLR. Once made aware, the Firefox dev team addressed this issue by enabling ASLR in their Mac builds. Also, Chrome scored higher than Safari due to a greater percentage of modules with a non-executable heap.
2. Developers working on older codebases that use risky functions can (and should) flip certain switches on their compiler or linker that improve security in theory. Compilers then use heuristics to determine which replacements can be made programmatically without introducing new problems.
However, the degree to which this helps remove risk will be different across every codebase. Thus, measuring is key to understanding the risks. And when such compiler tools aren’t helping as much as development teams might have guessed, the source code itself needs to change.
For example, CTIL’s analysis proves that function fortification replacements are lagging on macOS compared to Linux.
3. Through data analysis, CTIL can infer differences among development teams creating consumer software. For example, on macOS, they have compared Google Chrome to Microsoft Excel.
a. The Google team shipped 64-bit binaries and even went an extra step to manually make the heap non-executable – something the 64-bit compiler itself couldn’t do.
b. Microsoft Excel was comprised of 32-bit binaries and was missing some of the application armoring features. These features are on by default when using a modern build chain. Thus, CTIL concludes that either the Microsoft Office for Mac dev team wasn’t using modern tools or security-related compiler/linker flags were explicitly disabled.
c. On the other hand, the Microsoft Excel team did not use risky C functions and the Google Chrome team did.
4. Lots of Fortune 500 companies use Anaconda, a DARPA-funded freemium pre-packaged roll-up of Python and R interpreters and big-data analytics libraries and packages. CITL determined that its 600 binaries were built with an old tool chain (a 2008 compiler running on a 2005 version of Linux). Thus, close to a decade’s worth of security and safety improvements in compiler and linker technology was not being leveraged.
5. Apple has made strides in Sierra vs. El Capitan, specifically with a decrease in the number of unfortified binaries. CTIL saw significantly more good functions and less bad, risky or ick functions in Sierra vs. El Capitan. This indicates that Apple is maturing their Security Development Lifecycle.
6. Relative to application hardening techniques, Samsung was doing a much better job compared to LG in some of the smart TVs that CTIL analyzed.
7. The Microsoft Office for Mac team apparently made some vast improvements in Office 2016 compared to Office 2011. In Figure 1 (below), Sarah presented a detailed histogram that rolls up an analysis of over 12,000 files in macOS overlaid with results from Microsoft Office binaries. From this, it’s clear Microsoft has made significant improvements making Office 2016 for Mac a much harder target for attackers vs. the low-hanging fruit in Office 2011 for Mac.
a. This is especially true in Office 2011’s AutoUpdate component, one of the softest targets on many Macs for the last several years. (Recommendation: if you use Office 2011 for Mac, consider upgrading.)
8. In analyzing the market value for new zero-day exploits, there was a correlation with CTIL’s ranking of browser safety vs. the 2017 cash value of exploits for those browsers. This is another confirmation that CTIL’s methods appear to be validated in the real-world.
The Future of Software Security for Consumers
CTIL is also part of a new entity, The Digital Standard. Their mission is to “create a digital privacy and security standard to help guide the future design of consumer software, digital platforms and services, and Internet-connected products.” This is encouraging news and I’m excited to see how it evolves.
At Cylance, we are also committed to helping shield consumers from risks inherent in the software they use. Towards that end, we recently launched the world’s first AI-powered next-generation consumer security product, CylancePROTECT® Home Edition.
This product proactively blocks threats, whether seen before or not, without requiring any signature updates. It uses the same technology from our Enterprise class product which has been known to block every known piece of ransomware, BEFORE that ransomware was even released. It’s fast, lean, super-effective, is a fraction of the size of legacy consumer product suites and it uses fewer system resources.
As both a software security professional and as a consumer, I would like to be able to quantify the relative security metrics discussed above. Specifically, I believe the static analysis, dynamic analysis, and complexity metrics that CTIL has pioneered should be applied across all consumer antimalware offerings in the market.
Cylance – the Safest, Most Effective and Performant Solution
The codebase of CylancePROTECT Home Edition was built over the last five years using modern, state-of-the-art software development techniques. I believe this will lead to significantly higher CITL scores than competing legacy products. Plus, unlike the legacy players in this space, Cylance doesn’t have multiple decades worth of older C/C++ code within its binaries.
Prior BlackHat and DefCon talks (e.g. this one or this one) have shown how legacy software security products themselves have actually made systems less secure. That’s because legacy agents can introduce vulnerabilities into the very systems they were purporting to protect.
Stuart McClure and Ryan Permeh started Cylance to revolutionize the industry by leveraging artificial intelligence and machine learning to predict and prevent attacks. And they already achieved this goal. It’s become clear to hundreds of corporate customers and industry observers that Cylance’s approach is superior in thwarting attackers. Better yet, Cylance agents were designed and built with the highest degrees of safety, self-protection, armoring and resiliency.
Thus, I look forward to independent testing by CTIL and others. Beyond greater efficacy and performance preventing attacks, such testing will quantify how much safer Cylance’s products are to use vs. the competition. This will highlight the careful attention to product security and secure software development practices that have been ingrained in Cylance culture since day one.
I have a strong hunch that Cylance’s competitors are not nearly as excited as I am about this kind of bake-off. Well, like we always say - let’s do some real, ground-level testing and find out the truth!
Chief Architect | Consumer Products at Cylance