A malicious program disrupts computer operations, gains access to private computational resources, or collects sensitive information. In February 2012, nearly 300 million malicious programs were detected, according to a report compiled by SECURELIST. To help organizations protect against malware, I and other researchers at the SEI have focused our efforts on trying to determine the origin of the malware. In particular, I’ve recently worked with my colleagues—Arie Gurfinkel, who works with me in the SEI’s Research, Technology, & System Solutions Program, and Cory Cohen, a malware analyst with the CERT Program—to use the semantics of programming languages to determine the origin of malware. This blog post describes our exploratory research to derive precise and timely actionable intelligence to understand and respond to malware.
In a previous blog post, I described our efforts to use classification (a form of machine learning) to detect provenance similarities in binaries. Broadly, two functions are provenance similar if they have been compiled from similar source code using similar compilers. Malware programmers often draw upon similar source code with minor differences, different compilers (such as various versions of Microsoft Visual C++), or different levels of optimization. In creating a training set to learn (or train) a classifier to predict the similarity of binaries, we realized several positive results including high accuracy and automation in performing classification. We also recognized, however, the following limitations:
- Machine learning is not 100 percent precise. In some instances the classifier reported that two binaries were similar that weren’t or vice versa.
- In a lab environment, machine learning is hard to validate at scale due to the substantial time commitment of researchers who must manually check thousands of samples.
- Our approach was limited to pairwise comparison (detecting whether any given pair of malware/functions are similar), which impedes scalability to full production-size data sets since detecting similarity among a set of size N requires O(N2) comparisons.
Another approach that we researched, computing and comparing attribute vectors, also yielded low rates of accuracy and did not capture semantics.
Our New Approach
Our new approach is based on extending traditional syntactic clustering techniques that have been used to classify malware in the past with execution semantics. This approach involves two tasks:
We summarize each approach below.
Function Clustering Based on Semantic Hashes
The first task of our work—function clustering based on semantic hashes—is based on semantic reasoning about programs, which is also called static analysis. We chose this approach because semantic summaries significantly reduce the number of clusters compared to syntactic hash-based clustering (For more information on the Pithos signature, please read the article by Cory Cohen and Jeffrey Havilla on page 28 of the 2009 CERT Research Annual Report) while maintaining the quality of the cluster.
Our approach involves using static analysis to determine the effect of a function (i.e., some relationship between its inputs and outputs) and then mapping functions that have the same behavior to the same equivalence classes. We thus determine that two functions are similar if we arrive at a similar result when statically computing the effect of each function.
For example, the following functions can be identified as similar
- two functions that modify the same variable
- two functions that output the same value for the same input
- close outputs for close inputs (If a final input/output is 5 and another input output is 4, we’ll say that’s “close.” If one output is 10 and one is 20, we’ll say “far.”)
Another aspect of this first task involves validating the static analysis and constructing a semantic hash or equivalence class. The validation involves constructing a benchmark from the CERT malware database and then performing clustering on that benchmark. There are many choices we can make, so we need to find the ones that actually work.
This approach allows us to avoid the inefficient O(N2) pair-wise comparison problem. In particular, equivalence classes are determined by hashing every binary/function individually and determining which ones have the same hash, not by comparing every pair of binaries/functions.
The second task in our approach focused on solving the pairing problem by semantically comparing two functions using a technique called regression verification, which converts the problem of comparing two functions to that of checking the equivalence of their input-output relationship expressed as logical formulas. This approach has been applied to other domains and problems including checking the equivalence of two programs available as source code. We wanted to broaden its scope by applying regression verification to binaries, which are low-level, executable code.
To accomplish this task, we relied upon a platform called ROSE, which is a framework for handling binaries, as well as a binary analysis platform. The ROSE framework provides disassembly, instruction semantics, and other static analysis capabilities for us to build our semantic function comparison technique.
Pairing our first task with semantic hashing allowed us to verify that our approach worked. In other words, we used the second technique to verify the first technique.
The application of regression verification eliminates the need for researchers to spend countless hours manually verifying the correctness of every pair. It also provides a mechanism by which we could be reasonably confident in an automated method for verifying the correctness of pairs.
Combining the Two Tasks
Semantic hashing provides a precise—but rough—idea of similarity among functions. For example, given a million functions, semantic hashing could identify 10,000 groups each with 10 functions. In a particular group with 10 functions, all functions may not be similar. Researchers can then do a pairwise comparison, however, because each group is small in size. With 10 functions, researchers would have 45 pairs to check. Researchers can then apply regression verification (the second task) to take a closer look at each pair and prune out and refine each equivalence class.
One challenge that we face is applying regression verification to binaries. While prior research has applied this technique to compare source code, applying it to binaries is non-trivial due to low-level details (e.g., bit-precise semantics that treat registers as 32-bit values instead of infinite-domain integers) and lack of structure (e.g., much harder to compute accurate control-flow graphs). Our approach to overcome this challenge consists of two parts: (i) logically partition system memory into three parts – the stack, the heap, and the region where parameters passed to functions are stored; and (ii) develop a semantics of executable code in terms of its effect on these three partitions of system memory.
One risk we’ve identified is that the application of regression verification might not work well on very small functions since many have no inherent behavior (e.g., they wrap more complex functions), and they are trivially equivalent if we just consider their bodies. To mitigate that risk, we identified and pruned small functions from summarization. Our expectation is that this will have a marginal effect on analysis since small functions are not very helpful to analysts in general. Note that obfuscations that split large complex functions into many smaller ones are beyond the scope of this project; we assume that there are orthogonal deobfuscation techniques that deal with such issues.
A second risk that we’ve identified is that the cost of computing precise semantic summaries may be expensive. To mitigate this risk, our approach favored scalability over precision. We reasoned that it is better to create a scalable solution that yields precise summaries for 80 percent of functions than a non-scalable approach that could yield precise summaries for 99 percent of functions but only works on 10 percent of functions in a reasonable time.
Impact to DoD
Through this research we hope to reduce the amount of manual malware analysis that is required by the organizations seeking to protect valuable information system assets. Reducing this analysis will lead to more cost-effective identifications and more timely responses to intrusions. This approach will also provide increased visibility into intruder behavior, leading to more effective defense against future intrusions, which are of increasing concern.
Initial Results/Future Work
We have implemented a tool (on top of ROSE) that computes two different types of semantic hashes, and empirically evaluated them on a benchmark derived from the CERT artifact catalog. Our paper, “Binary Function Clustering using Semantic Hashes,” describing this research will appear in the Proceedings of the 11th International Conference on Machine Learning and Applications.
To read the paper Binary Function Clustering Using Semantic Hashes, Wesley Jin, Sagar Chaki, Cory Cohen, Arie Gurfinkel, Jeffrey Havrilla, Charles Hines, Priya Narasimhan, Proceedings of the 11th International Conference on Machine Learning and Applications (ICMLA), December 12 to 15, 2012.
To read the paper Supervised learning for provenance-similarity of binaries by Sagar Chaki, Cory Cohen, & Arie Gurfinkel, please visit
To read the paper Regression verification by Benny Godlin & Ofer Strichman, please visit
To read about the research on Function Hashing for Malicious Code Analysis by CERT analysts Cory Cohen and Jeffrey Havilla, please see page 28 of the CERT Research Annual Report, which can be read at www.cert.org/research/2009research-report.pdf