Good Stuff Bucket

Z3Py – guide

A gentle introduction to abstract interpretation

What is static analysis?

Get Big Fast vs Organic Growth (Joel Spolsky’s talk at Startup School 2012, @12:50)

Advertisements

A Survey of Call Graph and Points-to Algorithms in Java

Disclaimer: I am not an expert on the subject, this post is a collection of my reading notes from a bunch of papers. There may be factual errors and the content is subject to change. Just because I can.

In object-oriented languages, constructing a call graph is complicated by the presence of virtual method calls. This is a problem known as the dynamic dispatch problem, and is common in many programming languages and paradigms. The problem can be formulated as follows for Java: if m is a virtual method (i.e. not private or static), then the method that o.m() resolves to depends on the run-time type of o (as opposed to its declaration type). Determining the run-time type of o requires points-to information (what object may o point to?), which in turn requires a call graph to be computed (because what values o may point to depend on what methods get called). This mutual recursion is a fundamental problem, that is not quite trivial to tackle.

A number of static analyses have been proposed to do the job, but since the problem is undecidable no algorithm can return both a precise and correct answer. As a result, we have to decide if we want fast or precise analyses, since they can not be both. Some of the characteristics that affect the precision and cost of the algorithms are:

  • flow sensitivity: does the order of statements matter?
  • context sensitivity: how to distinguish between two calls of the same procedure in different contexts?
  • aggregate structure modeling: how are arrays, lists, and hash tables abstracted?
  • heap modeling: how to differentiate the potentially unbounded number of objects that may be created at a given allocation site?

Below are some notes about flow sensitivity and context sensitivity, followed by a discussion of some of the classic algorithms for call graph construction.

Flow sensitivity

  • Flow-insensitive analyses are only concerned with what statements are present in the program, not with the order or the reachability of statements. They return properties of variables valid in the whole program.
  • Flow-sensitive analyses compute what values may flow to variables by considering valid paths in the program. They return properties of variables valid at certain program points. Therefore, they are more precise but more expensive to compute.

Example (courtesy of smcc): a parity analysis for the following code snippet yields different results depending on the flow sensitivity of the analysis. A flow-insensitive analysis would return that x could be anything, while a flow-sensitive analysis would return that x is even at line 3 and odd at line 4.

void someMethod(int y) {
   int x;
   x = 2 * y;
   x = x + 1;
}

A common way to obtain a form of flow sensitivity is to transform the program in SSA form (thereby encoding control dependencies in variable names) and then to apply flow-insensitive analyses. Algorithms for efficient translation to SSA form are given in Appel, Modern Compiler Implementation in Java. For an implementation that transforms Java bytecode to SSA form, see Soot. In SSA form, the above example would become:

void someMethod(int y) {
    int x_1 = 2 * y;
    int x_2 = x_1 + 1;
}

A flow-insensitive analysis would then be sufficient to return that x_1 is even and x_2 is odd.
Although instances of the standard dataflow frameworks (Kildall, A Unified Approach to Global Program Optimization, POPL 1973) lend themselves particularly well to flow-sensitive analyses, Nielson et al. in Principles of Program Analysis give an example of a flow-insensitive instance, by simply considering empty kill sets in the dataflow equations (i.e., it loses information about which variables are overwritten).

Context sensitivity

Context sensitivity matters for inter-procedural analyses, where a single method may be invoked in different contexts. So, “what is a context?”, you may ask. Well, it is a parameter of the analysis, so the answer is “it depends”. The most common abstractions seem to be:

  • call sites, so that the same method invoked from different line numbers would be distinct. However, if a method is invoked multiple times from the same program point (in a loop for example), then all these invocations would look exactly the same to the analysis, because they are all abstracted to come from the same place.
  • receiver objects, so that the same method applied to two different (abstract) objects would be distinct. Objects are commonly abstracted by their allocation site (i.e., the place where they are new’ed). The same restriction as above applies, multiple allocations in a loop will be merged. The technique is described in Parameterized Object Sensitivity for Points-to Analysis for Java (ISSTA 2002, 129 citations) by Milanova, Rounev and Ryder.

Call site sensitivity example (from Lhoták’s PhD thesis): A context-insensitive points-to analysis would analyse id() once, and return that both c and d may point to b and a. A context-sensitive analysis using call sites as the context abstraction, on the other hand, would analyze id() in 2 different contexts and would properly return that c points to a and d points to b.

Object id(Object o) {
    return o;
}

void f() {
    Object a = new Object();
    Object b = new Object();
    Object c = id(a);
    Object d = id(b);
}

Object sensitivity example (from Chord’s PLDI tutorial): in the following example, we want to examine whether floors.elems and events.elems may point to the same elements. An object-insensitive points-to analysis would return that they could point to the same objects, because it does not distinguish between different instances of List. An object-sensitive analysis would abstract events and floors using two different allocation sites, and would maintain separate points-to information for evens.elems and floors.elems. Therefore, it would be able to determine that these two arrays have disjoint points-to sets.

class List {
	Object[] elems;

	List() {
		elems = new Object[42];
	}
}

class Bldg {
	List events, floors;

	Bldg() {
		this.events = new List();
		this.floors = new List();
		for (int i=0; i < events.elems.length; i++)
			events.elems[i] = new Object();
		for (int i=0; i < floors.elems.length; i++)
			floors.elems[i] = new Object();
	}
}

In each case (object sensitivity and call site sensitivity), we could get further precision by examining not only the last call site/receiver object, but the list of call sites/receiver objects that lead to the current context (i.e., the call stack is the context). In general, these lists may be infinite, so a common abstraction is to bound them by a parameter k. Hence, we refer to k-object-sensitivity or k-call-site-sensitivity. However, if m in the number of abstract context abstractions (allocation sites or call sites), there are m^k potential abstract contexts to consider. It quickly becomes impractical for values of k greater than 2 or 3.

The added precision of context-sensitivity depends on the client analysis. For instance, context-sensitivity does not add much to the precision of call graph construction and virtual method resolution. In Java, 1-object-sensitivity seems to provide the best tradeoff between performance and precision for most client analyses.

Algorithms

Example:

abstract class Animal {
	public abstract void saySomething();
}

class Cat extends Animal {
	public void saySomething() {
		System.out.println("purr");
	}
}

class Dog extends Animal {
	public void saySomething() {
		System.out.println("woof");
	}
}

class Fish extends Animal {
	public void saySomething() {
		System.out.println("...");
	}
}

class Car {  // not an Animal
	public void saySomething() {
		System.out.println("honk!");
	}
}

public class Main {
	static Animal neverCalled() {
		return new Fish();
	}

	static Animal selectAnimal() {
		return new Cat();
	}

	public static void main(String[] args) {
		Animal animal = selectAnimal();
		animal.saySomething();
	}
}

There are 4 methods called saySomething in this application, and the problem is to find which one(s) may be called at run-time (the correct answer is Cat.saySomething()). A dumb and incorrect analysis would just report that any method called saySomething could be invoked, including the one in the Car class. Many algorithms have been proposed to solve this problem, here are a few examples (they are all flow-insensitive, and n stands for the program size):

  • CHA (Class Hierarchy Analysis)
    • by Dean, Grove and Chambers in Optimization of object-oriented programs using static class hierarchy analysis (ECOOP 1995, 490 citations)
    • complexity: O(n)?
    • it does the only trivial thing to do: it first builds the class hierarchy (i.e., it figures that Cat, Fish and Dog are subclasses of Animal) and determines that saySomething() could be applied to any subclass of Animal, thereby eliminating Car.
  • RTA (Rapid Type Analysis)
    • by Bacon and Sweeney in Fast Static Analysis of C++ Virtual Function Calls (OOPSLA 1996, 387 citations)
    • complexity: O(n)?
    • simply put, it prunes from the CHA call graph methods that can never be reached because their enclosing class is never instantiated. In our example, it would determine that saySomething() could be applied to a Fish or a Cat but not a Dog, since no instance of Dog is ever created. It is strictly more powerful than CHA and still very fast and simple.
  • VTA (Variable Type Analysis)
    • by Sundaresan et al. in Practical Virtual Method Call Resolution for Java (OOPSLA 2000, 206 citations)
    • complexity: O(n)?
    • it can be thought of as a refined version of RTA. When RTA collects all objects that can be created in the whole program and uses that information to prune the call graph, VTA does this for each variable instead, providing more precise information. In our example, VTA would correctly determine that saySomething() can not be applied to a Fish, because there is no corresponding edge in the type propagation graph.
  • k-CFA (Control Flow Analysis of order k):
    • by Shivers in Control-flow analysis of higher-order languages (1991 PhD thesis, 476 citations)
    • complexity: O(n3/log(n)) for 0CFA
    • initially formulated for the functional language Scheme, it has apparently evolved to mean different things in the functional and OO communities (I am still looking for a simple formulation for OO languages though). In Resolving and Exploiting the k-CFA Paradox (PLDI 2010), it is described as a “k-call-site-sensitive, field-sensitive points-to analysis algorithm with a context-sensitive heap and with on-the-fly call-graph construction”. It seems that any analysis that uses k-call-strings as contexts could be called a k-CFA analysis.
    • in the PLDI paper, Might et al. also note that “[in the OO setting] 1- and 2-CFA analysis is considered heavy but certainly possible”.
    • in our example, 0CFA would return that saySomething() at line 40 has to be applied to a Cat, because only a Cat can flow to the animal variable.
  • ZCWL
    • by Zhu and Calman in Symbolic pointer analysis revisited (PLDI 2004, 69 citations) and Whaley and Lam in Cloning-based context-sensitive pointer alias analysis using binary decision diagrams (PLDI 2004, 327 citations)
    • complexity: massive
    • essentially a k-CFA analysis where k is the depth of the call graph (plus some fancy-schmancy operations on strongly connected components of the call graph). Requires a context-insensitive call-graph to boot, which it then makes context-sensitive.

Based on this simple example, we see that CHA and RTA return an imprecise call graph, while VTA, 0CFA and ZCWL return a precise call graph. VTA is by far the simplest and most efficient in our case, because we only want a rather coarse-grained call graph, not variable-level points-to information that k-CFA and ZCWL provide.

Further reading

  • for a detailed discussion of the different sensitivities/abstractions: Lhoták, Program analysis using binary decision diagrams (PhD thesis, 2006)
  • for a theoretical comparison of the relative precision of different algorithms: Grove et al., Call Graph Construction in Object-Oriented Languages (ACM SIGPLAN 1997, 209 citations)
  • for an empirical comparison of the relative precision and performance of different algorithms: Lhoták and Hendren, Context-sensitive points-to analysis: is it worth it? (CC 2006, 75 citations)

A healthy dose of sanity

It is not that often that you learn something and get a good laugh from a Ph.D. thesis. Hence, Shivers’ Control-flow analysis of higher-order languages has been a pleasant surprise.

He is at the origin of k-CFA, a class of static analyses that tackle the dynamic dispatch problem with context-sensitive data-flow analysis. Even though the analysis is targeted at functional languages, it also applies to object-oriented languages with virtual function calls (see Matt Might for a gentle introduction). From Shivers’ thesis:

So, if we wish to have a control-flow graph for a piece of Scheme code, we need to answer the following question: for every procedure call in the program, what are the possible lambda expressions that call could be a jump to? But this is a flow analysis question! So with regard to flow analysis in a [higher-order language], we are faced with the following unfortunate situation:

  • In order to do flow analysis, we need a control-flow graph.
  • In order to determine control-flow graphs, we need to do flow analysis.

The problem could be rephrased for Java as: “if we wish to have an [inter-procedural] control flow graph for a Java application, we need to answer the following question: for every instance method call, what classes could the object be an instance of?”

Generic Iterative Dataflow Analysis in Python

Here is an example of a remarkably simple yet powerful dataflow analysis technique. The algorithm is very generic and can be used to implement a number of forward and backward analyses such as constant propagation, reaching definitions, value-set analysis, or in my case type inference.

The algorithm, adapted from the maximal fixedpoint algorithm in the dragon book, takes a control flow graph as input and outputs IN and OUT (maps from basic block to abstract state at their entry and exit (an abstract state maps variables to abstract values)). It is parametric, you must supply it with a few functions that will determine the output of your analysis:

  • analysis.meet, takes two abstract states and returns an abstract state (see lattices). To guarantee that the algorithm terminates, this function should be monotone (and your lattice of abstract values of finite height).
  • analysis.step_forward (resp. analysis.step_backward), a function that takes an instruction and an abstract state at its entry (resp. exit) and “executes” it, transforming the abstract state. They are used to automatically compute the transfer function for each basic block in the cfg.
It looks like this:
def forward_transfer_function(analysis, bb, IN_bb):
    OUT_bb = IN_bb.copy()
    for insn in bb:
        analysis.step_forward(insn, OUT_bb)
    return OUT_bb

def backward_transfer_function(analysis, bb, OUT_bb):
    IN_bb = OUT_bb.copy()
    for insn in reversed(bb):
        analysis.step_backward(insn, IN_bb)
    return IN_bb

def update(env, bb, newval, todo_set, todo_candidates):
    if newval != env[bb]:
        print '{0} has changed, adding {1}'.format(bb, todo_candidates)
        env[bb] = newval
        todo_set |= todo_candidates

def maximal_fixed_point(analysis, cfg, init={}):
    # state at the entry and exit of each basic block
    IN, OUT = {}, {}
    for bb in cfg.nodes:
        IN[bb] = {}
        OUT[bb] = {}
    IN[cfg.entry_point] = init

    # first make a pass over each basic block
    todo_forward = cfg.nodes
    todo_backward = cfg.nodes

    while todo_backward or todo_forward:
        while todo_forward:
            bb = todo_forward.pop()

            ####
            # compute the environment at the entry of this BB
            new_IN = reduce(analysis.meet, map(OUT.get, cfg.pred[bb]), IN[bb])
            update(IN, bb, new_IN, todo_backward, cfg.pred[bb])

            ####
            # propagate information for this basic block
            new_OUT = forward_transfer_function(analysis, bb, IN[bb])
            update(OUT, bb, new_OUT, todo_forward, cfg.succ[bb])

        while todo_backward:
            bb = todo_backward.pop()

            ####
            # compute the environment at the exit of this BB
            new_OUT = reduce(analysis.meet, map(IN.get, succ[bb]), OUT[bb])
            update(OUT, bb, new_OUT, todo_forward, cfg.succ[bb])

            ####
            # propagate information for this basic block (backwards)
            new_IN = backward_transfer_function(analysis, bb, OUT[bb])
            update(IN, bb, new_IN, todo_backward, cfg.pred[bb])

    ####
    # IN and OUT have converged
    return IN, OUT
Ideally, to propagate dataflow information in one pass, you would like to have visited every predecessors of a basic block B for a forward pass before analyzing B. Unfortunately, due to irreducible flow graphs you are not guaranteed to be able to do this. Instead, this algorithm
  1. starts with an empty state at some arbitrary basic block
  2. makes a forward pass and a backward pass over each basic block, adding the successors/predecessors to a worklist when changes are detected
  3. continues until the worklist is empty.
The meet function is here to “combine” information from multiple paths, for instance if B2 is reachable from B0 and B1, then IN(B2) = meet(OUT(B1), OUT(B2)). If you wanted to collect value set information and you had:
  • OUT(B0) = [a->{1}]
  • OUT(B1) = [a-> {-1}]
  • then meet could output IN(B2) = [a -> {1, -1}]
Depending on how meet is defined, it can look for information true for all paths coming to a basic block, or for information from at least one path.
Now, some sample code to implement a simple constant propagation analysis. It is forward only for simplicity, but the algorithm works for bidirectional analyses such as type inference.

def meet_val(lhs, rhs):
    result = None

    if lhs == 'NAC' or rhs == 'NAC':
        result = 'NAC'

    elif lhs == 'UNDEF' or rhs == 'UNDEF':
        result = 'UNDEF'

    else:
        result = 'CONST'

    return result

def meet_env(lhs, rhs):
    lhs_keys = set(lhs.keys())
    rhs_keys = set(rhs.keys())
    result = {}

    for var in lhs_keys - rhs_keys:
        result[var] = lhs[var]

    for var in rhs_keys - lhs_keys:
        result[var] = rhs[var]

    for var in lhs_keys & rhs_keys:
        result[var] = meet_val(lhs[var], rhs[var])

    return result

def abstract_value(env, expr):
    if expr.isdigit():
        return 'CONST'

    try:
        return env[expr]
    except KeyError:
        return 'UNDEF'

def step_forward(insn, env_in):
    if type(insn) == str:
        return 

    var, op, expr = insn

    # insn is var = c
    if len(expr) == 1:
        env_in[var] = abstract_value(env_in, expr)

    else:
        e1, op, e2 = expr
        val1 = abstract_value(env_in, e1)
        val2 = abstract_value(env_in, e2)
        env_in[var] = meet_val(val1, val2)

def step_backward(insn, env_in):
    pass

The function step_forward defines the abstract semantics for the statements or instructions of the language you want to analyze and for the analysis you want to implement. For instance here we only collect if a variable at some program point is constant, undefined, or not a constant (NAC). To do the actual propagation, we could also collect the allocation site of the constant.

the example program

Let’s consider a super simple language, where variables are numbers that can only be affected to or added together. The function meet_val computes the meet for two abstract values, according to this table:

        UNDEF  CONST  NAC
       -----------------
UNDEF | UNDEF  UNDEF  NAC
CONST | UNDEF  CONST  NAC
NAC   | NAC    NAC    NAC

Let’s consider a simple program in this “language” where we don’t specify the constructs for the control flow. The algorithm just assumes that every edge in the CFG is reachable. This is obviously not the case in practice, but that only means that we are going to miss some patterns (the analysis is sound but imprecise in order to terminate).

Now, we want to find if a and ret are constants. Here is the code necessary to setup and run the example (you need networkx to run it):

import networkx as nx    

class SomeObject:
    pass

def instructionify(somestr):
    toks = somestr.split()
    if '+' in somestr:
        return (toks[0], toks[1], (toks[2], toks[3], toks[4]))
    return tuple(somestr.split())

# setup the program's cfg
prog = nx.DiGraph()
s0 = ('entry'),
s1 = instructionify('b = x'),
s2 = instructionify('c = 2'),
s3 = instructionify('a = 40 + c'),
s4 = instructionify('ret = a + x'),
prog.add_edge(s0, s1)
prog.add_edge(s1, s2)
prog.add_edge(s2, s1)
prog.add_edge(s1, s3)
prog.add_edge(s3, s3)
prog.add_edge(s3, s4)

# initialize pred and succ
pred, succ = {}, {}
for bb in prog:
    pred[bb] = set(prog.predecessors(bb))
    succ[bb] = set(prog.successors(bb))

cfg             = SomeObject()
cfg.nodes       = set(prog.nodes())
cfg.pred        = pred
cfg.succ        = succ
cfg.entry_point = s0

analysis               = SomeObject()
analysis.meet          = meet_env
analysis.step_forward  = step_forward
analysis.step_backward = step_backward

# run the whole thing
IN, OUT = maximal_fixed_point(analysis, cfg)
print 'a   at program point s3 is', OUT[s3]['a']
print 'ret at program point s4 is', OUT[s4]['ret']

And the output is:

a   at program point s3 is CONST
ret at program point s4 is UNDEF

As a final note: it is possible to speed things up a bit by choosing a better ordering for basic blocks than just going randomly at first (because we initially fail to propagate lots of information). This might end up in another blog post. Cheers!

How old is Dynamic Binary Instrumentation?

So I was recently reading TAOCP (note for later: find a way to not be smug about it), and stumbled on this gem:

1.4.3.2. Trace Routines. When a machine is being simulated on itself (…), we have the special case of a simulator called a trace or monitor routine. Such programs are occasionally used  to help in debugging, since they print out a step-by-step account of how the simulated program behaves.

This is about as exciting as it gets for somebody versed in tracing and dynamic binary instrumentation, especially since this volume of TAOCP was published in 1968. The algorithm that follows looks exactly like what you would find in today’s dynamic binary rewriters (saving and restoring application registers, decoding and executing the instruction while retaining control on jumps). There is also a mention regarding what we now call transparency:

[the program being traced] must not store anything into the locations used by the trace program

The bit that achieved to kill me was exercise 6, “design a trace routine capable of tracing itself” (I happen to be very sensitive when somebody mentions self-applying things). I spent a few years working on this, how come it never came up before? I looked again at references in the few standard papers about DBI, namely:

  • Luk et al., Pin: Building Customized Program Analysis Tools with Dynamic Instrumentation, 2005
  • Bruening, Efficient, Transparent, and Comprehensive Runtime Code Manipulation, 2004
  • Bala et al., Dynamo: a transparent dynamic optimization system, 2000
  • Nethercote and Seward, Valgrind: a framework for heavyweight dynamic binary instrumentation, 2007
  • Hunt and Brubacher, Detours: Binary Interception of Win32 Functions, 1999

There is no mention of Knuth, and with one exception, the oldest references date from the 90s. The exception comes from the Detours paper, with an incredible reference to Gill, The Diagnosis of Mistakes in Programmes on the EDSAC, 1951:

The second technique will be referred to as the ‘step-by-step’ technique.  In this, the  control unit  of  the  machine never  obeys  any  of  the  orders of  the  original programme directly.  The  machine remains under  the  control  of  the  checking routine, which is  so  constructed that  the  orders of  the  original programme are examined one by one, and carried out in exactly the same manner and sequence as if  they  were being  obeyed  directly.  If  this  were all  that  the  checking routine accomplished, it  would be merely a means of  carrying out machine operations in slow motion-slowed  down, in fact, by a factor of the order of  10. The reason for adopting this type of operation is that it is now open to the author of the checking routine to insert additional orders into it,  causing it  to print useful information as it proceeds. This information may be chosen so as to  assist in the investigation of either order or numerical failures.

So to answer the original question in this post: software-based tracing seems to be just as old as software itself.

The Halting Problem for Reverse Engineers

I often have the feeling that technically savvy people don’t have a very high opinion of academia, and this is particularly true of security people. They have to deal with low-level details such as hardware architecture, operating systems internals (more or less documented), proprietary protocols and data structures, all of which require very specialized knowledge. Logic, theorems and algorithms don’t have a predominant place in that picture.

For the past few years I have been working on these subjects, and found some fundamental theorems to be actually *useful* in understanding the security properties of computer architectures. One of them is, of course, the undecidability of the halting problem. In essence, it says that we can not know if the computation of a program on some input will ever terminate. So what? Who cares if a program terminates, what we want is to find vulnerabilities and unpack malware samples, right?

The importance of the undecidability of the halting problem lies in its generality. In particular we can see Rice’s theorem as a generalization of this result, and to put it very simply, it says that whatever properties of programs you’re interested in, no program can tell if this property holds for every program (i.e. it is undecidable).

This is very bad news for all of us, since basically everything about programs is undecidable. Say that you are interested in finding functions that do out-of-bounds memory writes (or as I said, any other property), Rice’s theorem says that there is no program that will give you a correct answer all the time. You must accept that your program sometimes fails or infinitely loops.

I want to emphasize how bad this is. Do not let the terminology used in the theorems confuse you. In particular, the notions of input, output, and function computed by a program do not map nicely to binaries. An output is anything that gets modified by your program — any register or memory location, as soon as it is touched by an instruction, is an output. And basically, everything about outputs is undecidable. As a consequence, simple tasks such as disassembling are undecidable.

For instance, take this seemingly innocent indirect jump:

jmp [eax]

If eax is an output of instructions before it, no luck, its value is undecidable. You can run the program, write the value down, and assume it will not change, but you have no guarantee that it will not change at a given date. Undecidable. You could argue that eax can only take a finite number of values, and hence disassembling is still possible, just very intractable. But that would be without counting on self-modifying code. SMC (think packers) is the scourge of disassemblers because it gives the ability to transfer control to an output. Since I can’t decide the value of the output, I can’t disassemble.

To sum things up, here are a few direct consequences of the undecidability of the halting problem:

  1. you can’t decide the target of indirect jumps, reads and writes
  2. you can not decide if a particular memory address is code, data, or both
  3. you can’t decide values written in memory
  4. you can’t decide the numbers of occurrences of loops
  5. you can’t decide if control flow can reach a given instruction
  6. whatever you see in a given run can change arbitrarily in another run
  7. disassembling is undecidable
  8. unpacking is undecidable

I will leave how all this led to the bad habit of relying on “heuristics” to a further post. Stay classy!