A gentle introduction to abstract interpretation
Get Big Fast vs Organic Growth (Joel Spolsky’s talk at Startup School 2012, @12:50)
A gentle introduction to abstract interpretation
Get Big Fast vs Organic Growth (Joel Spolsky’s talk at Startup School 2012, @12:50)
Disclaimer: I am not an expert on the subject, this post is a collection of my reading notes from a bunch of papers. There may be factual errors and the content is subject to change. Just because I can.
In object-oriented languages, constructing a call graph is complicated by the presence of virtual method calls. This is a problem known as the dynamic dispatch problem, and is common in many programming languages and paradigms. The problem can be formulated as follows for Java: if m is a virtual method (i.e. not private or static), then the method that o.m() resolves to depends on the run-time type of o (as opposed to its declaration type). Determining the run-time type of o requires points-to information (what object may o point to?), which in turn requires a call graph to be computed (because what values o may point to depend on what methods get called). This mutual recursion is a fundamental problem, that is not quite trivial to tackle.
A number of static analyses have been proposed to do the job, but since the problem is undecidable no algorithm can return both a precise and correct answer. As a result, we have to decide if we want fast or precise analyses, since they can not be both. Some of the characteristics that affect the precision and cost of the algorithms are:
Below are some notes about flow sensitivity and context sensitivity, followed by a discussion of some of the classic algorithms for call graph construction.
Example (courtesy of smcc): a parity analysis for the following code snippet yields different results depending on the flow sensitivity of the analysis. A flow-insensitive analysis would return that x could be anything, while a flow-sensitive analysis would return that x is even at line 3 and odd at line 4.
void someMethod(int y) { int x; x = 2 * y; x = x + 1; }
A common way to obtain a form of flow sensitivity is to transform the program in SSA form (thereby encoding control dependencies in variable names) and then to apply flow-insensitive analyses. Algorithms for efficient translation to SSA form are given in Appel, Modern Compiler Implementation in Java. For an implementation that transforms Java bytecode to SSA form, see Soot. In SSA form, the above example would become:
void someMethod(int y) { int x_1 = 2 * y; int x_2 = x_1 + 1; }
A flow-insensitive analysis would then be sufficient to return that x_1 is even and x_2 is odd.
Although instances of the standard dataflow frameworks (Kildall, A Unified Approach to Global Program Optimization, POPL 1973) lend themselves particularly well to flow-sensitive analyses, Nielson et al. in Principles of Program Analysis give an example of a flow-insensitive instance, by simply considering empty kill sets in the dataflow equations (i.e., it loses information about which variables are overwritten).
Context sensitivity matters for inter-procedural analyses, where a single method may be invoked in different contexts. So, “what is a context?”, you may ask. Well, it is a parameter of the analysis, so the answer is “it depends”. The most common abstractions seem to be:
Call site sensitivity example (from Lhoták’s PhD thesis): A context-insensitive points-to analysis would analyse id() once, and return that both c and d may point to b and a. A context-sensitive analysis using call sites as the context abstraction, on the other hand, would analyze id() in 2 different contexts and would properly return that c points to a and d points to b.
Object id(Object o) { return o; } void f() { Object a = new Object(); Object b = new Object(); Object c = id(a); Object d = id(b); }
Object sensitivity example (from Chord’s PLDI tutorial): in the following example, we want to examine whether floors.elems and events.elems may point to the same elements. An object-insensitive points-to analysis would return that they could point to the same objects, because it does not distinguish between different instances of List. An object-sensitive analysis would abstract events and floors using two different allocation sites, and would maintain separate points-to information for evens.elems and floors.elems. Therefore, it would be able to determine that these two arrays have disjoint points-to sets.
class List { Object[] elems; List() { elems = new Object[42]; } } class Bldg { List events, floors; Bldg() { this.events = new List(); this.floors = new List(); for (int i=0; i < events.elems.length; i++) events.elems[i] = new Object(); for (int i=0; i < floors.elems.length; i++) floors.elems[i] = new Object(); } }
In each case (object sensitivity and call site sensitivity), we could get further precision by examining not only the last call site/receiver object, but the list of call sites/receiver objects that lead to the current context (i.e., the call stack is the context). In general, these lists may be infinite, so a common abstraction is to bound them by a parameter k. Hence, we refer to k-object-sensitivity or k-call-site-sensitivity. However, if m in the number of abstract context abstractions (allocation sites or call sites), there are m^k potential abstract contexts to consider. It quickly becomes impractical for values of k greater than 2 or 3.
The added precision of context-sensitivity depends on the client analysis. For instance, context-sensitivity does not add much to the precision of call graph construction and virtual method resolution. In Java, 1-object-sensitivity seems to provide the best tradeoff between performance and precision for most client analyses.
Example:
abstract class Animal { public abstract void saySomething(); } class Cat extends Animal { public void saySomething() { System.out.println("purr"); } } class Dog extends Animal { public void saySomething() { System.out.println("woof"); } } class Fish extends Animal { public void saySomething() { System.out.println("..."); } } class Car { // not an Animal public void saySomething() { System.out.println("honk!"); } } public class Main { static Animal neverCalled() { return new Fish(); } static Animal selectAnimal() { return new Cat(); } public static void main(String[] args) { Animal animal = selectAnimal(); animal.saySomething(); } }
There are 4 methods called saySomething in this application, and the problem is to find which one(s) may be called at run-time (the correct answer is Cat.saySomething()). A dumb and incorrect analysis would just report that any method called saySomething could be invoked, including the one in the Car class. Many algorithms have been proposed to solve this problem, here are a few examples (they are all flow-insensitive, and n stands for the program size):
Based on this simple example, we see that CHA and RTA return an imprecise call graph, while VTA, 0CFA and ZCWL return a precise call graph. VTA is by far the simplest and most efficient in our case, because we only want a rather coarse-grained call graph, not variable-level points-to information that k-CFA and ZCWL provide.
It is not that often that you learn something and get a good laugh from a Ph.D. thesis. Hence, Shivers’ Control-flow analysis of higher-order languages has been a pleasant surprise.
He is at the origin of k-CFA, a class of static analyses that tackle the dynamic dispatch problem with context-sensitive data-flow analysis. Even though the analysis is targeted at functional languages, it also applies to object-oriented languages with virtual function calls (see Matt Might for a gentle introduction). From Shivers’ thesis:
So, if we wish to have a control-flow graph for a piece of Scheme code, we need to answer the following question: for every procedure call in the program, what are the possible lambda expressions that call could be a jump to? But this is a flow analysis question! So with regard to flow analysis in a [higher-order language], we are faced with the following unfortunate situation:
- In order to do flow analysis, we need a control-flow graph.
- In order to determine control-flow graphs, we need to do flow analysis.
The problem could be rephrased for Java as: “if we wish to have an [inter-procedural] control flow graph for a Java application, we need to answer the following question: for every instance method call, what classes could the object be an instance of?”
That was at Deepsec 2009 in Vienna. Slides here.
Here is an example of a remarkably simple yet powerful dataflow analysis technique. The algorithm is very generic and can be used to implement a number of forward and backward analyses such as constant propagation, reaching definitions, value-set analysis, or in my case type inference.
The algorithm, adapted from the maximal fixedpoint algorithm in the dragon book, takes a control flow graph as input and outputs IN and OUT (maps from basic block to abstract state at their entry and exit (an abstract state maps variables to abstract values)). It is parametric, you must supply it with a few functions that will determine the output of your analysis:
def forward_transfer_function(analysis, bb, IN_bb): OUT_bb = IN_bb.copy() for insn in bb: analysis.step_forward(insn, OUT_bb) return OUT_bb def backward_transfer_function(analysis, bb, OUT_bb): IN_bb = OUT_bb.copy() for insn in reversed(bb): analysis.step_backward(insn, IN_bb) return IN_bb def update(env, bb, newval, todo_set, todo_candidates): if newval != env[bb]: print '{0} has changed, adding {1}'.format(bb, todo_candidates) env[bb] = newval todo_set |= todo_candidates def maximal_fixed_point(analysis, cfg, init={}): # state at the entry and exit of each basic block IN, OUT = {}, {} for bb in cfg.nodes: IN[bb] = {} OUT[bb] = {} IN[cfg.entry_point] = init # first make a pass over each basic block todo_forward = cfg.nodes todo_backward = cfg.nodes while todo_backward or todo_forward: while todo_forward: bb = todo_forward.pop() #### # compute the environment at the entry of this BB new_IN = reduce(analysis.meet, map(OUT.get, cfg.pred[bb]), IN[bb]) update(IN, bb, new_IN, todo_backward, cfg.pred[bb]) #### # propagate information for this basic block new_OUT = forward_transfer_function(analysis, bb, IN[bb]) update(OUT, bb, new_OUT, todo_forward, cfg.succ[bb]) while todo_backward: bb = todo_backward.pop() #### # compute the environment at the exit of this BB new_OUT = reduce(analysis.meet, map(IN.get, succ[bb]), OUT[bb]) update(OUT, bb, new_OUT, todo_forward, cfg.succ[bb]) #### # propagate information for this basic block (backwards) new_IN = backward_transfer_function(analysis, bb, OUT[bb]) update(IN, bb, new_IN, todo_backward, cfg.pred[bb]) #### # IN and OUT have converged return IN, OUT
def meet_val(lhs, rhs): result = None if lhs == 'NAC' or rhs == 'NAC': result = 'NAC' elif lhs == 'UNDEF' or rhs == 'UNDEF': result = 'UNDEF' else: result = 'CONST' return result def meet_env(lhs, rhs): lhs_keys = set(lhs.keys()) rhs_keys = set(rhs.keys()) result = {} for var in lhs_keys - rhs_keys: result[var] = lhs[var] for var in rhs_keys - lhs_keys: result[var] = rhs[var] for var in lhs_keys & rhs_keys: result[var] = meet_val(lhs[var], rhs[var]) return result def abstract_value(env, expr): if expr.isdigit(): return 'CONST' try: return env[expr] except KeyError: return 'UNDEF' def step_forward(insn, env_in): if type(insn) == str: return var, op, expr = insn # insn is var = c if len(expr) == 1: env_in[var] = abstract_value(env_in, expr) else: e1, op, e2 = expr val1 = abstract_value(env_in, e1) val2 = abstract_value(env_in, e2) env_in[var] = meet_val(val1, val2) def step_backward(insn, env_in): pass
The function step_forward defines the abstract semantics for the statements or instructions of the language you want to analyze and for the analysis you want to implement. For instance here we only collect if a variable at some program point is constant, undefined, or not a constant (NAC). To do the actual propagation, we could also collect the allocation site of the constant.
Let’s consider a super simple language, where variables are numbers that can only be affected to or added together. The function meet_val computes the meet for two abstract values, according to this table:
UNDEF CONST NAC ----------------- UNDEF | UNDEF UNDEF NAC CONST | UNDEF CONST NAC NAC | NAC NAC NAC
Let’s consider a simple program in this “language” where we don’t specify the constructs for the control flow. The algorithm just assumes that every edge in the CFG is reachable. This is obviously not the case in practice, but that only means that we are going to miss some patterns (the analysis is sound but imprecise in order to terminate).
import networkx as nx class SomeObject: pass def instructionify(somestr): toks = somestr.split() if '+' in somestr: return (toks[0], toks[1], (toks[2], toks[3], toks[4])) return tuple(somestr.split()) # setup the program's cfg prog = nx.DiGraph() s0 = ('entry'), s1 = instructionify('b = x'), s2 = instructionify('c = 2'), s3 = instructionify('a = 40 + c'), s4 = instructionify('ret = a + x'), prog.add_edge(s0, s1) prog.add_edge(s1, s2) prog.add_edge(s2, s1) prog.add_edge(s1, s3) prog.add_edge(s3, s3) prog.add_edge(s3, s4) # initialize pred and succ pred, succ = {}, {} for bb in prog: pred[bb] = set(prog.predecessors(bb)) succ[bb] = set(prog.successors(bb)) cfg = SomeObject() cfg.nodes = set(prog.nodes()) cfg.pred = pred cfg.succ = succ cfg.entry_point = s0 analysis = SomeObject() analysis.meet = meet_env analysis.step_forward = step_forward analysis.step_backward = step_backward # run the whole thing IN, OUT = maximal_fixed_point(analysis, cfg) print 'a at program point s3 is', OUT[s3]['a'] print 'ret at program point s4 is', OUT[s4]['ret']
And the output is:
a at program point s3 is CONST ret at program point s4 is UNDEF
As a final note: it is possible to speed things up a bit by choosing a better ordering for basic blocks than just going randomly at first (because we initially fail to propagate lots of information). This might end up in another blog post. Cheers!
So I was recently reading TAOCP (note for later: find a way to not be smug about it), and stumbled on this gem:
1.4.3.2. Trace Routines. When a machine is being simulated on itself (…), we have the special case of a simulator called a trace or monitor routine. Such programs are occasionally used to help in debugging, since they print out a step-by-step account of how the simulated program behaves.
This is about as exciting as it gets for somebody versed in tracing and dynamic binary instrumentation, especially since this volume of TAOCP was published in 1968. The algorithm that follows looks exactly like what you would find in today’s dynamic binary rewriters (saving and restoring application registers, decoding and executing the instruction while retaining control on jumps). There is also a mention regarding what we now call transparency:
[the program being traced] must not store anything into the locations used by the trace program
The bit that achieved to kill me was exercise 6, “design a trace routine capable of tracing itself” (I happen to be very sensitive when somebody mentions self-applying things). I spent a few years working on this, how come it never came up before? I looked again at references in the few standard papers about DBI, namely:
There is no mention of Knuth, and with one exception, the oldest references date from the 90s. The exception comes from the Detours paper, with an incredible reference to Gill, The Diagnosis of Mistakes in Programmes on the EDSAC, 1951:
The second technique will be referred to as the ‘step-by-step’ technique. In this, the control unit of the machine never obeys any of the orders of the original programme directly. The machine remains under the control of the checking routine, which is so constructed that the orders of the original programme are examined one by one, and carried out in exactly the same manner and sequence as if they were being obeyed directly. If this were all that the checking routine accomplished, it would be merely a means of carrying out machine operations in slow motion-slowed down, in fact, by a factor of the order of 10. The reason for adopting this type of operation is that it is now open to the author of the checking routine to insert additional orders into it, causing it to print useful information as it proceeds. This information may be chosen so as to assist in the investigation of either order or numerical failures.
So to answer the original question in this post: software-based tracing seems to be just as old as software itself.
I often have the feeling that technically savvy people don’t have a very high opinion of academia, and this is particularly true of security people. They have to deal with low-level details such as hardware architecture, operating systems internals (more or less documented), proprietary protocols and data structures, all of which require very specialized knowledge. Logic, theorems and algorithms don’t have a predominant place in that picture.
For the past few years I have been working on these subjects, and found some fundamental theorems to be actually *useful* in understanding the security properties of computer architectures. One of them is, of course, the undecidability of the halting problem. In essence, it says that we can not know if the computation of a program on some input will ever terminate. So what? Who cares if a program terminates, what we want is to find vulnerabilities and unpack malware samples, right?
The importance of the undecidability of the halting problem lies in its generality. In particular we can see Rice’s theorem as a generalization of this result, and to put it very simply, it says that whatever properties of programs you’re interested in, no program can tell if this property holds for every program (i.e. it is undecidable).
This is very bad news for all of us, since basically everything about programs is undecidable. Say that you are interested in finding functions that do out-of-bounds memory writes (or as I said, any other property), Rice’s theorem says that there is no program that will give you a correct answer all the time. You must accept that your program sometimes fails or infinitely loops.
I want to emphasize how bad this is. Do not let the terminology used in the theorems confuse you. In particular, the notions of input, output, and function computed by a program do not map nicely to binaries. An output is anything that gets modified by your program — any register or memory location, as soon as it is touched by an instruction, is an output. And basically, everything about outputs is undecidable. As a consequence, simple tasks such as disassembling are undecidable.
For instance, take this seemingly innocent indirect jump:
jmp [eax]
If eax is an output of instructions before it, no luck, its value is undecidable. You can run the program, write the value down, and assume it will not change, but you have no guarantee that it will not change at a given date. Undecidable. You could argue that eax can only take a finite number of values, and hence disassembling is still possible, just very intractable. But that would be without counting on self-modifying code. SMC (think packers) is the scourge of disassemblers because it gives the ability to transfer control to an output. Since I can’t decide the value of the output, I can’t disassemble.
To sum things up, here are a few direct consequences of the undecidability of the halting problem:
I will leave how all this led to the bad habit of relying on “heuristics” to a further post. Stay classy!
This is spectacularly well stated:
Due to the unsolvability of the halting problem (and nearly any other question concerning program behaviour), no analysis that always terminates can be exact. Therefore we only have three alternatives:
- Consider systems with a finite number of finite behaviours (e.g. programs without loops) […]
- Ask interactively for help in case of doubt […]
- Accept approximate but correct information.
Neil D. Jones and Flemming Nielson, Abstract interpretation: a semantics-based tool for program analysis [PS]
Pym’s is an x86 disassembler packaged as a web application. It now includes a new feature, that I like to call “shellcode disassembly for really lazy dudes”. Just paste a C char*
in the textbox and let the magic occur.
For instance, try with this shellcode (chosen completely arbitrarily), and you should get something like this:
[0x00000000] jmp near 0x0000001e [0x00000002] pop edx [0x00000003] mov esi edx [0x00000005] mov eax [dx] [0x00000007] cmp ax 0x7dca [0x0000000b] jnz 0x00000013 [0x0000000d] add ax 0x303 [0x00000011] mov [dx] eax [0x00000013] inc dl [0x00000015] cmp eax 0x41414141 [0x0000001a] jnz 0x00000005 [0x0000001c] jmp esi [0x0000001e] call 0x00000002 [0x00000023] xor edx edx [0x00000025] push 0xb [0x00000027] pop eax [0x00000028] cdq [0x00000029] push edx [0x0000002a] push 0x68732f2f [0x0000002f] push 0x6e69622f [0x00000034] mov ebx esp [0x00000036] push edx [0x00000037] push ebx [0x00000038] mov ecx esp [0x0000003a] retf 0x417d
JPC is a full PC emulator (à la Bochs but in Java), including the BIOS, VGA BIOS, floppy drive and other hardware components. Of particular interest to me, is the way x86 instructions are modeled and executed. It works and stays at the binary level (no fancy disassembly), and actually compiles x86 instructions to a simpler microcode language representing “atomic” instructions. This microcode language is then straightforward to execute, although a bit more complex than similar micro-languages (such as VEX, REIL or BIL).
The core of the x86 semantics is contained in the x86 to microcode compiler, found in org.jpc.emulator.memory.codeblock.optimised.ProtectedModeUDecoder.decodeOpcode()
. This method takes a binary x86 instruction and decodes its prefices, opcode, modrm, sib, displacement and immediate parameters. Then it delegates the translation of the microcode to this sequence of methods:
writeInputOperands(prefices, opcode, modrm, sib, displacement, immediate); writeOperation(prefices, opcode, modrm); writeOutputOperands(prefices, opcode, modrm, sib, displacement); writeFlags(prefices, opcode, modrm); <span style="color: #000000;">
For instance, if we take the binary instruction 04 42 (add al, 0x42), it is decoded with opcode = 0x04 and immediate = 0x42. Then based on these values, the instruction is translated to the following microcode sequence:
// writeInputOperands: LOAD0_AL LOAD1_IB 0x42 // writeOperation: ADD // writeOutputOperands: STORE0_AL // writeFlags: ADD_O8_FLAGS<span style="color: #ffffff;"> </span>
Now, understanding the semantics of an x86 instruction reduces to understanding the semantics of the microcode language. For this, we need the microcode interpreter, which is org.jpc.emulator.memory.codeblock.optimised.ProtectedModeUBlock.execute()
. It is a relatively simple execution language (execution-wise), with 5 general-purpose registers but with roughly 750 opcodes. The execution of the above microcodes translates to this Java sequence:
reg0 = cpu.eax & 0xff; reg1 = 0x42 & 0xff; reg2 = reg0; reg0 = reg2 + reg1; cpu.eax = (cpu.eax & ~0xff) | (reg0 & 0xff); cpu.setZeroFlag((byte)reg0); cpu.setParityFlag(reg0); cpu.setSignFlag((byte)reg0); cpu.setCarryFlag(reg0, Processor.CY_TWIDDLE_FF); cpu.setAuxiliaryCarryFlag(reg2, reg1, result, Processor.AC_XOR); cpu.setOverflowFlag(reg0, reg2, reg1, Processor.OF_ADD_BYTE);