Generic Iterative Dataflow Analysis in Python

Here is an example of a remarkably simple yet powerful dataflow analysis technique. The algorithm is very generic and can be used to implement a number of forward and backward analyses such as constant propagation, reaching definitions, value-set analysis, or in my case type inference.

The algorithm, adapted from the maximal fixedpoint algorithm in the dragon book, takes a control flow graph as input and outputs IN and OUT (maps from basic block to abstract state at their entry and exit (an abstract state maps variables to abstract values)). It is parametric, you must supply it with a few functions that will determine the output of your analysis:

  •, takes two abstract states and returns an abstract state (see lattices). To guarantee that the algorithm terminates, this function should be monotone (and your lattice of abstract values of finite height).
  • analysis.step_forward (resp. analysis.step_backward), a function that takes an instruction and an abstract state at its entry (resp. exit) and “executes” it, transforming the abstract state. They are used to automatically compute the transfer function for each basic block in the cfg.
It looks like this:
def forward_transfer_function(analysis, bb, IN_bb):
    OUT_bb = IN_bb.copy()
    for insn in bb:
        analysis.step_forward(insn, OUT_bb)
    return OUT_bb

def backward_transfer_function(analysis, bb, OUT_bb):
    IN_bb = OUT_bb.copy()
    for insn in reversed(bb):
        analysis.step_backward(insn, IN_bb)
    return IN_bb

def update(env, bb, newval, todo_set, todo_candidates):
    if newval != env[bb]:
        print '{0} has changed, adding {1}'.format(bb, todo_candidates)
        env[bb] = newval
        todo_set |= todo_candidates

def maximal_fixed_point(analysis, cfg, init={}):
    # state at the entry and exit of each basic block
    IN, OUT = {}, {}
    for bb in cfg.nodes:
        IN[bb] = {}
        OUT[bb] = {}
    IN[cfg.entry_point] = init

    # first make a pass over each basic block
    todo_forward = cfg.nodes
    todo_backward = cfg.nodes

    while todo_backward or todo_forward:
        while todo_forward:
            bb = todo_forward.pop()

            # compute the environment at the entry of this BB
            new_IN = reduce(, map(OUT.get, cfg.pred[bb]), IN[bb])
            update(IN, bb, new_IN, todo_backward, cfg.pred[bb])

            # propagate information for this basic block
            new_OUT = forward_transfer_function(analysis, bb, IN[bb])
            update(OUT, bb, new_OUT, todo_forward, cfg.succ[bb])

        while todo_backward:
            bb = todo_backward.pop()

            # compute the environment at the exit of this BB
            new_OUT = reduce(, map(IN.get, succ[bb]), OUT[bb])
            update(OUT, bb, new_OUT, todo_forward, cfg.succ[bb])

            # propagate information for this basic block (backwards)
            new_IN = backward_transfer_function(analysis, bb, OUT[bb])
            update(IN, bb, new_IN, todo_backward, cfg.pred[bb])

    # IN and OUT have converged
    return IN, OUT
Ideally, to propagate dataflow information in one pass, you would like to have visited every predecessors of a basic block B for a forward pass before analyzing B. Unfortunately, due to irreducible flow graphs you are not guaranteed to be able to do this. Instead, this algorithm
  1. starts with an empty state at some arbitrary basic block
  2. makes a forward pass and a backward pass over each basic block, adding the successors/predecessors to a worklist when changes are detected
  3. continues until the worklist is empty.
The meet function is here to “combine” information from multiple paths, for instance if B2 is reachable from B0 and B1, then IN(B2) = meet(OUT(B1), OUT(B2)). If you wanted to collect value set information and you had:
  • OUT(B0) = [a->{1}]
  • OUT(B1) = [a-> {-1}]
  • then meet could output IN(B2) = [a -> {1, -1}]
Depending on how meet is defined, it can look for information true for all paths coming to a basic block, or for information from at least one path.
Now, some sample code to implement a simple constant propagation analysis. It is forward only for simplicity, but the algorithm works for bidirectional analyses such as type inference.

def meet_val(lhs, rhs):
    result = None

    if lhs == 'NAC' or rhs == 'NAC':
        result = 'NAC'

    elif lhs == 'UNDEF' or rhs == 'UNDEF':
        result = 'UNDEF'

        result = 'CONST'

    return result

def meet_env(lhs, rhs):
    lhs_keys = set(lhs.keys())
    rhs_keys = set(rhs.keys())
    result = {}

    for var in lhs_keys - rhs_keys:
        result[var] = lhs[var]

    for var in rhs_keys - lhs_keys:
        result[var] = rhs[var]

    for var in lhs_keys & rhs_keys:
        result[var] = meet_val(lhs[var], rhs[var])

    return result

def abstract_value(env, expr):
    if expr.isdigit():
        return 'CONST'

        return env[expr]
    except KeyError:
        return 'UNDEF'

def step_forward(insn, env_in):
    if type(insn) == str:

    var, op, expr = insn

    # insn is var = c
    if len(expr) == 1:
        env_in[var] = abstract_value(env_in, expr)

        e1, op, e2 = expr
        val1 = abstract_value(env_in, e1)
        val2 = abstract_value(env_in, e2)
        env_in[var] = meet_val(val1, val2)

def step_backward(insn, env_in):

The function step_forward defines the abstract semantics for the statements or instructions of the language you want to analyze and for the analysis you want to implement. For instance here we only collect if a variable at some program point is constant, undefined, or not a constant (NAC). To do the actual propagation, we could also collect the allocation site of the constant.

the example program

Let’s consider a super simple language, where variables are numbers that can only be affected to or added together. The function meet_val computes the meet for two abstract values, according to this table:

NAC   | NAC    NAC    NAC

Let’s consider a simple program in this “language” where we don’t specify the constructs for the control flow. The algorithm just assumes that every edge in the CFG is reachable. This is obviously not the case in practice, but that only means that we are going to miss some patterns (the analysis is sound but imprecise in order to terminate).

Now, we want to find if a and ret are constants. Here is the code necessary to setup and run the example (you need networkx to run it):

import networkx as nx    

class SomeObject:

def instructionify(somestr):
    toks = somestr.split()
    if '+' in somestr:
        return (toks[0], toks[1], (toks[2], toks[3], toks[4]))
    return tuple(somestr.split())

# setup the program's cfg
prog = nx.DiGraph()
s0 = ('entry'),
s1 = instructionify('b = x'),
s2 = instructionify('c = 2'),
s3 = instructionify('a = 40 + c'),
s4 = instructionify('ret = a + x'),
prog.add_edge(s0, s1)
prog.add_edge(s1, s2)
prog.add_edge(s2, s1)
prog.add_edge(s1, s3)
prog.add_edge(s3, s3)
prog.add_edge(s3, s4)

# initialize pred and succ
pred, succ = {}, {}
for bb in prog:
    pred[bb] = set(prog.predecessors(bb))
    succ[bb] = set(prog.successors(bb))

cfg             = SomeObject()
cfg.nodes       = set(prog.nodes())
cfg.pred        = pred
cfg.succ        = succ
cfg.entry_point = s0

analysis               = SomeObject()          = meet_env
analysis.step_forward  = step_forward
analysis.step_backward = step_backward

# run the whole thing
IN, OUT = maximal_fixed_point(analysis, cfg)
print 'a   at program point s3 is', OUT[s3]['a']
print 'ret at program point s4 is', OUT[s4]['ret']

And the output is:

a   at program point s3 is CONST
ret at program point s4 is UNDEF

As a final note: it is possible to speed things up a bit by choosing a better ordering for basic blocks than just going randomly at first (because we initially fail to propagate lots of information). This might end up in another blog post. Cheers!

How old is Dynamic Binary Instrumentation?

So I was recently reading TAOCP (note for later: find a way to not be smug about it), and stumbled on this gem: Trace Routines. When a machine is being simulated on itself (…), we have the special case of a simulator called a trace or monitor routine. Such programs are occasionally used  to help in debugging, since they print out a step-by-step account of how the simulated program behaves.

This is about as exciting as it gets for somebody versed in tracing and dynamic binary instrumentation, especially since this volume of TAOCP was published in 1968. The algorithm that follows looks exactly like what you would find in today’s dynamic binary rewriters (saving and restoring application registers, decoding and executing the instruction while retaining control on jumps). There is also a mention regarding what we now call transparency:

[the program being traced] must not store anything into the locations used by the trace program

The bit that achieved to kill me was exercise 6, “design a trace routine capable of tracing itself” (I happen to be very sensitive when somebody mentions self-applying things). I spent a few years working on this, how come it never came up before? I looked again at references in the few standard papers about DBI, namely:

  • Luk et al., Pin: Building Customized Program Analysis Tools with Dynamic Instrumentation, 2005
  • Bruening, Efficient, Transparent, and Comprehensive Runtime Code Manipulation, 2004
  • Bala et al., Dynamo: a transparent dynamic optimization system, 2000
  • Nethercote and Seward, Valgrind: a framework for heavyweight dynamic binary instrumentation, 2007
  • Hunt and Brubacher, Detours: Binary Interception of Win32 Functions, 1999

There is no mention of Knuth, and with one exception, the oldest references date from the 90s. The exception comes from the Detours paper, with an incredible reference to Gill, The Diagnosis of Mistakes in Programmes on the EDSAC, 1951:

The second technique will be referred to as the ‘step-by-step’ technique.  In this, the  control unit  of  the  machine never  obeys  any  of  the  orders of  the  original programme directly.  The  machine remains under  the  control  of  the  checking routine, which is  so  constructed that  the  orders of  the  original programme are examined one by one, and carried out in exactly the same manner and sequence as if  they  were being  obeyed  directly.  If  this  were all  that  the  checking routine accomplished, it  would be merely a means of  carrying out machine operations in slow motion-slowed  down, in fact, by a factor of the order of  10. The reason for adopting this type of operation is that it is now open to the author of the checking routine to insert additional orders into it,  causing it  to print useful information as it proceeds. This information may be chosen so as to  assist in the investigation of either order or numerical failures.

So to answer the original question in this post: software-based tracing seems to be just as old as software itself.

Do you really have to throw up to be a researcher?

Through me you pass into the city of woe:
Through me you pass into eternal pain:
Through me among the people lost for aye.

— Dante Alighieri

I recently listened to a talk about somebody’s experience as a computer science researcher. The argument was that research is ultra-competitive, because you fight against people that are smarter than you and work harder than you (think nights and weekends) for very few available positions. Therefore, you have to consider yourself as a professional athlete, i.e. choosing a lifestyle that serves your only purpose in life: succeeding. Also, professional athletes throw up when they train really hard but then they train even harder. So much for elegance.

Needless to say, this vision of life is disgusting from every point of view. Ok, in academic research the hours are long and the pay is low, but what about the joys one could find in science? There was nothing about the beauty of a proof, the elegance of a piece of code or the fun of technology. Apparently there is nothing more to life than getting a better position, and to me this is about as un-inspirational as it gets.

Obfuscation Patterns in Academic Papers

These techniques seem to work, since I stumble quite frequently upon them in papers accepted at important venues:

  • misleading claim — come up with your own definition of a widely accepted term, preferably in a way inconsistent with the intuition and the general understanding of the term. Then claim spectacular results that happen to be true only with respect to your definition.
  • marketing pitch — use data irrelevant to the scientific discussion. Bonus: cite a dubious, hard-to-verify source with a non-existent methodology. Ex: “Problem X costs more than Y gazillion $$$ per year (hence our work is of epic importance)”.
  • over-sophistication —  use overly sophisticated terms for techniques or ideas that would otherwise be simple to understand (especially if it makes them look better). Ex: “heuristics” [1] sounds a lot better than “hack”.
  • no limitations — there are no limitations to your work, only things that are either out-of-scope or left for future work.
  • outsourcing problems — outsource large portions of the hard problems in your area by “leveraging” (see over-sophistication) existing work from a different domain with a complex literature, just to make sure that nobody can grasp all the dependencies of your paper.
  • biased evaluation — perform experimental evaluation designed to make others look bad. Even better if the evaluation is not reproducible, because you fail to disclose your source code, detailed algorithm or input set.

[1] heuristics (noun — hyu’-ris-tik): algorithm that does not work.