Following Flow Dependences with Metasm

Metasm is an LGPL’ed assembly manipulation framework written in Ruby. It is capable of following flow dependences along multiple paths, a feature called backtracking in metasm jargon and close to the notion of program slicing [].

Definition: a node j is flow-dependent on a node i if (1) a variable x is referenced at j, (2) x is defined at i and (3) there exists a path from i to j without intervening definitions of x. Source []

We are going to use the following ruby script to test the backtracker with different source programs, it takes as input a program (sourcecode), a node (labeled ‘sliceme’) and a variable (eax) and returns the flow-dependences of eax and in some cases, the actual values that eax will take.

require 'metasm'
include Metasm

def sliceit(asmsource, label, reg)
    # encode the shellcode
    sc = Shellcode.assemble(, asmsource)
    dasm = sc.disassemble(0)

    # get the address of the label
    offset = sc.encoded.export[label]

    reg = reg.to_sym

    # backtrace
    log = []
    dasm.backtrace(reg, offset, :log => log)

    # return the trace

sourcecode = <<EOS
; asm program here

slice = sliceit(sourcecode, 'sliceme', 'eax')
# slice est un tableau qui contient le log du backtrace
slice.each { |ev, *args|
    if ev == :di    # decodedinstruction
        after = args[0]
        before = args[1]
        instr = args[2]
        puts "#{instr} [#{before} -> #{after}]"

Let’s take some simple code samples. If there is a single execution path, metasm can compute the value contained in the variable, for instance:single_path

mov eax, 42h
inc eax
sliceme: jmp eax

-> metasm returns:
5 inc eax [eax -> eax+1]
0 mov eax, 42h [eax+1 -> 43h]

We can see that useless definitions of the variable are not taken into account:

mov eax, 0h
mov eax, 42h
inc eax
sliceme: jmp eax

-> metasm returns:
0ah inc eax [eax -> eax+1]
5 mov eax, 42h [eax+1 -> 43h]

Definitions of other variables are also ignored (if eax does not depend on them):

mov eax, 42h
mov ebx, 0
sliceme: jmp eax

-> metasm returns:
0 mov eax, 42h [eax -> 42h]

But they are included if eax depends on them:

mov ebx, 1h
mov eax, 42h
add eax, ebx
sliceme: jmp eax

-> metasm returns:
0ah add eax, ebx [eax -> eax+ebx]
5 mov eax, 42h [eax+ebx -> ebx+42h]
0 mov ebx, 1 [ebx+42h -> 43h]

If there are multiple acyclic paths, metasm will be able to compute the value of the sliced variable along each path, for instance:multiple

mov eax, 42h
cmp ebx, 0
jnz pouet
dec eax
jmp sliceme
pouet: inc eax
sliceme: jmp eax

-> metasm returns:
0dh inc eax [eax -> eax+1]
0 mov eax, 42h [eax+1 -> 43h]
0ah dec eax [eax -> eax-1]
0 mov eax, 42h [eax-1 -> 41h]

Metasm states that at the end, eax will be either 41h or 43h but it doesn’t know which one. It doesn’t state it explicitly, but the value of eax is control-dependent on ebx.

Finally in the worst case scenario, if there are cyclic paths (such as loops), metasm will go through each path once but will be unable to compute the value after the cycle. Note that in the general case, this is undecidable.cycle

mov eax, 0
entry: cmp ebx, 0
jnz sliceme

dec ebx
inc eax
jmp entry

sliceme: jmp eax

-> metasm returns:
0bh inc eax [eax -> eax+1]
0 mov eax, entrypoint_0 [eax -> 0]

A Quick Survey on Intermediate Representations for Program Analysis

This is mostly a note to myself, but I guess people interested in automating reverse engineering will be interested at some point in IR suitable for low-level abstractions. I consider top-down IR used by optimizing compilers and bottom-up IR used by decompilers and other reversing tools.

Intermediate Representations for Reverse Engineering

REIL. Used in BinNavi, the Reverse Engineering Intermediate Language defines a very simple RISC architecture (17 instructions), with the nice property that each instruction has at most one side-effect. Thomas Dullien and Sebastian Porst recently presented at CanSecWest an abstract interpretation framework for REIL (paper, slides). It is clearly possible to easily write analyses and transformation passes for REIL without getting into the complexity of the whole x86 architecture, given x86 -> REIL and REIL -> x86 translators.

Here are some sample REIL instructions :

1006E4B00: str edi, , edi
1006E4D00: sub esp, 4, esp
1006E4D01: and esp, 4294967295, esp
1006E4D02: stm ebp, , esp

Language Reference

Hex Rays Microcode. Presented at Black Hat USA 2008 by Ilfak Guilfanov (paper, slides), it is an IR used during decompilation. From the paper: “The microcode language is very detailed and precisely represents how each instruction modifies the memory, registers, and processor condition codes. Typically one CPU instruction is converted into 5-15 microinstructions”. According to the REIL paper, REIL and the microcode language are significantly different, for instance the microinstructions can have a variable number of operands and perform multiple side effects.

Sample microcode:

mov esi.4, eoff.4
mov ds.2, seg.2
add eoff.4, #4.4, eoff.4
ldx seg.2, eoff.4, et1.4
mov et1.4, eax.4

I couldn’t find the language reference.

ELIR. Part of the ERESI project, the goal of ELIR is to simplify static analysis by providing a platform independent abstraction. An overview was presented at Ekoparty08 (slides) and some ideas appeared in Phrack 64, but 30s of Googling didn’t get me to the language reference or a code sample, so that’s all I will say about ELIR for the moment.

Pin Inspection API. PIN, Intel’s Dynamic Binary Instrumentation framework provides a very handy instruction inspection API. This is not an IR but provides the same type of information about complex instructions without having to make giant switch statements. For instance, this is the way to log memory writes with PIN given an instruction:

VOID RecordMemWrite(VOID * addr, UINT32 size) {
    fprintf(trace,",%dW%p", size, addr);

// this function is called each time an instruction is encountered
VOID Instruction(INS ins, VOID *v) {
    // isn't that a nice API ?
    if (WRITES && INS_IsMemoryWrite(ins)) {
            ins, IPOINT_BEFORE, (AFUNPTR)RecordMemWrite,

API Documentation

Valgrind IR. On my todo list.

FermaT Transformation System. I’ll have to write something about it someday. Oh lucky you, a wikipedia entry and a bunch of papers!

Optimizing Compilers Intermediate Representations

LLVM Bitcode. This language uses low-level RISC-like instructions in SSA form with type information. It is clean and well defined, and is a very suitable target for platform-independent analysis and optimization. It is designed to convey high-level information in lower level operations, so converting machine code to LLVM bitcode probably requires some intensive work.

Here is the hello world example :

; Declare the string constant as a global constant...
@.LC0 = internal constant [13 x i8] c"hello worldA0"          

; External declaration of the puts function
declare i32 @puts(i8 *)                                           

; Definition of main function
define i32 @main() {
        ; Convert [13 x i8]* to i8  *...
        %cast210 = getelementptr [13 x i8]* @.LC0, i64 0, i64 0 ; i8 *

        ; Call puts function to write out the string to stdout...
        call i32 @puts(i8 * %cast210)
        ret i32 0

Language reference

Register Transfer Language. One of the IR used in GCC, it is an architecture-neutral assembly language that represents instructions in a LISP-like form (d’oh), like this:

(insn 2 49 3 test.c:3 (set (mem/c/i:SI (plus:DI (reg/f:DI 6 bp)
                (const_int -20 [0xffffffffffffffec])) [0 argc+0 S4 A32])
        (reg:SI 5 di [ argc ])) 47 {*movsi_1} (nil))

It feels a bit old-fashioned and less clean than LLVM bitcode, but this is just a gut feeling. Use gcc -fdump-rtl-all to see what it looks like.

Side note: the idea of dumping RTL to a file, performing transformations on it and giving this back to GCC is quite common, but RMS qualifies it as “not feasible”, even though the creator of RTL says it is not only feasible but quite useful actually.