On the unsolvability of program behaviour

This is spectacularly well stated:

Due to the unsolvability of the halting problem (and nearly any other question concerning program behaviour), no analysis that always terminates can be exact. Therefore we only have three alternatives:

  • Consider systems with a finite number of finite behaviours (e.g. programs without loops) […]
  • Ask interactively for help in case of doubt […]
  • Accept approximate but correct information.

Neil D. Jones and Flemming Nielson, Abstract interpretation: a semantics-based tool for program analysis [PS]

Better shellcode disassembly in Pym’s

Pym’s is an x86 disassembler packaged as a web application. It now includes a new feature, that I like to call “shellcode disassembly for really lazy dudes”. Just paste a C char* in the textbox and let the magic occur.

For instance, try with this shellcode (chosen completely arbitrarily), and you should get something like this:

[0x00000000] 	jmp near 0x0000001e
[0x00000002] 	pop edx
[0x00000003] 	mov esi edx
[0x00000005] 	mov eax [dx]
[0x00000007] 	cmp ax 0x7dca
[0x0000000b] 	jnz 0x00000013
[0x0000000d] 	add ax 0x303
[0x00000011] 	mov [dx] eax
[0x00000013] 	inc dl
[0x00000015] 	cmp eax 0x41414141
[0x0000001a] 	jnz 0x00000005
[0x0000001c] 	jmp esi
[0x0000001e] 	call 0x00000002
[0x00000023] 	xor edx edx
[0x00000025] 	push 0xb
[0x00000027] 	pop eax
[0x00000028] 	cdq
[0x00000029] 	push edx
[0x0000002a] 	push 0x68732f2f
[0x0000002f] 	push 0x6e69622f
[0x00000034] 	mov ebx esp
[0x00000036] 	push edx
[0x00000037] 	push ebx
[0x00000038] 	mov ecx esp
[0x0000003a] 	retf 0x417d

A note on the x86 semantics modeling in JPC

JPC is a full PC emulator (à la Bochs but in Java), including the BIOS, VGA BIOS, floppy drive and other hardware components. Of particular interest to me, is the way x86 instructions are modeled and executed. It works and stays at the binary level (no fancy disassembly), and actually compiles x86 instructions to a simpler microcode language representing “atomic” instructions. This microcode language is then straightforward to execute, although a bit more complex than similar micro-languages (such as VEX, REIL or BIL).

The core of the x86 semantics is contained in the x86 to microcode compiler, found in org.jpc.emulator.memory.codeblock.optimised.ProtectedModeUDecoder.decodeOpcode(). This method takes a binary x86 instruction and decodes its prefices, opcode, modrm, sib, displacement and immediate parameters. Then it delegates the translation of the microcode to this sequence of methods:


writeInputOperands(prefices, opcode, modrm, sib, displacement, immediate);
writeOperation(prefices, opcode, modrm);
writeOutputOperands(prefices, opcode, modrm, sib, displacement);
writeFlags(prefices, opcode, modrm);
<span style="color: #000000;">

For instance, if we take the binary instruction 04 42 (add al, 0x42), it is decoded with opcode = 0x04 and immediate = 0x42. Then based on these values, the instruction is translated to the following microcode sequence:


// writeInputOperands:
LOAD0_AL
LOAD1_IB 0x42
// writeOperation:
ADD
// writeOutputOperands:
STORE0_AL
// writeFlags:
ADD_O8_FLAGS<span style="color: #ffffff;">
</span>

Now, understanding the semantics of an x86 instruction reduces to understanding the semantics of the microcode language. For this, we need the microcode interpreter, which is org.jpc.emulator.memory.codeblock.optimised.ProtectedModeUBlock.execute(). It is a relatively simple execution language (execution-wise), with 5 general-purpose registers but with roughly 750 opcodes. The execution of the above microcodes translates to this Java sequence:


reg0 = cpu.eax & 0xff;
reg1 = 0x42 & 0xff;
reg2 = reg0; reg0 = reg2 + reg1;
cpu.eax = (cpu.eax & ~0xff) | (reg0 & 0xff);
cpu.setZeroFlag((byte)reg0);
cpu.setParityFlag(reg0);
cpu.setSignFlag((byte)reg0);
cpu.setCarryFlag(reg0, Processor.CY_TWIDDLE_FF);
cpu.setAuxiliaryCarryFlag(reg2, reg1, result, Processor.AC_XOR);
cpu.setOverflowFlag(reg0, reg2, reg1, Processor.OF_ADD_BYTE);

A super quick and super dirty shellcode emulator

Using JPC, it gives something like this:


import java.io.IOException;
import org.jpc.emulator.PC;
import org.jpc.emulator.processor.Processor;
import org.jpc.j2se.VirtualClock;

public class EmulateShellcode {
    public static void main(String[] args) throws IOException {
	VirtualClock clock = new VirtualClock();
	PC pc = new PC(clock, args);
	Processor cpu = pc.getProcessor();
	cpu.cs.setSelector(0);
	cpu.eip = 0;

	if(args.length == 0) {
	    System.out.println("Usage: java EmulateShellcode ");
	    System.out.println("*** examples:");
	    System.out.println("* add al, 0x42:\n\tjava EmulateShellcode 04 42");
	    System.out.println("* nop:\n\tjava EmulateShellcode 0x90");
	    System.out.println("* inc edx:\n\tjava EmulateShellcode 42h");
	    System.exit(1);
	}

	System.out.println("Loading shellcode in memory...");
	for(int i=0; i<args.length; i++) {
	    String str = args[i];
	    if(args[i].startsWith("0x"))
		str = args[i].substring(2);
	    else if(args[i].startsWith("\\x"))
		str = args[i].substring(2);
	    else if(args[i].endsWith("h"))
		str = args[i].substring(0, args[i].length()-1);

            <strong>// the byte at address i is the argument at position i cpu.linearMemory.setByte(i, (byte)Integer.parseInt(str, 16));</strong>
	}

	System.out.println("Before execution:");
	cpu.printState();
        <strong>// execute starting at address 0 cpu.linearMemory.executeProtected(cpu, 0);</strong>
	System.out.println("\nAfter execution:");
	cpu.printState();
    }
}

Creating a toy virtual machine with PyPy

Here, you can use “virtual machine” as in “Java Virtual Machine”, not as in virtualization. We will play with the virtual machine described in the paper Tracing the Meta-Level: PyPy’s Tracing JIT Compiler by C.F. Bolz, A. Cuni, M. Fijalkowski and A. Rigo (it’s a great read by the way).

PyPy is a fascinating project, too complex to describe here. Among other things, PyPy can take any interpreter written in a subset of Python, translate it to C, and automatically generate a JIT compiler for this language. Does it sound too good to be true? Let’s try this.

  • grab PyPy source code
  • create the interpreter in pypy/translator/goal/target-toy.py with the following code:
</pre>
<pre style="text-align: justify;">import os, sys
import autopath
import py

# these are the opcodes for the interpreted language
JUMP_IF_A  = 1
MOV_A_R    = 2
MOV_R_A    = 3
ADD_R_TO_A = 4
DECR_A     = 5
RETURN_A   = 6

<em>from pypy.rlib.jit import JitDriver tlrjitdriver = JitDriver(greens = ['pc', 'bytecode'], reds = ['a', 'regs'])</em>

# the main interpreter loop
def interpret(bytecode, a):
   regs = [0] * 256
   pc = 0
   while True:
<em> tlrjitdriver.jit_merge_point(bytecode=bytecode, pc=pc, a=a, regs=regs) </em>       opcode = bytecode[pc]
       pc += 1
       if opcode == JUMP_IF_A:
           target = bytecode[pc]
           pc += 1
           if a:
<em> if target<pc: tlrjitdriver.can_enter_jit(bytecode=bytecode, pc=target, a=a, regs=regs) </em>               pc = target
       elif opcode == MOV_A_R:
           n = bytecode[pc]
           pc += 1
           regs[n] = a
       elif opcode == MOV_R_A:
           n = bytecode[pc]
           pc += 1
           a = regs[n]
       elif opcode == ADD_R_TO_A:
           n = bytecode[pc]
           pc += 1
           a += regs[n]
       elif opcode == DECR_A:
           a -= 1
       elif opcode == RETURN_A:
           return a

# __________  Entry point  __________
def entry_point(argv):
    # the program we want to interpret
    # it computes the square of its argument
    bytecode = [
        MOV_A_R,    0, # i = a
        MOV_A_R,    1, # copy of ’a’
        # 4:
        MOV_R_A,    0, # i--
        DECR_A,
        MOV_A_R,    0,
        MOV_R_A,    2, # res += a
        ADD_R_TO_A, 1,
        MOV_A_R,    2,
        MOV_R_A,    0, # if i!=0: goto 4
        JUMP_IF_A,  4,
        MOV_R_A,    2,
        RETURN_A
    ]
    result = interpret(bytecode, int(argv[1]))
    print result
    return 0

def jitpolicy(driver):
    from pypy.jit.metainterp.policy import JitPolicy
    return JitPolicy()

# _____ Define and setup target ___
def target(*args):
    return entry_point, None

# main function, if this script is called from the command line
if __name__ == '__main__':
    entry_point(sys.argv)</pre>
<div><span style="color: #000000;">

  • the lines in italic are the annotations for the JIT compiler. We need to give PyPy some insight on the interpreted language by declaring what is the instruction pointer (the green variables), the beginning of the dispatch loop and the backward branches (see the paper for full details).
  • check that you can execute this script correctly by running python target-toy.py 12, the output should be 144
  • PyPy can translate this script in C. For this, first install the dependencies and then run the following command: python translate.py target-toy.py
  • this should give you an executable target-toy-c, rename it target-toy-native and check that ./target-toy-native 12 yields 144
  • now we can ask PyPy to translate target-toy.py in C and generate a JIT compiler for it. For this, we just run python translate.py --opt=jit target-toy.py
    • note: the 64-bit backend of PyPy is not implemented yet, so if you are on a 64-bit system, you will have to struggle a bit. You will have to use a 32-bit Python interpreter (see my former post), create an alias for gcc -m32 (let’s call it gcc32) and then pass the option –cc=gcc32 to translate.py.
  • this should give you another target-toy-c executable, rename it to target-toy-jit and check that ./target-toy-jit 12 yields 144

Ok, everything is working, so let’s now see how all this performs by computing large squares:

~/pypy-trunk/pypy/translator/goal$ time python target-toy.py 1000000
1000000000000
real 0m18.637s

~/pypy-trunk/pypy/translator/goal$ time ./target-toy-native 1000000
-727379968
real 0m0.024s

~/pypy-trunk/pypy/translator/goal$ time ./target-toy-jit 1000000
-727379968
[...]
real 0m0.005s

The first run is the square program interpreted by our program, itself interpreted by the Python interpreter. Double interpretation is slow.

The second run is the square program interpreted by a native version of our interpret function. Interpretation by native code is ok.

The third run is the square program interpreted and JIT’ed on the fly. It’s super awesome :)

Final note: I must thank everybody from #pypy on freenode, for their help and resilience to stupid questions. Thanks guys!