Newsflash: clever tricks have never been clever
“DOS applications were rarely well behaved in terms of using existing system resources since they typically ran one at time. Developers often used tricks or even existing bugs in the OS or processors in clever ways to get more memory or prevent running of unlicensed copies. Windows was intended to enable multiple applications to run at the same time, so it was difficult to support running existing applications while isolating them so their clever tricks would not corrupt other applications.”
– Tandy Trower [technologizer.com]
And that was 25 years ago. Ouch.
US Domination over Security Conferences
(UPDATE: now with NDSS data)
Looking at the program committees of a few top-ranking security conferences, you have to stop and think “wow, there’s a lot of American people in there” (I am referring to people working in the US, not necessarily their nationality). Just to name a few:
- ACM CCS 2010: 39 out of 54 PC members in the US (72.2%)
- Internet Society NDSS 2010: 31 out of 37 (83.8%)
- IEEE S&P 2010: 42 out of 49 (85.7%)
- Usenix Security 2010: 26 out of 28 (92.9%)
And an approximation of the similarity of program committees (only based on last names):
ndss & s_p : ['Feamster', 'Traynor', 'Shacham', 'Ateniese', 'Kirda',
'Smith', 'Stavrou', 'Barth', 'Backes', 'Kruegel', 'Borisov', 'Kohno',
'Brumley', 'Sion', 'Locasto', 'Reiter']
similarity: 33.33%
ndss & usenix : ['Traynor', 'Shacham', 'Provos', 'Barth', 'Danezis',
'Borisov', 'Ganapathy', 'Garfinkel', 'Locasto']
similarity: 24.32%
ndss & ccs : ['Christodorescu', 'Kirda', 'Stavrou', 'Barth',
'Danezis', 'Borisov', 'Kohno', 'Ganapathy', 'Locasto', 'Cui']
similarity: 18.52%
s_p & usenix : ['Jha', 'Traynor', 'Giffin', 'Halderman', 'Shacham',
'Barth', 'Borisov', 'Locasto']
similarity: 16.67%
s_p & ccs : ['Halderman', 'Molnar', 'Waters', 'XWang', 'Stavrou',
'Barth', 'Gligor', 'Borisov', 'Kohno', 'Locasto', 'Yu', 'Kirda']
similarity: 22.22%
usenix & ccs : ['Sekar', 'Halderman', 'Gribble', 'Sailer', 'Barth',
'Danezis', 'Borisov', 'Ganapathy', 'Locasto', 'HWang']
similarity: 18.52%
.
I’m not sure how to interpret these numbers, they can mean lots of different things:
- they might reflect the proportion of US/RestOfWorld academics working on computer security
- or they might reflect the proportion of US/RestOfWorld academics working on software engineering (and therefore there is nothing specific to computer security)
- causation 1: security conferences are top ranking when they have mostly US PC members
- causation 2: top ranking conferences attract US PC members
- what else?
Better shellcode disassembly in Pym’s
Pym’s is an x86 disassembler packaged as a web application. It now includes a new feature, that I like to call “shellcode disassembly for really lazy dudes”. Just paste a C char* in the textbox and let the magic occur.
For instance, try with this shellcode (chosen completely arbitrarily), and you should get something like this:
[0x00000000] jmp near 0x0000001e [0x00000002] pop edx [0x00000003] mov esi edx [0x00000005] mov eax [dx] [0x00000007] cmp ax 0x7dca [0x0000000b] jnz 0x00000013 [0x0000000d] add ax 0x303 [0x00000011] mov [dx] eax [0x00000013] inc dl [0x00000015] cmp eax 0x41414141 [0x0000001a] jnz 0x00000005 [0x0000001c] jmp esi [0x0000001e] call 0x00000002 [0x00000023] xor edx edx [0x00000025] push 0xb [0x00000027] pop eax [0x00000028] cdq [0x00000029] push edx [0x0000002a] push 0x68732f2f [0x0000002f] push 0x6e69622f [0x00000034] mov ebx esp [0x00000036] push edx [0x00000037] push ebx [0x00000038] mov ecx esp [0x0000003a] retf 0x417d
A note on the x86 semantics modeling in JPC
JPC is a full PC emulator (à la Bochs but in Java), including the BIOS, VGA BIOS, floppy drive and other hardware components. Of particular interest to me, is the way x86 instructions are modeled and executed. It works and stays at the binary level (no fancy disassembly), and actually compiles x86 instructions to a simpler microcode language representing “atomic” instructions. This microcode language is then straightforward to execute, although a bit more complex than similar micro-languages (such as VEX, REIL or BIL).
The core of the x86 semantics is contained in the x86 to microcode compiler, found in org.jpc.emulator.memory.codeblock.optimised.ProtectedModeUDecoder.decodeOpcode(). This method takes a binary x86 instruction and decodes its prefices, opcode, modrm, sib, displacement and immediate parameters. Then it delegates the translation of the microcode to this sequence of methods:
writeInputOperands(prefices, opcode, modrm, sib, displacement, immediate); writeOperation(prefices, opcode, modrm); writeOutputOperands(prefices, opcode, modrm, sib, displacement); writeFlags(prefices, opcode, modrm); <span style="color: #000000;">
For instance, if we take the binary instruction 04 42 (add al, 0x42), it is decoded with opcode = 0x04 and immediate = 0x42. Then based on these values, the instruction is translated to the following microcode sequence:
// writeInputOperands: LOAD0_AL LOAD1_IB 0x42 // writeOperation: ADD // writeOutputOperands: STORE0_AL // writeFlags: ADD_O8_FLAGS<span style="color: #ffffff;"> </span>
Now, understanding the semantics of an x86 instruction reduces to understanding the semantics of the microcode language. For this, we need the microcode interpreter, which is org.jpc.emulator.memory.codeblock.optimised.ProtectedModeUBlock.execute(). It is a relatively simple execution language (execution-wise), with 5 general-purpose registers but with roughly 750 opcodes. The execution of the above microcodes translates to this Java sequence:
reg0 = cpu.eax & 0xff; reg1 = 0x42 & 0xff; reg2 = reg0; reg0 = reg2 + reg1; cpu.eax = (cpu.eax & ~0xff) | (reg0 & 0xff); cpu.setZeroFlag((byte)reg0); cpu.setParityFlag(reg0); cpu.setSignFlag((byte)reg0); cpu.setCarryFlag(reg0, Processor.CY_TWIDDLE_FF); cpu.setAuxiliaryCarryFlag(reg2, reg1, result, Processor.AC_XOR); cpu.setOverflowFlag(reg0, reg2, reg1, Processor.OF_ADD_BYTE);
A super quick and super dirty shellcode emulator
Using JPC, it gives something like this:
import java.io.IOException;
import org.jpc.emulator.PC;
import org.jpc.emulator.processor.Processor;
import org.jpc.j2se.VirtualClock;
public class EmulateShellcode {
public static void main(String[] args) throws IOException {
VirtualClock clock = new VirtualClock();
PC pc = new PC(clock, args);
Processor cpu = pc.getProcessor();
cpu.cs.setSelector(0);
cpu.eip = 0;
if(args.length == 0) {
System.out.println("Usage: java EmulateShellcode ");
System.out.println("*** examples:");
System.out.println("* add al, 0x42:\n\tjava EmulateShellcode 04 42");
System.out.println("* nop:\n\tjava EmulateShellcode 0x90");
System.out.println("* inc edx:\n\tjava EmulateShellcode 42h");
System.exit(1);
}
System.out.println("Loading shellcode in memory...");
for(int i=0; i<args.length; i++) {
String str = args[i];
if(args[i].startsWith("0x"))
str = args[i].substring(2);
else if(args[i].startsWith("\\x"))
str = args[i].substring(2);
else if(args[i].endsWith("h"))
str = args[i].substring(0, args[i].length()-1);
<strong>// the byte at address i is the argument at position i cpu.linearMemory.setByte(i, (byte)Integer.parseInt(str, 16));</strong>
}
System.out.println("Before execution:");
cpu.printState();
<strong>// execute starting at address 0 cpu.linearMemory.executeProtected(cpu, 0);</strong>
System.out.println("\nAfter execution:");
cpu.printState();
}
}
Creating a toy virtual machine with PyPy
Here, you can use “virtual machine” as in “Java Virtual Machine”, not as in virtualization. We will play with the virtual machine described in the paper Tracing the Meta-Level: PyPy’s Tracing JIT Compiler by C.F. Bolz, A. Cuni, M. Fijalkowski and A. Rigo (it’s a great read by the way).
PyPy is a fascinating project, too complex to describe here. Among other things, PyPy can take any interpreter written in a subset of Python, translate it to C, and automatically generate a JIT compiler for this language. Does it sound too good to be true? Let’s try this.
- grab PyPy source code
- create the interpreter in pypy/translator/goal/target-toy.py with the following code:
</pre>
<pre style="text-align: justify;">import os, sys
import autopath
import py
# these are the opcodes for the interpreted language
JUMP_IF_A = 1
MOV_A_R = 2
MOV_R_A = 3
ADD_R_TO_A = 4
DECR_A = 5
RETURN_A = 6
<em>from pypy.rlib.jit import JitDriver tlrjitdriver = JitDriver(greens = ['pc', 'bytecode'], reds = ['a', 'regs'])</em>
# the main interpreter loop
def interpret(bytecode, a):
regs = [0] * 256
pc = 0
while True:
<em> tlrjitdriver.jit_merge_point(bytecode=bytecode, pc=pc, a=a, regs=regs) </em> opcode = bytecode[pc]
pc += 1
if opcode == JUMP_IF_A:
target = bytecode[pc]
pc += 1
if a:
<em> if target<pc: tlrjitdriver.can_enter_jit(bytecode=bytecode, pc=target, a=a, regs=regs) </em> pc = target
elif opcode == MOV_A_R:
n = bytecode[pc]
pc += 1
regs[n] = a
elif opcode == MOV_R_A:
n = bytecode[pc]
pc += 1
a = regs[n]
elif opcode == ADD_R_TO_A:
n = bytecode[pc]
pc += 1
a += regs[n]
elif opcode == DECR_A:
a -= 1
elif opcode == RETURN_A:
return a
# __________ Entry point __________
def entry_point(argv):
# the program we want to interpret
# it computes the square of its argument
bytecode = [
MOV_A_R, 0, # i = a
MOV_A_R, 1, # copy of ’a’
# 4:
MOV_R_A, 0, # i--
DECR_A,
MOV_A_R, 0,
MOV_R_A, 2, # res += a
ADD_R_TO_A, 1,
MOV_A_R, 2,
MOV_R_A, 0, # if i!=0: goto 4
JUMP_IF_A, 4,
MOV_R_A, 2,
RETURN_A
]
result = interpret(bytecode, int(argv[1]))
print result
return 0
def jitpolicy(driver):
from pypy.jit.metainterp.policy import JitPolicy
return JitPolicy()
# _____ Define and setup target ___
def target(*args):
return entry_point, None
# main function, if this script is called from the command line
if __name__ == '__main__':
entry_point(sys.argv)</pre>
<div><span style="color: #000000;">
- the lines in italic are the annotations for the JIT compiler. We need to give PyPy some insight on the interpreted language by declaring what is the instruction pointer (the green variables), the beginning of the dispatch loop and the backward branches (see the paper for full details).
- check that you can execute this script correctly by running
python target-toy.py 12, the output should be 144 - PyPy can translate this script in C. For this, first install the dependencies and then run the following command:
python translate.py target-toy.py - this should give you an executable target-toy-c, rename it target-toy-native and check that
./target-toy-native 12yields 144 - now we can ask PyPy to translate target-toy.py in C and generate a JIT compiler for it. For this, we just run
python translate.py --opt=jit target-toy.py- note: the 64-bit backend of PyPy is not implemented yet, so if you are on a 64-bit system, you will have to struggle a bit. You will have to use a 32-bit Python interpreter (see my former post), create an alias for gcc -m32 (let’s call it gcc32) and then pass the option –cc=gcc32 to translate.py.
- this should give you another target-toy-c executable, rename it to target-toy-jit and check that
./target-toy-jit 12 yields 144
Ok, everything is working, so let’s now see how all this performs by computing large squares:
~/pypy-trunk/pypy/translator/goal$ time python target-toy.py 1000000
1000000000000
real 0m18.637s
~/pypy-trunk/pypy/translator/goal$ time ./target-toy-native 1000000
-727379968
real 0m0.024s
~/pypy-trunk/pypy/translator/goal$ time ./target-toy-jit 1000000
-727379968
[...]
real 0m0.005s
The first run is the square program interpreted by our program, itself interpreted by the Python interpreter. Double interpretation is slow.
The second run is the square program interpreted by a native version of our interpret function. Interpretation by native code is ok.
The third run is the square program interpreted and JIT’ed on the fly. It’s super awesome :)
Final note: I must thank everybody from #pypy on freenode, for their help and resilience to stupid questions. Thanks guys!

