This is mostly a note to myself, but I guess people interested in automating reverse engineering will be interested at some point in IR suitable for low-level abstractions. I consider top-down IR used by optimizing compilers and bottom-up IR used by decompilers and other reversing tools.
Intermediate Representations for Reverse Engineering
REIL. Used in BinNavi, the Reverse Engineering Intermediate Language defines a very simple RISC architecture (17 instructions), with the nice property that each instruction has at most one side-effect. Thomas Dullien and Sebastian Porst recently presented at CanSecWest an abstract interpretation framework for REIL (paper, slides). It is clearly possible to easily write analyses and transformation passes for REIL without getting into the complexity of the whole x86 architecture, given x86 -> REIL and REIL -> x86 translators.
Here are some sample REIL instructions :
1006E4B00: str edi, , edi 1006E4D00: sub esp, 4, esp 1006E4D01: and esp, 4294967295, esp 1006E4D02: stm ebp, , esp
Hex Rays Microcode. Presented at Black Hat USA 2008 by Ilfak Guilfanov (paper, slides), it is an IR used during decompilation. From the paper: “The microcode language is very detailed and precisely represents how each instruction modifies the memory, registers, and processor condition codes. Typically one CPU instruction is converted into 5-15 microinstructions”. According to the REIL paper, REIL and the microcode language are significantly different, for instance the microinstructions can have a variable number of operands and perform multiple side effects.
Sample microcode:
mov esi.4, eoff.4 mov ds.2, seg.2 add eoff.4, #4.4, eoff.4 ldx seg.2, eoff.4, et1.4 mov et1.4, eax.4
I couldn’t find the language reference.
ELIR. Part of the ERESI project, the goal of ELIR is to simplify static analysis by providing a platform independent abstraction. An overview was presented at Ekoparty08 (slides) and some ideas appeared in Phrack 64, but 30s of Googling didn’t get me to the language reference or a code sample, so that’s all I will say about ELIR for the moment.
Pin Inspection API. PIN, Intel’s Dynamic Binary Instrumentation framework provides a very handy instruction inspection API. This is not an IR but provides the same type of information about complex instructions without having to make giant switch statements. For instance, this is the way to log memory writes with PIN given an instruction:
VOID RecordMemWrite(VOID * addr, UINT32 size) { fprintf(trace,",%dW%p", size, addr); } // this function is called each time an instruction is encountered VOID Instruction(INS ins, VOID *v) { // isn't that a nice API ? if (WRITES && INS_IsMemoryWrite(ins)) { INS_InsertPredicatedCall( ins, IPOINT_BEFORE, (AFUNPTR)RecordMemWrite, IARG_MEMORYWRITE_EA, IARG_MEMORYWRITE_SIZE, IARG_END); } }
Valgrind IR. On my todo list.
FermaT Transformation System. I’ll have to write something about it someday. Oh lucky you, a wikipedia entry and a bunch of papers!
Optimizing Compilers Intermediate Representations
LLVM Bitcode. This language uses low-level RISC-like instructions in SSA form with type information. It is clean and well defined, and is a very suitable target for platform-independent analysis and optimization. It is designed to convey high-level information in lower level operations, so converting machine code to LLVM bitcode probably requires some intensive work.
Here is the hello world example :
; Declare the string constant as a global constant... @.LC0 = internal constant [13 x i8] c"hello worldA0" ; External declaration of the puts function declare i32 @puts(i8 *) ; Definition of main function define i32 @main() { ; Convert [13 x i8]* to i8 *... %cast210 = getelementptr [13 x i8]* @.LC0, i64 0, i64 0 ; i8 * ; Call puts function to write out the string to stdout... call i32 @puts(i8 * %cast210) ret i32 0 }
Register Transfer Language. One of the IR used in GCC, it is an architecture-neutral assembly language that represents instructions in a LISP-like form (d’oh), like this:
(insn 2 49 3 test.c:3 (set (mem/c/i:SI (plus:DI (reg/f:DI 6 bp) (const_int -20 [0xffffffffffffffec])) [0 argc+0 S4 A32]) (reg:SI 5 di [ argc ])) 47 {*movsi_1} (nil))
It feels a bit old-fashioned and less clean than LLVM bitcode, but this is just a gut feeling. Use gcc -fdump-rtl-all to see what it looks like.
Side note: the idea of dumping RTL to a file, performing transformations on it and giving this back to GCC is quite common, but RMS qualifies it as “not feasible”, even though the creator of RTL says it is not only feasible but quite useful actually.