The Berkeley Overmind
Some postdocs at my university get paid to play StarCraft. My life sucks.
Obfuscation Patterns in Academic Papers
These techniques seem to work, since I stumble quite frequently upon them in papers accepted at important venues:
- misleading claim — come up with your own definition of a widely accepted term, preferably in a way inconsistent with the intuition and the general understanding of the term. Then claim spectacular results that happen to be true only with respect to your definition.
- marketing pitch — use data irrelevant to the scientific discussion. Bonus: cite a dubious, hard-to-verify source with a non-existent methodology. Ex: “Problem X costs more than Y gazillion $$$ per year (hence our work is of epic importance)”.
- over-sophistication — use overly sophisticated terms for techniques or ideas that would otherwise be simple to understand (especially if it makes them look better). Ex: “heuristics” [1] sounds a lot better than “hack”.
- no limitations — there are no limitations to your work, only things that are either out-of-scope or left for future work.
- outsourcing problems — outsource large portions of the hard problems in your area by “leveraging” (see over-sophistication) existing work from a different domain with a complex literature, just to make sure that nobody can grasp all the dependencies of your paper.
- biased evaluation — perform experimental evaluation designed to make others look bad. Even better if the evaluation is not reproducible, because you fail to disclose your source code, detailed algorithm or input set.
[1] heuristics (noun — hyu’-ris-tik): algorithm that does not work.
The Halting Problem for Reverse Engineers
I often have the feeling that technically savvy people don’t have a very high opinion of academia, and this is particularly true of security people. They have to deal with low-level details such as hardware architecture, operating systems internals (more or less documented), proprietary protocols and data structures, all of which require very specialized knowledge. Logic, theorems and algorithms don’t have a predominant place in that picture.
For the past few years I have been working on these subjects, and found some fundamental theorems to be actually *useful* in understanding the security properties of computer architectures. One of them is, of course, the undecidability of the halting problem. In essence, it says that we can not know if the computation of a program on some input will ever terminate. So what? Who cares if a program terminates, what we want is to find vulnerabilities and unpack malware samples, right?
The importance of the undecidability of the halting problem lies in its generality. In particular we can see Rice’s theorem as a generalization of this result, and to put it very simply, it says that whatever properties of programs you’re interested in, no program can tell if this property holds for every program (i.e. it is undecidable).
This is very bad news for all of us, since basically everything about programs is undecidable. Say that you are interested in finding functions that do out-of-bounds memory writes (or as I said, any other property), Rice’s theorem says that there is no program that will give you a correct answer all the time. You must accept that your program sometimes fails or infinitely loops.
I want to emphasize how bad this is. Do not let the terminology used in the theorems confuse you. In particular, the notions of input, output, and function computed by a program do not map nicely to binaries. An output is anything that gets modified by your program — any register or memory location, as soon as it is touched by an instruction, is an output. And basically, everything about outputs is undecidable. As a consequence, simple tasks such as disassembling are undecidable.
For instance, take this seemingly innocent indirect jump:
jmp [eax]
If eax is an output of instructions before it, no luck, its value is undecidable. You can run the program, write the value down, and assume it will not change, but you have no guarantee that it will not change at a given date. Undecidable. You could argue that eax can only take a finite number of values, and hence disassembling is still possible, just very intractable. But that would be without counting on self-modifying code. SMC (think packers) is the scourge of disassemblers because it gives the ability to transfer control to an output. Since I can’t decide the value of the output, I can’t disassemble.
To sum things up, here are a few direct consequences of the undecidability of the halting problem:
- you can’t decide the target of indirect jumps, reads and writes
- you can not decide if a particular memory address is code, data, or both
- you can’t decide values written in memory
- you can’t decide the numbers of occurrences of loops
- you can’t decide if control flow can reach a given instruction
- whatever you see in a given run can change arbitrarily in another run
- disassembling is undecidable
- unpacking is undecidable
I will leave how all this led to the bad habit of relying on “heuristics” to a further post. Stay classy!
Headin’ Out Californee Way
Well, I have not been very active on this blog lately. Since my last post I got married, defended my PhD (in front of a superlatively awesome committee, really) and left the snowy plains of Lorraine for California, among other things. I started my new job as a research staff member in Dawn Song‘s group in Berkeley, so I will probably keep complaining about malware, vulnerabilities, the price of housing and pretty much everything else on this blog.
Also, I might actually start training to become a cage fighter and learn emacs — not sure which one is less painful.
On the unsolvability of program behaviour
This is spectacularly well stated:
Due to the unsolvability of the halting problem (and nearly any other question concerning program behaviour), no analysis that always terminates can be exact. Therefore we only have three alternatives:
- Consider systems with a finite number of finite behaviours (e.g. programs without loops) [...]
- Ask interactively for help in case of doubt [...]
- Accept approximate but correct information.
Neil D. Jones and Flemming Nielson, Abstract interpretation: a semantics-based tool for program analysis [PS]
Aboriginal Art
Pictures from an exhibition on aboriginal art at Vogüé Castle, France. Shot with a Nexus One.
Extended Self-Advertisement
The deadline of the Malware 2010 conference is extended to July 17, 2010. Now you have no excuse.
Fun with diagonalization
No more loops, embrace the future with recursive self-modifying code and diagonalization. In the following code snippet, we show how to replace loops like ‘while E do C’ with only ‘if’ and ‘eval’ statements:
// expects stmt == 'if(expr) { command }'
function Z(stmt) {
start = stmt.indexOf('(')+1;
end = stmt.indexOf(')');
expr = stmt.substr(start, end-start);
start = stmt.indexOf('{')+1;
end = stmt.lastIndexOf('}');
command = stmt.substr(start, end-start);
out = 'if ('+expr+') {'+command+'; stmt=\'if('+expr+') {'+command+'}\'; eval(Z(stmt))}';
return out;
}
i = 3;
// the following line is equivalent to while(i>=0) { alert(i); i-- }:
eval(Z('if(i>=0) { alert(i); i-- }'))
Read more about diagonalization and its use by Cantor and Gödel here.
UPDATE: a nicer code snippet by Matt Kaczmarek below.
special = "/**/"
function Z (stmt){
stmt = stmt.replace (special, "\'" + stmt + "\'");
return stmt;
}
i = 3;
// the following line is equivalent to while(i>=0) { alert(i); i-- }:
eval(Z('if(i>=0) {alert(i); i-- ; eval(Z(/**/))}'));



