(disclaimer: the author of Savarin, Matthieu Kaczmarek, is a colleague working in the office next door and a friend of mine)
Savarin is a free online binary classification service (you can think of it as automatic diff’ing against large databases of programs). It is in beta, not fully polished yet, but you can still squeeze some interesting results out of it. Here is your daily shot of binary analysis, freshly brewed.
You will need:
- 2 different malware samples in the same malware family. We are going to use Sasser.A (already in Savarin’s database) and an unpacked Sasser.G (md5 b973853d0863070aca89ce00d4ee0fb9 [offensivecomputing.net])
- IDA with IDAPython for the actual diff’ing (I have IDA 5.5, I don’t know if this works with the free version)
Let’s go:
- open Savarin
- in “Classification against custom database”, choose SasserA
- upload the Sasser.G sample
- in the results page, click More to see the similarity with other binaries in the Sasser family
- you can see that the sample is 41.95% similar to a sample with md5 edc66a4031f5a41f9ddf08595a1d4c92
At this point, you have a classification of a sample against a (small) database of programs. You can therefore see the distance between this sample and other samples. If you ask me, it’s a lot better to see that unknownsample.exe is 80% similar to badguy.exe and 90% similar to badguy2.0.exe than just “infected” or “not infected”.
For the actual diff’ing, follow these steps:
- open the Sasser.G sample in IDA
- download the IDAPython analysis report on Savarin’s analysis page (this report contains all the data needed to visualize the binary differences in IDA)
- execute the IDAPython analysis report
- right now, the situation is pretty anticlimactic since you should see no change apart from a few lines in the console. Wait until next step for the interesting stuff. Yes, you had nothing to do in this step, so what?
- type SavColor(‘md5.edc66a4031f5a41f9ddf08595a1d4c92’, 0x0088ff) in the IDAPython console (it is the md5 value of the Sasser.A sample)
- type SavComment(‘md5.edc66a4031f5a41f9ddf08595a1d4c92’) in the IDAPython console
- this is it, now you can browse the Sasser.G sample, and the common parts with Sasser.A will be colored. Additionally, for two matching instructions you will see the corresponding address in the Sasser.A sample.
The Fine Screenshots:
Hi. I agree that savarin is a great service to provide for malware analysis. I am curious about the similarity function you demonstrated where there was a similarity of 41% etc. Is this based on the largest common subgraph? The papers that savarin is based on, to my knowledge, work on fast isomorphisn and maximum common subgraph testing. So am I presuming correctly that there is some similarity function in the vein of s=|maximum_common_subgraph(a, b)|/max(|a|, |b|). I can’t recall this function being described in the savarin papers.
A quick additional comment because you say that savarin identifies the common code between samples as shown in the screenshots. This does not appear to be the largest common subgraph identification. I’d be interested in having this explained to me, because it seems I must not have understood the original papers if this is the case.
I’ll let the Savarin guru answer in person if you don’t mind ^^
Hi Silvio,
We don’t do largest common sub-graph but common subgraphs. The best paper on this approach (I think, but yours have not appeared yet ;)) is the one of Kruegel on worm detection http://www.cs.ucsb.edu/~seclab/projects/polyworms/index.html. I should add a link on savarin.
The details on the method are not explained in the papers provided on savarin, perhaps in a future publication or in a patent. But the technology used to do it, is the one explained in the papers.
Moreover savarin don’t identify common code but common structures, that is common sub-CFG.
Although, with some work you should be able to do largest common subgraph with automata techniques ;).
I’m on hollydays between two trains… do not hesitate to ask precision but I may be a bit long to answer.
—
Matthieu
Hi. It’s interesting because I have based some new work on/related to that paper also. You may have seen a twitter post of mine recently which linked that paper as I had also done an implementation of it – except I don’t do vertex colouring. I was hoping to publish later in the year – I have a working system also and I think the results are good – its efficient and effective. I have this vision that we might both be doing the same thing, because it seems a natural progression from existing literature – especially when applied to malware, because the kruegel paper is primarily about the simpler case of worm detection and doesn’t continue on with the traditional approach *wink*. The 2 papers I am publishing currently take a different approach and are not related to the kruegel paper.
I think for my case a patent would be hard to obtain because of existing work in the kruegel paper. I actually have 2 variations, and 1 is somewhat different, while the other variation is very similar. But still I think of a prior art issue.
If you publish, give me a warning if possible before you submit, so I have the option of submitting to another conference or journal independently. I can tell you also when I am intending to submit. I was hoping to postpone writing a paper for a while however while I finished my thesis.
The research world on this problem is really small it seems.
No problem, I’ll keep you updated via e-mail. But don’t be afraid I think you’ll be quicker. I’m a bit tired by publications, reviewers and stupid bibliometric indices. Moreover, I’ll have to find a job and publications don’t necessarily help.
But that’s an other matter ^^.