I spend a lot of my time building and evaluating implementations for research papers. Over time, I’ve built a strategy for running evaluations. I just uploaded a project to github with a skeleton for performing evaluations how I do. The description is available here, and the github project is here.
Coming to a BAP Near You
There’s going to be a lot of surprises in BAP 0.8. Here’s a little preview:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 | |
Typechecking Fail
If you spend enough time around me, you will probably hear me extol the virtues of statically typed languages such as ML or Haskell. Statically typed languages are great, because they find so many bugs before you even run your program. In fact, a good friend of mine jokes about OCaml program that “If it compiles it is correct!” This is not true, of course. OCaml does not check that your algorithm’s logic is correct, for instance.
Yesterday I encountered what I would consider a “type bug” in OCaml. There was human error involved, but I was still surprised that such a bug is possible in a type-checked program.
In BAP, we have a utility function, Util.timeout, which performs a
computation, and if it takes too long, aborts it and raises an
exception. The timeout function has type: int -> ('a -> 'b) -> 'a
-> 'b, or in plain english, takes the number of seconds, a function
f to execute, and a value x such that f x is computed.
The problem we observed was that timeouts were simply not working: the
code would run for a very long time. The timeout function was being
called, though, and at first glance I thought it looked fine:
Util.timeout 3 Asmir_disasm.vsa_at p addr vsa_at takes two
arguments, so this makes some intuitive sense. But a closer look
reveals that we are psssing 4 arguments to Util.timeout, and it
only takes 3 — How is this possible?
The problem is that timeout is polymorphic, and the type inference
engine decided that 'a = Asmir.asmprogram and 'b = Type.addr ->
Cfg.AST.G.t * unit, or in plain english: we want to partially
evaluate the vsa_at function on only one argument, and timeout if
that takes too long. Unfortunately, partial evaluation is usually
very fast. The slowdown occurs when the extra argument, addr is
then consumed by this partially evaluated function. Another way of
viewing this is that OCaml thinks we wanted to do (Util.timeout 3
Asmir_disasm.vsa_at p) addr.
The solution is simply to force the partial evaluation to occur
earlier, i.e., Util.timeout 3 (Asmir_disasm.vsa_at p) addr. I think
that such a bug should at least trigger a warning, or maybe even
require explicit parenthesization in such cases to denote currying is
intentionally being applied. However, the consensus at lunch was that
I am wrong, and the programmer should not make such mistakes. But
isn’t the point of static type checking to find dumb mistakes?
BAP for Everyone
The use of OCaml in BAP has both positives and negatives. OCaml’s pattern matching is a wonderful fit for binary analysis. However, very few people know OCaml, and thus few people can understand or modify BAP’s source code. Although you can do some nifty things without digging into BAP’s source code, this only touches the surface of BAP’s capabilities.
Today I added a feature which I hope will bring BAP to the masses — well, at least to the masses of security people who want to use BAP but haven’t because they do not use OCaml. This feature should allow users to easily read and analyze the BAP IL in the comfort of their favorite programming language, whatever that might be.
BAP has always had a robust pretty printer and parsing mechanism which could, in theory, be parsed by other languages. But honestly — who wants to build a parser to parse the BAP IL? It’s annoying, and I doubt anyone has gone through the trouble of doing it. The new feature I added gives users the ability to serialize the BAP IL to a number of formats, including protobuf, XML, JSON, and Piqi. If your programming language doesn’t have libraries to parse one of these formats, it probably isn’t worth using.
Let’s take a look at some examples. Here’s some BAP IL:
1 2 3 | |
Here’s the same IL represented in JSON:
1
| |
and in XML:
1 2 | |
and in protobuf:
1 2 3 4 5 6 7 8 9 | |
Hopefully this will encourage some new people to use and contribute to BAP. Adding support for new instructions isn’t that hard, even for people that don’t know OCaml! This serialization will be in BAP 0.7, which will be released in a few days.
Wiley Q
Q is the automated return oriented programming system (ROP) that I developed with my colleague, Thanassis Avgerinos. The basic idea behind ROP is to find small code fragments that do something useful in a binary that you are attacking. These code fragments are called gadgets. (Do you get the reference to Q now?) Later, the gadgets are strung together to accomplish something even more useful. (Hint: It can be Turing-complete.) ROP can be used to bypass program defenses such as Data Execution Prevention.
I’m not doing much personal work with Q these days, but it has become fairly popular as a tool for evaluating proposed ROP defenses. I just added a new option for such an evaluation. The new option forces Q to only consider gadgets at specified addresses. Naturally, to make sure it works, I specified a random gadget address, 0x80492d9, and checked the output:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 | |
The code at this address is mov %ebx,%eax; pop %ebx; pop %edi; ret.
Interestingly, Q figured out that it could load into %eax by simply
executing the gadget twice. This allows you to put a value into
%ebx during the first execution, and then move that value to %eax
during the second execution. This is fairly cute, and I don’t know if
I would have thought of the same trick myself.
Maybe it’s just me, but I am always surprised when automated systems out-smart us. With respect to security, this means that we really need provable and measurable defensive capabilities. The “I don’t see any attacks” defense strategy is simply not good enough against a sophisticated attacker. Just because we (as limited humans) cannot see an attack against our system does not mean that an attacker, who can create a sophisticated program to search for attacks, will not.
Pcmpistri Revisited: Aggregation Modes
Warning: This blog post is not going to be useful to you unless you are trying to understand the pcmpistri instruction. It is not self-contained. See Section 4.1 of the Intel Manual for reference.
A while back I wrote a blog post about how bad the
documentation is for the x86 instruction pcmpistri. I wrote the previous blog post on pcmpistri as I was trying to implement the behavior of
pcmpistri only for the control byte 0xc.
Now I’m trying to add a more general implementation to BAP. This
involves reading and understanding the documentation, specifically focusing on the various aggregation modes that effectively control the behavior of pcmpistri. The problem
with the documentation is that it’s a literal translation of what the
processor is doing, i.e., a description of the circuit that implements
the instruction, rather than an abstract description of what’s going
on. A more abstract view is helpful when not implementing
hardware. In this blog post, I give pseudocode that computes the intres1 intermediate result variable for each of
the four aggregation modes.
I hope these are helpful in understanding the various aggregation modes. Even though each one of these cases is simple, it was difficult for me to get here from the documentation. However, you should take these pseudocode algorithms with a grain of salt. I will update them as I find any problems, but I haven’t tested these yet. Each pseudocode listing is a direct translation from my notes of the Intel manual. I have also yet to formally demonstrate that these are equivalent to the specifications in the documentation. I hope to write a blogpost about how you can use BAP’s verification condition algorithms to show such an equivalence.
Notation
r is the register operand, rm is the register or memory operand,
and u is the upper bound on the number of elements inside each
operand (i.e., 16 for byte-size elements, and 8 for word-size).
Pseudocode
EqualAny
EqualAny sets intres1[j] if rm[j] is a character listed in r.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | |
Ranges
Ranges sets intres1[j] if r[i] <= rm[j] <= r[i+1] for some even
index i. r is a list of ranges.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | |
EqualEach
EqualEach sets intres1[j] iff rm[j] = r[j].
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | |
EqualOrdered
EqualOrdered sets intres1[j] iff the substring in r is present at
rm[j].
1 2 3 4 5 6 7 8 9 10 11 12 | |
Musings
Ideally we’d be able to compute each intres1[j] independently, but
it doesn’t seem to be the case. break can be thought of as using
implied state, i.e., whether we executed break or not.
Familiar Problems
I’ve often noticed that hard problems are often found in many forms in seemingly irrelevant projects. I ran into an interesting one the other week. I am running some experiments that test verification conditions (VCs), which are algorithms that produce a formula that can be tested with a SMT solver to demonstrate some forms of program correctness. VCs are a very useful technique right now because there are some really great SMT solvers out there – we often use STP, Yices, Boolector, Z3, and CVC3.
You might be familiar with a very common VC, Forward Symbolic Execution (FSE). Dijkstra also created his own VC, called Weakest Precondition (WP), which is generally defined on a language called the Guarded Command Language (GCL). The specifics of the GCL do not matter for this blog post. What matters is that it is a structured programming language; that is, it contains no jumps or goto statements. The only control flow operators are the sequence operator, and a conditional operator. Many followup algorithms for WP have followed in the footsteps of Dijkstra and defined their algorithms to work on the GCL.
The problem is that the GCL is not a real programming language, and so
we must first convert programs to GCL to run these algorithms on
them. Of course, for me, this means converting BAP programs to GCL.
It turns out that my adviser solved this problem while working on his
PhD. His dissertation gives an algorithm that visits a program in CFG
form in reverse topological order. When visiting a conditional branch
node n that has
two successors s1 and s2, the GCL programs gcl1 and gcl2 are
computed from s1 and s2 (don’t worry about the details). gcl1
and gcl2 are split into three parts: a common suffix, and a prefix
for gcl1 and gcl2 that is not shared. This is then converted to
(if (e) gclprefix1 else gclprefix2); suffix. The idea is to avoid
duplicating nodes in the GCL as much as possible.
The problem I ran into was when I started looking at large programs, the GCL size was growing exponentially. After looking at some of the examples, I realized this was a problem I had seen before! The above algorithm is actually performing structural analysis: it is taking an unstructured program (think of a CFG) and trying to give it structure in the form of if-then-else or sequences.
A few months ago I was working on a more general structural analysis algorithm. One interesting thing we found is that the short-circuit behavior from high-level languages causes unstructured CFGs at the binary level. For instance, the program:
1 2 3 4 | |
creates the following CFG:
Notice that BB_6 and BB_7 are reachable from BB_5 but not dominated by it; this is not good for structural analysis!
The high-level point of this post is that many hard problems are related in some way. While working on the VCs, I had already concluded that algorithms that avoid the GCL conversion process entirely are preferable. I probably would not have taken the time to figure out precisely why the GCL conversion had so much code duplication in it, had I not remembered seeing a similar problem before. It always pays to try to relate seemingly new hard problems back to similar ones that you’ve solved or seen solved.
Success?
BAP now has a new framework for adding disassembly engines. For example, here is the implementation of recursive descent:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 | |
Right now, the engine is asked what the successors are for an unresolved edge. All of the yucky details for maintaing the CFG and such are hidden from the resolving code. There is also a VSA-based engine, but it’s inefficient: VSA must be re-run from scratch every time an edge must be resolved. I’m going to fix this, of course.
VSA-based lifting is neat because it allows lifting us weird malware types of obfuscation, for instance:
1 2 3 4 | |
can be resolved:
Finally, here is an example of a switch being resolved:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 | |
The interesting thing here is that VSA resolved the indirect jump to more than six destinations. Because the destination addresses are not aligned, the strided interval representation is forced to overapproximate. This is bad, but not a big deal in this case.
A more pressing issue is that (our implementation of) VSA’s abstract
stores (memory) is extremely slow. Balakrishnan says he uses
applicative dictionaries in his dissertation for this because it is
space efficient. Conveniently, OCaml’s Map module provides
applicative dictionaries. (Side note: I <3 OCaml.) However, abstract
interpretation often fails to prove termination of a loop. When this
occurs, any stack pointer being increased in the loop has its upper
bound widened to about 2^31. As soon as data is written to this
pointer, 2^31 maps are added to the abstract store. This takes a
while (minutes on a very fast computer), but as Balakrishnan notes, it
is space efficient. Unfortunately, even more slowdown occurs whenever
the meet operation takes place on two abstract stores: every
individual element (a value set) in both stores must be traversed.
There is no easy way to get the differences between two maps in the
OCaml Map module. (There is Set.difference, though.)
I see two possible solutions here.
Create an efficient abstract store data structure. This structure would be strided-interval aware, e.g., it would not store
2^31entries when writing0to[0, 2^31]. It would also be able to identify differences in two structures very quickly. We might want something like an interval tree.Whenever a pointer value set gets “too big”, widen it to top. This has the disadvantage of wiping out memory on that region. In the first option, we would only wipe out any memory locations above the lower bound of the pointer.
Thoughts on Coq and Isabelle/HOL
Coq and Isabelle/HOL are both “proof assistants” or “interactive theorem provers”. Both systems allows one to model a theorem, and then prove the theorem true. The resulting theorem is then checked, typically by reducing the proof to a program that can be type checked. Type checkers are relatively simple, and so we are confident that these machine checked proofs are correct.
I’ve been reading Software Foundations, which is an introduction to program semantics and verification in which everything is modeled and proved in Coq. Remember doing those dreaded structural induction program semantics proofs by hand? It’s a little bit more satisfying when done in Coq, and sometimes is easier too. More recently, I have been learning Isabelle/HOL. Technically, Isabelle is a proof assistant that allows multiple logics to be built on top of it, including HOL, or High Order Logic.
Unlike Isabelle, Coq only supports High Order Logic natively. Coq has
support for \forall and --> baked in to it. --> is used to
represent both entailment and logical implication. As a result, it’s
very easy to write down theorems, because there is generally only one
way to write them! Isabelle/HOL, being split up into Isabelle’s
meta-logic and HOL’s logic, is different. Isabelle has a notion of a
universal quantifier /\ and entailment ==> baked in. HOL then
models the \forall quantifier, and logical implication -->. Of
course, the two levels are closely related, and it is possible to
switch between them. However, when writing a theorem, it is often not
clear in which level quantification and entailment or implication
should be written.
In my first proofs, I errored on the side of HOL instead of the
metalogic, only to find that some of Isabelle’s utilities only work on
the metalogic level. For instance, there is no quick way to
instantiate p and q in \forall p q. P p q. There is a syntax,
however, for instantiating p and q in the meta-logic quantified
version. The ambiguity between logics is a very inelegant property of
Isabelle/HOL in my opinion.
Isabelle does have some nice features, though. It has a structured proof language called Isar. The idea is to allow Isabelle proofs to look like real, human proofs. (In contrast, non-structural mechanized proofs are just a series of tactics, which do not explain why a theorem is true.) Here’s a simple example:
1 2 3 4 5 6 7 8 9 | |
I bet you can figure out what’s going on, even though you don’t know the syntax.
My last observation about Isabelle/HOL and Coq has to do with their unspoken strategies. Isabelle/HOL seems to be designed to make extensive use of simplification rules. Indeed the tutorial essentially says to prove something, you should write your theorem, try to prove it automatically, and if that fails, add a simplification lemma and repeat! This is nice when it works. However, manipulating rules manually is difficult in my opinion, although that could be because I’m a novice user. Manually manipulating rules in Coq is much easier; the design of tactics just seems to allow me to do what I want more naturally. On the other hand, I’ve been using Coq for longer.
In conclusion, Coq and Isabelle/HOL are both interesting tools, and both can probably accomplish what you want. The differences are subtle, and the best way to get a feel for them is to learn them! I recommend Software Foundations for Coq, and the tutorial for Isabelle/HOL.
Undocumented Features of VSA
As I’ve written about recently, I’ve been hacking on VSA. I’ve been trying to get to the point where the example in Balakrishnan’s dissertation works. I chose this example because I don’t know of any others! Below is a snapshot of the example that I started with.

And here are the results that VSA is supposed to be able to infer:

There are several interesting things about these results. First, note that %edx is bounded, even though it is incremented in a loop. In contrast, %eax grows to its maximum value. Although %eax approaches the maximum positive integer, VSA infers that it does not overflow. That sounds kind of weird, doesn’t it?
My VSA implementation does not yield the same results. It notices
that %eax is incremented without a direct bound on line 9, and widens
%eax to 2^31 - 8 at line 7. On line 9, the computation %eax + 8
overflows, and represents any value on the stack. The next loop
iteration, we see weak updates to the whole stack at L1 and line 8,
since %eax represents any stack address.
Somewhere there’s a hidden assumption in addition of value-sets that says adding values in non-global regions cannot overflow. This kind of makes sense, if regions are separated, since the program might crash. I do not consider this to be a sound assumption, however. (Programs can handle exceptions…) Perhaps that is why this assumption is not explicitly mentioned, or if it is, is so buried that I could not find it. I expected it to be in the definition of addition for value-sets, but it’s not there.
With this assumption/hack, we get the same results as the example above.