QINSB is not a Software Blog
Monday, December 12, 2011
Stack crawling in DrMemory without frame pointers or debug info
If you remember your early computer systems classes, you'll recall those exam questions about unwinding the stack by chasing the base pointer. The reality is that for years we've been stuck with x86_32, which has a measly 8 general purpose registers, and compilers have been itching to grab that tender little %ebp register. So, they took it, and now the world is much more complicated because walking the stack is a whole lot harder.
To unwind the stack, we need to answer the question: given the stack pointer (SP) and program counter (PC), what is the offset from the SP to the next return address? With that, we can repeat the question until we hit main (or NULL or whatever).
Maintaining a huge table mapping any arbitrary module offset to a stack offset would waste a lot of space, so most systems use some form of state machine encoding that can compute the value on the fly, should you need to walk the stack. (Win64 SEH uses an interesting mechanism involving carefully constructed prologues which I'd like to learn more about.) If stack walking is rare, such as for an "exceptional" exception or when in the debugger, then the state machine (DWARF CFI) works great. But if you need to do it all the time and you need to do it on multiple platforms, it will be slow, you will need platform-specific code, or you will need to depend on some other library. Pulling arbitrary code into a tool that runs in the same as the address space as the application can often break isolation between the tool and the app, so we don't like to do it very often.
However, we're in dynamic binary instrumentation land, so we get to see every instruction as it's executed. The next most obvious solution for unwinding is what we call the "shadow stack" solution. Every time you see a call or ret instruction, you just push or pop the PC on to or off of the current thread's shadow stack. If you need to record lots of stack traces, then this may be your best bet, especially since you can imagine logging these updates and then recording a trace is as easy as logging a timestamp. I believe that this is what ThreadSanitizer does, because they need to produce a stack trace not just for the current memory access, but for the other memory access that races with the current thread.
For a tool like DrMemory, we only need to store stack traces for every heap routine invocation, which is not nearly quite as often. We also don't want to unnecessarily slow down an app that performs many calls in a loop without calling any heap routines or creating any memory bug reports. You also have to worry about a lot of corner cases, such as an app which clobbers the return addr to go somewhere else, or longjmp, or exception handling.
The solution we use is I think quite elegant. First, if there are frame pointers, we use them, because it's fast (like iterating a linked list that's in cache :). While the compiler may get a 10% perf boost from an extra register, we get a bigger boost by not doing a "scan", which is what I am about to describe.
If we don't have frame pointers, we start with the stack pointer, and scan each pointer-aligned cell for a value that looks like a pointer into executable memory. This part is obvious, and you can get Windbg to do the same thing and symbolize it to boot (anyone know how to do this in gdb?).
The problem is that most code rarely initializes its entire stack frame. In any given frame, there is a lot of spill space for temporaries in unexecuted code, or arrays that are not filled. This means that stale return addresses from prior frames are still on the stack. The clever bit is realizing that we're already shadowing every memory write to track definedness information to find uninitialized reads, so we actually *know* which bytes are initialized and which are not. That means we can tell what is a probable return address and what is stale.
Still, what if the app stores a function pointer to the stack? That will look like a code pointer and show up as an extra frame. The next bit of cleverness is where we take the candidate return address and look backwards in the code for likely encodings of a call instruction. This is pretty straightforward, as there are only a handful of encodings for direct and indirect calls, and we can just enumerate the opcodes along with their offsets. If it's a function pointer, this test should fail, unless we're really unlucky.
And with that, we have our (mostly) reliable, (reasonably) performant, and (mostly) portable stack crawler! :) To my knowledge, Memcheck doesn't attempt any of this craziness, and encourages you to recompile with FPs if it gets a weird looking stack.
Edit: Comments on G+ are preferred, since I'll actually respond.
Thursday, August 11, 2011
Early injection of DBI frameworks on Linux
Early injection is the ability of a binary instrumentation framework to interpret an application from the very first instruction. On Linux, applications do not start life in userspace at main. They start life at _start, and that usually hops into the dynamic loader to bring in the shared libraries such as libc. Usually this code is very similar for every application, and you don't need to instrument it to find uses of uninitialized memory in your application. However, its possible you have an app that doesn't use the dynamic loader, or there really are bugs there you want to see. A related problem is how to load your own dynamic libraries without conflicting with the loader being used by the application. Typically the DBI framework will load two copies of libc: one for the application, and one for the framework and its libraries. DynamoRIO, Valgrind, and Pin all have different ways of loading and injecting with different tradeoffs.
DynamoRIO is the easiest, because we don't support early injection on Linux. (Windows is a different story which I can't speak to.) We simply set LD_LIBRARY_PATH and LD_PRELOAD in a shell script that wraps the application invocation. This way, the loader will pull in one of our shared libraries, and call _init, at which point we take control. Instead of returning back to the loader, we start interpreting from where we would have returned. For DynamoRIO's dynamic libraries, the most important of which is the actual tool that you want to run, ie your race detector or memory debugger, we have our own private loader which keeps its own libraries on the side and does all the memory mapping and relocations itself. This is a pain, since it practically reimplements all of the system loader, but we do have the ability to place restrictions on the kinds of libraries we load to make things easier.
Valgrind is the next easiest, since it's basically the inverse of the above strategy. The command "valgrind" is a proper binary, so the system loader kicks in and loads all of Valgrind's libraries for it. It can use dlopen to bring in the tools it wants. To load libraries for the application, Valgrind has a private loader similar to our own, except that it loads the application's libraries instead of Valgrind's.
Finally, Pin has the most interesting strategy. Pin does not have it's own loader. Pin is also a proper binary. They start by loading all the libraries they ever want, including the tool you want to run, using the system loader (dlopen, etc). Then they fork a child process. The *parent*, not the child, sets up a call to exec to match the invocation of the application. Somehow, probably using ptrace or some other API, the child is able to pause the parent application before the first instruction and attach. The child then copies its memory map into the parent. If there are conflicts, it errors out. The child then redirects execution to Pin's entry point, which eventually starts interpreting the application from _start.
Pin's approach allows early injection without the maintenance burden of a private loader. However, it can cause conflicts with the application binary and requires heuristics to set preferred addresses which won't conflict. It also slows down startup, due to the extra forking and memory mapping and copying. Personally, I think this is pretty cool, even if it has some transparency issues.
Thursday, May 5, 2011
FLAGS dependencies on x86
Wednesday, April 6, 2011
HipHop talk
if (is_int(v)) { /* large int code */ }
else { /* massive general code */ }if ($a) {
function foo() { return true; }
} else {
function foo() { return false; }
}function foo() { return true; }
function bar() { return false; }
foo = bar; // No can do.function foo() { return true; }
$func = "foo";
$bar = $func();$which = $a ? "foo" : "bar";
$$which = "string";Another feature of HipHop is that, once you forbid certain features like eval or using include to pull in generated or uploaded source from the filesystem, you get to do whole program compilation. One of the major optimization blockers even for C and C++ is that the compiler only gets to see a translation unit (TU) at a time. HipHop gets to see the whole program, so it makes it possible to do the checks I mentioned earlier, like making sure that there is only a single definition of a function 'foo' at top-level which is always present, so you can statically bind it at the call-site even if it's in a different source file.
Wednesday, March 30, 2011
Building spec benchmarks with newer libstdc++
I was having issues building the 483.xalancbmk benchmark with a not-so-new libstdc++ (4.3.4). There's a particular source file (FormatterToHTML.cpp) that uses memset without including string.h or cstring. This was the error message:
FormatterToHTML.cpp:139: error: 'memset' was not declared in this scopeIt depended on an include of 'vector' to pull in 'cstring' as a transitive dep. That seems to have changed. Since the SPEC benchmarks go so far as to checksum the benchmark sources before letting you run them, it was easier to add '#include <cstring>' to /usr/include/c++/4.3.4/vector than fix the application. =(
Saturday, March 26, 2011
Unladen Swallow Retrospective
I wrote this while I was at PyCon, but I kept revising it. Anyway, here goes. :)
As is apparent by now, no one has really been doing any work directly on Unladen Swallow or in porting it to py3k. Why not?
Lack of Sponsor Interest
The primary reason is that we weren't able to generate enough internal customers at Google. There are a few reasons for that:
- Most Python code at Google isn't performance critical. It's used mainly for tools and prototyping, and most user-facing applications are written in Java and C++.
- For those internal customers who were interested, deployment was too difficult: being a replacement was not enough. Any new Python implementation had to be turn-key. We thought building on CPython instead of starting fresh would sidestep this problem because C extensions and SWIG code would just work. However, simply changing from the previous version of CPython to a 2.6-based Python was too difficult.
- Our potential customers eventually found other ways of solving their performance problems that they felt more comfortable deploying.
After my internship was over, I tried to make Unladen the focus of my Master's thesis at MIT, but my advisor felt that the gains so far were insufficient to have big impact and the techniques I wanted to implement are all no longer considered novel. Most feedback techniques were implemented for Smalltalk by Urs Hölzle and tracing for Java by Andreas Gal. Not to say there aren't novel techniques to be discovered, but I didn't have any ideas at the time.
Lack of Personal Interest
Most of this was decided around Q1 of 2010. We still could have chosen to pursue it in our spare time, but at that point things looked a little different.
First of all, it's a lot less fun to work on a project by yourself than with other people, especially if it's unclear if you'll even have users.
Secondly, a large part of the motiviation for the project was that we felt like PyPy would never try to support CPython C extension modules or SWIG wrapped code. We were very surprised to see PyPy take steps in that direction. That somewhat obviated the need to build a bolt-on JIT for CPython. Also, when the project was launched, PyPy didn't have x86_64 support, but in the meantime they have added it.
Finally, the signals we were getting from python-dev were not good. There was an assumption that if Unladen Swallow were landed in py3k, Google would be there to maintain it, which was no longer the case. If the merge were to have gone through, it is likely that it would have been disabled by default and ripped out a year later after bitrot. Only a few developers seemed excited about the new JIT. We never finished the merge, but our hope was that if we had, we could entice CPython developers do hack on the JIT.
So, with all that said for why none of us are working on Unladen anymore, what have we learned?
Lessons About LLVM
First of all, we explored a lot of pros and cons of using LLVM for the JIT code generator. The initial choice to use LLVM was made because at the time none of us had significant experience with x86 assmebly, and we really wanted to support x86 and x86_64 and potentially ARM down the road. There were also some investigations of beefing up psyco, which I beleive were frusturated by the need for a good understanding of x86.
Unfortunately, LLVM in its current state is really designed as a static compiler optimizer and back end. LLVM code generation and optimization is good but expensive. The optimizations are all designed to work on IR generated by static C-like languages. Most of the important optimizations for optimizing Python require high-level knowledge of how the program executed on previous iterations, and LLVM didn't help us do that.
An example of needing to apply high-level knowledge to code generation is optimizing the Python stack access. LLVM will not fold loads from the Python stack across calls to external functions (ie the CPython runtime, so all the time). We eventually wrote an alias analysis to solve this problem, but it's an example of what you have to do if you don't roll your own code generator.
LLVM also comes with other constraints. For example, LLVM doesn't really support back-patching, which PyPy uses for fixing up their guard side exits. It's a fairly large dependency with high memory usage, but I would argue that based on the work Steven Noonan did for his GSOC that it could be reduced, especially considering that PyPy's memory usage had been higher.
I also spent the summer adding an interface between LLVM's JIT and gdb. This wasn't necessary, but it was a nice tool. I'm not sure what the state of debugging PyPy is, but we may be able to take some of the lessons from that experience and apply it to PyPy.
Take Aways
Personally, before working on this project, I had taken a compiler class and OS class, but this experience really brought together a lot of systems programming skills for me. I'm now quite experienced using gdb, having hacked on it and run it under itself. I also know a lot more about x86, compiler optimization techniques, and JIT tricks, which I'm using extensively in my Master's thesis work.
I'm also proud of our macro-benchmark suite of real world Python applications which lives on and PyPy uses it for speed.pypy.org. In all the performance work I've done before and after Unladen, I have to say that our macro benchmark suite was the most useful. Every performance change was easy to check with a before and after text snippet.
We also did a fair amount of good work contributing to LLVM, which other LLVM JIT projects, such as Parrot and Rubinius, can benefit from. For example, there used to be a 16 MB limitation on JITed code size, which I helped to fix. LLVM's JIT also has a gdb interface. Jeff also did a lot of work towards being able to inline C runtime functions into the JITed code, as well as fixing memory leaks and adding the TypeBuilder template for building C types in LLVM.
So, while I wish there were more resources and the project could live on, it was a great experience for me, and we did make some material contributions to LLVM and the benchmark suite that live on.