|
|
Wed, Mar. 18th, 2009, 11:17 pm Moved
My site has moved! Well, sort
of. The URL is the same, but I've switched to WordPress instead
of LiveJournal for content management. Frankly, I was embarrassed of
the old aesthetics and structure and I think a facelift was long
overdue.
I will no longer update my LiveJournal account. If you wish to update your
subscription, you can add this RSS to your reader or friend
chadaustin_rss
on LiveJournal.
I will resume the series on IMVU's crash reporting
shortly. I also can't wait to blog about a new open source project, but one thing at a time.
Thanks to Ben McGraw, Mike Rooney, and Steven Peterson for all of their help. I'm perpetually a noob at this stuff.
p.s. Why the ads on the new site? I don't expect to make any real
money, but this is a great opportunity for me to learn AdSense and
SEO. Maybe the ads will fund one super
burrito per month.
I'm finally updating my web site for this side of the millenium! That means my LiveJournal RSS feed will stop updating. If you're subscribed, please point your blog readers at http://feeds2.feedburner.com/chadaustin. Thanks!
This post has moved.
Last time, we talked about including contextual information to help us
actually fix crashes that happen in the field. Minidumps are a great
way to easily save a snapshot of the most important parts of a running
(or crashed) process, but it's often useful to understand the
low-level mechanics of a C++ call stack (on x86). Given some basic
principles about function calls, we will derive the implementation
of code to walk a call stack.
C++ function call stack entries are stored on the x86 stack, which
grows downward in memory. That is, pushing on the stack subtracts
from the stack pointer. The ESP register points to the
most-recently-written item on the stack; thus, push eax
is equivalent to:
sub esp, 4
mov [esp], eax
Let's say we're calling a function:
int __stdcall foo(int x, int y)
The __stdcall
calling convention pushes arguments onto the stack from right to left
and returns the result in the EAX register, so calling
foo(1, 2) generates this code:
push 2
push 1
call foo
; result in eax
If you aren't familiar with assembly, I know this is a lot to absorb,
but bear with me; we're almost there. We haven't seen the
call instruction before. It pushes the EIP
register, which is the return address from the called function onto
the stack and then jumps to the target function.
If we didn't store the instruction pointer, the called function would
not know where to return when it was done.
The final piece of information we need to construct a C++ call stack is
that functions live in memory, functions have names, and thus sections
of memory have names. If we can get access to a mapping of memory
addresses to function names (say, with the /MAP
linker option), and we can read instruction pointers up the call
stack, we can generate a symbolic stack trace.
How do we read the instruction pointers up the call stack?
Unfortunately, just knowing the return address from the current
function is not enough. How do you know the location of the caller's
caller? Without extra information, you don't. Fortunately, most
functions have that information in the form of a function prologue:
push ebp
mov ebp, esp
and epilogue:
mov esp, ebp
pop ebp
These bits of code appear at the beginning and end of every function, allowing you
to use the EBP register as the "current stack frame".
Function arguments are always accessed at positive offsets from EBP,
and locals at negative offsets:
; int foo(int x, int y)
; ...
[EBP+12] = y argument
[EBP+8] = x argument
[EBP+4] = return address (set by call instruction)
[EBP] = previous stack frame
[EBP-4] = local variable 1
[EBP-8] = local variable 2
; ...
Look! For any stack frame EBP, the caller's address is
at [EBP+4] and the previous stack frame is at [EBP].
By dereferencing EBP, we can walk
the call stack, all the way to the top!
struct stack_frame {
stack_frame* previous;
unsigned long return_address;
};
std::vector<unsigned long> get_call_stack() {
std::vector<unsigned long> call_stack;
stack_frame* current_frame;
__asm mov current_frame, ebp
while (!IsBadReadPtr(current_frame, sizeof(stack_frame))) {
call_stack.push_back(current_frame->return_address);
current_frame = current_frame->previous;
}
return call_stack;
}
// Convert the array of addresses to names with the aforementioned MAP file.
Yay, now we know how to grab a stack trace from any location in the
code. This implementation is not robust, but the concepts are
correct: functions have names, functions live in memory, and we can
determine which memory addresses are on the call stack. Now that you
know how to manually grab a call stack, let Microsoft do the heavy
lifting with the StackWalk64
function.
Next time, we'll talk about setting up your very own Microsoft Symbol Server so you can
grab accurate function names from every version of your software.
This post has moved.
So far, we've implemented reporting for Python exceptions that bubble
out of the main loop, C++ exceptions that bubble into Python (and then
out of the main loop), and structured exceptions that bubble into
Python (and then out of the main loop.) This is a fairly
comprehensive set of failure conditions, but there's still a big piece
missing from our reporting.
Imagine that you implement this error reporting and have customers try
the new version of your software. You'll soon have a collection of
crash reports, and one thing will stand out clearly. Without the
context in which crashes happened (call stacks, variable values,
perhaps log files), it's very hard to determine their cause(s). And
without determining their cause(s), it's very hard to fix them.
Reporting log files are easy enough. Just attach them to the error
report. You may need to deal with privacy concerns or limit the size
of the log files that get uploaded, but those are straightforward
problems.
Because Python has batteries
included, grabbing the call stack from a Python exception is
trivial. Just take a quick look at the traceback
module.
Structured exceptions are a little harder. The structure of a call
stack on x86 is machine- and sometimes compiler-dependent.
Fortunately, Microsoft provides an API to dump the relevant process
state to a file such that it can be opened in Visual
Studio or WinDbg,
which will let you view the stack trace and select other data. These
files are called minidumps, and they're pretty small. Just call MiniDumpWriteDump
with the context of the exception and submit the generated file with your crash
report.
Grabbing a call stack from C++ exceptions is even harder, and maybe
not desired. If you regularly use C++ exceptions for communicating
errors from C++ to Python, it's probably too expensive to grab a call
stack or write a minidump every single time. However, if you want to
do it anyway, here's one way.
C++ exceptions are implemented on top of the Windows kernel's
structured exception machinery. Using the try and
catch statements in your C++ code causes the compiler to
generate SEH code behind the scenes. However, by the time your C++
catch clauses run, the stack has already been unwound. Remember
that SEH has three passes: first it runs filter expressions until it
finds one that can handle the exception; then it unwinds the stack
(destroying any objects allocated on the stack); finally it runs the
actual exception handler. Your C++ exception handler runs as the last stage,
which means the stack has already been unwound, which means you can't
get an accurate call stack from the exception handler. However, we
can use SEH to grab a call stack at the point where the exception was
thrown, before we handle it...
First, let's determine the SEH exception code of C++ exceptions
(WARNING, this code is compiler-dependent):
int main() {
DWORD code;
__try {
throw std::exception();
}
__except (code = GetExceptionCode(), EXCEPTION_EXECUTE_HANDLER) {
printf("%X\n", code);
}
}
Once we have that, we can write our exception-catching function like
this:
void throw_cpp_exception() {
throw std::runtime_error("hi");
}
bool writeMiniDump(const EXCEPTION_POINTERS* ep) {
// ...
return true;
}
void catch_seh_exception() {
__try {
throw_cpp_exception();
}
__except (
(CPP_EXCEPTION_CODE == GetExceptionCode()) && writeMiniDump(GetExceptionInformation()),
EXCEPTION_CONTINUE_SEARCH
) {
}
}
int main() {
try {
catch_seh_exception();
}
catch (const std::exception& e) {
printf("%s\n", e.what());
}
}
Now we've got call stacks and program state for C++, SEH, and Python
exceptions, which makes fixing reported crashes dramatically easier.
Next time I'll go into more detail about how C++ stack traces work,
and we'll see if we can grab them more efficiently.
Tue, Feb. 24th, 2009, 09:22 pm RSI
Bloodpact has been wreaking havoc on my arms, so I am taking a break today. I am diagnosed with tendinitis caused by RSI. RSIGuard and Dvorak have helped a lot, but I need to resume lifting weights and stretching. Perhaps I will start using Dragon again.
This post has moved.
Previously, we discussed the implementation of automated reporting of
unhandled C++ exceptions. However, if you've ever programmed in C++,
you know that C++ exceptions are not the only way your code can fail.
In fact, the most common failures probably aren't C++ exceptions at
all. You know what I'm referring to: the dreaded access violation
(sometimes called segmentation fault).
How do we detect and report access violations? First, let's talk
about what an access violation actually is.
Your processor has a mechanism for detecting loads and stores from
invalid memory addresses. When this happens, it raises an interrupt,
which Windows exposes to the program via Structured Exception Handling
(SEH). Matt Pietrek has written an excellent article on how
SEH works, including a description of C++ exceptions implemented
on top of SEH. The gist is that there is a linked list of stack
frames that can possibly handle the exception. When an exception
occurs, that list is walked, and if an entry claims it can handle it,
it does. Otherwise, if no entry can handle the exception, the program
is halted and the familiar crash dialog box is displayed to the user.
OK, so access violations can be detected with SEH. In fact, with the
same mechanism, we can detect all other types of structured
exceptions, including division by zero and stack overflow. What does
the code look like? It's approximately:
bool handle_exception_impl_seh(function f) {
__try {
// This is the previously-described C++ exception handler.
// For various reasons, they need to be in different functions.
// C++ exceptions are implemented in terms of SEH, so the C++
// exception handling must be deeper in the call stack than
// the structured exception handling.
return handle_exception_impl_cpp(f);
}
// catch all structured exceptions here
__except (EXCEPTION_EXECUTE_HANDLER) {
PyErr_SetString(PyExc_RuntimeError, "Structured exception in C++ function");
return true; // an error occurred
}
}
Note the __try and __except keywords. This means we're using
structured exception handling, not C++ exception handling. The filter
expression in the __except statement evaluates to
EXCEPTION_EXECUTE_HANDLER, indicating that we always want to handle
structured exceptions. From the filter expression, you can optionally
use the GetExceptionCode
and GetExceptionInformation
intrinsics to access information about the actual error.
Now, if you write some code like:
Object* o = 0;
o->method(); // oops!
The error will be converted to a Python exception, and reported
with our existing mechanism. Good enough for now! However, there are
real problems with this approach. Can you think of them?
Soon, I'll show the full implementation of the structured
exception handler.
This post has moved.
A year ago, I explained
how the IMVU client automatically reports unexpected Python exceptions
(crashes) to us. I intended that post to be the first of a long
series that covered all of the tricks we use to detect and report
abnormal situations. Clearly, my intentions have not played out yet,
so I am going to pick up that series by describing how we catch
exceptions that occur in our C++ code. Without further ado,
Reporting C++ Exceptions
As discussed earlier, IMVU's error handling system can handle any
Python exception that bubbles out of the client's main loop and
automatically report the failure back to us so that we can fix it for
the next release. However, our application is a
combination of Python and C++, so what happens if our C++ code has a
bug and raises an uncaught C++ exception, such as std::bad_alloc
or std::out_of_range?
Most of our C++ code is exposed to Python via the excellent
Boost.Python library, which automatically catches C++ exceptions at
the boundary and translates them to Python exceptions. The
translation layer looks something like this:
bool handle_errors(function fn) {
try {
fn();
return false; // no error
}
catch (const std::runtime_error& e) {
// raise RuntimeError into Python
PyErr_SetString(PyExc_RuntimeError, e.what());
}
catch (const std::bad_alloc&) {
// raise MemoryError into Python
PyErr_SetString(PyExc_MemoryError, "out of memory");
}
catch (const std::exception& e) {
// raise Exception into Python
PyErr_SetString(PyExc_Exception, e.what());
}
catch (...) {
PyErr_SetString(PyExc_Exception, "Unknown C++ exception");
}
return true;
}
Thus, any C++ exception that's thrown by the C++ function is
caught by Boost.Python and reraised as the appropriate Python
exception, which will already be handled by the previously-discussed
crash reporting system.
Let's take another look at the client's main loop:
def mainLoop():
while running:
pumpWindowsMessages()
updateAnimations()
redrawWindows()
def main():
try:
mainLoop()
except:
# includes exception type, exception value, and python stack trace
error_information = sys.exc_info()
if OK == askUserForPermission():
submitError(error_information)
If the C++ functions called from updateAnimations() or redrawWindows()
raise a C++ exception, it will be caught by the Python error-handling
code and reported to us the same way Python
exceptions are.
Great! But is this a complete solution to the problem? Exercise
for the reader: what else could go wrong here? (Hint: we use Visual
Studio 2005 and there was a bug in catch (...) that Microsoft fixed in
Visual Studio 2008...)
I have a problem: I am very lucky. Stuff just seems to work out, without any great effort on my part. I believe this has something to do with decisions my subconscious is making, and how they guide me through life. I place great trust in my instincts, though I'll be damned if I could articulate my thought processes to you. This has been a great source of contention between me and the people who manage and work with me. Even my father recently said to me "I get frustrated that you don't back up your political viewpoints with data." Nonetheless, I trust that my instincts see the big picture.
Similarly, I know several extremely talented people whose advice I implicitly trust, even if I always play the devil's advocate with them. If I ask them to explain their advice, they always fail. Maybe they will manage to articulate some of their thought processes, but I get the feeling they're lying anyway, or at least not telling the whole story. Good advice always has hidden virtues. Their instincts know this, but that information never bubbles into the foreground.
Not everyone has good instincts, however. Some people think of the most insane and inappropriate solutions when confronted with a problem. Others propose things that seem like great ideas on the surface, but often fail to work. Further, there are plenty of examples where I will make a proposal by instinct, and it turns out to be totally wrong. On the other hand, there are examples where someone will have a choice of solutions to a problem and I will say "solution B is better than A and C", which is true, but only in hindsight down the road. What to do?
I've been wrestling with this issue since I was born, because I'm especially bad at articulating why I feel certain ways. Recently, I started reading the book Blink: The Power of Thinking Without Thinking by Malcolm Gladwell, and I hope it will provide tools for coping in a world where great decisions come from the mind's subconscious and without any explanation why.
So, actually, I have two problems. Dear reader, please help me answer these questions:
- How do you accept advice from people if they can't explain why?
- How do you give advice if you can't explain why?
The answer may lie in experience and history. If you haven't built up a lot of trust, you may need to say "Please, just give me a chance, and let the results speak for themselves." If you're experienced, you can let previous successes be your explanation. Discussing this feels weird because we live in a "why why why" world and I work at a company that prides itself on being data-driven and explicitly not opinion-driven.
Perhaps one reason agile software development is successful is because it enables individuals to use their raw talents to work towards an understood goal, without an overemphasis on ceremony, explanation, and conscious thought. Humans are amazing pattern recognition engines, and harnessing that power is surely beneficial. Put humans in teams, and you get to harness the benefits of our social machinery too. Ask these same humans to clearly articulate every decision they make before they're allowed to make it, and I guarantee you'll see a tangible reduction in their success rate.
I don't know the answers, and this is a new train of thought to me. I would love to hear your thoughts.
Thanks.
This post has moved.
Several people have personally requested that I give a brief
introduction to modern x86 (sometimes called IA32) assembly language.
For simplicity's sake, I'll stick with the 32-bit version with a flat
memory model. AMD64
(sometimes called x64) just isn't as popular as x86 yet, so this seems safe.
For some reason, there's a mythos around assembly language. People associate it with bearded gurus, assuming only ninjas can program in it, when, in principle, assembly language
is one of the simplest programming languages there is. Any complexity
stems from a particular architecture's oddities, and even though x86 is one of the
oddest of them all, I'll show you that it can be easy to read and write.
First, I'll describe the basic architecture. When programming in assembly,
there are three main concepts:
Instructions are the individual commands that tell the
computer to perform an operation. These include instructions for
adding, multiplying, comparing, copying, performing bit-wise operations,
accessing memory, and communicating with external devices. The
computer executes instructions sequentially.
Registers are where temporary values go. There is a
small, fixed set of registers available for use. Since there aren't many registers, nothing stays in
them for very long, as they ar soon needed for other purposes.
Memory is where longer-lived data goes. It's a
giant, flat array of bytes (8-bit quantities). It's much slower to
access than registers, but there's a lot of it.
Before I get into some examples, let me describe the registers
available on x86. There are only 8 general-purpose registers, each of
which is 32 bits wide. They are:
EAX
EBX
ECX
EDX
ESI
EDI
EBP - used when accessing local variables or function arguments
ESP - used when calling functions
On x86, most instructions have two operands, a destination and a
source. For example, let's add two and three:
mov eax, 2 ; eax = 2
mov ebx, 3 ; ebx = 3
add eax, ebx ; eax = 2 + 3 = 5
add eax, ebx adds the values in registers eax and ebx, and stores
the result back in eax. (BTW, this is one of the oddities of x86.
Other modern architectures differentiate between destination and
source operands, which would look like add eax, ebx, ecx
meaning eax = ebx + ecx. On x86, the first operand is read and written in the same instruction.)
mov is the data movement instruction. It copies values
from one register to another, or from a constant to a register, or
from memory to a register, or from a register to memory.
Speaking of memory, let's say we want to add 2 and 3, storing the
result at address 32. Since the result of the addition is 32 bits, the result will
actually use addresses 32, 33, 34, and 35. Remember, memory is
indexed in bytes.
mov eax, 2
mov ebx, 3
add eax, ebx
mov edi, 32
mov [edi], eax ; copies 5 to address 32 in memory
What about loading data from memory? (Reads from memory are called
loads. Writes are called stores.) Let's write a program that copies
1000 4-byte quantities (4000 bytes) from address 10000 to address
20000.
mov esi, 10000 ; by convention, esi is often used as the 'source' pointer
mov edi, 20000 ; similarly, edi often means 'destination' pointer
mov ecx, 1000 ; let's copy 1000 32-bit items
begin_loop:
mov eax, [esi] ; load from source
mov [edi], eax ; store to destination
add esi, 4
add edi, 4
sub ecx, 1 ; ecx -= 1
cmp ecx, 0 ; is ecx 0?
; if ecx does not equal 0, jump to the beginning of the loop
jne begin_loop
; otherwise, we're done
This is how the C memcpy function works. Not so bad, is
it? For reference, this is what our x86 code would look like in C:
int* src = (int*)10000;
int* dest = (int*)20000;
int count = 1000;
while (count--) {
*dest++ = *src++;
}
From here, all it takes is a good instruction
reference, some memorization, and a bit of practice. x86 is full
of arcane details (it's 30 years old!), but once you've got the basic
concepts down, you can mostly ignore them. I hope I've shown you that writing x86
is easy. Perhaps more importantly, I hope you won't be intimidated the next time Visual Studio
shows you the assembly for your program. Understanding how the machine is executing your code
can be invaluable when debugging.
Fri, Feb. 20th, 2009, 02:24 am Leverage
This post has moved.Every day, we prioritize how we spend our time. We can write some mundane code to solve a user's problem, create a presentation on effective use of a programming language, study a technical topic, or eat delicious cheeseburgers. Generally, we prioritize by return on investment (ROI), the value of a particular activity divided by its cost. So far, this is all obvious. (I must say, a cheeseburger sounds amazing right now...) For a typical software development project, the investment component of ROI is typically measured in time, but sometimes there are other costs involved: monetary expenses, team morale, or perhaps even a negative impact to your customers' loyalty. It's harder to quantifiably predict the return from a particular project, but upsides can include increased revenue, decreased expenses, increased customer engagement and enjoyment, or new strategic opportunities. There's an important component of ROI that will sound obvious as I explain it, but I've seen people fail to account for it, time and time again, in their own prioritizations. I'm referring to leverage. Leverage: n. A force compounded by means of a lever rotating around a pivot. If I'm an engineer and I write a bit of code that's unlikely to ever change, is rarely called, and has a minor use in the application, then it doesn't make any sense for me to spend a great deal of time refactoring and optimizing that code. By definition, work done on that code has low leverage. However, if I write a bit of code that is unlikely to change, but will _often_ be called, then optimizing it makes more sense. The benefits of my work are multiplied by the number of times its used. However, since it won't change much, refactoring it might not be valuable. OK, so what about code that's a critical component of the application, will change every time the customer asks for something new, and runs in 90% of the product's use cases? Work done on this code is extremely high-leverage, and not just for the above reasons. Since this code will be changing a lot, many engineers will cut their teeth on it. Thus, its style will influence the team's approach to future work. This is the code that sets the standard by which your application is built. (If you're a software engineer, you know what code in your application I'm referring to.) It's well worth your time and your team's time to pound this code into tip-top shape. Now, how does your team develop software? If you can educate, motivate, and inspire such that everyone around you becomes more effective, you've just applied leverage at a much larger scale. This is why effective leaders are worth their weight in gold. Leaders are success amplifiers. Further, how does your team _improve_ at developing software? Are you the kind of leader that can create an empowering, self-improving culture, where everyone around you is able to become a leader, all aligned around the same goals? If so, then you're applying exponentially more leverage, and influencing the lives of people several degrees outward from you. In return, the people around you will enrich your life, making you even more successful. These types of leaders are at the core of any successful organization. Remember that applying leverage requires that you see the big picture, the world in which you work. Pay attention to the people around you, because their success is intrinsically tied to yours. When you next look through your todo list, ask yourself "which task is the pivot around which I can apply myself most effectively?"
|