Home

Fri, Feb. 27th, 2009, 12:35 am
Reporting Crashes in IMVU: C++ Call Stacks

This post has moved.

Last time, we talked about including contextual information to help us actually fix crashes that happen in the field. Minidumps are a great way to easily save a snapshot of the most important parts of a running (or crashed) process, but it's often useful to understand the low-level mechanics of a C++ call stack (on x86). Given some basic principles about function calls, we will derive the implementation of code to walk a call stack.

C++ function call stack entries are stored on the x86 stack, which grows downward in memory. That is, pushing on the stack subtracts from the stack pointer. The ESP register points to the most-recently-written item on the stack; thus, push eax is equivalent to:

sub esp, 4
mov [esp], eax

Let's say we're calling a function:

int __stdcall foo(int x, int y)

The __stdcall calling convention pushes arguments onto the stack from right to left and returns the result in the EAX register, so calling foo(1, 2) generates this code:

push 2
push 1
call foo
; result in eax

If you aren't familiar with assembly, I know this is a lot to absorb, but bear with me; we're almost there. We haven't seen the call instruction before. It pushes the EIP register, which is the return address from the called function onto the stack and then jumps to the target function. If we didn't store the instruction pointer, the called function would not know where to return when it was done.

The final piece of information we need to construct a C++ call stack is that functions live in memory, functions have names, and thus sections of memory have names. If we can get access to a mapping of memory addresses to function names (say, with the /MAP linker option), and we can read instruction pointers up the call stack, we can generate a symbolic stack trace.

How do we read the instruction pointers up the call stack? Unfortunately, just knowing the return address from the current function is not enough. How do you know the location of the caller's caller? Without extra information, you don't. Fortunately, most functions have that information in the form of a function prologue:

push ebp
mov ebp, esp

and epilogue:

mov esp, ebp
pop ebp

These bits of code appear at the beginning and end of every function, allowing you to use the EBP register as the "current stack frame". Function arguments are always accessed at positive offsets from EBP, and locals at negative offsets:

; int foo(int x, int y)
; ...
[EBP+12] = y argument
[EBP+8]  = x argument
[EBP+4]  = return address (set by call instruction)
[EBP]    = previous stack frame
[EBP-4]  = local variable 1
[EBP-8]  = local variable 2
; ...

Look! For any stack frame EBP, the caller's address is at [EBP+4] and the previous stack frame is at [EBP]. By dereferencing EBP, we can walk the call stack, all the way to the top!

struct stack_frame {
    stack_frame*  previous;
    unsigned long return_address;
};

std::vector<unsigned long> get_call_stack() {
    std::vector<unsigned long> call_stack;

    stack_frame* current_frame;
    __asm mov current_frame, ebp

    while (!IsBadReadPtr(current_frame, sizeof(stack_frame))) {
        call_stack.push_back(current_frame->return_address);
        current_frame = current_frame->previous;
    }
    return call_stack;
}

// Convert the array of addresses to names with the aforementioned MAP file.

Yay, now we know how to grab a stack trace from any location in the code. This implementation is not robust, but the concepts are correct: functions have names, functions live in memory, and we can determine which memory addresses are on the call stack. Now that you know how to manually grab a call stack, let Microsoft do the heavy lifting with the StackWalk64 function.

Next time, we'll talk about setting up your very own Microsoft Symbol Server so you can grab accurate function names from every version of your software.

Thu, Feb. 26th, 2009, 12:29 am
Reporting Crashes in IMVU: Call Stacks and Minidumps

This post has moved.

So far, we've implemented reporting for Python exceptions that bubble out of the main loop, C++ exceptions that bubble into Python (and then out of the main loop), and structured exceptions that bubble into Python (and then out of the main loop.) This is a fairly comprehensive set of failure conditions, but there's still a big piece missing from our reporting.

Imagine that you implement this error reporting and have customers try the new version of your software. You'll soon have a collection of crash reports, and one thing will stand out clearly. Without the context in which crashes happened (call stacks, variable values, perhaps log files), it's very hard to determine their cause(s). And without determining their cause(s), it's very hard to fix them.

Reporting log files are easy enough. Just attach them to the error report. You may need to deal with privacy concerns or limit the size of the log files that get uploaded, but those are straightforward problems.

Because Python has batteries included, grabbing the call stack from a Python exception is trivial. Just take a quick look at the traceback module.

Structured exceptions are a little harder. The structure of a call stack on x86 is machine- and sometimes compiler-dependent. Fortunately, Microsoft provides an API to dump the relevant process state to a file such that it can be opened in Visual Studio or WinDbg, which will let you view the stack trace and select other data. These files are called minidumps, and they're pretty small. Just call MiniDumpWriteDump with the context of the exception and submit the generated file with your crash report.

Grabbing a call stack from C++ exceptions is even harder, and maybe not desired. If you regularly use C++ exceptions for communicating errors from C++ to Python, it's probably too expensive to grab a call stack or write a minidump every single time. However, if you want to do it anyway, here's one way.

C++ exceptions are implemented on top of the Windows kernel's structured exception machinery. Using the try and catch statements in your C++ code causes the compiler to generate SEH code behind the scenes. However, by the time your C++ catch clauses run, the stack has already been unwound. Remember that SEH has three passes: first it runs filter expressions until it finds one that can handle the exception; then it unwinds the stack (destroying any objects allocated on the stack); finally it runs the actual exception handler. Your C++ exception handler runs as the last stage, which means the stack has already been unwound, which means you can't get an accurate call stack from the exception handler. However, we can use SEH to grab a call stack at the point where the exception was thrown, before we handle it...

First, let's determine the SEH exception code of C++ exceptions (WARNING, this code is compiler-dependent):

int main() {
    DWORD code;
    __try {
        throw std::exception();
    }
    __except (code = GetExceptionCode(), EXCEPTION_EXECUTE_HANDLER) {
        printf("%X\n", code);
    }
}

Once we have that, we can write our exception-catching function like this:

void throw_cpp_exception() {
    throw std::runtime_error("hi");
}

bool writeMiniDump(const EXCEPTION_POINTERS* ep) {
    // ...
    return true;
}

void catch_seh_exception() {
    __try {
        throw_cpp_exception();
    }
    __except (
        (CPP_EXCEPTION_CODE == GetExceptionCode()) && writeMiniDump(GetExceptionInformation()),
        EXCEPTION_CONTINUE_SEARCH
    ) {
    }
}

int main() {
    try {
        catch_seh_exception();
    }
    catch (const std::exception& e) {
        printf("%s\n", e.what());
    }
}

Now we've got call stacks and program state for C++, SEH, and Python exceptions, which makes fixing reported crashes dramatically easier.

Next time I'll go into more detail about how C++ stack traces work, and we'll see if we can grab them more efficiently.

Tue, Feb. 24th, 2009, 09:22 pm
RSI

Bloodpact has been wreaking havoc on my arms, so I am taking a break today. I am diagnosed with tendinitis caused by RSI. RSIGuard and Dvorak have helped a lot, but I need to resume lifting weights and stretching. Perhaps I will start using Dragon again.

Mon, Feb. 23rd, 2009, 09:24 pm
Reporting Crashes in IMVU: Structured Exceptions

This post has moved.

Previously, we discussed the implementation of automated reporting of unhandled C++ exceptions. However, if you've ever programmed in C++, you know that C++ exceptions are not the only way your code can fail. In fact, the most common failures probably aren't C++ exceptions at all. You know what I'm referring to: the dreaded access violation (sometimes called segmentation fault).

Access Violation

How do we detect and report access violations? First, let's talk about what an access violation actually is.

Your processor has a mechanism for detecting loads and stores from invalid memory addresses. When this happens, it raises an interrupt, which Windows exposes to the program via Structured Exception Handling (SEH). Matt Pietrek has written an excellent article on how SEH works, including a description of C++ exceptions implemented on top of SEH. The gist is that there is a linked list of stack frames that can possibly handle the exception. When an exception occurs, that list is walked, and if an entry claims it can handle it, it does. Otherwise, if no entry can handle the exception, the program is halted and the familiar crash dialog box is displayed to the user.

OK, so access violations can be detected with SEH. In fact, with the same mechanism, we can detect all other types of structured exceptions, including division by zero and stack overflow. What does the code look like? It's approximately:

bool handle_exception_impl_seh(function f) {
    __try {
        // This is the previously-described C++ exception handler.
        // For various reasons, they need to be in different functions.
        // C++ exceptions are implemented in terms of SEH, so the C++
        // exception handling must be deeper in the call stack than
        // the structured exception handling.
        return handle_exception_impl_cpp(f);
    }
    // catch all structured exceptions here
    __except (EXCEPTION_EXECUTE_HANDLER) {
        PyErr_SetString(PyExc_RuntimeError, "Structured exception in C++ function");
        return true; // an error occurred
    }
}

Note the __try and __except keywords. This means we're using structured exception handling, not C++ exception handling. The filter expression in the __except statement evaluates to EXCEPTION_EXECUTE_HANDLER, indicating that we always want to handle structured exceptions. From the filter expression, you can optionally use the GetExceptionCode and GetExceptionInformation intrinsics to access information about the actual error.

Now, if you write some code like:

Object* o = 0;
o->method(); // oops!

The error will be converted to a Python exception, and reported with our existing mechanism. Good enough for now! However, there are real problems with this approach. Can you think of them?

Soon, I'll show the full implementation of the structured exception handler.

Sun, Feb. 22nd, 2009, 09:52 pm
Reporting Crashes in IMVU, Part II: C++ Exceptions

This post has moved.

A year ago, I explained how the IMVU client automatically reports unexpected Python exceptions (crashes) to us. I intended that post to be the first of a long series that covered all of the tricks we use to detect and report abnormal situations. Clearly, my intentions have not played out yet, so I am going to pick up that series by describing how we catch exceptions that occur in our C++ code. Without further ado,

Reporting C++ Exceptions

As discussed earlier, IMVU's error handling system can handle any Python exception that bubbles out of the client's main loop and automatically report the failure back to us so that we can fix it for the next release. However, our application is a combination of Python and C++, so what happens if our C++ code has a bug and raises an uncaught C++ exception, such as std::bad_alloc or std::out_of_range?

Most of our C++ code is exposed to Python via the excellent Boost.Python library, which automatically catches C++ exceptions at the boundary and translates them to Python exceptions. The translation layer looks something like this:

bool handle_errors(function fn) {
    try {
        fn();
        return false; // no error
    }
    catch (const std::runtime_error& e) {
        // raise RuntimeError into Python
        PyErr_SetString(PyExc_RuntimeError, e.what());
    }
    catch (const std::bad_alloc&) {
        // raise MemoryError into Python
        PyErr_SetString(PyExc_MemoryError, "out of memory");
    }
    catch (const std::exception& e) {
        // raise Exception into Python
        PyErr_SetString(PyExc_Exception, e.what());
    }
    catch (...) {
        PyErr_SetString(PyExc_Exception, "Unknown C++ exception");
    }
    return true;
}

Thus, any C++ exception that's thrown by the C++ function is caught by Boost.Python and reraised as the appropriate Python exception, which will already be handled by the previously-discussed crash reporting system.

Let's take another look at the client's main loop:

def mainLoop():
    while running:
        pumpWindowsMessages()
        updateAnimations()
        redrawWindows()

def main():
    try:
        mainLoop()
    except:
        # includes exception type, exception value, and python stack trace
        error_information = sys.exc_info()
        if OK == askUserForPermission():
            submitError(error_information)

If the C++ functions called from updateAnimations() or redrawWindows() raise a C++ exception, it will be caught by the Python error-handling code and reported to us the same way Python exceptions are.

Great! But is this a complete solution to the problem? Exercise for the reader: what else could go wrong here? (Hint: we use Visual Studio 2005 and there was a bug in catch (...) that Microsoft fixed in Visual Studio 2008...)

Sun, Feb. 22nd, 2009, 02:05 am
Inarticulate Instincts

I have a problem: I am very lucky. Stuff just seems to work out, without any great effort on my part. I believe this has something to do with decisions my subconscious is making, and how they guide me through life. I place great trust in my instincts, though I'll be damned if I could articulate my thought processes to you. This has been a great source of contention between me and the people who manage and work with me. Even my father recently said to me "I get frustrated that you don't back up your political viewpoints with data." Nonetheless, I trust that my instincts see the big picture.

Similarly, I know several extremely talented people whose advice I implicitly trust, even if I always play the devil's advocate with them. If I ask them to explain their advice, they always fail. Maybe they will manage to articulate some of their thought processes, but I get the feeling they're lying anyway, or at least not telling the whole story. Good advice always has hidden virtues. Their instincts know this, but that information never bubbles into the foreground.

Not everyone has good instincts, however. Some people think of the most insane and inappropriate solutions when confronted with a problem. Others propose things that seem like great ideas on the surface, but often fail to work. Further, there are plenty of examples where I will make a proposal by instinct, and it turns out to be totally wrong. On the other hand, there are examples where someone will have a choice of solutions to a problem and I will say "solution B is better than A and C", which is true, but only in hindsight down the road. What to do?

I've been wrestling with this issue since I was born, because I'm especially bad at articulating why I feel certain ways. Recently, I started reading the book Blink: The Power of Thinking Without Thinking by Malcolm Gladwell, and I hope it will provide tools for coping in a world where great decisions come from the mind's subconscious and without any explanation why.

So, actually, I have two problems. Dear reader, please help me answer these questions:

  1. How do you accept advice from people if they can't explain why?
  2. How do you give advice if you can't explain why?

The answer may lie in experience and history. If you haven't built up a lot of trust, you may need to say "Please, just give me a chance, and let the results speak for themselves." If you're experienced, you can let previous successes be your explanation. Discussing this feels weird because we live in a "why why why" world and I work at a company that prides itself on being data-driven and explicitly not opinion-driven.

Perhaps one reason agile software development is successful is because it enables individuals to use their raw talents to work towards an understood goal, without an overemphasis on ceremony, explanation, and conscious thought. Humans are amazing pattern recognition engines, and harnessing that power is surely beneficial. Put humans in teams, and you get to harness the benefits of our social machinery too. Ask these same humans to clearly articulate every decision they make before they're allowed to make it, and I guarantee you'll see a tangible reduction in their success rate.

I don't know the answers, and this is a new train of thought to me. I would love to hear your thoughts.

Thanks.

Sat, Feb. 21st, 2009, 01:34 am
A brief introduction to modern x86 assembly language

This post has moved.

Several people have personally requested that I give a brief introduction to modern x86 (sometimes called IA32) assembly language. For simplicity's sake, I'll stick with the 32-bit version with a flat memory model. AMD64 (sometimes called x64) just isn't as popular as x86 yet, so this seems safe.

For some reason, there's a mythos around assembly language. People associate it with bearded gurus, assuming only ninjas can program in it, when, in principle, assembly language is one of the simplest programming languages there is. Any complexity stems from a particular architecture's oddities, and even though x86 is one of the oddest of them all, I'll show you that it can be easy to read and write.

First, I'll describe the basic architecture. When programming in assembly, there are three main concepts:

Instructions are the individual commands that tell the computer to perform an operation. These include instructions for adding, multiplying, comparing, copying, performing bit-wise operations, accessing memory, and communicating with external devices. The computer executes instructions sequentially.

Registers are where temporary values go. There is a small, fixed set of registers available for use. Since there aren't many registers, nothing stays in them for very long, as they ar soon needed for other purposes.

Memory is where longer-lived data goes. It's a giant, flat array of bytes (8-bit quantities). It's much slower to access than registers, but there's a lot of it.

Before I get into some examples, let me describe the registers available on x86. There are only 8 general-purpose registers, each of which is 32 bits wide. They are:

  • EAX
  • EBX
  • ECX
  • EDX
  • ESI
  • EDI
  • EBP - used when accessing local variables or function arguments
  • ESP - used when calling functions

On x86, most instructions have two operands, a destination and a source. For example, let's add two and three:

mov eax, 2   ; eax = 2
mov ebx, 3   ; ebx = 3
add eax, ebx ; eax = 2 + 3 = 5

add eax, ebx adds the values in registers eax and ebx, and stores the result back in eax. (BTW, this is one of the oddities of x86. Other modern architectures differentiate between destination and source operands, which would look like add eax, ebx, ecx meaning eax = ebx + ecx. On x86, the first operand is read and written in the same instruction.)

mov is the data movement instruction. It copies values from one register to another, or from a constant to a register, or from memory to a register, or from a register to memory.

Speaking of memory, let's say we want to add 2 and 3, storing the result at address 32. Since the result of the addition is 32 bits, the result will actually use addresses 32, 33, 34, and 35. Remember, memory is indexed in bytes.

mov eax, 2
mov ebx, 3
add eax, ebx
mov edi, 32
mov [edi], eax ; copies 5 to address 32 in memory

What about loading data from memory? (Reads from memory are called loads. Writes are called stores.) Let's write a program that copies 1000 4-byte quantities (4000 bytes) from address 10000 to address 20000.

mov esi, 10000 ; by convention, esi is often used as the 'source' pointer
mov edi, 20000 ; similarly, edi often means 'destination' pointer
mov ecx, 1000 ; let's copy 1000 32-bit items

begin_loop:
mov eax, [esi] ; load from source
mov [edi], eax ; store to destination
add esi, 4
add edi, 4

sub ecx, 1 ; ecx -= 1
cmp ecx, 0 ; is ecx 0?

; if ecx does not equal 0, jump to the beginning of the loop
jne begin_loop
; otherwise, we're done

This is how the C memcpy function works. Not so bad, is it? For reference, this is what our x86 code would look like in C:

int* src = (int*)10000;
int* dest = (int*)20000;
int count = 1000;
while (count--) {
    *dest++ = *src++;
}

From here, all it takes is a good instruction reference, some memorization, and a bit of practice. x86 is full of arcane details (it's 30 years old!), but once you've got the basic concepts down, you can mostly ignore them. I hope I've shown you that writing x86 is easy. Perhaps more importantly, I hope you won't be intimidated the next time Visual Studio shows you the assembly for your program. Understanding how the machine is executing your code can be invaluable when debugging.

Fri, Feb. 20th, 2009, 02:24 am
Leverage

This post has moved.

Every day, we prioritize how we spend our time. We can write some mundane code to solve a user's problem, create a presentation on effective use of a programming language, study a technical topic, or eat delicious cheeseburgers. Generally, we prioritize by return on investment (ROI), the value of a particular activity divided by its cost. So far, this is all obvious. (I must say, a cheeseburger sounds amazing right now...)

For a typical software development project, the investment component of ROI is typically measured in time, but sometimes there are other costs involved: monetary expenses, team morale, or perhaps even a negative impact to your customers' loyalty. It's harder to quantifiably predict the return from a particular project, but upsides can include increased revenue, decreased expenses, increased customer engagement and enjoyment, or new strategic opportunities.

There's an important component of ROI that will sound obvious as I explain it, but I've seen people fail to account for it, time and time again, in their own prioritizations. I'm referring to leverage.

Leverage: n. A force compounded by means of a lever rotating around a pivot.

If I'm an engineer and I write a bit of code that's unlikely to ever change, is rarely called, and has a minor use in the application, then it doesn't make any sense for me to spend a great deal of time refactoring and optimizing that code. By definition, work done on that code has low leverage.

However, if I write a bit of code that is unlikely to change, but will _often_ be called, then optimizing it makes more sense. The benefits of my work are multiplied by the number of times its used. However, since it won't change much, refactoring it might not be valuable.

OK, so what about code that's a critical component of the application, will change every time the customer asks for something new, and runs in 90% of the product's use cases? Work done on this code is extremely high-leverage, and not just for the above reasons. Since this code will be changing a lot, many engineers will cut their teeth on it. Thus, its style will influence the team's approach to future work. This is the code that sets the standard by which your application is built. (If you're a software engineer, you know what code in your application I'm referring to.) It's well worth your time and your team's time to pound this code into tip-top shape.

Now, how does your team develop software? If you can educate, motivate, and inspire such that everyone around you becomes more effective, you've just applied leverage at a much larger scale. This is why effective leaders are worth their weight in gold. Leaders are success amplifiers.

Further, how does your team _improve_ at developing software? Are you the kind of leader that can create an empowering, self-improving culture, where everyone around you is able to become a leader, all aligned around the same goals? If so, then you're applying exponentially more leverage, and influencing the lives of people several degrees outward from you. In return, the people around you will enrich your life, making you even more successful. These types of leaders are at the core of any successful organization.

Remember that applying leverage requires that you see the big picture, the world in which you work. Pay attention to the people around you, because their success is intrinsically tied to yours. When you next look through your todo list, ask yourself "which task is the pivot around which I can apply myself most effectively?"

Tue, Feb. 17th, 2009, 11:41 pm
Evaluating JavaScript in an Embedded XULRunner/Gecko Window

This post has moved.

I intended to write something with more substance tonight, but I'm exhausted from wrasslin' with Gecko/XULRunner/SpiderMonkey in a days-long marathon debugging session. None of you will understand this entry, because its intent is to contain enough keywords and content that others don't have to go through the pain that I did.

If you're embedding Gecko/XULRunner/SpiderMonkey into your application, and you want to evaluate some JavaScript in the context of an nsIDOMWindow or nsIWebBrowser, you'd think you'd have many approaches. You could call JS_EvaluateScript or JS_EvaluateUCScript directly, getting the JSContext from the nsIScriptContext and the JSObject* global from the nsIScriptGlobalObject... However, I simply could not get this to work: I kept running into crazy errors inside of JS_InitArrayClass. I still don't understand those errors.

People suggested using EvaluateString and EvaluateStringWithValue on nsIScriptContext, but that failed in an empty window (I define empty as not having called nsIWebNavigation::LoadURI) because it did not have a security principal (nsIPrincipal). Eventually I learned that you can grab the system principal from the nsIScriptSecurityManager service and pass that directly to EvaluateStringWithValue. With a few more minor details, this approach worked in all cases that we care about so far!

Here is the final magic incantation:

typedef std::map<jsval, boost::python::object> ReferenceMap;

boost::python::object GeckoWindow::evalJavaScript(const std::wstring& js) {
    nsresult rv;

    nsCOMPtr<nsIPrincipal> principal;
    nsCOMPtr<nsIScriptSecurityManager> secMan = do_GetService(
        NS_SCRIPTSECURITYMANAGER_CONTRACTID);
    rv = secMan->GetSystemPrincipal(getter_AddRefs(principal));
    if (NS_FAILED(rv)) {
        throw GeckoError("Failed to get system principal");
    }

    nsCOMPtr<nsIScriptGlobalObject> sgo = do_GetInterface(webBrowser);
    nsCOMPtr<nsIScriptContext> ctx = sgo->GetContext();

    JSContext* cx = reinterpret_cast<JSContext*>(ctx->GetNativeContext());
    Assert(cx);
    uint32 previous = JS_SetOptions(
        cx,
        JS_GetOptions(cx) | JSOPTION_DONT_REPORT_UNCAUGHT);

    jsval out;
    rv = ctx->EvaluateStringWithValue(
        nsString(js.data(), js.size()),
        sgo->GetGlobalJSObject(),
        principal,
        "mozembed",
        0,
        nsnull,
        &out,
        nsnull);

    JS_SetOptions(cx, previous);

    JSAutoRequest ar(cx);
    JSAutoLocalRootScope alrs(cx);

    maybeThrowPythonExceptionFromJsContext(cx);

    if (NS_SUCCEEDED(rv)) {
        ReferenceMap references;
        return buildPythonObjectFromJsval(references, cx, out);
    } else {
        throw GeckoEvalUnknownError("eval failed with no exception set");
    }
}

void GeckoWindow::maybeThrowPythonExceptionFromJsContext(JSContext* cx) {
    jsval exception;
    if (JS_GetPendingException(cx, &exception)) {
        JS_ClearPendingException(cx);
        ReferenceMap references;
        boost::python::object py_exc_value(buildPythonObjectFromJsval(
            references,
            cx,
            exception));
        throw GeckoEvalError(py_exc_value.ptr());
    }
}

boost::python::object GeckoWindow::buildPythonObjectFromJsval(
    ReferenceMap& references,
    JSContext* cx,
    const jsval v
) {
    using namespace boost::python;

    if (v == JSVAL_TRUE) {
        return object(handle<>(Py_True));
    } else if (v == JSVAL_FALSE) {
        return object(handle<>(Py_False));
    } else if (v == JSVAL_NULL) {
        return object(handle<>(Py_None));
    } else if (v == JSVAL_VOID) {
        return object(handle<>(Py_None));
    } else if (JSVAL_IS_INT(v)) {
        return object(handle<>(PyInt_FromLong(JSVAL_TO_INT(v))));
    } else if (JSVAL_IS_NUMBER(v)) {
        return object(handle<>(PyFloat_FromDouble(*JSVAL_TO_DOUBLE(v))));
    // } else if (JSVAL_IS_STRING(v)) {
    //
    } else if (JSVAL_IS_OBJECT(v)) {
        JSObject* obj = JSVAL_TO_OBJECT(v);

        if (references.count(v)) {
            return references[v];
        }

        if (JS_IsArrayObject(cx, obj)) {
            list rv;
            references[v] = rv;
            jsuint length;
            if (JS_GetArrayLength(cx, obj, &length)) {
                jsval element;
                for (jsuint i = 0; i < length; ++i) {
                    if (JS_GetElement(cx, obj, i, &element)) {
                        rv.append(buildPythonObjectFromJsval(references, cx, element));
                    }
                }
            }
            return rv;
        } else {
            dict rv;
            references[v] = rv;

            JSObject* iterator = JS_NewPropertyIterator(cx, obj);
            if (!iterator) {
                throw GeckoEvalUnknownError("Error creating object property iterator while marshalling");
            }
            for (;;) {
                jsid propertyName;
                if (!JS_NextProperty(cx, iterator, &propertyName)) {
                    throw GeckoEvalUnknownError("Error enumerating property list of object while marshalling");
                }

                if (propertyName == JSVAL_VOID) {
                    break;
                }

                jsval propertyNameValue;
                jsval propertyValue;
                object k;

                if (!JS_IdToValue(cx, propertyName, &propertyNameValue)) {
                    throw GeckoEvalUnknownError("Error converting property name to jsval while marshalling");
                }
                if (JSVAL_IS_INT(propertyNameValue)) {
                    jsint propertyIndex = JSVAL_TO_INT(propertyNameValue);
                    k = long_(propertyIndex);

                    if (!JS_LookupElement(cx, obj, propertyIndex, &propertyValue)) {
                        throw GeckoEvalUnknownError("Error looking up property value by index");
                    }
                } else if (JSVAL_IS_STRING(propertyNameValue)) {
                    JSString* kjsstr = JSVAL_TO_STRING(propertyNameValue);
                    std::wstring kstr(JS_GetStringChars(kjsstr), JS_GetStringLength(kjsstr));
                    k = object(kstr);

                    if (!JS_LookupUCProperty(cx, obj, kstr.c_str(), kstr.size(), &propertyValue)) {
                        throw GeckoEvalUnknownError("Error looking up property value by name");
                    }
                } else {
                    throw GeckoEvalUnknownError("Unknown property name type while marshalling");
                }

                rv[k] = buildPythonObjectFromJsval(references, cx, propertyValue);
            }
            return rv;
        }
    } else {
        // We don't know what type it is, or we can't marshal it,
        // so convert it to a string and hope for the best...
        JSString* string = JS_ValueToString(cx, v);
        return str(std::wstring(JS_GetStringChars(string), JS_GetStringLength(string)));
    }
}

Hope that helps, and Godspeed.

Fri, Feb. 13th, 2009, 10:57 pm
Latency vs. Throughput

This post has moved.

This is my last post about processors and performance, I swear! Plus, my wrists are starting to hurt from this bloodpact thing (as I'm diagnosed with RSI), so I think this will be a light one.

As I've discussed previously, modern desktop processors work really hard to exploit the inherent parallelism in your programs. This is called instruction-level parallelism, and is one of the techniques processors use to get more performance out of slower clock rates (along with data-level parallelism (SIMD) or multiple cores (MIMD)*). Previously, I waved my hands a bit and said "The processor makes independent operations run in parallel." Now I'm going to teach you how to count cycles in the presence of latency and parallelism.

Traditionally, when analyzing the cost of an algorithm, you would simply count the operations involved, sum their costs in cycles, and call it a day. These days, it's not that easy. Instructions have two costs: dependency chain latency and reciprocal throughput.

Reciprocal throughput is simply the reciprocal of the maximum throughput of a particular instruction. Throughput is measured in instructions/cycle, so reciprocal throughput is cycles/instruction.

OK, that sounds like the way we've always measured performance. So what's dependency chain latency? When the results of a previous calculation are needed for another calculation, you have a dependency chain. In a dependency chain, you measure the cost of an instruction by its latency, not its reciprocal throughput. Remember that our processors are working really hard to exploit parallelism in our code. When there is no instruction-level parallelism available, we get penalized.

Let's go back to our sum 10000 numbers example, but unroll it a bit:

float array[10000];
float sum = 0.0f;
for (int i = 0; i < 10000; i += 8) {
    sum += array[i+0];
    sum += array[i+1];
    sum += array[i+2];
    sum += array[i+3];
    sum += array[i+4];
    sum += array[i+5];
    sum += array[i+6];
    sum += array[i+7];
}
return sum;
In x86:
xor ecx, ecx     ; ecx  = i   = 0
mov esi, array
xorps xmm0, xmm0 ; xmm0 = sum = 0.0

begin:
addss xmm0, [esi+0]
addss xmm0, [esi+4]
addss xmm0, [esi+8]
addss xmm0, [esi+12]
addss xmm0, [esi+16]
addss xmm0, [esi+20]
addss xmm0, [esi+24]
addss xmm0, [esi+28]

add esi, 32
add ecx, 1
cmp ecx, 10000
jl begin ; if ecx < 10000, goto begin

; xmm0 = total sum

Since each addition to sum in the loop depends on the previous addition, these instructions are a dependency chain. On a modern processor, let's say the reciprocal throughput of addss is 1 cycle. However, the minimum latency is 4 cycles. Since every instruction depends on the previous, each addition costs 4 cycles.

As before, let's try summing with four temporary sums:

xor ecx, ecx     ; ecx  = i    = 0
mov esi, array
xorps xmm0, xmm0 ; xmm0 = sum1 = 0.0
xorps xmm1, xmm1 ; xmm1 = sum2 = 0.0
xorps xmm2, xmm2 ; xmm2 = sum3 = 0.0
xorps xmm3, xmm3 ; xmm3 = sum4 = 0.0

; top = sum0

begin:
addss xmm0, [esi+0]
addss xmm1, [esi+4]
addss xmm2, [esi+8]
addss xmm3, [esi+12]
addss xmm0, [esi+16]
addss xmm1, [esi+20]
addss xmm2, [esi+24]
addss xmm3, [esi+28]

add esi, 32
add ecx, 1
cmp ecx, 10000
jl begin ; if ecx < 10000, goto begin

; accumulate sums
addss xmm0, xmm1
addss xmm2, xmm3 ; this instruction happens in parallel with the one above
addss xmm0, xmm2

Here, the additions in the loop that depend on each other are 4 cycles apart, meaning the minimum latency is no longer a problem. This lets us hit the maximum addition rate of one per cycle.

Removing dependency chains is a critical part of optimizing on today's processors. The Core 2 processor has six execution units, three of which are fully 128-bit SIMD ALUs. If you can restructure your algorithm so calculations happen independently, you can take advantage of all of them. (And if you can pull off making full use of the Core 2's ALU capacity, you win.)

* BTW, it's sort of unrelated, but I couldn't help but link this article. Greg Pfister has an interesting comparison and history of SIMD vs. MIMD here. Ignore the terminology blathering and focus on the history of and influences on SIMD and MIMD over time.

Thu, Feb. 12th, 2009, 11:16 pm
Running Time -> Algebra -> Hardware

This post has moved.

I'm going to talk about something which should be obvious, but I continue to see people optimizing code in the wrong order (*cough* including myself *cough*). So, here's a reminder. When you're optimizing a bit of code...

FIRST

Make sure your algorithmic running time is right. O(1) is almost always faster than O(N), and O(N^2) is right out. Often these optimization exercises involve changing some O(N) to O(M), where M is smaller than N.

I'll give an example. Drawing a frame of 3D graphics in IMVU is O(N) where N is the sum of all vertices from all products on all objects loaded into a scene. We recently implemented view frustum culling, which skips drawing objects that are known to be off-screen. This reduces the rendering time from O(N) to O(M) where M<N and M is the number of vertices from products that are visible. If we implemented View Independent Progressive Meshes, we could reduce the time to O(P) where P is the number of vertices that contribute to the visible detail of the scene, and P<M<N.

However, make sure to avoid algorithms with good running times but huge constants. This is why, when CPUs got fast and random memory accesses got slow, searching an O(N) array (or std::vector) is often faster than searching an O(log N) tree (or std::map). The tree will miss cache far more often.

SECOND

Then, use all of the algebra, set theory, and logic you know to reduce the number of operations required, in order of operation cost.

Let's say we're going to calculate the diffuse reflectance on a surface: N dot L, where N and L are three-vectors, N is the normal of the surface, and L is the direction to the light.

The naive normalize(N) dot normalize(L) is...

float lengthN = sqrtf(N.x*N.x + N.y*N.y + N.z*N.z);
float lengthL = sqrtf(L.x*L.x + L.y*L.y + L.z*L.z);
float dot =
    (N.x / lengthN) * (L.x / lengthL) +
    (N.y / lengthN) * (L.y / lengthL) +
    (N.z / lengthN) * (L.z / lengthL);

... which turns out to be 6 additions, 9 multiplications, 6 divisions, and 2 square roots. Let's say additions and multiplications are 2 cycles, and divisions and square roots are 40 cycles. This gives us a total of 6*2 + 9*2 + 6*40 + 2*40 = 350 cycles.

Instead, let's do a bit of algebra:

  normalize(N) dot normalize(L)
= N/|N| dot L/|L|
= (N dot L) / (|N||L|)
= (N dot L) / sqrt((N dot N) * (L dot L))

The new calculation is...

float lengthSquaredN = N.x*N.x + N.y*N.y + N.z*N.z;
float lengthSquaredL = sqrtf(L.x*L.x + L.y*L.y + L.z*L.z);
float NdotL          = N.x*L.x + N.y*L.y + N.z*L.z;
float dot =          NdotL / sqrtf(lengthSquaredN * lengthSquaredL);

... 6 additions, 10 multiplications, 1 division, and 1 sqrt: 6*2 + 10*2 + 1*40 + 1*40 = 112 cycles. Huge improvement just by applying basic math.

THIRD

Once you're done optimizing algebraically, read your processor manuals and take full advantage of the hardware. If you've got SSE4, you can do the dot products in one instruction (DPPS), and an approximate reciprocal square root in another (RSQRTSS), which can give another huge improvement.

The reason you want to optimize in this order is that algorithmic improvements reduce the amount of work you have to do, making it less important to make that work fast. A hardware-optimized O(N^2) algorithm can be easily beaten by an unoptimized O(N log N) algorithm. Remember, Chad, the next time you schedule optimization projects, consider downstream effects such as these.

Wed, Feb. 11th, 2009, 11:09 pm
A Simple Introduction to Superscalar, Out-of-Order Processors

This post has moved.

Since the Pentium Pro/Pentium 2, we have all been using heavily superscalar, out-of-order processors. I'd heard these terms a million times, but didn't know what they meant until I read The Pentium Chronicles: The People, Passion, and Politics Behind Intel's Landmark Chips (Practitioners). (BTW, if you love processors, the history of technology, and the fascinating dynamics at a company like Intel, that book is fantastic.)

Superscalar basically means "greater than 1", implying that a superscalar processor can run code faster than its clock speed would suggest. Indeed, a 3 GHz Pentium 4 can retire 4 independent integer additions per clock cycle, which is 12 billion integer additions per second!

Out-of-order means just that - the processor looks at your code at runtime and executes it in parallel if it can. For example, imagine this code:

// three independent, non-null pointers
int* p; int* q; int* r;
const int flag1, flag2, flag3;

if (*p & flag1) {
    if (*q & flag2) {
        if (*r & flag3) {
            do_work();
        }
    }
}

The processor can't assume that q is a valid pointer until it checks p, and the same for r and q. Accessing main memory costs ~200 cycles, so if none of the pointers point to cached memory, you just spent 600 cycles determining whether to do_work(). This is called a "dependency chain", where the result of a later calculation depends on the previous. But what if you know that p, q, and r will all be valid pointers? You can rewrite as:

const int x = *p;
const int y = *q;
const int z = *r;
if ((x & flag1) && (y & flag2) && (z & flag3)) {
    do_work();
}

Now, the processor knows that all of those memory fetches are independent, so it runs them in parallel. Then, it runs the ANDs in parallel too, since they're independent. Your 600-cycle check just became 200 cycles.

Similarly, let's say you want to add 10,000 numbers.

int sum = 0;
for (int i = 0; i < 10000; ++i) {
    sum += array[i];
}
return sum;

Let's assume the loop overhead and memory access is free, and each addition takes one cycle. Since each addition depends on the previous value of sum, they must be executed serially, taking 10000 cycles. However, you know that addition is associative, you can sum with two variables:

int sum1 = 0;
int sum2 = 0;
for (int i = 0; i < 10000; i += 2) {
    sum1 += array[i];
    sum2 += array[i+1];
}
return sum1 + sum2;

Now you have two independent additions, which can be executed in parallel! The loop takes 5000 cycles now. If you independently sum in sum1, sum2, sum3, and sum4, the loop will take 2500 cycles. And so on, until you've hit the IPC (instructions per cycle) limit on your processor. If you're making effective use of your SIMD units, you'd be surprised at how much work you can do in parallel...

And that's what an out-of-order, superscalar processor can do for you!

Wed, Feb. 11th, 2009, 12:20 am
Logic vs. Array Processing

This post has moved.

I've always been amused by the Java vs. C++ performance arguments:

  • "Java's faster than C++!"
  • "No it's not!"
  • "Yeah it is, look at this benchmark!"
  • "Well look how much longer the Java version of program takes to start!"

Back and forth and back and forth. The fact is, they're both right, and here's why. I mentally separate code into either of two categories, logic or array processing:

  1. 3D rasterization is obviously array processing.
  2. Video playback is also array processing.
  3. Calculating your tax refund is logic.
  4. Loading a PDF is definitely logic.

Often the line is blurry, but array processing involves running a relatively small set of rules over a lot of homogenous data. Computers are very, very good at this kind of computation, and specialized hardware such as a GPU can increase performance by orders of magnitude. Ignoring memory bandwidth, a desktop CPU can multiply billions of floating point numbers per second, and a fast GPU can multiply trillions.

At the other extreme, logic code tends to be full of branches, function calls, dependent memory accesses, and often it executes code that hasn't been run in minutes. Just think about the set of operations that happen when you open a file in Word. Computers aren't so good at these types of operations, and as Moore's Law continues, they tend not to improve as rapidly as array computation does.

Back to Java vs. C++. The synthetic benchmarks that compare Java and C++ performance tend to be tight loops, simply because accurate measurement requires it. This gives the JVM time to prime its JIT/prediction engines/what have you, so I'd expect a good result. Heck, I'd expect a good result from the modern JavaScript tracing engines.*

The lesson here is that, for array processing, it's very little work to make full use of the hardware at hand. Because the amount of code is limited (and the amount of data is large), time spent optimization has high leverage.

On the other hand, logic code is messy and spread out, often written by entire teams of people. Its performance is dominated by your programming language and the team's vocabulary of idioms. Truly optimizing this kind of code is hard or impossible. It can be done, but you often have to retrain your team to make sure the benefits stick.

This is a reason that the choice of programming language(s) and libraries has such a big effect on the responsiveness of a desktop application, and one of the reasons why people can "feel" the programming language in which a project was written. Typical desktop application usage patterns are dominated by random, temporally sparse actions, so code size, "directness", and working set are primary performance factors. (Anecdote: Andy's rewriting the IMVU client's windowing framework so it's a bajillion times simpler, and when he had the client running again, he exclaimed "Hey, resizing the 3D window is twice as responsive!")

Perhaps there's an argument here for the creation of more project-specific programming languages (GOAL, TreeHydra, DSLs), so that performance improvements can be applied universally across the codebase.

With disk and memory speeds improving so much more slowly than CPU speeds, the difference between a snappy desktop application and a sluggish application is a handful of page faults. When choosing a technology platform for a project, it's worth considering the impact to overall responsiveness down the road. And I'm pretty sure I just recommended writing your entire application in C++, which sounds insane, even to me. I'll leave it at that.

* By the way, I'm not picking on Java or promoting C++ in particular. You could make these same arguments between any "native" language and "managed" language. The blocking and tackling of loading applications, calling functions, and keeping memory footprint low are important.

Mon, Feb. 9th, 2009, 09:54 pm
A Global Approach to Optimization

This post has moved.

Since joining IMVU, I have had two people tell me "Profilers (especially VTune) suck. They don't help you optimize anything." That didn't make sense to me: how could a tool that gives such amazingly detailed data about the execution of your program be useless?

Now, these are two very smart people, so I knew there had to be some truth in what they were saying. I just had to look for it. At least a year later, after I'd dramatically improved the performance of some key subsystems in the IMVU client, I realized what they meant. They should have said: "Don't just run a profiler to find out which functions are taking the most time, and make them execute faster. Take a global approach to optimization. Prevent those functions from being called at all."

There you have it: take a global approach to optimization. But how does that work? First, let me ramble a bit about the benefits of performance.

There are two types of features:

  1. interactive
  2. not interactive (i.e. slow)

Searching on Google, opening a Word document, and firing a gun in Team Fortress 2 are all interactive.

Compressing large files, factoring large primes, and downloading HD movies are not.

We wish all features were interactive, but computers can't do everything instantly. Sometimes, however, a feature switches from non-interactive to interactive, to dramatic effect. Remember way back? Before YouTube? Downloading videos took forever, and probably wouldn't even play. YouTube made video consumption so fast and so easy that it changed the shape of the internet. Similarly, thanks to Google, it's faster to search the internet for something than it is to search your own hard drive in Windows XP.

Anyway, if you truly want to make something as fast as it can be, you need to think like this:

  • What's the starting state A?
  • What's the ending state B?
  • What's the minimal set of operations to get from A to B, and how do I execute them as fast as possible?

Optimizing your existing, naive code won't get you there. You'll have to build your application around these goals. There's plenty of room for out-of-the-box thinking here. Take MacOS X's resumption-from-hibernation feature:

  • The starting state: the computer is off, the memory is saved to disk.
  • The ending state: the computer is on and the user is productive.

MacOS X takes advantage of the fact that this is not purely a technology problem. The user has to remember what they were doing and become reattached to the computer. Thus, they show you a copy of what was on your screen last to remind you what was happening while the computer prepares for your actions. Opportunities for this kind of parallelism abound: why is it that operating system installers ask you questions, download packages, and install them serially? There is dramatic room for improvement there.

I don't claim that IMVU's website is the fastest website out there, but here's an example of a type of optimization that takes the whole picture into account: when you start loading a page on IMVU.com, it optimistically fetches hundreds of keys from our memcache servers, before even looking up your customer information. It's probable that you'll need many of those keys, and it's faster to get them all at once than to fetch them as you need them.

Someday, I hope to apply this global optimization approach to a software build system (a la Make, SCons, MSBuild). It's insane that we don't have a build system with all of the flexibility of SCons and instantaneous performance. Sure, the first build may need to scan for dependencies, but there's no reason that a second build couldn't reuse the information from the first build and start instantaneously. Just have a server process that watches for changes to files and updates the internal dependency graph. On large projects, I've seen SCons take minutes to figure out that it has to rebuild one file, which is simply crazy.

When optimizing a feature, take the user into consideration, and write down the minimum set of steps between the starting state and ending state. Execute those steps as fast as you can, and run them in parallel if it helps. If technology has advanced enough, maybe you have just transformed something non-interactive into an interactive feature.

Sun, Feb. 8th, 2009, 08:24 pm
The Real Benefit of Inlining Functions (or: Floating Point Calling Conventions)

This post has moved.

My mental model for the performance benefit of inlining a function call was:

  1. code size increases
  2. the overhead of the call, including argument and return value marshalling, is eliminated
  3. the compiler knows more information, so it can generate better code

I had dramatically underestimated the value of #3, so this entry is an attempt to give a concrete example of how inlining can help.

As alluded in my previous entry, you can't just leave the floating point state willy nilly across function calls. Every function should be able to make full use of the floating point register stack, which doesn't work if somebody has left stale values on it. In general, these rules are called calling conventions. Agner Fog has excellent coverage of the topic, as usual.

Anyway, back to inlining. The specifics aren't that important, but we had a really simple function in the IMVU client which continued to show up in the profiles. It looked something like this:

std::vector<float> array;

float function() {
    float sum = 0.0f;
    for (size_t i = 0; i < array.size(); ++i) {
        sum += array[i];
    }
    return sum;
}

This function never operated on very large lists, and it also wasn't called very often, so why was it consistently in the profiles? A peek at the assembly showed (again, something like):

fldz
fstp dword ptr [sum] ; sum = 0.0

xor ecx, ecx ; i = 0
jmp cmp

loop:

push ecx
call array.operator[]

fadd [sum] ; return value of operator[] in ST(0)
fstp [sum] ; why the load and the store??

add ecx, 1

cmp:

call array.size()
cmp ecx, eax
jb loop ; continue if i < return value

fld [sum] ; return value

First of all, why all of the function calls? Shouldn't std::vector be inlined? But more importantly, why does the compiler keep spilling sum out to the stack? Surely it could keep the sum in a floating point register for the entire calculation.

This is when I realized: due to the calling convention requirements on function calls, the floating point stack must be empty upon entry into the function. The stack is in L1 cache, but still, that's three cycles per access, plus a bunch of pointless load and store uops.

Now, I actually know why std::vector isn't inlined. For faster bug detection, we compile and ship with bounds checking enabled on STL containers and iterators. But in this particular situation, the bounds checking isn't helpful, since we're iterating over the entire container. I rewrote the function as:

std::vector<float> array;

float function() {
    const float* p = &array[0];
    size_t count = array.size();
    float sum = 0.0f;
    while (count--) {
        sum += *p++;
    }
    return sum;
}

And the compiler generated the much more reasonable:

call array.size()
mov ecx, eax ; ecx = count

push 0
call array.operator[]
mov esi, eax ; esi = p

fldz ; ST(0) = sum

jmp cmp
loop:

fadd [esi] ; sum += *p

add esi, 4 ; p++
sub ecx, 1 ; count--

cmp:
cmp ecx, 0
jne loop

; return ST(0)

This is the real benefit of inlining. Modern compilers are awesome at making nearly-optimal use of the CPU, but only when they have enough information. Inlining functions gives them that information.

p.s. I apologize if my pseudo-assembly had mistakes. I wrote it from memory.

Sat, Feb. 7th, 2009, 06:16 pm
#IND and #QNaN with /fp:fast

This post has moved.

The other day Timothy and I were optimizing some floating-point-intensive lighting code. Looking at the generated code, I realized we weren't compiling with /fp:fast. Due to the wonky state of floating point on 32-bit x86, Visual C++ frequently stores temporary results of floating point calculations to the stack and then reloads them, for the sake of consistent results.

See, the problem is that the floating point registers on x86 are 80 bits wide, so if you compile "float x, y, z, w; w = (x + y) * z" as...
fld [x]  ; ST0 = x
fadd [y] ; ST0 = x + y
fmul [z] ; ST0 = (x + y) * z
fstp [w] ; w = (x + y) * z

... the temporary results are always stored in ST0 with 80 bits of precision. However, since floats only have 32 bits of precision, you can wind up with different results depending on compilers, optimization settings, register allocation, etc. We often had problems like this at VRAC. Some poor engineering student would send out a panicked e-mail at 9:00 p.m. asking why his program started producing different results in release mode than it did in debug mode.

Thus, Visual C++ takes a more cautious approach. By default, it stores float intermediates back to memory to truncate them to 32 bits of precision:

fld [x]
fadd [y]
fstp [temp] ; truncate precision
fld [temp]
fmul [z]
fstp [w]

Tiny differences in precision don't matter in IMVU, so enabling /fp:fast saved 50-100 CPU cycles per vertex in our vertex lighting loop. However, with this option turned on, our automated tests started failing with crazy #IND and #QNAN errors!

After some investigation, we discovered that our 4x4 matrix inversion routine (which calculates several 2x2 and 3x3 determinants) was using all 8 floating point registers with /fp:fast enabled. The x87 registers are stored in a "stack", where ST0 is the top of the stack and STi is the i'th entry. Load operations like fld, fld1, and fldz push entries on the stack. Arithmetic operations like fadd and fmul operate on the top of the stack with the value in memory, storing the result back on the stack.

But what if the x87 register stack overflows? In this case, an "indefinite" NAN is loaded instead of the value you requested, indicating that you have lost information. (The data at the bottom of the stack was lost.) Here's an example:

fldz  ; ST0 = 0
fld1  ; ST0 = 1, ST1 = 0
fldpi ; ST0 = pi, ST1 = 1, ST2 = 0
fldz
fldz
fldz
fldz
fldz  ; ST0-4 = 0, ST5 = pi, ST6 = 1, ST7 = 0
fldz  ; ST0 = IND!

Woops, there's a bug in your code! You shouldn't overflow the x87 register stack, so the processor has given you IND.

Indeed, this is what happened in our matrix inversion routine. But why?

Using a debugger, we determined that the x87 stack contained one value at the start of the function. Moreover, it contained a value at the start of the test! Something was fishy. Somebody was leaving the x87 stack dirty, and we needed to find out who.

void verify_x87_stack_empty() {
    unsigned z[8];
    __asm {
        fldz
        fldz
        fldz
        fldz
        fldz
        fldz
        fldz
        fldz
        fstp dword ptr [z+0x00]
        fstp dword ptr [z+0x04]
        fstp dword ptr [z+0x08]
        fstp dword ptr [z+0x0c]
        fstp dword ptr [z+0x10]
        fstp dword ptr [z+0x14]
        fstp dword ptr [z+0x18]
        fstp dword ptr [z+0x1c]
    }

    // Verify bit patterns. 0 = 0.0
    for (unsigned i = 0; i < 8; ++i) {
        CHECK_EQUAL(z[i], 0);
    }
}

The previous function, called before and after every test, discovered the culprit: we had a test that intentionally called printf() and frexp() with NaN values, which had the side effect of leaving the floating point stack in an unpredictable state.

Adding __asm emms to the end of the test fixed our problem: thereafter, /fp:fast worked wonderfully. Case closed.

Fri, Feb. 6th, 2009, 09:04 pm
Hello!

Timothy has somehow convinced a gaggle of us nerdy types to spend hours blogging, every night, for 30 days straight. I intend to use this as an opportunity to talk about some technical topics that interest me. Before I dive in, however, I'd like to introduce myself.

My name is Chad, and I love software.

I spent years of my childhood creating paper and pencil games, which I would foist upon my brother and cousins and schoolmates. When I later discovered that you could program computers to play games too, I knew exactly what my career would be.

My formal education in computer science gave me the rigor to solve larger problems, and my graduate education in human-computer interaction taught me that minor technical decisions can have a large impact on your customers.

I've always been blessed, and having the opportunity to be the first engineering hire at IMVU is no exception. Helping a startup grow from "Will this thing even work?" to "Wow, that's real revenue" to "My work affects millions of people" has been nothing short of amazing. On top of that, I've been able to work with and learn from the most talented people in Silicon Valley.

Hopefully I can spread some of their wisdom with this blog.

Sat, Oct. 25th, 2008, 05:29 pm
10 Pitfalls of Dirty Code

This post has moved.

10 Pitfalls of Dirty Code

(Disclaimer: These are my opinions and not the opinions of IMVU or its founders. I'm sure we all have different perspectives.)

A History of IMVU's Development Process

IMVU was started with a particular philosophy: We don't know what customers will like, so let's rapidly build a lot of different stuff and throw away what doesn't work. This was an effective approach to discovering a business by using a sequence of product prototypes to get early customer feedback. The first version of the 3D IMVU client took about six months to build, and as the founders iterated towards a compelling user experience, the user base grew monthly thereafter.

This development philosophy created a culture around rapid prototyping of features, followed by testing them against large numbers of actual customers. If a feature worked, we'd keep it. If it didn't, we'd trash it.

It would be hard to argue against this product development strategy, in general. However, hindsight indicates we forgot to do something important when developing IMVU: When the product changed, we did not update the code to reflect the new product, leaving us with piles of dirty code.

Dirty Code

What makes code dirty? Really, anything that prevents people from making changes to the system. Examples include code with:

  • unclear or too many responsibilities,
  • overly complicated or obscure control flow,
  • concepts that don't map to the domain,
  • too many dependencies,
  • global state,
  • or duplicated logic.

In short, if you hire someone who's clearly smart and they say "I can't make sense of this", then you probably have a dirty code problem.

You'll sometimes hear of a technical debt metaphor. Technical debt is a way to think about the cost of introducing dirty code as you'll need to maintain it in the future. Knowingly introducing dirty code lets you quickly test a hypothesis or learn some information, so you're not investing in code you won't need. For code you will need, however, technical debt compounds on itself and rapidly becomes more expensive than the original cost of fixing it.

Taking on technical debt can be the right decision, but it's important to remember that you're introducing work for someone down the road. Moreover, only programmers have a good grasp on the true cost of technical debt. The business will never decide to refactor over pursuing the next shiny project (they don't have enough information). Your engineering organization must be empowered to do what it thinks best for the business and technology platform. If the term "technical debt" ever shows up in a strategic plan, your engineering team has failed to communicate the true costs of their work.

We used the technical debt metaphor quite a bit at IMVU, and, in hindsight, it's obvious that we underestimated the long-term costs by at least an order of magnitude.

So what are the true costs of letting dirty code linger? Each of these are drawn from real examples at IMVU.

Team Costs

  1. Dirty code does not scale to larger teams.

In a code base where modules and objects have unclear responsibilities, programmers tend to implement features by modifying code all over the system. This causes conflicts as multiple people change the same files. Following the open-closed principle helps here. Every object should do one thing well and have a clear interface to the rest of the system. Ideally, each feature would fit into its own object or set of objects, and plug into the system in a standard way.

  1. Dirty code reduces team morale.

Most programmers I know get pleasure and validation by shipping code that makes people happy. Any frustration that gets in the way of that basic need reduces morale. If an improvement to the product seems like it should be a three-hour task but takes two days of investigation and pain, the programmer is unlikely to feel like they can make a difference.

  1. Dirty code makes programmers slower.

If there are two systems for doing X, and you want to make an improvement to the way X is done, you have to change both systems, increasing effort and the risk of regression.

If the concepts in the domain don't map to your objects, programmers have to struggle to find the right place for new code.

If A and B are unrelated aspects of the system and the logic for A and B are glommed together, changes to A involve understanding B too. The more aspects are coupled together, the cost of changing each of those aspects goes up.

  1. Dirty code inhibits the formation of an ownership culture.

When the code is too complicated for anyone to fit it in their heads, programmers will tend to blame the legacy code or architecture for any bugs or regressions that crop up. If they perceive it's too expensive to fix the architecture, they will not feel responsible if the product ends up being low-quality. To build a sustainable, high-quality product, the programmers ultimately need to feel responsible, and the feedback loop between customers and programmers needs to be closed.

Product Costs

  1. If product concepts are not reflected in the code, programmers might implement features in ways that don't make sense in the product.

To explain this, I'll give an example from the IMVU client: The business rules around product loading are complicated to begin with. Worse, the code does not directly reflect said rules. Because of this, our attempts to implement a better loading experience (including progress bar and object prioritization) have failed multiple times, and we still don't have it quite right.

  1. Dirty code incentivizes the business to invest in tangential revenue work rather than attacking core business problems.

For most startups*, the primary product or core competency should ultimately derive the most revenue. If management perceives it's too expensive to work on the core product, they will tend to fund tangential work such as new payment methods or bizdev deals. There's a point where that kind of work makes sense (and pays for itself), but the core offering should be the biggest lever, and the company should rally around that.

* During Web 2.0, your mileage may vary.

Quality Costs

  1. Even with good automated test coverage, dirty code increases the risk of introducing regressions.

If a module has too many dependencies or responsibilities, changes to it can have unintended consequences. Automated test coverage helps a great deal, but think of test coverage as approximating (number of tested conditions) / (number of possible conditions). The combinatorially increased states in dirty code effectively reduces the coverage of your automated (and manual!) tests, allowing regressions to slip through to customers. Threads, if statements, and nullable objects are all examples of ways to reduce test coverage from the code.

  1. Wide or unclear dependencies reduce the quality of tests.

In our experience, unit tests around dirty code tend to involve lots of mocks, partial mocks, and monkey patching, reducing the likelihood that tests will actually catch regressions. Worse, these tests are more likely to fail after benign refactorings. As we refactor our objects to better reflect the actual domain (updating the tests as we go), our confidence in the tests improve and they become dramatically easier understand.

  1. Dirty code hides real bugs.

If the code is too complicated for the programmers to understand and has too many possible states to effectively test, what makes you think you can know whether it reliably works? In our experience, every dirty module corresponds with the area of the product that our users report intermittently breaks. When refactoring those systems, we inevitably discover race conditions, bugs in edge cases, and performance problems.

  1. Dirty code gets dirtier.

Finally, if your team does not get used to improving dirty code, it will only get worse. Eventually, the programmers (and maybe even management) will start calling for a rewrite (a.k.a. resetting the shitty counter). Software design and refactoring are skills that take practice. Honing them will keep your business and technology nimble.

Final Thoughts

I've talked a lot about why dirty code is expensive, so you may be asking "Well, what can I do about it?" First, try to pay attention to your code. After you finish writing some, ask yourself "Could I make this clearer?" Then ask your neighbor the same question. Beyond that, here are some resources that will help you improve your design skills:

It's definitely possible to keep that feeling of 'newness' in your code. Hopefully I've convinced that you that the extra few hours or days to clean up after every project easily pays for itself. Code is the lifeblood of our industry - keep it clean!

Sun, Oct. 5th, 2008, 12:06 pm
Refactor first!

Test-driven development teaches the red-green-refactor mantra, which works wonderfully when starting a project from scratch. Chances are good, however, that you'll spend 80% of your programming career working in existing codebases. In existing code, I've found it easier to refactor first until the desired behavior change is a 30-minute task. By refactoring first, you'll...

  • better understand the code you're about to change.
  • better understand the consequences of your change.
  • make it easier for the person reviewing your change to understand its impact. (It's much easier to review several independent refactorings and one simple change than a nonobvious change followed by refactorings; or, worse, all of the above in one commit.)
  • reduce risk and save time by preventing unexpected bugs.

Red-green-refactor is great, but try refactoring first!

Sat, May. 24th, 2008, 02:05 pm
IMVU Client's Automated Crash Reporting System: Catching Python Exceptions

This post has moved.

For years now, I have been meaning to write a series of articles on the automated crash reporting system in the IMVU client. This first article will give a bit of background on the structure of the client and show how we handle Python exceptions.

At IMVU, we generally subscribe to the Fail Fast philosophy of handling errors: when the client encounters an unexpected error, we immediately crash the program and ask the user to submit a crash report. As part of the crash report, we send log files, stack traces, system information, and anything else that might help us debug the failure.

You might wonder why we crash the program whenever anything goes wrong rather than trying to catch the error and continue running. Counterintuitively, crashing the program forces us to act on crashes and immediately exposes bugs that might trigger unwanted behavior or lost data down the road.

Now let's talk a little bit about how the client is structured. The IMVU client is written primarily in Python, with time-critical components such as the 3D renderer written in C++. Since the client is a cross between a normal interactive Windows program and a real-time game, the main loop looks something like this:

def main():
    while running:
        pumpWindowsMessages() # for 1/30th of a second
        updateAnimations()
        redrawWindows()

This structure assumes that no exceptions bubble into or out of the main loop. Let's imagine that updateAnimations() has a bug and occasionally raises an uncaught exception. If running the client with a standard command-line python invocation, the program would print the exception and stack trace to the console window and exit. That's all great, but our users don't launch our client by invoking python from the command line: we use py2exe to build a standalone executable that users ultimately run. With an unmodified py2exe application, uncaught exceptions are printed to sys.stderr (as above), except there is no console window to display the error. Thus, the py2exe bootstrap code registers a handler so that errors are logged to a file, and when the program shuts down, a dialog box shows something like "An error has been logged. Please see IMVUClient.exe.log."

From a crash reporting standpoint, this is not good enough. We can't be asking our users to manually hunt down some log files on their hard drives and mail them to us. It's just too much work - they will simply stop using our product. (Unfortunately, most of the software out there asks users to do exactly this!) We need a way for the client to automatically handle errors and prompt the users to submit the reports back to us. So let's rejigger main() a bit:

def mainLoop():
    while running:
        pumpWindowsMessages()
        updateAnimations()
        redrawWindows()

def main():
    try:
        mainLoop()
    except:
        error_information = sys.exc_info()
        if OK == askUserForPermission():
            submitError(error_information)

This time, if a bug in updateAnimations() raises an exception, the top-level try: except: clause catches the error and handles it intelligently. In our current implementation, we post the error report to a Bugzilla instance, where we have built custom tools to analyze and prioritize the failures in the field.

This is the main gist of how the IMVU client automatically reports failures. The next post in this series will cover automatic detection of errors in our C++ libraries.

20 most recent