Home

Wed, Jul. 9th, 2008, 07:42 pm
More Python trivia

Given a set a, what's the difference between:

a |= set([1])

and:

a = a | set([1])

?

Sat, May. 24th, 2008, 02:05 pm
IMVU Client's Automated Crash Reporting System: Catching Python Exceptions

This post has moved.

For years now, I have been meaning to write a series of articles on the automated crash reporting system in the IMVU client. This first article will give a bit of background on the structure of the client and show how we handle Python exceptions.

At IMVU, we generally subscribe to the Fail Fast philosophy of handling errors: when the client encounters an unexpected error, we immediately crash the program and ask the user to submit a crash report. As part of the crash report, we send log files, stack traces, system information, and anything else that might help us debug the failure.

You might wonder why we crash the program whenever anything goes wrong rather than trying to catch the error and continue running. Counterintuitively, crashing the program forces us to act on crashes and immediately exposes bugs that might trigger unwanted behavior or lost data down the road.

Now let's talk a little bit about how the client is structured. The IMVU client is written primarily in Python, with time-critical components such as the 3D renderer written in C++. Since the client is a cross between a normal interactive Windows program and a real-time game, the main loop looks something like this:

def main():
    while running:
        pumpWindowsMessages() # for 1/30th of a second
        updateAnimations()
        redrawWindows()

This structure assumes that no exceptions bubble into or out of the main loop. Let's imagine that updateAnimations() has a bug and occasionally raises an uncaught exception. If running the client with a standard command-line python invocation, the program would print the exception and stack trace to the console window and exit. That's all great, but our users don't launch our client by invoking python from the command line: we use py2exe to build a standalone executable that users ultimately run. With an unmodified py2exe application, uncaught exceptions are printed to sys.stderr (as above), except there is no console window to display the error. Thus, the py2exe bootstrap code registers a handler so that errors are logged to a file, and when the program shuts down, a dialog box shows something like "An error has been logged. Please see IMVUClient.exe.log."

From a crash reporting standpoint, this is not good enough. We can't be asking our users to manually hunt down some log files on their hard drives and mail them to us. It's just too much work - they will simply stop using our product. (Unfortunately, most of the software out there asks users to do exactly this!) We need a way for the client to automatically handle errors and prompt the users to submit the reports back to us. So let's rejigger main() a bit:

def mainLoop():
    while running:
        pumpWindowsMessages()
        updateAnimations()
        redrawWindows()

def main():
    try:
        mainLoop()
    except:
        error_information = sys.exc_info()
        if OK == askUserForPermission():
            submitError(error_information)

This time, if a bug in updateAnimations() raises an exception, the top-level try: except: clause catches the error and handles it intelligently. In our current implementation, we post the error report to a Bugzilla instance, where we have built custom tools to analyze and prioritize the failures in the field.

This is the main gist of how the IMVU client automatically reports failures. The next post in this series will cover automatic detection of errors in our C++ libraries.

Wed, May. 21st, 2008, 10:13 pm
Open sourced our pstats viewer!

IMVU has benefited greatly from open source software. Large components of our client and website are built on top of various open source projects. In fact, you can see the list of software on our technology page.

Well, I'm happy to announce that we're beginning to contribute source code back! We have created a project on SourceForge and released our first standalone tool: pstats_viewer.py. pstats_viewer is a tool for browsing the data stored in a .pstats file generated by the Python profiler.

Thu, Apr. 3rd, 2008, 05:33 am
Tasks, object lifetimes, and implicit cancellation

Background: The IMVU client has a system that we call the 'task system' which works a lot like lightweight threads or coroutines. You can schedule up some work which gives you back a Future object on which you can either wait or request the actual result (if it has materialized, that is).

I just read Joe Duffy's IDisposable, finalization, and concurrency. The parallels between TPL and IMVU's task system surprised me.

When we built the task system, there were two ways to cancel execution of a task: explicit and implicit. You could either call future.cancel() or just release your reference to the future and wait for it to be collected by the GC. In either case, the scheduler would notice and stop running your task. I've often wondered if support for implicit task cancellation was a good design.

After reading Joe's article, I'm convinced.  If you believe that resource utilization is also a resource, and that implicit resource disposal (garbage collection) is the right default, then implicit cancellation is the only sane default.  In fact, we recently removed support for explicit cancellation, because we haven't even 'really' used it yet.  (Some of Dusty's code used it, but he said it was fine to take it out.)

We may have to implement explicit cancellation again someday, but now I'm happy to wait until there is a compelling use case.

Sun, May. 7th, 2006, 04:07 am
Postmortem of a crash.

We're getting to the point in our development at IMVU where we're starting to have to fix some of the harder, more obscure, crash bugs. I spent the last week on one, and I wanted to write about what it was and the process we used to find it.

First, IMVU has a very good automated bug reporting system. Crashes, unclean shutdowns, and a couple other types of errant behavior get automatically reported and entered into our bug database. Most crashes (read: exceptions) are reported via our normal crash handling logic, along with some basic system information and both Python and C++ stack traces. There are some scenarios, however, where our normal mechanisms do not get a chance to handle the error. In these cases, we've got a "last-chance handler" which writes the exception information to a file, which is then submitted as part of the next report. The biggest omission from our automated reporting is deadlock detection.

Dusty discovered a deadlock in Python's logging module:

import logging
logger = logging.getLogger('logger')
logger.addHandler(logging.StreamHandler())
logger.setLevel(logging.INFO)

class C:
    def __str__(self):
        logger.info('C.__str__')
        return 'C'

logger.info('test')
logger.info(C())

Logging from an object that's being logged tries to acquire a non-recursive lock twice on the same thread. First rule of avoiding deadlocks: never call back into user code while a lock is held. In this simple case, a recursive lock would have sufficed. We decided to simply disable the logging.Handler lock; after all, garbled log file output isn't so bad, and it's better than deadlocking.

def nullfunc(*args): pass
logging.Handler.acquire = nullfunc
logging.Handler.release = nullfunc

That seemed to fix the problem. Except for the intermittent crash reports we started getting (to be specific, last chance crashes), inside of the CRT via Python's file functions. The exception was a null pointer access and the stack looks like:

PYTHON23!0x000baa00 -                         - PyFile_Type
PYTHON23!0x0003ac27 -                         - PyFile_SetEncoding
  MSVCRT!0x00030a06 -                         - clearerr
   ntdll!0x00001010 -                         - RtlEnterCriticalSection

Well that's weird. PyFile_Type isn't a function... and PyFile_SetEncoding doesn't call clearerr as far as I can tell. But a test program that calls clearerr(NULL) does indeed raise the same exception. But why would Python call clearerr on a null FILE* pointer?

It was around this time I noticed the crashes happened during log rollover. We use Python's RotatingFileHandler to rotate through a fixed set of fixed-size log files. Things started to click in my mind, and after a few lucky google searches, I found out that Python file objects (PyFileObject) are not thread-safe. There was some discussion on Python-dev, but no actionable conclusion was reached. The long and short is that if you have two threads working on the same file at the same time, the results are undefined. More specifically, if one thread closes a file while another tries to write, you can get into a situation where f->f_fp is NULL, and clearerr(NULL) raises an exception. (Ah ha!)

But why are two threads working on a file object at the same time? Oh right! We disabled the logging synchronization earlier... so if one thread decides it's time to rollover, it starts closing files and doing its internal shuffling, all while another thread is trying to output log statements. Oops! The "proper" fix involves reaching into the deadlock-prone logging module and replacing its nonrecursive lock with the recursive equivalent.

Lesson learned: it's very hard to understand the consequences of changes to synchronization code. Also, without our crash reporting system, we would have never found this problem.

Fri, Mar. 31st, 2006, 12:30 am
Guess the output of this code, answers.

My previous post showed a bit of code that demonstrates an actual bug I ran into this week. Here is the code again:

# Python
names = ['foo', 'bar', 'baz']
fns = [lambda: n for n in names]
print [f() for f in fns]

Here's the output:

['baz', 'baz', 'baz']

Let's try to figure out why. First factor out the list comprehensions:

names = ['foo', 'bar', 'baz']
fns = []
for n in names:
    fns.append(lambda: n)
result = []
for f in fns:
    result.append(f())
print result

Same result. Good. Now let's factor out the lambda (anonymous function) and give it an explicit name.

names = ['foo', 'bar', 'baz']
fns = []
for n in names:
    def fn(): return n
    fns.append(fn)
result = []
for f in fns:
    result.append(f())
print result

Still the same result. The reason this program outputs what it does might be becoming clearer... Let's think about what a for loop does again. "for N in LS" is translated into something like "if LS has anything, assign N to the first item in LS and run the loop body. if there is another item in LS, assign N to that item and run the body again. repeat." That assignment to N happens in the current local scope.

Okay, getting closer... Inside a function created inside of a scope, variable lookups start in the current local scope; if the lookup fails, it continues to the next outer scope. And so on all the way up to the global scope. (If it's not there, you get a NameError.) This lets you create inner functions that use variables from outside, such as globals. The important thing to note here is that these lookups only happen when you actually try to use a variable. So let's see what's happening here. The program builds a list of functions which return a variable n. Since n is not in the functions' scopes, it is pulled from the outer scope (in this case, the globals). By the time the functions are called, however, n has changed value several times. Its most recent value is the one the functions return; in this case, 'baz'. Thus, all of the functions return 'baz'. For performance reasons, you wouldn't want to take a snapshot of the values of every variable your functions access, but I think mentally that's what we think is happening when we write the code.

Is there a way to take a snapshot of variables we know will change, though? Default argument initializers let us do just that. Replace the program with the following and it does exactly what we want.

names = ['foo', 'bar', 'baz']
fns = []
for n in names:
    def fn(n=n): return n
    fns.append(fn)
result = []
for f in fns:
    result.append(f())
print result

And translating back into the original form with lambdas and list comprehensions...

names = ['foo', 'bar', 'baz']
fns = [lambda n=n: n for n in names]
print [f() for f in fns]

Tada! Lesson: functions create scopes (or frames in Python terminology). loops don't. Therefore, if you create functions inside of loops where variables will change value, use default arguments to capture their values at function creation time.

Mon, Mar. 27th, 2006, 10:18 pm
Guess the output of this code.

# Python
names = ['foo', 'bar', 'baz']
fns = [lambda: n for n in names]
print [f() for f in fns]