Wednesday, June 24, 2009

Holy Cow, Galcon runs in Flash!

Phil Hassey, noted Python/Pygame enthusiast has ported his Award Winning Galcon game to Flash! Awesome!

I'd recommend limiting yourself to 1 match per day, else you'll be joining me at Galcon Anonymous meetings.

Monday, June 15, 2009

Stackless vs GIL: It's a draw, cont.

I decided to double check my assumptions about Stackless and the GIL on a multicore machine, using the same example provided by David Beazley in his slides, which highlights the GIL contention issue. It's probably a better example than my previous test.

import stackless
import time
import threading

def count(n):
while n > 0:
n -= 1

T = time.clock()
count(100000000)
count(100000000)
print time.clock() - T

>>> 17.37
A normal sequential execution takes about 17 seconds.
T = time.clock()
t1 = threading.Thread(target=count,args=(100000000,))
t1.start()
t2 = threading.Thread(target=count,args=(100000000,))
t2.start()
t1.join(); t2.join()
print time.clock() - T

>>> 53.22
This confirms David's experiment. It takes much longer due to threads fighting over the GIL.
T = time.clock()
stackless.tasklet(count)(100000000)
stackless.tasklet(count)(100000000)
while stackless.getruncount() > 1:
task = stackless.run(100)
if task:
task.insert()
print time.clock() - T

>>> 25.34
Twice as fast as the threaded solution, and roughly 50% slower than than the sequential solution. Cool.

We can get much better results by increasing the granularity of the scheduler. If we change the tick count to 1000, the Stackless solution takes 17.71 seconds, and the threaded solution takes 22.25 seconds. This is interesting behavior, which I can't yet explain.

Sunday, June 14, 2009

Stackless vs GIL: Benchmarks.

Following on from my last post, I decided I should check my assertion that "Stackless will outperform CPython, even with CPU bound tasks." I'm not saying the GIL is bad, I'm just pointing out how the same behavior can be achieved with Stackless. If these tasks called some random C function which blocked, but still yielded the GIL, it is likely that CPython would come out on top.
import time
from threading import Thread
import stackless

def factorial(c):
T = 1
for i in xrange(1, c):
T *= i
return T

#Benchmark stackless using 500 tasklets
T = time.clock()
for i in xrange(500):
stackless.tasklet(factorial)(1024)
while stackless.getruncount() > 1:
task = stackless.run(100)
if task:
task.insert()
print time.clock() - T

#Benchmark OS threads using 500 threads
T = time.clock()
threads = []
for i in xrange(500):
thread = Thread(target=factorial, args=(1024,))
thread.start()
threads.append(thread)
for thread in threads: thread.join()
print time.clock() - T

>>> 0.5
>>> 0.77
On my dual core machine, Stackless performs around 30%-40% faster than regular threads. This is usually not a suprise, we all know that IO bound threads always come with a penalty in CPython. However, these Stackless tasklets are being pre-empted in the same way that the GIL works. This is something my Fibra framework and other similar frameworks which are based on generators, can never achieve.

Very Interesting!

Update: Fixed factorial funcion and timings.

Stackless vs GIL: It's a draw.

After thinking some more about David Beazley's research into the GIL, I realized the gravity of his comment that the GIL provides "Basically a kind of "cooperative" multitasking". I've spent 10 minutes knocking up some code in Stackless that demonstrates something similar to what David describes. I doubt many people knew Stackless has this capability.
import stackless

#A contrived CPU bound task
def factorial(c):
T = 1
for i in xrange(1, c):
T *= i
return T

#create two tasklets
stackless.tasklet(factorial)(512)
stackless.tasklet(factorial)(1024)

#used to track of how many task switches happen
switches = {}

#while there is more than the main task running...
while stackless.getruncount() > 1:
#run the schedule for 100 ticks
task = stackless.run(100)
#if we have a pre-empted task
if task:
#increment it's switch counter
C = switches.setdefault(task, 0)
switches[task] += 1
#insert it at the end of the schedule
task.insert()

print switches.values()

>>> [13, 26]

What is happening here?

Firstly, notice the factorial function. It is a contrived example of a CPU bound task that does not yield to the Stackless scheduler (by calling stackless.schedule) anywhere in it's body.

Next, two tasklets are installed into the schedule. One computes the factorial of 512, the other computes the factorial of 1024. If we call stackless.run() at this point, both tasks would run to completion, one after the other. This is because the factorial function never yields control back to the Stackless scheduler.

However, in this example, when we call stackless.run, we pass it an argument of 100. This asks Stackless to run each tasklet for a maximum of 100 'ticks'. If the tasklet exceeds this, it is interrupted and returned to main tasklet. This forces our factorial tasklet to yield control back to our main loop, at which point we can decide what to do with it. In this case, the best thing to do is insert at the back of the run queue to give other tasklets a chance to execute.

In the example above, the tasklets were forced to switch 13 and 26 times respectively. These few lines of code show how Stackless can almost achieve the same kind of threading that CPython provides, without using OS threads. I think that is pretty neat.

Why do I say "almost"? The Stackless solution does not provide everything the GIL does. For example, if a tasklet makes some C library call which blocks for a long time, all tasklets will block and wait, whereas in CPython other threads will get some CPU time. Still, the task switching in Stackless is much faster than OS thread switches, so if you're using mostly Python code, Stackless will outperform CPython, even with CPU bound tasks.

Saturday, June 13, 2009

CPython Threading is Fundamentally Broken.

Most Python coders know that the Global Interpreter Lock (GIL) in CPython effectively serializes execution of threaded programs, even on systems with multiple CPU cores.

Despite this, Python threads are still useful in some situations, especially when working with blocking IO. Or so we thought.

The GIL problem is much, much worse than I expected.

David Beazley has published some slides which summarize his research into the GIL, which demonstrate how the implementation is fundamentally broken. The slides are clear and easy to understand and it is definitely worth taking the time to read and digest.

To summarize, the implementation of the GIL causes threads (on multicore machines) to compete with each other as they attempt to acquire the GIL. The scary part is, as you add more cores and more threads, the problem will grow exponentially, causing a massive degradation in performance when compared to serial execution. This isn't the only issue. Read the slides if you ever intend to use threads in CPython.

Saturday, June 06, 2009

Is Python's select.poll unreliable?

I've been building a DHT using Stackless Python, and a nonblocking IO layer which provides IO event notification using epoll, poll, or select, depending on which module is available on the operating system.

I'm developing on OSX, so my module cannot use epoll, and downgrades to the regular select.poll object. This is the first time I've used a select.poll object instead of select.

I'm coming up against some nasty problems. Every now and then, select.poll does weird things. Eg, returning a file no which I have not registered, which might be 1 or some random number. Sometimes it even returns negative numbers. Recently, in some circumstances it does not return a write event, when clearly it should be. And of course, these bugs are intermittent and hard to reproduce.

So... I switched the IO layer to use select.select instead of select.poll. Voila, everything works perfectly.

Is select.poll known to be buggy? It's hard to find lots of example python code which uses it, so I wonder if it is very well tested across a range of platforms.

Friday, June 05, 2009

Working with Subversive Games.

I'm now working with Subversive Games, helping establish the Perth Studio, and at the same time delivering an awesome project to a large client! Subversive Games helped sponsor the Global Game Jam here in Perth.

We've started a blog, so we can talk about some of our game tech, and perhaps reveal some of our Python and Unity 3D tips and tricks. I'm currently preparing a demo to show how we've created hundreds of kilometers of train tracks across a vast landscape.

Stay tuned!

Thursday, June 04, 2009

A Room Heater, in Python.

It is about 8 degrees C this morning. So cold, especially when last week we had high twenties.

To help solve the problem, a friend suggested I write a heater application to help warm me up. Not one to refuse a challenge, and always eager to show that Python can solve all your problems, I came up with a Room Heater, in Python, in 15 lines of code.

import time
import sys

try:
D = int(sys.argv[1])
except IndexError:
D = 10
except ValueError:
print "Usage: %s MINUTES"%sys.argv[0]
raise SystemExit

print "Starting Heater..."
start = time.time()
while (time.time() - start) < (D*60): pass
print "Heater is stopping."

It works very well, especially if you are using a laptop, on your lap of course! It takes about one minute or so for your CPU to get to hot enough to be felt, and if you are using a Mac Book Pro, like me, you'll really feel the heat!

Popular Posts