Tag Archives: python

Python to Cython and benchmarking

Recently I have been coding with Cython for my project PyCAF. Obviously, I am doing this to make my code run faster. My approach was to first write my code in Python and make sure it works correctly. I wrote unit-tests for each function and class, using nosetests --with-coverage to look for test cases I might have missed. Then I profiled the code, finding out hot-spots, and eliminating the obvious issues.

Next up was to write some performance micro-benchmarks for important bits of code. These benchmarks have the same interface: they take an “input size” N as a parameter. What N means is specific to the function. For e.g., for the function event_one_one(N) it means “create N pairs of tasklets, and make each pair exchange one event” but for the function event_one_many(N) it means “create one sender tasklet and N recipient tasklets waiting for the sender to send an event”. You get the drift. The micro-benchmarks are tested as well, by the simple expedient of writing a unit-test case per micro-benchmark that calls it with a small input.

Digression: I spent some time looking for some tool where I could store results of my benchmarks per code commit and then later browse how a particular benchmark varied over code commits, but found nothing…. does anyone have any ideas? If not, I might write a tool for this in the future.

Anyway, the benefit of micro-benchmarks is that you can see how the performance scales when the input sizes grow. For e.g., here is the output of my benchmark tool:

Test Name N Ratio Time (s) Ratio K-PyStones
tasklet_new 100 1 0.000518083572388 1 0.0364847586189
tasklet_new 1000 10 0.00419187545776 8.09111826967 0.295202497026
tasklet_new 10000 10 0.0460090637207 10.9757706746 3.24007490991
tasklet_new 100000 10 0.516650915146 11.2293290357 36.3838672638
tasklet_yield 100 1 0.000921964645386 1 0.0649270877032

Some things to note here: I convert the time taken by a test to kilo-pystones and record that as well as the time taken. What’s a Pystone? Well, its the output of python -c "from test import pystone; pystone.main()". For my machine:
Pystone(1.1) time for 50000 passes = 0.7
This machine benchmarks at 71428.6 pystones/second

So basically pystones is a (somewhat) machine-independent measure of how long a test took to run. While a test runs faster on a fast machine and slower on a slow machine, when you convert the time to pystones, it should be the same.

Now the interesting thing to note is how the time for a test increases when the input grows larges. To help me, my benchmark prints the ratio between successive input sizes and times taken. If I increase the input size by, say 10 times, and the time taken increases 100 times, then I might have a problem. Of course, behind every test are a bunch of functions and algorithms. I have a general expectation of what the complexity of my function should be, and look at the benchmark to confirm my expectations. I can then replace algorithms with high complexity with better ones. A good design and a well chosen algorithm or data structure gives more speedup than mindlessly twiddling around with code.

Digression: My benchmark is not reporting the space-complexity of the tests (aka, how much memory is being consumed). Tarek Ziade mentions some tools in his excellent book “Expert Python Programming”, but I have not understood the details of the tools well enough to incorporate them into my testing, just yet.

Once I find a benchmark that I want to improve, I will first profile it, to find where the function or algorithm is spending most time. This is obvious and there are enough resources on the web about it (just look for “cProfile”). What I was wondering about was how would I profile code that I had converted to Cython, since Cythonized code becomes a binary? Well, the Cython wiki gives the answer: http://wiki.cython.org/Profiling. So thats what I will be looking into very soon.

SQLAlchemy: one classes – two tables

SQLAlchemy is a wonderful ORM for Python. While it allows the normal “class per table” semantics, one can do some more interesting stuff with it.

We often start with a table for some stuff.  For e.g., we might start off our app with a User class mapping to a user table, which has columns for an id, name and an address. For illustration, the list of columns is small. In a real application, this table would probably contain a lot more columns.

Here is some boilerplate code that initiates our database connection, etc.:

from sqlalchemy import *
from sqlalchemy import sql
from sqlalchemy.orm import mapper, sessionmaker, create_session

dburl = "mysql://testuser:testpasswd@localhost/test"
engine = create_engine(dburl)
meta = MetaData(bind=engine)

Let’s define a table and a class mapping to the table.

t_users = Table(
        'users', meta,
        Column('id', Integer, primary_key=True),
        Column('name', String(40), nullable=False),
        Column('address', String(200))
        )
class User(object):
    def __init__(self, name, address=None):
        self.name = name
        self.address = address
    def __repr__(self):
        return "<User(%s,%s,%s)>" % (self.id, self.name, self.address)
mapper(User, t_users)

To create the database, we might do something like the below, or roll out own database schema management system:

meta.create_all()

We store and retrieve User objects from the database like this:

# create a session
session = create_session(bind=engine)
# for testing only... ensure table is empty to start with
session.execute(t_users.delete())
# create and storea record
u = User('Beowulf', 'Denmark')
session.save(u)
session.flush()

# retrieve records from the db
session.clear()
u = session.query(User).filter_by(name='Beowulf').one()
print u

The output of the above snippet would be:

<User(2,Beowulf,Denmark)>

It might make sense in the beginning to put everything related to a user in one table. After a while, as the application grows, one starts seeing some patterns. Imagine that the address field above was very large (or imagine that there was a ‘photo’ field containing the user’s photograph). However, the address field is used only in one corner of the application, and only the name is used in the vast majority of the application code. What’s more, the large binary/string field is slowing down other queries. One of the ways to solve this problem is to split the user table into two:

t_users = Table(
    'users', meta,
    Column('id', Integer, primary_key=True),
    Column('name', String(40), nullable=False),
    )
t_addresses = Table(
    'addresses', meta,
    Column('user_id', Integer,
           ForeignKey('users.id'), primary_key=True),
    Column('address', String(255)),
    )

meta.create_all()

Now, whereever we are interested in only the user’s name, we’d use a User class. Where-ever we are interested in the address, we’d use an address class.

class User(object):
    pass
class Address(object):
    pass
mapper(User, t_users)
mapper(Address, t_addresses, properties={
        'user': relation(User, backref='address',
                         primaryjoin=t_users.c.id==t_addresses.c.user_id)
        })

session = create_session(bind=engine, transactional=False)
session.execute(t_addresses.delete())
session.execute(t_users.delete())

u1 = User(); u1.name = "Hagar"
session.save(u1)
a1 = Address() ; a1.user = u1 ; a1.address = "Denmark"
session.save(a1)
session.flush()

Now, we are presumably saving a lot of time and memory on our queries. But, what if we want the equivalent of the original User object, one which refers to both the name and address? Well, we can define a class mapped to a join and use it:

usersaddresses = sql.join(t_users, t_addresses,
                           t_users.c.id == t_addresses.c.user_id)
class UserAddress(object):
    def __repr__(self):
        return "<FullUser(%s,%s,%s)" % (self.id, self.name, self.address)
mapper(UserAddress, usersaddresses, properties={
        'id': [t_users.c.id, t_addresses.c.user_id],
        })
f = session.query(UserAddress).filter_by(name='Hagar').one()
print f

Note the “id” column in the “properties”: we’ve told SQLAlchemy that the “id” attribute of our UserAddress class actually is the same as the “users.id” or “addresses.user_id” columns, which are always the same. Thus SQLAlchemy will NOT produce redundant “id” and “user_id” attributes in our class.

We can even use this class to change and save attributes, and the attributes will go to their respective table!

f.name = "Hagar the horrible"
f.address = "Copenhagen"
session.flush()

for (id, name, user_id, address) in session.execute(usersaddresses.select()):
    print id, name, user_id, address

We get:

11 Hagar the horrible 11 Copenhagen

Python packaging: custom scripts

In the post Python Packaging: setuptools and eggs, I described how to use setuptools to create a distributable egg.  Installing the egg would provide:

  1. a set of python packages and modules, usable as a library
  2. a set of “scripts”: small programs that live in the user’s bin directory, where ever that is; these, for the most part invoke functions within the packages

While setuptools can generate these scripts for us, sometimes that does not cut it. Maybe we really do want to create scripts ourselves and get them installed in the user’s `bin` directory.

Writing scripts yourself

In the last post we got setuptools to generate three scripts for us: rundog, rungendibal, runbjarne. Let us hand-craft a script that does the work of any of these, depending on a command line argument. Let’s call our script ‘runany’: it will take one argument—“dog”, “gendibal”, or “bjarne”—and run the appropriate script for the argument. Here is our script:


# scripts/runany
#!python
import sys, os
which=sys.argv[1]
os.execlp("run" + which)

And we tell setuptools (actually, the underlying distutils) about it:

# setup.py
setup(...
scripts=['scripts/runany'],
...
)

Now, we can build our egg. When the user installs the egg, he would have the following scripts available:

  • rundog, rungendibal, runbjarne: these are auto-generated by setuptools
  • runay: this is a wrapper script that setup’s Pythons path and calls our own script above.

The user can run our runany script like this:
$ runany dog
Bow, wow!
$ runany bjarne
Hello, C++ World!

Wrapper code

Let’s see what setuptools did. First, it put a runany script in the user’s path. What’s inside?


$ cat `which runany`
#!/home/parijat/.virtualenvs/test/bin/python
# EASY-INSTALL-SCRIPT: 'Speaker==0.2dev','runany'
__requires__ = 'Speaker==0.2dev'
import pkg_resources
pkg_resources.run_script('Speaker==0.2dev', 'runany')

If you were expecting to see the code we wrote above and are surprised, join the club. This script uses the pkg_resources.run_script to actually call our code. The rest is just ensuring that the proper version of the egg is being referenced.

Why does it do so? So that the user can have multiple versions of our package installed and be able to use them.

Where is our script anyway? It is here:

$ ls /lib/python2.5/site-packages/Speaker-0.2dev-py2.5-linux-x86_64.egg/EGG-INFO/scripts/
runany

where install_dir refers to where ever the user wanted our egg to be installed.

In conclusion, we wrote some custom code, told setuptools where to find it (using `scripts=[…]` directive in setup.py), created an egg, installed it, and everything works.

Non-Python scripts

You might have noticed above that the custom script we wrote was in Python. This is not an accident: the distutils/setuptools directive `scripts=…` that we used to get our stuff packaged and installed expects to be given python scripts. Actually, distutils can handle non-Python scripts: it will copy them verbatim without wrapping them or adjusting them in anyway. However, setuptools always wraps scripts in a way that non-Python scripts are not supported. This can be considered a bug in setuptools. So, when using setuptools, non-Python scripts are out. This is not so bad: why would you want to write a shell or some other script for doing something that can be done in Python?

Non-package Data Files

What if you wanted an init-script to go into `/etc/init.d/’, or some config file to go into ‘/etc/’? setuptools won’t do that for you either. Setuptools will keep your files bundled inside the egg. But you can provide the user with a script that will extract and copy the files to arbitrary locations, on their discretion. See http://peak.telecommunity.com/DevCenter/setuptools#non-package-data-files.

Unkind programmers…

The other day, while going through a Twisted tutorial, I stumbled upon this:


user, status = line.split(':', 1)
user = user.strip()
status = status.strip()

Heh, we programmers don’t take the user seriously, do we?

Python packaging: setuptools and eggs

Developing Packages

This document describes how to create a buildable, distributable package out of
python source code.We’ll look at the popular ‘egg’ distribution format.

Tools we need:

Also, highly recommended is: virtualenv

What is an egg?

  • Eggs are basically directories that are added to Python’s path.
  • The directories may be zipped.
  • Eggs have some meta-data
    • Dependencies
    • Entry-points
  • May be distributed as source
  • Can be discovered from PyPI

What is easy_install?

A tool to find, download, compile (if needed), and install python packages. It
can install eggs, or even source tarballs, as long as the tarball uses the
standard python setup.py method of building itself.

Egg Terminology

  • Distribution
    • a term used by Python distutils;
    • anything which can be ‘distributed’, really;
    • most common: tarballs, eggs.
  • Source distribution:
    • A distribution that contains only source files
  • Binary distribution:
    • A distribution that contains compiled ‘.pyc’ files and C extensions
    • E.g., RPMs and eggs
  • Egg:
    • A kind of binary distribution
  • Platform dependent eggs:
    • Eggs which contain built C extension modules and are thus tied to an OS
  • ‘develop eggs’ and ‘develop egg links’:
    • develop egg links are special files that allow a source directory to be
      treated as if it were an installed egg. (That is, an egg that you are
      ‘developing’!)
  • Index server and link servers:
    • easy_install will automatically download distributions from the
      Internet. When looking for distributions, it will look at zero or more
      links servers for links to distributions. They will also look on a single
      index server, typically (always) http://www.python.org/pypi. Index servers
      are required to provide a specific web interface.

Example Project

Our sample project consists of this code:

  • package ‘speaker’
    • module dog:
      • class Dog
      • function DogMain
    • module gendibal
      • class Gendibal
      • function GendibalMain
    • module bjarne
      • class Bjarne
      • function BjarneMain
  • pacakge ‘tests’
    • module dog_test
      • class DogTest
    • module gendibal_test
      • class GendibalTest
    • module bjarne_test
      • class BjarneTest

The classes Dog, Gendibal, and Bjarne are “speakers": they all have the
method greeting() which takes no arguments and returns a string containing
something they said. The Dog speaker will, of course, say “Bow, wow!”.
Gendibal is a mathemetician and therefore uses prime numbers in his
greetings. Bjarne likes to talk about C++.

For every module and class, there is a corresponding test module and class.

We shall also have three scripts (i.e., programs that live in a bin
directory somewhere) that are intended to be launched from the command
line. The programs will be:

  • rundog: runs speaker.dog:Dogmain
  • rungendibal: runs speaker.gendibal:GendibalMain
  • runbjarne: runs speaker.bjarne:BjarneMain

Directory Structure

This is the intended directory structure.

Speaker/
|-- README.txt
|-- setup.cfg
|-- setup.py
|-- speaker
|   |-- __init__.py
|   |-- bjarne.cpp
|   |-- dog.py
|   `-- gendibal.pyx
`-- tests
    |-- __init__.py
    |-- bjarne_test.py
    |-- dog_test.py
    `-- gendibal_test.py
  • Speaker is the name of the project, and it will also be the name of our
    package (Speaker-0.1.tgz, for e.g.);
  • Our project contains a package named speaker, where we will put our
    classes; we can add more packages inside later;
  • setup.py and setup.cfg contain information to build our egg.
  • The tests package will contains test code.

Version 0.0: setting up the package

Let’s create some dirs and files:

Speaker/
|-- setup.cfg
|-- setup.py
|-- speaker
|   |-- __init__.py
`-- tests
    |-- __init__.py

Where:

  • setup.py
    from setuptools import setup, find_packages
    setup(name='Speaker',
          packages=find_packages(),
          )

The find_packages function automatically will discover your python packages
and modules, and pack them up.

  • setup.cfg
    [egg_info]
    tag_build = dev

The tag_build option appends a tag of our choice to the generated
filename. We’ll see it in action in a second.

  • speaker/__init__.py and tests/__init__.py are empty files.

Now we can build our package:

$ cd Speaker
$ python setup.py build
$ python setup.py bdist_egg
$ ls dist/
Speaker-0.0dev-py2.5.egg  Speaker-0.0dev.tar.gz

We have just created a source distribution and a platform-independent egg, even
though we don’t have a single line of useful code yet.

Note: the ‘dev’ in the filename: we’ve told setuptools that our package
is a in-development package and specified the tag ‘dev’ in setup.cfg. This
actually matters when easy_install is figuring out which out of several
versions of a pacakge it should download and install. More on it later.

Version 0.1: making a releasable package

Let’s update our setup.py:

# setup.py
from setuptools import setup, find_packages
import sys, os

version = '0.1'

setup(name='Speaker',
      version=version,
      description="Demo Pakcage",
      packages=find_packages(exclude=['ez_setup', 'examples', 'tests']),
      include_package_data=True,
      zip_safe=False,
      )

Notes:

  • Look at find_packages directive: some packages and modules are not going
    to be part of your distribution, because we want the tests and examples package, and the ez_setup.py module, to be available only to people checking out the code, not when they are downloading a built egg. (We haven’t written any exampels yet, but you were going to do it, right? ;-))
  • zip_safe means that the package won’t be unzipped: stuff will run right
    out of the zipped directory! Normally not useful.
  • We always want to set include_package_data to True.

Our first bit of code

We create our first speaker:

# speaker/dog.py
class Dog(object):
    def greeting(self):
        return "Bow, wow!"

and write a test:

# tests/dog_test.py

import unittest
from speaker import dog

class DogTest(unittest.TestCase):
    def test_greeting(self):
        d = dog.Dog()
        self.assert_(d.greeting() == "Bow, wow!")

if __name__ == "__main__":
    unittest.main()

Oops! Python does not know where to find our packages yet. So we ‘install’ our
egg as a ‘develop egg’:

$ python setup.py develop

This will create the necessary symbolic links for python to find
our packages. Now our code will behave just as if it was
installed, while letting us keep coding away.

$ python tests/dog_test.py
.
----------------------------------------------------------------------
Ran 1 test in 0.000s

OK

Automatic test discovery and running

We specified a collection of tests above (dog_test.py). But we
will be writing a lot of tests, and we want to be able to run all
of them in one shot. We are going to use the ‘nose’ test
discovery and execution tool to find and run our tests.

# setup.py
setup(...
     test_suite="nose.collector",
     tests_require="nose",
     )

The tests_require line will make easy_install download and put nose in
the current directory if nose is not already installed.

$ python setup.py test
... <downloads nose>

...
test_greeting (tests.dog_test.DogTest) ... ok
...

(If it fails the first time; just run python setup.py test again.)

The main function

We have a speaker library, but we don’t have a “main” script
yet. You often have to create a separate file just for the
“main” script, which is (should be) just a wrapper script that
imports some module and calls a function in it. In fact, for a
large package, we may have many “main” scripts, each doing
nothing more than importing the required packages and modules and
calling some function in there.

We can use the setuptools ‘Entry points’ mechanism for this. An
‘entry point’ is the name of some functionality of the
package/application; entry points come in groups; two groups are
pre-defined: “console_scripts” and “gui_scripts”. Setuptools can
auto-generate wrapper scripts for our entry points.

Here is how we can tell setuptools to generate a console script
that does something useful:

# setup.py
setup(...
      entry_points={
        'console_scripts': [
            'rundog = speaker.dog:DogMain',
            ],
        },
     ...
     )

Now, when we do a python setup.py develop, or a user installs
our egg, a script called ‘rundog’ will be generated and
automatically put somewhere in the path. The script will called
the DogMain function in the speaker.dog module with no
arguments, and the return value of the function will be the
exit status of the script.

What would the DogMain function be like?

# speaker/dog.py
...
def DogMain():
    d = Dog()
    print d.greeting()
    return 0

Now, when we run ‘develop’ again, setuptools will generate the rundog script
for us:

$ python setup.py develop
...
Installing rundog script to .../bin
...
$ rundog
Bow, wow!

We should keep a minimum amount of code in DogMain and put most
of it in discrete, well tested functions. This helps make code
more robust and re-usable.

About version numbers

Until now, other projects using our Speaker package have been
checking out code from our code repository and using it directly.
Now it is time to make an ‘official’ release. We shall release
v0.1 (the version we have been working on, and the one specified
in setup.py) of our package (and remove ‘dev’ from the release
name). For easy_install:

0.1a < 0.1b … < 0.1dev < 0.1 < 0.1-1 < 0.1-2 …

Steps:

  1. We create a release branch
  2. On the release branch, we edit setup.cfg. Currently, it probably says:
    [egg_info]
    tag_build = dev
  3. We change it to:
    [egg_info]
    tag_build =
  4. Now we can generate a ‘release’ version and copy it to some download page.
    $ python setup.py sdist bdist_egg
    $ ls dist/
    Speaker-0.1-py2.5.egg  Speaker-0.1.tar.gz
  5. Back on the main branch, we prepare to work on the next version by
    changing the version number in setup.py to 0.2.
  6. Our main branch releases are now ‘0.2dev’:
    $ python setup.py sdist bdist_egg
    $ ls dist/
    Speaker-0.2dev-py2.5.egg  Speaker-0.2dev.tar.gz

Post releases

So, we have released v0.1 of Speaker. However, there is a bug:
there is no README.txt! This bug has just been fixed on the
trunk. The trunk is not going to be stable until the next
release, which is a month away, and we have to release
a bugfix NOW!

Steps:

  1. We checkout the release branch;
  2. We cherry pick the desired commit from trunk to our release branch;
  3. We edit setup.cfg in the release branch and add a post-release tag:
    [egg_info]
    tag_build = -1
    tag_svn_revision = false
  4. We make a new release:
    $ python setup.py sdist bdist_egg
    $ ls dist/
    Speaker-0.1_1-py2.5.egg  Speaker-0.1-1.tar.gz
  5. And tag the new release (we always tag outgoing stuff)

And now we have a bugfix update to our 0.1 release. If we upload
it to the distribution dir, easy_install will pick it in
preference to the older 0.1 release.

Defining dependencies

We are probably going to be using a bunch of libraries when
developing our project. We can define a dependency requirement
like this:

# setup.py
setup(...
     install_requires=["SQLAlchemy"],
     ...
     )

Now, when we do a python setup.py develop, or a user installs
our egg, easy_install will find the latest version of SQLAlchemy
from PyPI, download it, and install it.

Other projects can depend on our Speaker project in the same way.

Restricting dependency versions

Let’s say we know that SQLAlchemy has a stable 0.4 branch, and 0.5 beta in
progress. We don’t want 0.5 beta versions. How do we tell
setuptools to install the highest 0.4 version, but not any 0.5
version?

First, we have to find out what the smallest version on the 0.5
branch is. Then we need to chenge our requirement to:

SQLAlchemy < 0.5.0a

Where “0.5.0a” refers to the first version ever of the 0.5
branch. (This version does not have to exist; it should just be
smaller than the smallest version you want to ignore.) One needs
to be quite careful about choosing the right version number.
Saying only 0.5, or only 0.5.0, would not not have worked, because 0.5.0rc1
is “smaller” than 0.5.0 or 0.5!

Let’s say we know that our stuff works with 0.4.3 and higher
versions of SQLAlchemy, but does not work with 0.4.2 or below.
Our requirement can look like this:

SQLAlchemy >= 0.4.3, < 0.5.0a

Also, note that:

  • If a version of SQLAlchemy is installed system wide that
    satisfies the dependency version requirement, easy_install
    will download and install that version. Hence, we should
    avoid polluting the system python site-packages.
  • Easy_install will not upgrade your dependency automatically
    when you run it later, even if a newer version of the
    dependency is available, as long as the installed version
    satisfies your dependency version requirement.

Dependencies not on PyPI

What if the library we need is not on PyPI? What if it is
actually developed and packaged by another group in our company,
and available only from an internal release page?

We can get dependencies like these by telling setuptools to look
at a particular URL.

# setup.py
setup(...
      install_requires=[
        "SQLAlchemy >0.4.3, <0.5.0a", # On PyPI
        "hello", # An Affle package
      ],
      dependency_links = [
        "file:///home/parijat/Python/" # find Affle packages here

        ],
     )

Now setuptools will look first in /home/parijat/Python. If
hello and SQLAlchemy eggs are there, it will use them. If an egg of
the eggs is not found there, then it will go to PyPI.

More than one dependency link can be specified.

Developing binary eggs (C extensions)

Now we come to the interesting bit: binary packages. We can use
the Python C API to write extension modules, and let distutils
build them. But there are easier ways.

Version 0.2: Pyrex extensions

Pyrex is “a Language for Writing Python Extension Modules”. The
greatest benefit is that Pyrex makes it easy to convert types
between Python and C.

Writing extensions in Pyrex

We’ll demonstrate this with a new speaker class, and we shall choose Gendibal
for this task. Here is the interface to Gendibal:

# tests/gendibal_test.py
import unittest

from speaker import gendibal

class GendibalTest(unittest.TestCase):
    def test_greeting(self):
        g = gendibal.Gendibal()
        self.assert_(g.greeting() == "Hello 29")

Gendibal is a mathematical speaker, and happens to like the 10th
prime number a lot. Now we only have to define the Gendibal class:

# speaker/gendibal.pyx
...
class Gendibal(object):
    def greeting(self):
        return "Hello %s" % primes(10)[-1]

def GendibalMain():
    g = Gendibal()
    print g.greeting()
    return 0

and add a new entry point:

# setup.py
setup(...
      entry_points={
        'console_scripts': [
            'rundog = speaker.dog:DogMain',
            'rungendibal = speaker.gendibal:GendibalMain',
            ],
        },
     ...
     )

Note:

  • the definition of this class is in a file with the .pyx suffix, indicating that this is a Pyrex, not Python file.
  • the definition is a Python definition. Pyrex code can contain normal Python code.

We have not defined the primes function yet. Here is the definition of the
primes function, in the same .pyx file:

# speaker/gendibal.pyx
...
def primes(int kmax):
  cdef int n, k, i

  cdef int p[1000]
  result = []
  if kmax > 1000:
    kmax = 1000
  k = 0
  n = 2
  while k < kmax:
    i = 0
    while i < k and n % p[i] <> 0:
      i = i + 1
    if i == k:
      p[k] = n
      k = k + 1
      result.append(n)
    n = n + 1
  return result

This is Pyrex code. It looks very much like Python, with some type annotations.

Building Pyrex extensions

Setuptools can build Pyrex files “out of the box”, as long as the
Pyrex compiler is somewhere on the path. Let’s get Pyrex:

$ easy_install pyrex

We need to tell setuptools about our extension, though:

# setup.py
from setuptools import setup, find_packages, Extension
...
setup(...
      ext_modules=[
        Extension('speaker.gendibal', ['speaker/gendibal.pyx']),
        ],
     ...
     )

And that’s it! We can build the egg:

$ python setup.py bdist_egg
...
running build_ext
pyrexc speaker/gendibal.pyx --> speaker/gendibal.c
...
gcc -pthread -fno-strict-aliasing -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -fPIC -I/usr/include/python2.5 -c speaker/gendibal.c -o build/temp.linux-i686-2.5/speaker/gendibal.o
gcc -pthread -shared -Wl,-O1 -Wl,-Bsymbolic-functions
build/temp.linux-i686-2.5/speaker/gendibal.o -o
build/lib.linux-i686-2.5/speaker/gendibal.so
...
creating stub loader for speaker/gendibal.so
byte-compiling build/bdist.linux-i686/egg/speaker/gendibal.py to
gendibal.pyc
...

Note:

  • The Pyrex code gendibal.pyx was converted to C code gendibal.c by the
    Pyrex compiler;
  • The extension gendibal.so was compiled;
  • A wrapper python script gendibal.py to load the extension was automagically
    created for us.

Now there are two tests:

$ python setup.py test
...
test_greeting (tests.dog_test.DogTest) ... ok
test_greeting (tests.gendibal_test.GendibalTest) ... ok
...

Wasn’t it handy we are using ‘nose’? Our new test is discovered
and run for us without having to add it anywhere.

We can run our new ‘main’ script:

$ rungendibal
Hello 29

Pyrex can not only be used to convert Python code to C, but it
can help us interface to existing C code/libraries.

Version 0.3: Boost.Python extensions

What about libraries/code in C++? Pyrex does not help there, and
wrapping around C++ code with Python C API can be tricky.
Boost.Python to the rescue.

Writing extensions in Boost.Python

Let’s say we have the following C++ library:

# speaker/bjarne.cpp
#include <string>
#include <iostream>

namespace { // Avoid clutering the global namespace
    class BjarneCPP {
    public:
        std::string greet() const { return "Hello, C++ World!"; }
    };

    int BjarneCPPMain() {
        BjarneCPP b = BjarneCPP();
        std::cout << b.greet() << std::endl;
    }
}

As can be seen, there is a class named BjarneCPP with an
interface very similar to our speaker interface, except that it
has a greet method, instead of our usual greeting method.
There is also a BjarneCPPMain function, that looks like a good
candidate to be a main function in our application. This looks like a useful
library. How do we access it in Python?

We can wrap it in Python like this:

# speaker/bjarne.cpp
...
#include <boost/python.hpp>
using namespace boost::python;

BOOST_PYTHON_MODULE(bjarne) {
    class_<BjarneCPP>("Bjarne", init<>())
        .def("greeting", &BjarneCPP::greet)
        ;
    def("BjarneMain", BjarneCPPMain, "The main function for 'bjarne'' module");
}

(For convenience and brevity, we’ve added our code in the same file.
Realistically, the code to be wrapped would be in a library, and
we would link against that library at build time.)

As usual, we do not forget to write our tests:

# tests/bjarne_test.py
import unittest
from speaker import bjarne

class BjarneTest(unittest.TestCase):
    def test_greeting(self):
        b = bjarne.Bjarne()
        self.assert_(b.greeting() == "Hello, C++ World!")

and define an entry point:

# setup.py
...
setup(...
      entry_points={
        'console_scripts': [
            ...
            'runbjarne = speaker.bjarne:BjarneMain',
            ],
      ...
      )

Building Boost.Python extensions

Now we need to tell setuptools about the new extension:

# setup.py
...
setup(...
     ext_modules=[
     ...
        Extension('speaker.bjarne',
                  ['speaker/bjarne.cpp'],
                  libraries=['boost_python']),
        ],
     ...
     )

And that’s it. We can create an egg:

$ python setup.py bdist_egg
...
building 'speaker.bjarne' extension
gcc -pthread -fno-strict-aliasing -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -fPIC -I/usr/include/python2.5 -c speaker/bjarne.cpp -o build/temp.linux-i686-2.5/speaker/bjarne.o
...
g++ -pthread -shared -Wl,-O1 -Wl,-Bsymbolic-functions build/temp.linux-i686-2.5/speaker/bjarne.o -lboost_python -o build/lib.linux-i686-2.5/speaker/bjarne.so
...
creating stub loader for speaker/bjarne.so
...
byte-compiling build/bdist.linux-i686/egg/speaker/bjarne.py to bjarne.pyc
...

Again, setuptools has compiled our extension module, linked it
against the libraries specified (boost_python), and generated a
wrapper (‘bjarne.py’) for us.

We can run our tests, and our new test will appear:

$ python setup.py test
...
test_greeting (tests.bjarne_test.BjarneTest) ... ok
test_greeting (tests.dog_test.DogTest) ... ok
test_greeting (tests.gendibal_test.GendibalTest) ... ok
...

And our new entry point works too:

$ runbjarne
Hello, C++ World!

easy_install annoyances

  • easy_install does not upgrade dependencies when upgrading a
    package;
  • easy_install does not, by itself, have a way of specifying
    exact versions of all dependencies of a package;
  • it is possible to force easy_install to not download anything
    from the Internet but to install everything from a given
    location; this can be used to mitigate unexpected versions of
    dependencies being installed;
  • easy_install, by itself, will install packages in the
    system-wide python site-packages directory; this can be a big
    annoyance. It is highly recommended to use virtualenv.

Credits

Egg jargon/terminology taken from: http://grok.zope.org/documentation/tutorial/introduction-to-zc.buildout.

Tesla = Pylons + SQLAlchemy + Elixir

Tesla is a Paster template creating Pylons applications using SQLAlchemy/Elixir ORM. Adds some simple database paster commands. Includes the following features:

  1. Create model classes
  2. Simple database commands (create/drop tables)
  3. Migrations (using SoC migrate library)
  4. Create and run batch scripts
  5. Handles SQLAlchemy setup and session refresh

Now, will Tesla/Elixir allow me to have composite primary keys of my choosing? Can I have, for most models, an Elixir definition, and for some, a more flexible SQLAlchemy mapper? How?

The reason I’d like to do that is: ActiveRecord does a poor job of structuring database tables for performance

What a bunch of libraries! Sometimes you wonder if plain PHP with embedded SQL isn’t better after all. No, no, that was heresy…

Nose: db setup and teardown

In the last post I noted some documentation related to nose and ORM. Well, they did not work for me because my setup was not exactly like others’. Here is what worked for me. In

tests/__init__.py

:

  • I imported
    from pylons import config

    to get the SQLAlchemy engine embedded in pylons’s config variable

  • I imported
    import quickwiki.model as model

    so that I could get hold of my models and metadata

  • I created a class
    TestModel

    inheriting from

    TestCase

    to hold the setup and teardown code

  • In the
    tearDown

    method, I do

    model.metadata.drop_all(bind=engine)

    to destroy all tables

  • In the
    setUp

    method, I call

    tearDown

    to destroy tables if they have not already been cleaned up, and then call

    model.metadata.create_all(bind=engine)

    to create the tables.

Here is the final code:


"""Pylons application test package

When the test runner finds and executes tests within this directory,
this file will be loaded to setup the test environment.

It registers the root directory of the project in sys.path and
pkg_resources, in case the project hasn't been installed with
setuptools. It also initializes the application via websetup (paster
setup-app) with the project's test.ini configuration file.
"""

import os
import sys
from unittest import TestCase

import pkg_resources
import paste.fixture
import paste.script.appinstall
from paste.deploy import loadapp
from routes import url_for

__all__ = ['url_for', 'TestController']

here_dir = os.path.dirname(os.path.abspath(__file__))
conf_dir = os.path.dirname(os.path.dirname(here_dir))

sys.path.insert(0, conf_dir)
pkg_resources.working_set.add_entry(conf_dir)
pkg_resources.require('Paste')
pkg_resources.require('PasteScript')

test_file = os.path.join(conf_dir, 'test.ini')
cmd = paste.script.appinstall.SetupCommand('setup-app')
cmd.run([test_file])

from pylons import config
import quickwiki.model as model

class TestModel(TestCase):
"""
We want the database to be created from scratch before each test and dropped
after each test (thus making them unit tests).
"""

def setUp(self):
self.tearDown()
engine = config['pylons.g'].sa_engine
model.metadata.create_all(bind=engine)

page = model.Page()
page.title = 'FrontPage'
page.content = 'Welcome to the QuickWiki front page'
model.Session.save(page)
model.Session.commit()



def tearDown(self):

engine = config['pylons.g'].sa_engine
model.metadata.drop_all(bind=engine)


class TestController(TestModel):

def __init__(self, *args, **kwargs):
wsgiapp = loadapp('config:test.ini', relative_to=conf_dir)
self.app = paste.fixture.TestApp(wsgiapp)
TestCase.__init__(self, *args, **kwargs)

Well, that almost worked. I fell foul of the ‘setup-app’ command and ‘setup-config’ in websetup.py. As you can see, the ‘tests/__init__.py’ file loads and executes the paster ‘setup-app’ command. Stands to reason: the app should be ‘set up’ before I run tests.

My setup-app is responsible for creating the DB and populating it with some initial data. But now I can’t repeat my tests, because the first time I run the tests, the db is created and initial data put in, and the next time I run the tests… poof:


IntegrityError: (IntegrityError) column title is not unique
u'INSERT INTO pages (title, content) VALUES (?, ?)' ['FrontPage',
'Welcome to the QuickWiki front page']

Well, I thought, I would just drop everything in the db in the tearDown and ensure that tearDown is called before setUp is run. No go. For some reason, it seems, the tearDown() is not working.

Well, it seems I must have at least one test, for the fixtures to be run. So I created a dummy test, and all was well.

Pylons, Paste, Nose and ORMs

Trying to do some unit tests in Pylons. Pylons uses nose. However, the Pylons Unit Testing guide is a little short on describing how to setup and teardown the database before each test. Here is all the docco to hand:

Reactor vs Proactor

I found a comparison of the Reactor and the Proactor pattern here. Both patterns talk about isses that crop up when building a concurrent network server. Both are related alternatives to thread based concurrency (or could work as a complement to thread based concurrency).
Both revolve around the concept of an IO De-multiplexer, event sources and event handlers. The driver program registers some event sources (e.g., sockets) with an IO de-multiplexer (e.g., select() or poll()). When an event occurs on a socket, a corresponding event handler is called. Of course, there must be some map between events from an event source to event handlers.
I found that these patterns are more or less embodied in the Python asyncore and asynchat modules, and want to discuss how the modules implement these patterns.

The Basics

We’ll first compare the terminology of the patterns with that of the Python modules.

  • blocking IO: this would translate to a read()/write() on a blocking socket. The call would block until there was some data available to read or the socket was closed. The thread making the call cannot do anything else.
  • non-blocking, synchronous IO: this would translate to a read()/write() on a non-blocking socket. The call would return immediately, either with the data read/written, or with a signal that the IO operation could not complete (e.g., read() returns with -1, and errno set to EWOULBLOCK/EAGAIN. It is then the caller’s responsibility to keep calling repeatedly until the operation succeeds.
  • non-blocking, asynchronous IO: this would translate to Unix SIGIO mechanisms (unfortunately, I am not familiar with this), or posix aio_* functions (not familiar with these either). Essentially, these IO calls return immediately, and the OS starts doing the operation in a separate (kernel level) thread; when the operation is ready, the user code is given some notification.

The Reactor Pattern: asyncore

According to the authors, here is how the Reactor pattern, which usually would use non-blocking synchronous IO, would work:

Here’s a read in Reactor:

  1. An event handler declares interest in I/O events that indicate readiness for read on a particular socket
  2. The event de-multiplexer waits for events
  3. An event comes in and wakes-up the demultiplexor, and the demultiplexor calls the appropriate handler
  4. The event handler performs the actual read operation, handles the data read, declares renewed interest in I/O events, and returns control to the dispatcher

How does this work in Python? Its done using the asyncore module.

  1. The IO demux is the asyncore.loop() function; it listens for events on sockets using either the select() or poll() OS call. It uses a global or user supplied dictionary to map sockets to event handlers (see below). Event handlers are instances of asyncore.dispatcher (or its subclasses). A dispatcher contains a socket and registers itself in the global map, letting loop() know that its methods should be called in response to events on its sockets. It also, through its readable() and writable() methods, lets loop() know what events it is interested in handling.
  2. loop() uses select() or poll() to wait for events on the sockets it knows about.
  3. select()/poll() returns; loop() goes through each socket that has an event, find the corresponding dispatcher object, determines the type of event, and calls a method corresponding to the event on the dispatcher object. In fact, loop() translates raw readable/writable events on sockets to slightly higher-level events using state information about the socket.
  4. The dispatcher object’s method is supposed to perform the actual IO: for example, in handle_read() we would read() the data off the socket and process it. Control then returns to loop(). Of course, one problem is that we should not do lengthy tasks in our handler, because then our server would not behave very concurrently and be unable to process other events in time. But what if we did need to do time-taking tasks in response to the event? Thats a subject for another post. For now we assume that our handlers can return quickly enough that as a whole the server behaves pretty concurrently.

The Proactor pattern: a psuedo-implementation in asynchat

According to the authors, here is how the Proactor pattern, which would usually use true asynchronous IO operations provided by the OS, would work:

Here is a read operation in Proactor (true async):

  1. A handler initiates an asynchronous read operation (note: the OS must support asynchronous I/O). In this case, the handler does not care about I/O readiness events, but instead registers interest in receiving completion events.
  2. The event demultiplexor waits until the operation is completed
  3. While the event demultiplexor waits, the OS executes the read operation in a parallel kernel thread, puts data into a user-defined buffer, and notifies the event demultiplexor that the read is complete
  4. The event demultiplexor calls the appropriate handler;
  5. The event handler handles the data from user defined buffer, starts a new asynchronous operation, and returns control to the event demultiplexor.

How does this work in Python? Using the asynchat module.

  1. Event handlers are instances of asynchat.async_chat (or rather, its subclasses). Taking read as an example, the handler would register interest in reading data by providing a readable() method that returns True.
  2. loop() would then use it to wait on its socket until the socket was readable. When the socket become readable, instead of calling some OS function to read the data, async_chat.handle_read() is called.
  3. This method will slurp up all available data.
  4. Then, handle_read() would call the collect_incoming_data() method of the subclass. From the subclass’s point of view, someone else has done the job of doing the actual IO, and it is being signaled that the IO operation is complete.
  5. collect_incoming_data() processes the data, and by returning, implicitly starts a new async IO cycle.

The similarity between asynchat and Proactor is that from the application writer’s point of view, he only has to write code to collect_incoming_data(). The difference is that, with asynchat, user level code is doing the IO, instead of true async facilities provided by the OS. The difference is greater when considering write operations. In a true Proactor, the event handler would initiate the write, and the event demultiplexer would wait for the completion event. However, in asynchat, the event handler (the subclass of async_chat) does not initiate the write per-se: it creates the data and pushes it onto a fifo, and loop(), indirectly through async_chat, writes it to the socket using synchronous non-blocking IO.

A Unified API

Basically, Python’s asynchat is providing an emulated Proactor interface to application writers. It would be good if asynchat could be redone so that it could use true async IO operations on OSes that support them, and fall back to synchronous IO when it is not available.

Python’s asynchat module

Introduction

As mentioned in the previous post, I was going to look at how to write a network server using Python’s asynchat module. To utilize the asynchat module’s capabilities, I had to change the semantics of the echo server a little bit. The echo server using asyncore would echo back the data as soon as it got it. The echo server using asynchat will echo data back line by line, where each line should be terminated by the string “\r\n”.

The async_chat interface for server writers

asynchat provides a higher level interface than asyncore. In this interface, to write a server, you subclass asynchat.async_chat, and override two methods:

collect_incoming_data(data)

Unlinks asyncore, you don’t have to bother with handle_read() event. The framework will read the data for you and call this method with the data. You probably want to save this data somewhere, in preparation for processing it later

found_terminator()

The framework calls this method when it detects that a ‘terminator’ has been found in the incoming data stream. The framework decides this based on information you give to the framework using the set_terminator() method.

Getting started

So how do you use this module to write a server? Just as with asyncore, you write a server class and instantiate and object; this object’s socket is the server socket; you handle the event handle_accept() and create objects of class async_chat (or, rather, a subclass of async_chat that you created) to handle the client connection. The only difference between asyncore and asynchat is, so far, the object that you instantiate to handle the client connection.

Let’s get started. First we look at the driver code:

server_main.py:


import asynchat_echo_server
...
server = module.EchoServer((interface, port))
server.serve_forever()

The server class

Our ‘EchoServer’ class looks pretty much like before:

asynchat_echo_server.py


class EchoServer(asyncore.dispatcher):

allow_reuse_address = False
request_queue_size = 5
address_family = socket.AF_INET
socket_type = socket.SOCK_STREAM


def __init__(self, address, handlerClass=EchoHandler):
self.address = address
self.handlerClass = handlerClass

asyncore.dispatcher.__init__(self)
self.create_socket(self.address_family,
self.socket_type)

if self.allow_reuse_address:
self.set_resue_addr()

self.server_bind()
self.server_activate()


def server_bind(self):
self.bind(self.address)
log.debug("bind: address=%s:%s" % (self.address[0], self.address[1]))


def server_activate(self):
self.listen(self.request_queue_size)
log.debug("listen: backlog=%d" % self.request_queue_size)


def fileno(self):
return self.socket.fileno()


def serve_forever(self):
asyncore.loop()
# TODO: try to implement handle_request()

# Internal use
def handle_accept(self):
(conn_sock, client_address) = self.accept()
if self.verify_request(conn_sock, client_address):
self.process_request(conn_sock, client_address)


def verify_request(self, conn_sock, client_address):
return True


def process_request(self, conn_sock, client_address):
log.info("conn_made: client_address=%s:%s" % \
(client_address[0],
client_address[1]))
self.handlerClass(conn_sock, client_address, self)


def handle_close(self):
self.close()

The difference is in the handlerClass, which is defined to be EchoHandler as before, but is coded differently. When we instantiate this object, it gets added to the global map of sockets that loop() is monitoring, and now loop() will monitor events on the client socket as well as the server socket. There can be any number of sockets. This behaviour is the same as that of asyncore.

handling per-client connections

Here is how we start our new EchoHandler:


class EchoHandler(asynchat.async_chat):

LINE_TERMINATOR = "\r\n"

def __init__(self, conn_sock, client_address, server):
asynchat.async_chat.__init__(self, conn_sock)
self.server = server
self.client_address = client_address
self.ibuffer = []

self.set_terminator(self.LINE_TERMINATOR)

As can be seen, the init method calls async_chat.set_terminator() method with a string argument. The string argument tells async_chat that a message or record is terminated when it encounters the string in the data. Now, loop() will wait on this client socket and call async_chat’s handle_read() method. async_chat’s handle_read() will read the data, look at it, and call the collect_incoming_data() method that you define:


def collect_incoming_data(self, data):
log.debug("collect_incoming_data: [%s]" % data)
self.ibuffer.append(data)

As you can see, we just buffer the data here for later processing.

Now, in the handle_read() method, async_chat will look for the string set by set_terminator(). If it finds it, then it will call the found_terminator() method that we define:


def found_terminator(self):
log.debug("found_terminator")
self.send_data()

When we find that we have a complete line (because it was terminated by “\r\n”) we just send the data back. After all, we are writing an echo server.

Sending data

Sending data back to peers is a common task. Using asyncore, we would create the data to be sent back and put it in a buffer. Then we’d wait for handle_write() events, writing as much data from the buffer to the socket as possible in each event.

asynchat makes this easier. We create the data, put it in a so called ‘producer’ object, and push the producer object to a FIFO. async_chat will then call each producer in turn, get data from it, send it out over the socket, piece by piece, until the producer is exhausted; it will then move on to the next producer.

If it encounters a None object in place of a producer, async_chat will close the connection.

All this can be accomplished with:


def send_data(self):
data = "".join(self.ibuffer)
log.debug("sending: [%s]" % data)
self.push(data+self.LINE_TERMINATOR)
self.ibuffer = []

As you can see, putting the data in a producer object and pushing it on to the FIFO takes just one line of code: self.push(...). We dont have to define a producer class in the normal case because async_chat provides a simple_producer class for us, and the push() method creates an object of that class, populates it with whatever we supply, and then pushes it on to the FIFO. This behaviour can be over-ridden, using the async_chat module API, but we will look at that in another installment.

We have not bothered to push a None onto the FIFO, because we depend on the client closing the connection. We might have put a timer and when the timer expired, close the connection ourselves, to handle clients that go away without properly closing the connection.

Here is the full code:


import logging
import asyncore
import asynchat
import socket

logging.basicConfig(level=logging.DEBUG, format="%(created)-15s %(levelname)8s %(thread)d %(name)s %(message)s")
log = logging.getLogger(__name__)

BACKLOG = 5
SIZE = 1024

class EchoHandler(asynchat.async_chat):

LINE_TERMINATOR = "\r\n"

def __init__(self, conn_sock, client_address, server):
asynchat.async_chat.__init__(self, conn_sock)
self.server = server
self.client_address = client_address
self.ibuffer = []

self.set_terminator(self.LINE_TERMINATOR)


def collect_incoming_data(self, data):
log.debug("collect_incoming_data: [%s]" % data)
self.ibuffer.append(data)


def found_terminator(self):
log.debug("found_terminator")
self.send_data()


def send_data(self):
data = "".join(self.ibuffer)
log.debug("sending: [%s]" % data)
self.push(data+self.LINE_TERMINATOR)
self.ibuffer = []


def handle_close(self):
log.info("conn_closed: client_address=%s:%s" % \
(self.client_address[0],
self.client_address[1]))

asynchat.async_chat.handle_close(self)

class EchoServer(asyncore.dispatcher):

allow_reuse_address = False
request_queue_size = 5
address_family = socket.AF_INET
socket_type = socket.SOCK_STREAM


def __init__(self, address, handlerClass=EchoHandler):
self.address = address
self.handlerClass = handlerClass

asyncore.dispatcher.__init__(self)
self.create_socket(self.address_family,
self.socket_type)

if self.allow_reuse_address:
self.set_resue_addr()

self.server_bind()
self.server_activate()


def server_bind(self):
self.bind(self.address)
log.debug("bind: address=%s:%s" % (self.address[0], self.address[1]))


def server_activate(self):
self.listen(self.request_queue_size)
log.debug("listen: backlog=%d" % self.request_queue_size)


def fileno(self):
return self.socket.fileno()


def serve_forever(self):
asyncore.loop()
# TODO: try to implement handle_request()

# Internal use
def handle_accept(self):
(conn_sock, client_address) = self.accept()
if self.verify_request(conn_sock, client_address):
self.process_request(conn_sock, client_address)


def verify_request(self, conn_sock, client_address):
return True


def process_request(self, conn_sock, client_address):
log.info("conn_made: client_address=%s:%s" % \
(client_address[0],
client_address[1]))
self.handlerClass(conn_sock, client_address, self)


def handle_close(self):
self.close()