Python to Cython and benchmarking
July 24, 2009
Recently I have been coding with Cython for my project PyCAF. Obviously, I am doing this to make my code run faster. My approach was to first write my code in Python and make sure it works correctly. I wrote unit-tests for each function and class, using nosetests --with-coverage to look for test cases I might have missed. Then I profiled the code, finding out hot-spots, and eliminating the obvious issues.
Next up was to write some performance micro-benchmarks for important bits of code. These benchmarks have the same interface: they take an “input size” N as a parameter. What N means is specific to the function. For e.g., for the function event_one_one(N) it means “create N pairs of tasklets, and make each pair exchange one event” but for the function event_one_many(N) it means “create one sender tasklet and N recipient tasklets waiting for the sender to send an event”. You get the drift. The micro-benchmarks are tested as well, by the simple expedient of writing a unit-test case per micro-benchmark that calls it with a small input.
Digression: I spent some time looking for some tool where I could store results of my benchmarks per code commit and then later browse how a particular benchmark varied over code commits, but found nothing…. does anyone have any ideas? If not, I might write a tool for this in the future.
Anyway, the benefit of micro-benchmarks is that you can see how the performance scales when the input sizes grow. For e.g., here is the output of my benchmark tool:
| Test Name | N | Ratio | Time (s) | Ratio | K-PyStones |
|---|---|---|---|---|---|
| tasklet_new | 100 | 1 | 0.000518083572388 | 1 | 0.0364847586189 |
| tasklet_new | 1000 | 10 | 0.00419187545776 | 8.09111826967 | 0.295202497026 |
| tasklet_new | 10000 | 10 | 0.0460090637207 | 10.9757706746 | 3.24007490991 |
| tasklet_new | 100000 | 10 | 0.516650915146 | 11.2293290357 | 36.3838672638 |
| tasklet_yield | 100 | 1 | 0.000921964645386 | 1 | 0.0649270877032 |
Some things to note here: I convert the time taken by a test to kilo-pystones and record that as well as the time taken. What’s a Pystone? Well, its the output of python -c "from test import pystone; pystone.main()". For my machine:
Pystone(1.1) time for 50000 passes = 0.7
This machine benchmarks at 71428.6 pystones/second
So basically pystones is a (somewhat) machine-independent measure of how long a test took to run. While a test runs faster on a fast machine and slower on a slow machine, when you convert the time to pystones, it should be the same.
Now the interesting thing to note is how the time for a test increases when the input grows larges. To help me, my benchmark prints the ratio between successive input sizes and times taken. If I increase the input size by, say 10 times, and the time taken increases 100 times, then I might have a problem. Of course, behind every test are a bunch of functions and algorithms. I have a general expectation of what the complexity of my function should be, and look at the benchmark to confirm my expectations. I can then replace algorithms with high complexity with better ones. A good design and a well chosen algorithm or data structure gives more speedup than mindlessly twiddling around with code.
Digression: My benchmark is not reporting the space-complexity of the tests (aka, how much memory is being consumed). Tarek Ziade mentions some tools in his excellent book “Expert Python Programming”, but I have not understood the details of the tools well enough to incorporate them into my testing, just yet.
Once I find a benchmark that I want to improve, I will first profile it, to find where the function or algorithm is spending most time. This is obvious and there are enough resources on the web about it (just look for “cProfile”). What I was wondering about was how would I profile code that I had converted to Cython, since Cythonized code becomes a binary? Well, the Cython wiki gives the answer: http://wiki.cython.org/Profiling. So thats what I will be looking into very soon.
Things to get right about MySQL from day #1
July 14, 2009
There is an excellent article in Linux Magazine titled “MySQL Performance from the Start”. Its by Jeremy Jawodny, so it comes with some serious authority backing it. He talks about some mistakes people make when developing the typical database backed web app and how keeping some things correct from the start will save oodles of pain later. I confess, this touched a nerve. I have experienced all of these mistakes and the pain that comes along with them.
Here is what Jeremy says, in a nutshell:
- What’s MySQL There For? MySQL is there for stuff that you do relational queries on. Don’t store BLOBs (images, videos, excel sheets, word docs, etc.) in the db; store metadata about them, for sure, but store the binary data that you can’t really query in a distributed file system. And don’t log to the database, either. Logs aren’t usually queried often by the front-end and tend to grow unbounded, slowing down the database and adding to admin’s headaches when they become too big.
- Plan for a caching layer between the app and the DB; memcached is a good solution, but there are more. You can do this, for e.g., by putting all DB access code in a library initially when you don’t have caching; later, when it is time to add memcached, you only have one place to modify.
- OLTP vs OLAP. Front end database is transactional in nature: a user session reads/updates many different tables but using some index (the user ID, for e.g.). Then there is reporting. It typically wants to look at all data (“Count users who did XXX in last YYY days”). From day 1, have separate front-end and reporting databases with different schemas, the formed optimized for indexed queries and perhaps denormalized a little to optimize SELECTs, and the latter tuned for aggregate queries with complex joins. If you don’t have a reporting db yet, at least don’t convolute the front-end database with reporting-like schema: keep it lean. Also see the next poi
Here I want to note down what I was thinking when I made these mistakes, so I can present them as arguments (for myself and others) when someone is about to make the same mistake again.
Storing BLOBs in the DB
When I started out, I did a little research on the performance of retrieving images from the db vs the file-system. I ended up choosing the DB because:
- Storing images in the DB automatically gave me a distributed repository; at that time I was not aware of any good distributed file-systems, and not having another component in the stack meant there was one less thing to manage
- I thought that there was no significant performance impact of storing BLOBs in the db
Boy, was I wrong.
First, storing BLOBs in db meant that some tables have become VERY large in physical size. It is a pain to query those tables, and back up and trim the db, now. If I have to remove old, unused BLOBs now, the DELETE query takes very long and is threatening to bring the DB to its knees. If I had only kept the metadata in the DB and stored the files in the file-system, I could have very quickly removed just the metadata from the db, freeing it up for its normal duties, and removed the files themselves at leisure. So, all in all, its worth spending the time to find, implement and maintain the distribtued file-system solution in the long run.
Secondly, I can’t take advantage of the highly optimized file-serving capabilities of web servers. When I started out, I was serving out everything from scripts: dynamic data as well as static images. When the load was low, things were fine. When the load went up, I painfully realized that more than 50% of the hits to the app were for static images that could have been served off the file-system by the web-server itself.
Thirdly, the table storing BLOBS should have stored just the BLOB and an identifier, nothing else. You don’t want to make InnoDB crawl through kilobytes of binary data looking for that 4-byte integer you stashed away there. Thankfully, this is a mistake we did not make. Or, rather, cannot make. Because by now we know that keeping anything in that table that a SELECT mught use in a JOIN or WHERE clause will make for a query that will never finish and bring the db down to boot.
Storing Logs in the DB
We needed reports on how often certain pages were being hit and certain actions triggered. We could have logged it to the filesystem and written a separate script to grok the logs, create summaries, and bundle it off to the reporting db. But that was too much work. It seemed just so easy to simply stuff raw log data into the front-end db. We would, we reasoned, one day write a script that would grok the table, create summaries, bundle it off to the reporting db, and trim the log table. (Notice that this is slightly more work.)
Jeremy is right: you never quite get around to write that “trim table” cron job and the log table just keeps growing. The management report is generated off that table directly. One day, it will finally become too slow.
Caching Layer
We do have a caching layer. The front-end code makes heavy use of memcached, and all db access is through functions in one or two files. Guess how this happened? Yeah, because of a major code-refactoring and speed-up sprint when we realized the servers were about to go kaput.
Multiple Languages, ORMs and Caching
There is another problem: memcached is good for storing objects, not tuples or records. But if you have programs in multiple languages, and they all use different ORMs, or worse, no ORM at all, then the utility of memcached becomes limited to one part of the system. Component A uses and cached some data, but component B, which needs the same data cannot use the cache because its repesentation of the object is different.
I don’t know if there is a lesson here. One might be tempted to say “use one language and ORM for everything”, but then one is also tempted to say “use the best language/library/framework for the task at hand”.
Multi-language systems are a reality. The best I can think of is to keep the number of different languages and frameworks small. And use the same ORM for all components written in a particular language.
OLTP vs OLAP
The big problem with not planning for a separate reporting db from day 1 is that
- you keep adding raw logs to the front-end db and they never get trimmed;
- you are afraid of deleting historical information from the db; you don’t know what kind of report you might want in the future, so you just keep all the information there, indefinitetely.
Better to:
- Create a reporting db and reporting system from day 1, with a very simple report that sumarizes some information;
- Write tools to create summary data for the report and trim the front-end db
- Have an “archive” db separate from the front end db; write tools to move historical stuff to the archive db and remove it from the main db
Make it clear to management that:
- Any report they request that needs new summary information (that you are not already generating) will not be able to report stuff from the past;
- Or, if they want arbitrary reports in the future, you will have to keep ALL data around, and they better be ready to issue checks to buy bigger and bigger servers for the DB.