Tag Archives: performance

MySQL and Index Woes

I have often noticed that altering indexes on large tables in MySQL is a very time consuming process. I’d chalked it up to “just how things are” and account for that when doing database migrations. Possibly because I’d usually *add* an index on the table not drop it.

But apparently there there is a design flaw in MySQL 5.0 and before with respect to how it handles indexes that caused dropping indexes on tables very slow, and also adding a new index on the table slower than it needed to be. Gory details here: http://bugs.mysql.com/bug.php?id=2364

But in 5.1, at least dropping table ought to be near instantaneous. Adding an index? Umm… it seemed the old behaviour was:

Given the same MyISAM table T having four indexes (ndx1,ndx2,ndx3,ndx4)
and you want to ‘alter table T add index ndx5 (…);’ here is exactly
what happens (from: http://lists.mysql.com/mysql/202489)

1) MySQL copies T.MYD to a temp table, i.e., S.MYD and a zero byte S.MYI.
2) MySQL does ‘alter table S add index ndx1 (…);
3) MySQL does ‘alter table S add index ndx2 (…);
4) MySQL does ‘alter table S add index ndx3 (…);
5) MySQL does ‘alter table S add index ndx4 (…);
6) MySQL does ‘alter table S add index ndx5 (…);
7) MySQL deletes T.MYD and deletes T.MYI
8) MySQL renames S.MYD to T.MYD, and renames S.MYI to T.MYI

Note that each of the “alter table” step involves copying the table to a temp table with no indexes, applying all existing indexes one by one, and then applying the new index…. so in the above case:

In fact, for table T with no indexes and you want to add N indexes
MySQL will copy the MYD N times
MySQL will copy the MYI N times
MySQL will run ‘alter table add index’ N(N+1)/2 times if adding an index
MySQL will run ‘alter table drop index’ N(N-1)/2 times if dropping an index

So here is a better way to add indexes to a table:

1) create table T1 like T;
This creates an empty table T1 with indexes ndx1,ndx2,ndx3 and ndx4.
2) alter table T1 drop index ndx3;
This drops index ndx3 on the empty T1, which should be instantaneous.
3) insert into T1 select * from T;
This will populate table T and load all three(3) indexes for T1 in one pass.
4) drop table table T;
5) alter table T1 rename to T;

CPU consumption by idle JVM and how to reduce it


Monitoring, knowledge and teamwork

Jeremy Zawodny, MySQL guru, tells a story about:

One of the frustrating things about building and running backend infrastructure services is that things break sometimes — often in unexpected ways. And when they do, you can spend untold hours chasing down various ideas, most of which are dead ends, before eventually solving the problem.

He starts with a curious MySQL login failure and ends up with IRQs and boot time kernel options.


Read the story here: part I, part 2.

Python to Cython and benchmarking

Recently I have been coding with Cython for my project PyCAF. Obviously, I am doing this to make my code run faster. My approach was to first write my code in Python and make sure it works correctly. I wrote unit-tests for each function and class, using nosetests --with-coverage to look for test cases I might have missed. Then I profiled the code, finding out hot-spots, and eliminating the obvious issues.

Next up was to write some performance micro-benchmarks for important bits of code. These benchmarks have the same interface: they take an “input size” N as a parameter. What N means is specific to the function. For e.g., for the function event_one_one(N) it means “create N pairs of tasklets, and make each pair exchange one event” but for the function event_one_many(N) it means “create one sender tasklet and N recipient tasklets waiting for the sender to send an event”. You get the drift. The micro-benchmarks are tested as well, by the simple expedient of writing a unit-test case per micro-benchmark that calls it with a small input.

Digression: I spent some time looking for some tool where I could store results of my benchmarks per code commit and then later browse how a particular benchmark varied over code commits, but found nothing…. does anyone have any ideas? If not, I might write a tool for this in the future.

Anyway, the benefit of micro-benchmarks is that you can see how the performance scales when the input sizes grow. For e.g., here is the output of my benchmark tool:

Test Name N Ratio Time (s) Ratio K-PyStones
tasklet_new 100 1 0.000518083572388 1 0.0364847586189
tasklet_new 1000 10 0.00419187545776 8.09111826967 0.295202497026
tasklet_new 10000 10 0.0460090637207 10.9757706746 3.24007490991
tasklet_new 100000 10 0.516650915146 11.2293290357 36.3838672638
tasklet_yield 100 1 0.000921964645386 1 0.0649270877032

Some things to note here: I convert the time taken by a test to kilo-pystones and record that as well as the time taken. What’s a Pystone? Well, its the output of python -c "from test import pystone; pystone.main()". For my machine:
Pystone(1.1) time for 50000 passes = 0.7
This machine benchmarks at 71428.6 pystones/second

So basically pystones is a (somewhat) machine-independent measure of how long a test took to run. While a test runs faster on a fast machine and slower on a slow machine, when you convert the time to pystones, it should be the same.

Now the interesting thing to note is how the time for a test increases when the input grows larges. To help me, my benchmark prints the ratio between successive input sizes and times taken. If I increase the input size by, say 10 times, and the time taken increases 100 times, then I might have a problem. Of course, behind every test are a bunch of functions and algorithms. I have a general expectation of what the complexity of my function should be, and look at the benchmark to confirm my expectations. I can then replace algorithms with high complexity with better ones. A good design and a well chosen algorithm or data structure gives more speedup than mindlessly twiddling around with code.

Digression: My benchmark is not reporting the space-complexity of the tests (aka, how much memory is being consumed). Tarek Ziade mentions some tools in his excellent book “Expert Python Programming”, but I have not understood the details of the tools well enough to incorporate them into my testing, just yet.

Once I find a benchmark that I want to improve, I will first profile it, to find where the function or algorithm is spending most time. This is obvious and there are enough resources on the web about it (just look for “cProfile”). What I was wondering about was how would I profile code that I had converted to Cython, since Cythonized code becomes a binary? Well, the Cython wiki gives the answer: http://wiki.cython.org/Profiling. So thats what I will be looking into very soon.

Facebook uses Erlang

Here is proof: http://www.joyent.com/?gclid=CLeIy4La-pQCFRMJewoddC7c3A

For Facebook Chat, we rolled our own subsystem for logging chat messages (in C++) as well as an epoll-driven web server (in Erlang) that holds online users’ conversations in-memory and serves the long-polled HTTP requests. Both subsystems are clustered and partitioned for reliability and efficient failover. Why Erlang? In short, because the problem domain fits Erlang like a glove. Erlang is a functional concurrency-oriented language with extremely low-weight user-space “processes”, share-nothing message-passing semantics, built-in distribution, and a “crash and recover” philosophy proven by two decades of deployment on large soft-realtime production systems.

Reactor vs Proactor

I found a comparison of the Reactor and the Proactor pattern here. Both patterns talk about isses that crop up when building a concurrent network server. Both are related alternatives to thread based concurrency (or could work as a complement to thread based concurrency).
Both revolve around the concept of an IO De-multiplexer, event sources and event handlers. The driver program registers some event sources (e.g., sockets) with an IO de-multiplexer (e.g., select() or poll()). When an event occurs on a socket, a corresponding event handler is called. Of course, there must be some map between events from an event source to event handlers.
I found that these patterns are more or less embodied in the Python asyncore and asynchat modules, and want to discuss how the modules implement these patterns.

The Basics

We’ll first compare the terminology of the patterns with that of the Python modules.

  • blocking IO: this would translate to a read()/write() on a blocking socket. The call would block until there was some data available to read or the socket was closed. The thread making the call cannot do anything else.
  • non-blocking, synchronous IO: this would translate to a read()/write() on a non-blocking socket. The call would return immediately, either with the data read/written, or with a signal that the IO operation could not complete (e.g., read() returns with -1, and errno set to EWOULBLOCK/EAGAIN. It is then the caller’s responsibility to keep calling repeatedly until the operation succeeds.
  • non-blocking, asynchronous IO: this would translate to Unix SIGIO mechanisms (unfortunately, I am not familiar with this), or posix aio_* functions (not familiar with these either). Essentially, these IO calls return immediately, and the OS starts doing the operation in a separate (kernel level) thread; when the operation is ready, the user code is given some notification.

The Reactor Pattern: asyncore

According to the authors, here is how the Reactor pattern, which usually would use non-blocking synchronous IO, would work:

Here’s a read in Reactor:

  1. An event handler declares interest in I/O events that indicate readiness for read on a particular socket
  2. The event de-multiplexer waits for events
  3. An event comes in and wakes-up the demultiplexor, and the demultiplexor calls the appropriate handler
  4. The event handler performs the actual read operation, handles the data read, declares renewed interest in I/O events, and returns control to the dispatcher

How does this work in Python? Its done using the asyncore module.

  1. The IO demux is the asyncore.loop() function; it listens for events on sockets using either the select() or poll() OS call. It uses a global or user supplied dictionary to map sockets to event handlers (see below). Event handlers are instances of asyncore.dispatcher (or its subclasses). A dispatcher contains a socket and registers itself in the global map, letting loop() know that its methods should be called in response to events on its sockets. It also, through its readable() and writable() methods, lets loop() know what events it is interested in handling.
  2. loop() uses select() or poll() to wait for events on the sockets it knows about.
  3. select()/poll() returns; loop() goes through each socket that has an event, find the corresponding dispatcher object, determines the type of event, and calls a method corresponding to the event on the dispatcher object. In fact, loop() translates raw readable/writable events on sockets to slightly higher-level events using state information about the socket.
  4. The dispatcher object’s method is supposed to perform the actual IO: for example, in handle_read() we would read() the data off the socket and process it. Control then returns to loop(). Of course, one problem is that we should not do lengthy tasks in our handler, because then our server would not behave very concurrently and be unable to process other events in time. But what if we did need to do time-taking tasks in response to the event? Thats a subject for another post. For now we assume that our handlers can return quickly enough that as a whole the server behaves pretty concurrently.

The Proactor pattern: a psuedo-implementation in asynchat

According to the authors, here is how the Proactor pattern, which would usually use true asynchronous IO operations provided by the OS, would work:

Here is a read operation in Proactor (true async):

  1. A handler initiates an asynchronous read operation (note: the OS must support asynchronous I/O). In this case, the handler does not care about I/O readiness events, but instead registers interest in receiving completion events.
  2. The event demultiplexor waits until the operation is completed
  3. While the event demultiplexor waits, the OS executes the read operation in a parallel kernel thread, puts data into a user-defined buffer, and notifies the event demultiplexor that the read is complete
  4. The event demultiplexor calls the appropriate handler;
  5. The event handler handles the data from user defined buffer, starts a new asynchronous operation, and returns control to the event demultiplexor.

How does this work in Python? Using the asynchat module.

  1. Event handlers are instances of asynchat.async_chat (or rather, its subclasses). Taking read as an example, the handler would register interest in reading data by providing a readable() method that returns True.
  2. loop() would then use it to wait on its socket until the socket was readable. When the socket become readable, instead of calling some OS function to read the data, async_chat.handle_read() is called.
  3. This method will slurp up all available data.
  4. Then, handle_read() would call the collect_incoming_data() method of the subclass. From the subclass’s point of view, someone else has done the job of doing the actual IO, and it is being signaled that the IO operation is complete.
  5. collect_incoming_data() processes the data, and by returning, implicitly starts a new async IO cycle.

The similarity between asynchat and Proactor is that from the application writer’s point of view, he only has to write code to collect_incoming_data(). The difference is that, with asynchat, user level code is doing the IO, instead of true async facilities provided by the OS. The difference is greater when considering write operations. In a true Proactor, the event handler would initiate the write, and the event demultiplexer would wait for the completion event. However, in asynchat, the event handler (the subclass of async_chat) does not initiate the write per-se: it creates the data and pushes it onto a fifo, and loop(), indirectly through async_chat, writes it to the socket using synchronous non-blocking IO.

A Unified API

Basically, Python’s asynchat is providing an emulated Proactor interface to application writers. It would be good if asynchat could be redone so that it could use true async IO operations on OSes that support them, and fall back to synchronous IO when it is not available.

Concurrency: Erlang vs Java

Its fashionable to extol the high-performance of Erlang concurrency these days. Programming Erlang, Section 8.11, has this problem:

Write a ring benchmark. Create N processes in a ring. Send a message
round the ring M times so that a total of N * M messages get
sent. Time how long this takes for different values of N and M.

Well, I copied off the Java program from here: http://www.sics.se/~joe/ericsson/du98024.html

And then I wrote this Erlang program:


timeit() ->
register(mainproc, self()),
io:format(" Procs, Mesgs, SpawnTotal, SpawnProc, RunTotal, RunMesg~n", []),
timeit_aux(1000, 1000, 10000),

timeit_aux(NProcs, NMsgs, NProcsMax) ->
if NProcs =
main([NProcs, NMsgs]),
timeit_aux(NProcs+1000, NMsgs, NProcsMax);
true -> void

main([N, M]) ->
{_, _W0} = statistics(wall_clock),
FirstPid = spawn(fun loop0/0),
% io:format("First: ~p~n", [FirstPid]),
LastPid = setup(N-1, FirstPid),
FirstPid ! LastPid,
{_, W1} = statistics(wall_clock),
LastPid ! M,
stop ->
{_, W2} = statistics(wall_clock),
io:format("~6B, ~6B, ~10g, ~10g, ~10g, ~10g~n", [N, M, W1*1.0, 1.0*W1/N, W2*1.0, W2*1000.0/(M*N+N)]).

setup(0, PrevPid) ->
setup(HowMany, PrevPid) ->
Pid = spawn(fun() -> loop(PrevPid) end),
% io:format("~p: Linking to: ~p~n", [Pid, PrevPid]),
setup(HowMany - 1, Pid).

loop0() ->
LinkedPid when is_pid(LinkedPid) ->
% io:format("First: ~p: Linking to ~p~n", [self(), LinkedPid]),
X ->
erlang:error({unknownMessageError, X, self()})

loop1(LinkedPid) ->
M when is_integer(M), M > 0 ->
% io:format("~p: Received: ~p~n", [self(), M]),
LinkedPid ! (M-1),
M when is_integer(M), M =:= 0 ->
% io:format("~p: Received: ~p, Terminating~n", [self(), M]),
mainproc ! stop,
X ->
erlang:error({unknownMessageError, X, self()})

loop(LinkedPid) ->
M when is_integer(M), M > 0 ->
% io:format("~p: Received: ~p~n", [self(), M]),
LinkedPid ! M,
M when is_integer(M), M =:= 0 ->
% io:format("~p: Received: ~p, Terminating~n", [self(), M]),
LinkedPid ! M,
X ->
erlang:error({unknownMessageError, X, self()})

The performance is something like this (all times are in milliseconds, except the last one, which is in microseconds):

 Procs,  Mesgs, SpawnTotal,  SpawnProc,   RunTotal,    RunMesg
1000, 1000, 0.00000e+0, 0.00000e+0, 266.000, 0.265734
2000, 1000, 16.0000, 8.00000e-3, 562.000, 0.280719
3000, 1000, 0.00000e+0, 0.00000e+0, 969.000, 0.322677
4000, 1000, 0.00000e+0, 0.00000e+0, 1672.00, 0.417582
5000, 1000, 15.0000, 3.00000e-3, 2469.00, 0.493307
6000, 1000, 16.0000, 2.66667e-3, 3234.00, 0.538462
7000, 1000, 16.0000, 2.28571e-3, 4015.00, 0.572998
8000, 1000, 16.0000, 2.00000e-3, 4750.00, 0.593157
9000, 1000, 31.0000, 3.44444e-3, 5469.00, 0.607060
10000, 1000, 31.0000, 3.10000e-3, 6375.00, 0.636863

The numbers in the RunMesg column are the time it takes to send one message from one process to another. Why is it increasing? Does the runtime have to do more work to find the target process?

Here is the Java output (program below):

 Procs,  Mesgs, SpawnTotal, SpawnProcs,   RunTotal,   RunMesg
100, 1000, 16.000, 0.160, 15531.000, 15.531
200, 1000, 62.000, 0.310, 27047.000, 27.047
300, 1000, 94.000, 0.313, 37828.000, 37.828
400, 1000, 125.000, 0.313, 46859.000, 46.859
500, 1000, 141.000, 0.282, 54578.000, 54.578
600, 1000, 203.000, 0.338, 60594.000, 60.594
700, 1000, 234.000, 0.334, 65282.000, 65.282
800, 1000, 406.000, 0.508, 68156.000, 68.156
900, 1000, 484.000, 0.538, 69782.000, 69.782
1000, 1000, 735.000, 0.735, 70015.000, 70.015

The last column is in milliseconds whereas for Erlang it was in microseconds, AND the number of procs is 100-1000, whereas for Erlang they were 1000-10,000.

When unix processes are too much

I want to ssh into a server, run vmstat for a while, then get back the output of vmstat. All automatically, via a Python program. There are many problems. Here is a simulation of the problems through the command line.

Take 1

The first thought was to ssh in, run vmstat and redirect the output of vmstat to a file on the remove server, like so. Then later, I would kill ssh, and use scp to retrieve the logfile from the remote server. Here is the command line equivalent:

$ ssh ‘couffable@asgard’ ‘vmstat 1 2>&1 >/home/couffable/remote-vmstat.log’ &
[2] 10405

On remote machine:
[couffable@asgard ~]$ ps ax | grep vmstat
20921 ? Ss 0:00 bash -c vmstat 1 2>&1 >/home/couffable/remote-vmstat.log
20938 ? S 0:00 vmstat 1

This starts things nicely and gives me the PID of the ssh process. The trouble is when I want to stop the process:

On local machine
$ kill 10405
$ fg # to get bash to retrive ssh’s status and remove it from zombie list

But on the remote machine, vmstat is still running!
[couffable@asgard ~]$ ps ax | grep vmstat
20921 ? Ss 0:00 bash -c vmstat 1 2>&1 >/home/couffable/remote-vmstat.log
20938 ? S 0:00 vmstat 1

This won’t do at all.

Take 2

I noticed that if I run vmstat though ssh without redirecting vmstat’s output on the remote machine, ssh shows vmstat’s output on my terminal. So I said, why not redirect ssh’s output?. Here is what I tried:

On local machine:
$ ssh ‘couffable@asgard’ ‘vmstat 1’ 2>&1 >/home/couffable/local-vmstat.log &
[2] 10569

On the remote machine:
[couffable@asgard ~]$ ps ax | grep vmstat
21239 ? Ss 0:00 vmstat 1

Bingo! Now vmstat’s output is being stored in a local file, not on the remote machine. Or is it? I notice that on the local machine:
$ # press enter
[2]+ Stopped ssh ‘couffable@asgard’ ‘vmstat 1’ 2>&1 >/home/couffable/local-vmstat.log

That is, vmstat/ssh’s output is not getting logged because ssh stops when backgrounded. Puff!

Take 3

I decided that I won’t background ssh.

On local machine:
$ ssh ‘couffable@asgard’ ‘vmstat 1’ 2>&1 >/home/couffable/local-vmstat.log

On remote machine:
[couffable@asgard ~]$ ps ax | grep vmstat
21624 ? Ss 0:00 vmstat 1

After some time, again on local machine (in another terminal):
$ ps ax | grep vmstat
10880 pts/1 S+ 0:00 ssh couffable@asgard vmstat 1
$ kill 10880

And yes, the vmstat on remote machine dies too. Now, will this work in Python?

Take 4

The trouble with running the results of Take 3 in Python is that I can’t directly invoke ssh: the redirection won’t happen. For redirection, I need a shell. So I could do something like:

>>> pid = os.spawnlp(os.P_NOWAIT, “sh”, “sh”, “-c”, “ssh couffable@asgard vmstat 1”, “2>&1”, “>vmstat.log”)
>>> procs ———–memory———- —swap– —–io—- –system– —-cpu—-
r b swpd free buff cache si so bi bo in cs us sy id wa
0 0 144 82808 231972 1293708 0 0 2 2 0 2 1 0 99 0

What? No redirection? Not good. After killing ssh processes, I try again:
>>> pid = os.spawnlp(os.P_NOWAIT, “sh”, “sh”, “-c”, “ssh couffable@asgard vmstat 1 2>&1 >vmstat.log”)
>>> pid

Redirection is working now. I kill the process:
>>> os.kill(11935,9)
>>> pid, status = os.waitpid(11935, 0)
>>> status

Ok, so far so good. But the ssh and vmstat processes are still running! (Killing ssh does stop both ssh and vmstat, like Take 2 and unlike Take 1.) What gives? Killing sh does not stop ssh since sh spawned ssh. Hmm…. I don’t really want sh to spawn ssh. I want sh to redirect standard output and error and then exec ssh.

Take 5

>>> pid = os.spawnlp(os.P_NOWAIT, “sh”, “sh”, “-c”, “exec ssh couffable@asgard vmstat 1 2>&1 >vmstat.log”)
>>> pid

On local macine:
$ ps ax | grep vmstat
12080 pts/4 S+ 0:00 ssh couffable@asgard vmstat 1

On remote machine:
[couffable@asgard ~]$ ps ax | grep vmstat
24081 ? Ss 0:00 vmstat 1

Well and good. Now let’s kill the ssh process.
On local machine:
>>> os.kill(12080, 9)
>>> pid, status = os.waitpid(12080, 0)
>>> status

On local machine:
$ ps ax | grep vmstat

On remote machine:
[couffable@asgard ~]$ ps ax | grep vmstat


The question is: all this mucking around with Unix plumbing is wonderful, but could I have have done it faster if I’d written my own fork(), redirect, and exec mini-program instead of trying to invoke os.spawnlp(…) using sh magic?

Jeff Darcy’s notes on really high performance servers

In his article “High-Performance Server Architecture” (http://pl.atyp.us/content/tech/servers.html)

Jeff Darcy talks about what kills performance. He mentions the following:

  • data copies
  • context switches
  • memory allocation
  • lock contention

as the biggest four reasons for poor performance, especially at high concurrency. I did not quite understand his suggestions on reducing lock contention. But perhaps it will become clearer by putting his idea to work on a real problem.

He also mentions:

  • How does your storage subsystem perform with larger vs. smaller requests? With sequential vs. random? How well do read-ahead and write-behind work?
  • How efficient is the network protocol you’re using? Are there parameters or flags you can set to make it perform better? Are there facilities like TCP_CORK, MSG_PUSH, or the Nagle-toggling trick that you can use to avoid tiny messages?
  • Does your system support scatter/gather I/O (e.g. readv/writev)? Using these can improve performance and also take much of the pain out of using buffer chains.
  • What’s your page size? What’s your cache-line size? Is it worth it to align stuff on these boundaries? How expensive are system calls or context switches, relative to other things?
  • Are your reader/writer lock primitives subject to starvation? Of whom? Do your events have “thundering herd” problems? Does your sleep/wakeup have the nasty (but very common) behavior that when X wakes Y a context switch to Y happens immediately even if X still has things to do?

I am now itching to try a few things. However, how does one begin in a dynamic language like Python? I can’t do much about memory allocation. Umm… also, Python does its own reference counting, so data copies should not be a huge problem: I just have to ensure that my own code does not make unnecessary copies. Context switches and lock contention look like the primary and secondary targets to focus on for performance and design.

This statement is rather interesting:

It’s very important to use a “symmetric” approach in which a given thread can go from being a listener to a worker to a listener again without ever changing context. Whether this involves partitioning connections between threads or having all threads take turns being listener for the entire set of connections seems to matter a lot less.

Now, how does one go about doing that and still have clean, understandable, and maintainable code?

powered by performancing firefox