Jeff Darcy’s notes on really high performance servers

In his article “High-Performance Server Architecture” (

Jeff Darcy talks about what kills performance. He mentions the following:

  • data copies
  • context switches
  • memory allocation
  • lock contention

as the biggest four reasons for poor performance, especially at high concurrency. I did not quite understand his suggestions on reducing lock contention. But perhaps it will become clearer by putting his idea to work on a real problem.

He also mentions:

  • How does your storage subsystem perform with larger vs. smaller requests? With sequential vs. random? How well do read-ahead and write-behind work?
  • How efficient is the network protocol you’re using? Are there parameters or flags you can set to make it perform better? Are there facilities like TCP_CORK, MSG_PUSH, or the Nagle-toggling trick that you can use to avoid tiny messages?
  • Does your system support scatter/gather I/O (e.g. readv/writev)? Using these can improve performance and also take much of the pain out of using buffer chains.
  • What’s your page size? What’s your cache-line size? Is it worth it to align stuff on these boundaries? How expensive are system calls or context switches, relative to other things?
  • Are your reader/writer lock primitives subject to starvation? Of whom? Do your events have “thundering herd” problems? Does your sleep/wakeup have the nasty (but very common) behavior that when X wakes Y a context switch to Y happens immediately even if X still has things to do?

I am now itching to try a few things. However, how does one begin in a dynamic language like Python? I can’t do much about memory allocation. Umm… also, Python does its own reference counting, so data copies should not be a huge problem: I just have to ensure that my own code does not make unnecessary copies. Context switches and lock contention look like the primary and secondary targets to focus on for performance and design.

This statement is rather interesting:

It’s very important to use a “symmetric” approach in which a given thread can go from being a listener to a worker to a listener again without ever changing context. Whether this involves partitioning connections between threads or having all threads take turns being listener for the entire set of connections seems to matter a lot less.

Now, how does one go about doing that and still have clean, understandable, and maintainable code?

powered by performancing firefox

Post a comment or leave a trackback: Trackback URL.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: