There is an excellent article in Linux Magazine titled “MySQL Performance from the Start”. Its by Jeremy Jawodny, so it comes with some serious authority backing it. He talks about some mistakes people make when developing the typical database backed web app and how keeping some things correct from the start will save oodles of pain later. I confess, this touched a nerve. I have experienced all of these mistakes and the pain that comes along with them.

Here is what Jeremy says, in a nutshell:

  1. What’s MySQL There For? MySQL is there for stuff that you do relational queries on. Don’t store BLOBs (images, videos, excel sheets, word docs, etc.) in the db; store metadata about them, for sure, but store the binary data that you can’t really query in a distributed file system. And don’t log to the database, either. Logs aren’t usually queried often by the front-end and tend to grow unbounded, slowing down the database and adding to admin’s headaches when they become too big.
  2. Plan for a caching layer between the app and the DB; memcached is a good solution, but there are more. You can do this, for e.g., by putting all DB access code in a library initially when you don’t have caching; later, when it is time to add memcached, you only have one place to modify.
  3. OLTP vs OLAP. Front end database is transactional in nature: a user session reads/updates many different tables but using some index (the user ID, for e.g.). Then there is reporting. It typically wants to look at all data (“Count users who did XXX in last YYY days”). From day 1, have separate front-end and reporting databases with different schemas, the formed optimized for indexed queries and perhaps denormalized a little to optimize SELECTs, and the latter tuned for aggregate queries with complex joins. If you don’t have a reporting db yet, at least don’t convolute the front-end database with reporting-like schema: keep it lean. Also see the next poi

Here I want to note down what I was thinking when I made these mistakes, so I can present them as arguments (for myself and others) when someone is about to make the same mistake again.

Storing BLOBs in the DB

When I started out, I did a little research on the performance of retrieving images from the db vs the file-system. I ended up choosing the DB because:

  • Storing images in the DB automatically gave me a distributed repository; at that time I was not aware of any good distributed file-systems, and not having another component in the stack meant there was one less thing to manage
  • I thought that there was no significant performance impact of storing BLOBs in the db

Boy, was I wrong.

First, storing BLOBs in db meant that some tables have become VERY large in physical size. It is a pain to query those tables, and back up and trim the db, now. If I have to remove old, unused BLOBs now, the DELETE query takes very long and is threatening to bring the DB to its knees. If I had only kept the metadata in the DB and stored the files in the file-system, I could have very quickly removed just the metadata from the db, freeing it up for its normal duties, and removed the files themselves at leisure. So, all in all, its worth spending the time to find, implement and maintain the distribtued file-system solution in the long run.

Secondly, I can’t take advantage of the highly optimized file-serving capabilities of web servers. When I started out, I was serving out everything from scripts: dynamic data as well as static images. When the load was low, things were fine. When the load went up, I painfully realized that more than 50% of the hits to the app were for static images that could have been served off the file-system by the web-server itself.

Thirdly, the table storing BLOBS should have stored just the BLOB and an identifier, nothing else. You don’t want to make InnoDB crawl through kilobytes of binary data looking for that 4-byte integer you stashed away there. Thankfully, this is a mistake we did not make. Or, rather, cannot make. Because by now we know that keeping anything in that table that a SELECT mught use in a JOIN or WHERE clause will make for a query that will never finish and bring the db down to boot.

Storing Logs in the DB

We needed reports on how often certain pages were being hit and certain actions triggered. We could have logged it to the filesystem and written a separate script to grok the logs, create summaries, and bundle it off to the reporting db. But that was too much work. It seemed just so easy to simply stuff raw log data into the front-end db. We would, we reasoned, one day write a script that would grok the table, create summaries, bundle it off to the reporting db, and trim the log table. (Notice that this is slightly more work.)

Jeremy is right: you never quite get around to write that “trim table” cron job and the log table just keeps growing. The management report is generated off that table directly. One day, it will finally become too slow.

Caching Layer

We do have a caching layer. The front-end code makes heavy use of memcached, and all db access is through functions in one or two files. Guess how this happened? Yeah, because of a major code-refactoring and speed-up sprint when we realized the servers were about to go kaput.

Multiple Languages, ORMs and Caching

There is another problem: memcached is good for storing objects, not tuples or records. But if you have programs in multiple languages, and they all use different ORMs, or worse, no ORM at all, then the utility of memcached becomes limited to one part of the system. Component A uses and cached some data, but component B, which needs the same data cannot use the cache because its repesentation of the object is different.

I don’t know if there is a lesson here. One might be tempted to say “use one language and ORM for everything”, but then one is also tempted to say “use the best language/library/framework for the task at hand”.

Multi-language systems are a reality. The best I can think of is to keep the number of different languages and frameworks small. And use the same ORM for all components written in a particular language.

OLTP vs OLAP

The big problem with not planning for a separate reporting db from day 1 is that

  • you keep adding raw logs to the front-end db and they never get trimmed;
  • you are afraid of deleting historical information from the db; you don’t know what kind of report you might want in the future, so you just keep all the information there, indefinitetely.

Better to:

  • Create a reporting db and reporting system from day 1, with a very simple report that sumarizes some information;
  • Write tools to create summary data for the report and trim the front-end db
  • Have an “archive” db separate from the front end db; write tools to move historical stuff to the archive db and remove it from the main db

Make it clear to management that:

  • Any report they request that needs new summary information (that you are not already generating) will not be able to report stuff from the past;
  • Or, if they want arbitrary reports in the future, you will have to keep ALL data around, and they better be ready to issue checks to buy bigger and bigger servers for the DB.

In the Lucknow round of Miss India 2009:

Beauty queens are encouraged to think of themselves as role models so it was easy to ask Manisha what she would do when she was one. What would she advise people who were short or dark? Very seriously she replied, “Not everyone can be beautiful but they should try.” Manisha clearly equated short and dark with ugliness.

When will we look on *all* Indians as equal claimants to the right to be called beautiful? I don’t think anyone in Lucknow would every allow for a person with Mongoloid features (as some tribals from the north east of India) to be called beautiful without qualifying it in some sense.


I have a problem: I am trying to test a demo server designed to deal with LOTS of simultaneous connections. And my tests fail at a small number because of socket.error: (24, ‘Too many open files’). I need to make the system let my test program open lots of sockets.

What are the current limits?

  • To obtain the current system-wide maximum number of file-descriptors-per-process, use 'cat /proc/sys/fs/file-max' or 'sysctl -a | grep fs.file-max'.
  • There is also a limit on how many file-descriptors a user may create: 'ulimit -n'

First we shall increase the system-wide limit: edit /etc/sysctl.conf and set the value of fs.file-max to the desired maximum. Run sysctl -p to effect the changes.

Now, let’s increase per-user limits. Edit /etc/security/limits.conf and put something like this at the bottom:

parijat hard nofile 400000

where ‘parijat’ is my login on this system. Logout and login again. ulimit -n should print 400000 now.

I code in Emacs. Emacs has good support for version control tool interaction in the vc-mode package. I want to use git and have vc-mode work with git. I also want access to special git commands that the vc-mode does not export directly. The notes below are for Ubuntu Hardy, which is what I use all the time.

Install Git

First I made sure I had git installed:

$ sudo aptitude install git-core

(I use aptitude instead of apt when I want ‘recommended’ packages to be installed automatically. Basically, I am letting aptitude choose and install other packages which are not strictly needed to install git-core, but which Ubuntu/Debian packagers think might be useful for me ;-)).

Create a template python package

$ cd workspace
$ paster create hello

I am using Python paste, paste script and paste deploy tools. The documentation for these sucks. Basically, I just use paste script to create the template for a project that can be packaged with setuptools. I am also doing this in a virtualenv. More on them some other time.

I end up with this structure:

$ cd ~/workspace/hello
$ tree
.
|-- build
|   `-- lib
|       `-- hello
|           `-- __init__.py
|-- hello
|   `-- __init__.py
|-- hello.egg-info
|   |-- PKG-INFO
|   |-- SOURCES.txt
|   |-- dependency_links.txt
|   |-- entry_points.txt
|   |-- not-zip-safe
|   `-- top_level.txt
|-- setup.cfg
`-- setup.py

5 directories, 10 files

Now I can build a tarball or egg with:

$ python setup.py bdist_egg
$ python setup.py bdist

Users can take the output and install this with ‘python setup.py install’.

Initial git commit

$ git init

$ git add -a .

$ git commit # write a suitable commit message

$ git log

$ git status

Emacs+git

On ubuntu, the git-core package contains emacs support files. I just add this to my .emacs
;; -----------------------------------------------------------------------------
;; Git support
;; -----------------------------------------------------------------------------
(load "/usr/share/doc/git-core/contrib/emacs/git.el")
(load "/usr/share/doc/git-core/contrib/emacs/git-blame.el")
(load "/usr/share/doc/git-core/contrib/emacs/vc-git.el")
(add-to-list 'vc-handled-backends 'GIT)

Now, basic vc-mode works:

C-x C-f ~/workspace/hello/setup.py
C-x v l # vc-print-log
C-x v g # vc-annotate

Direct git interaction

M-x git-status

git-status-mode is the major mode for interacting with git directly. Typing h in this mode bring up a menu of commands and keybindings.

Adding a file

I add two files: hello/speaker.py and hello/main.py.  Then:

M-x git-status

Navigate to speaker.py and main.py and type ‘a’ (for git-add-file) on both.

Type ‘c’ (for git-commit-file). Type in a log message and C-c C-c to finish the log message and commit.

Adding files to gitignore

I create a .gitignore file in the top level directory of my project and put this in it:

*.egg
*.pyc
*~

Now I go back to my git-status window and press g (git-status-refresh), and et voila, the clutter disappears.

Committing changes

However, I notice that the hello/main.py is marked as ‘Added’ but not Uptodate. What gives? It seems git-commit-file commits only one file. Oops.  What I should have done is what I do now:

  • ‘M’ to mark all files (.gitignore and hello/main.py)
  • c to commit all files. (Type in log message)
  • g to refresh. (Shows ‘No changes’ and no files are listed)
  • t u (git-toggle-show-uptodate; now lists all files, even those which are uptodate)
  • q (git-status-quit)

Diffing changes

I edit main.py and speaker.py. Going to git-status, and pressing ‘=’ over a file gives me the diff against the committed version.

I can also mark multiple files with ‘m’, or all files with ‘M’. Now pressing ‘=’ diffs all those files.

Committing selected changes

Let’s say I’ve added a bunch of code to some files that I want to commit. I’ve also toggled a gloabl DEBUG variable to aid in my own testing, and I don’t want to commit that particular change. In subversion or cvs I’d have to ensure that the flag was set to its proper state and then commit the whole thing together. However in git, I can commit some changes (patches) and ignore others. This cannot be done from within emacs though. I have to switch to the command line and do a ‘git add –interactive’, which gives me a menu system to add selective ‘hunks’.

Back to the command line, I do a git diff –cached to see what I am going to commit. I don’t like it. I want to start choosing the changes I want to commit again:

  • git reset –mixed (reset the index)
  • git add -i (choose what to add, again)
  • git diff –cached (see what I am gonna be committing0
  • git diff (see what else is there in my WC that I have not scheduled to be committed)
  • git commit (type log and commit)

Note:

Working of git-reset and git-diff

Working of git-reset and git-diff

Homeopathy

August 2, 2008

Via this blog belonging to the magician James Randi (a debunker of extraordinary claims, most famously those of Uri Geller), I learnt that on 25th July the BBC ran a story on the 20th anniversary of a scientific paper published in the journal <em>Nature</em>.

According to a charismatic French scientist named Jacques Benveniste, pure water could somehow remember what it had previously containe. …  John Maddox, editor of Nature, realised that Benveniste’s research would be controversial, so it was accompanied by a disclaimer similar to one that had been run when he published research about Uri Geller’s supposed supernatural powers…. Unfortunately for Benveniste, the investigators soon discovered that the results in his laboratory were unreliable.

How were they unreliable? From someone who was there:

The test procedure involved counting – via a reticule in a microscope – the number of exploded basophils. This would seem to be a straightforward procedure: select an area, count the burst cells, and record that number, but that wasn’t quite sufficient. Ideally – and such an experiment can’t properly be done with less than optimum care being applied – each microscope slide should have been randomly coded, so that double-blind conditions were in place. That was not done; on one occasion, we saw a lab worker perform the count, record it, and then erase the number when it was realized that the slide that had been scanned was a “control,” not a randomly-selected sample. The lab worker replaced the slide in the holder, moved it about, and settled on an area that gave a count more in line with what had been expected. That is not the way science is done.

Ok, so the original experiment was flawed. So, file and and forget. But no. This experiment is probably the cornerstone of explanations of how homeopathy could work, assuming it does. So people persist in claiming that the phenomenon exists and that proper experiments would prove it. Here is what the BBC says in the same article:

For example, the BBC science series Horizon attempted to test Benveniste’s claims in 2002, and the conclusion was announced by Professor Martin Bland, of St George’s Hospital Medical School.

He said: “There’s absolutely no evidence at all to say that there is any difference between the solution that started off as pure water and the solution that started off with the histamine [an allergen].”

Looks like some people are giving up on homeopathy.

McCain vs Obama

June 29, 2008

From CNN:

Muslim and Jewish groups on Monday sharply criticized Sen. John McCain’s comments that he would prefer a Christian president to lead the United States.

The Arizona Republican’s remarks came in an interview with Beliefnet, a Web site that covers religious issues and affairs.

“I just have to say in all candor that since this nation was founded primarily on Christian principles, personally, I prefer someone who has a grounding in my faith,” the GOP presidential hopeful told the Web site in an interview published Saturday.

Its hilarious that other religious groups are attacking him for emphasizing his religion and excluding theirs, and not because this is blatantly false.

On the other hand, Barack Obama seems to take a more reasonable stand:

Whatever we once were, we’re no longer just a Christian nation; we are also a Jewish nation, a Muslim nation, a Buddhist nation, a Hindu nation, and a nation of non-believers. (http://www.cbn.com/CBNnews/204017.aspx)

Here is another quote:

Democracy demands that the religiously motivated translate their concerns into universal, rather than religion-specific, values. It requires that their proposals be subject to argument, and amenable to reason. (from http://obama.senate.gov/speech/060628-call_to_renewal/)

http://kerneltrap.org/mailarchive/git/2007/8/21/255431: this should be kept in mind/verified when using git-svn.

For some time now, have wanted to use git and git-svn to track the subversion repo at work. That repo has lots of branches, and individual changesets are merged very often between branches, making it a nightmare to find out what has already been merged, and what not.
Used to keep track of all this in a text file. Its becoming tedious. Have been trying to use git-svn to ease the pain.
On the first try, git-svn gave up halfway during the clone operation with an error message about protocol/network error. This git-svn was the one in ubuntu repositories.
Since then, installed git from source.
First I tried this way. It works, but then gitk will only show me one branch: the one I have currently checked out.
Now, I am trying this other way. Hopefully this will work better.

Code merging

April 6, 2008

Wanting to merge two branches of code, both of which have had concurrent commits going on them for some time. Code is kept in svn, so the branches actually live in separate directories. Most changesets have been merged back and forth, yet residual changes remain.

I get diff to tell the files which are different between the branches.

diff -q -r -x '\.svn' -x '*.pyx' -x '*~' branch1 branch2

This produces lines like:

Files zeroc-ice-python_3.2.1/config/Make.rules and zeroc-ice-python-3.2.1-new/config/Make.rules differOnly in zeroc-ice-python-3.2.1-new/config: Make.rules.GNUOnly in zeroc-ice-python-3.2.1-new/config: Make.rules.GNU_kFreeBSD

I need to munge the output of diff. I want to have 4 columns: directory (sans the branchname), filename, link to the file in branch1 if it exists there, link to file in branch2 if it exists there. I want this ordered/sorted by directory, and I want the output in emacs org mode format. I filter the output of the above diff command through this perl script:

while (<>) {    if (/^Only in/) {        m/Only in (.*): (.*)$/; $dirname=$1; $fname=$2; # sep dirpath and filename        ($branch = $dirname) =~ s/([^\/]+).*/$1/; # the branch name is the first part of the dirpath (e.g. branch-xy/dir/path/file)        $dirname =~ s/$branch\/?(.*)/$1/; # remove the 'branch-xy' from dirpath to get the remaining dirpath        $_ = "| $dirname/$fname | [[file:~/workspace/couffable/branches/$branch/$dirname/$fname][Only in: $branch]] |";    }

    if (/^Files /) {        m/Files (.*) and (.*) differ/; $left=$1;        $left =~ m/release-3-11\/(.*\/)?([^\/]+)/;        if (defined($1)) { $dirname=$1; } else { $dirname = ""; }        $left=$2; # remove branch name, sep dir and file        $_ = "| $dirname$left | [[file:~/workspace/couffable/branches/release-3-11/$dirname$left][r311]] | [[file:~/workspace/couffable/branches/release-3-11-airtel/$dirname$left][r311-airtel]] |";    }    print $_ . "\n";}