Things to get right about MySQL from day #1

There is an excellent article in Linux Magazine titled “MySQL Performance from the Start”. Its by Jeremy Jawodny, so it comes with some serious authority backing it. He talks about some mistakes people make when developing the typical database backed web app and how keeping some things correct from the start will save oodles of pain later. I confess, this touched a nerve. I have experienced all of these mistakes and the pain that comes along with them.

Here is what Jeremy says, in a nutshell:

  1. What’s MySQL There For? MySQL is there for stuff that you do relational queries on. Don’t store BLOBs (images, videos, excel sheets, word docs, etc.) in the db; store metadata about them, for sure, but store the binary data that you can’t really query in a distributed file system. And don’t log to the database, either. Logs aren’t usually queried often by the front-end and tend to grow unbounded, slowing down the database and adding to admin’s headaches when they become too big.
  2. Plan for a caching layer between the app and the DB; memcached is a good solution, but there are more. You can do this, for e.g., by putting all DB access code in a library initially when you don’t have caching; later, when it is time to add memcached, you only have one place to modify.
  3. OLTP vs OLAP. Front end database is transactional in nature: a user session reads/updates many different tables but using some index (the user ID, for e.g.). Then there is reporting. It typically wants to look at all data (“Count users who did XXX in last YYY days”). From day 1, have separate front-end and reporting databases with different schemas, the formed optimized for indexed queries and perhaps denormalized a little to optimize SELECTs, and the latter tuned for aggregate queries with complex joins. If you don’t have a reporting db yet, at least don’t convolute the front-end database with reporting-like schema: keep it lean. Also see the next poi

Here I want to note down what I was thinking when I made these mistakes, so I can present them as arguments (for myself and others) when someone is about to make the same mistake again.

Storing BLOBs in the DB

When I started out, I did a little research on the performance of retrieving images from the db vs the file-system. I ended up choosing the DB because:

  • Storing images in the DB automatically gave me a distributed repository; at that time I was not aware of any good distributed file-systems, and not having another component in the stack meant there was one less thing to manage
  • I thought that there was no significant performance impact of storing BLOBs in the db

Boy, was I wrong.

First, storing BLOBs in db meant that some tables have become VERY large in physical size. It is a pain to query those tables, and back up and trim the db, now. If I have to remove old, unused BLOBs now, the DELETE query takes very long and is threatening to bring the DB to its knees. If I had only kept the metadata in the DB and stored the files in the file-system, I could have very quickly removed just the metadata from the db, freeing it up for its normal duties, and removed the files themselves at leisure. So, all in all, its worth spending the time to find, implement and maintain the distribtued file-system solution in the long run.

Secondly, I can’t take advantage of the highly optimized file-serving capabilities of web servers. When I started out, I was serving out everything from scripts: dynamic data as well as static images. When the load was low, things were fine. When the load went up, I painfully realized that more than 50% of the hits to the app were for static images that could have been served off the file-system by the web-server itself.

Thirdly, the table storing BLOBS should have stored just the BLOB and an identifier, nothing else. You don’t want to make InnoDB crawl through kilobytes of binary data looking for that 4-byte integer you stashed away there. Thankfully, this is a mistake we did not make. Or, rather, cannot make. Because by now we know that keeping anything in that table that a SELECT mught use in a JOIN or WHERE clause will make for a query that will never finish and bring the db down to boot.

Storing Logs in the DB

We needed reports on how often certain pages were being hit and certain actions triggered. We could have logged it to the filesystem and written a separate script to grok the logs, create summaries, and bundle it off to the reporting db. But that was too much work. It seemed just so easy to simply stuff raw log data into the front-end db. We would, we reasoned, one day write a script that would grok the table, create summaries, bundle it off to the reporting db, and trim the log table. (Notice that this is slightly more work.)

Jeremy is right: you never quite get around to write that “trim table” cron job and the log table just keeps growing. The management report is generated off that table directly. One day, it will finally become too slow.

Caching Layer

We do have a caching layer. The front-end code makes heavy use of memcached, and all db access is through functions in one or two files. Guess how this happened? Yeah, because of a major code-refactoring and speed-up sprint when we realized the servers were about to go kaput.

Multiple Languages, ORMs and Caching

There is another problem: memcached is good for storing objects, not tuples or records. But if you have programs in multiple languages, and they all use different ORMs, or worse, no ORM at all, then the utility of memcached becomes limited to one part of the system. Component A uses and cached some data, but component B, which needs the same data cannot use the cache because its repesentation of the object is different.

I don’t know if there is a lesson here. One might be tempted to say “use one language and ORM for everything”, but then one is also tempted to say “use the best language/library/framework for the task at hand”.

Multi-language systems are a reality. The best I can think of is to keep the number of different languages and frameworks small. And use the same ORM for all components written in a particular language.

OLTP vs OLAP

The big problem with not planning for a separate reporting db from day 1 is that

  • you keep adding raw logs to the front-end db and they never get trimmed;
  • you are afraid of deleting historical information from the db; you don’t know what kind of report you might want in the future, so you just keep all the information there, indefinitetely.

Better to:

  • Create a reporting db and reporting system from day 1, with a very simple report that sumarizes some information;
  • Write tools to create summary data for the report and trim the front-end db
  • Have an “archive” db separate from the front end db; write tools to move historical stuff to the archive db and remove it from the main db

Make it clear to management that:

  • Any report they request that needs new summary information (that you are not already generating) will not be able to report stuff from the past;
  • Or, if they want arbitrary reports in the future, you will have to keep ALL data around, and they better be ready to issue checks to buy bigger and bigger servers for the DB.
Advertisements
Post a comment or leave a trackback: Trackback URL.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: