You are here

Caching Heavy Queries - Good Or Bad?

Categories: 

I am going to be talking Drupal here pretty much, but the same logic applies to any development.

So there's  certain block or data on a page you want. You've developed a query that needs a lot of joins and resorts to all kinds of nastiness in MySQL (temporary tables, filesorts, large number of row returns, etc). One of the common cures it to cache the results of that query. This can be a life saver, but care should be taken to do it right.

Let's take a look at a common way of handling this:

 

function mymodule_nastyquery(){
  $cache = cache_get('mydata');
  if ($cache->data){
    return $cache->data;
  }
  //perform the nasty query here and store it into an array or object
  cache_set('mydata', $object);
  return $object;
}

This is rather common code. It's looking for the cache and if it exists then it returns it. If not, then it regenerates the data and stores it again.

Now if that query takes 2-3 seconds, you can end up with a couple of problems. First off is the unlucky user who has to wait for the page to load while the query executes. They have no idea that the query is running and decides to start hitting the refresh button, shooting off new requests to your server and since that cache is invalid, again running the query. This can quickly result in requests backing up on your server and the inevitable "site off-line" message.

Another situation involves a race condition. If that query takes 2-3 seconds, what happens if user 1 hits your site at 09:00:00 and the cache needs refreshed, then another user hits at 09:00:01? The cache isn't there, so user 2 is also performing that same query. On really busy sites, this can be a problem.

So what's the solution? My good old friend cron and a little custom caching.

First up - the custom caching.

You might be asking yourself "why would I use a custom cache?". Well Drupal does clear the caches from time to time, and if you are running something like memcache or cache router, more cache gets cleared than you would like.

You have a couple of options here. If your data is going to be shown on a majority of your pages (ie: in a block) and is rather small in size, you could always just store it in a variable. If, however, you are dealing with larger data and/or data that is only going to be used on something like the front page, then you would be better off to create your own custom cache table. A very simple table schema consisting of key, created and data is all you need. Make the key your primary key and you are good to go. Of course you have to write your own custom setters and getters for this new cache, but the fact is that Drupal won't ever flush it. This is something I commonly do for clients with larger sites and throw it in a common module. I actually have a snipplet saved, so all I got to do is a quick copy and paste.

Now you need to know when to invalidate the cache. Say your nasty query has a condition set for a certain node type. In this example I want nodes that are a type of custom_type. Using nodeapi you can quickly invalidate the cache:

 

function mymodule_nodeapi(&$node, $op, $teaser = NULL, $page = NULL) {
  switch ($op){
    case 'insert':
    case 'update':
      
      if ($node->type=='custom_type' && $node->status==1){
        variable_set('mymodule_refresh_data_cache',1);
      }
      break;
  }
}

You are simply setting a boolean variable to indicate that the cache is invalid and needs refreshed on the next cron run. So what you are looking at is a more or less completed code of:

 

function mymodule_generate_data() {
  //Perform nasty query here and store it in $object
  mycustom_cache_set('nastyquery_data', $object);
  variable_set('mymodule_refresh_nastyquery_data', 0);
  return $object;
}

function mymodule_cron() {
  if (variable_get('mymodule_refresh_nastyquery_data', 0)) {
    mymodule_generate_data();
  }
}

function mymodule_load_data() {
  $data = mycustom_cache_get('nastyquery_data');
  if (!$data) {
    $data = mymodule_generate_data();
  }

  return $data;
}

function mymodule_cache_set($key, $value) {
  $value = serialize($value);
  if (db_result(db_query("SELECT 1 FROM {mycustom_cache} WHERE key='%s'", $key))) {
    db_query("UPDATE {mycustom_cache} SET value='%s' WHERE key='%s'", $value, $key);
  } else {
    db_query("INSERT INTO {mycustom_cache} (key,value) VALUES ('%s','%s')", $value, $key);
  }
}

function mymodule_cache_get($key) {
  static $cache;
  // We store it in a static cache to prevent multiple queries
  if (!isset($cache[$key])) {
    $result = db_result(db_query("SELECT value FROM {mycustom_cache} WHERE key='%s'", $key));
    if ($result) {
      $cache[$key] = unserialize($result);
    } else {
      $cache[$key] = false;
    }
  }
  return $cache[$key];
}

So by setting a simple flag you and adding a few more lines of code, you have successfully avoided those nasty race conditions that harsh queries can present. Of course your cron will take a little longer to run when it has to regenerate the cache, but this is still a much better alternative than having your visitors wait an extra 2-3 seconds for that query to execute.

Of course this is still a temporary fix and the query times will get longer as your dataset grows, so it's best to put this in as a temporary fix and start coming up with a better solution to increase the performance of your query.