Friday, April 10, 2009

FragmentCache part duex: Cleaning

I realized that I forgot to address an additional, critical, issue with fragment caching: cleaning out expired fragments.  A fragment cache implemented as I described in my last post would update expired fragments when requested, but what about expired fragments that aren't requested?  

If you externalized the cache itself, then your cache will likely support LRU or LFU expiration policies. But, if you're using a in-memory cache in the Java heap space, well, that's a memory leak.  If you cache a lot of fragments, failing to clean the cache is a recipe for maxing out the old generation space and thus causing Full GC's to run back-to-back, collecting just enough space to limp along, but not do any good, until eventually your application becomes non-responsive.  So, an additional requirement for an in-memory fragment cache is that you monitor the usage and have a cleaner task which runs, say, once an hour, to remove fragments that have expired and are not currently locked for update.  Doing this efficiently with a large cache requires using a data structure that supports thread-safe concurrent access, so as to avoid locking the whole cache while cleaning and maintain high throughput.  

Sunday, April 5, 2009

The JamLegend Fragment Cache

Abstract
The benefits of fragment caching are pretty well-known.  If you haven't come across it before, the concept is simple:  cache portions of the page to lighten database load by reducing or eliminating redundant queries.  Often, this can be the "most bang for your buck" type of cache. How long a fragment should be cached is a function of the expense of generation, frequency of changes, and the frequency of access.  Ideal fragments for caching are the most accessed and most expensive to generate.

The fragment cache I'm going to discuss is the implementation of the FragmentCacheTag JSP custom tag, in J2Free, the one-day-to-be-open-source framework on which JamLegend is built (you can find more info on J2Free here, and in a few of my previous blogs).  

JamLegend Examples
The "Popular Today" songs lists on JamLegend are a little expensive to generate, shown on all the highest trafficked pages, and don't change too quickly.  As such, we cache that list for 5 minutes, saving us a lot of queries but keeping the list rather dynamic.  On the other extreme, comments change constantly and need to be much more dynamic.  We cache comment threads for 30 seconds.  When you have a high number of concurrent users, even a 30 second cache can save you a lot of queries without sacrificing liveliness.

The Code Samples

<cache:fragment timeout="${30 * 1000}">
Simple cache based on a timeout of 30 seconds.
</cache:fragment>

<cache:fragment timeout="-1" condition="${size(items)}">
Simple cache with no timeout, using instead an additional attribute to refresh the cache when the size of the collection "items" has changed since the cache was last updated. This works well in situations where data changes rarely, and you want it to be available immediately upon change.
</cache:fragment>

<cache:fragment timeout="${30*1000}" disable="${devMode}">
Simple cache based on a timeout of 30 seconds but with an attribute to disable the cache if devMode is flagged. Here, devMode is a application scope attribute set when the server starts up. It's nice to be able to selectively disable certain caches for testing, and have them enabled live without having to change any code.
</cache:fragment>

The JamLegend FragmentCacheTag also supports an additional application scope attribute to globally disable all caching.  One of these days, I'm going to get around to changing the timeout to be specified in seconds, instead of milliseconds; just hasn't happened yet... 

Fragment Caching "Gotchas"
Two of the first problems you'll run into with fragment caching are:
  1. logged in vs. logged out users
  2. concurrent cache refreshing
The first is a problem you're going to encounter simply by using a fragment cache.  The second, on the other hand, you'll only run into if you're building your own fragment cache; if you use the J2Free FragmentCacheTag described here, it will be taken care of for you.

Adding Dynamic Capabilities to Cached HTML 
Sometimes ideal candidates for fragment caching have display differences when users are logged in.  For example, caching comment threads:  logged in users see a "report" link, whereas logged out users do not.  Since a fragment cache holds the generated HTML, as opposed to an object cache or query cache, it isn't easy or efficient to modify the fragment based on logged-in status (though, it's not impossible and I'd appreciate any elegant approaches).  Fortunately, JavaScript comes to the rescue.  

On JamLegend, a "report" link is included in the generated HTML for comment threads whether the user is logged in or not, with style="display:none;" to hide it initially.  Then, if the user is logged in, we use a little JavaScript on page load to show all the report links, otherwise they stay hidden.  This is the same technique used by 37Signals (see JavaScript Makes Relative Times Compatible with Caching).  Since we also check server-side that a user is logged in before allowing them to mark a comment as spam, there isn't a significant downside to having the hidden spam links for users who are not logged in.

Preventing Concurrent Refreshes
If you're caching a fragment because it's expensive to generate, the last thing you want is a few hundred users concurrently refreshing the fragment triggering a few hundred concurrent expensive queries.  So, how do you prevent this?  The first question you have to answer is this:
When user A has triggered a cache refresh, what should happen to other users who hit the same fragment?
Should they:
(a) Head over to user A's thread, wait for user A's thread to complete the query, and then ask for the result?  or,
(b) Should they just see that user A is refreshing the cache and so grab the last cached fragment and return immediately?
Both are acceptable answers, but which you choose depends on how stale you allow your fragment to be.  In case A, you can serve the most recent fragment to an unlimited number of callers while still only generating the fragment once.  Since user A triggered the refresh first, that thread is necessarily ahead of the other threads, so the others will still return is less time than if they were the sole thread refreshing the cache.  In case B, the subsequent threads can return immediately serving the last version of the fragment while assuming that user A's thread will refresh the cache.

For J2Free's FragmentCacheTag, I chose scenario B on the reasoning that the fragment is cached to begin with because it is not crucial that it be 100% up-to-date.  Ergo, what's one more minute with the old fragment?  At some point, I want to add this as an optional attribute, something to the effect of:

<cache:fragment timeout="600000" waitOnRefresh="true">
... where waitOnRefresh enabled would indicate that the cache should not server the last fragment while being refreshed, but rather attach itself to the thread conducting the refresh and get the result when complete.
</cache:fragment>

Obviously, the above only applies to multi-threaded servers.  There is an additional case, though, that applies to single and multi-threaded servers:  what happens in a cluster?  How do you prevent concurrent refreshes when you're running multiple servers?  A few options are: 
  1. Store fragments in a distributed cache, like ehcache, JBoss cache, or memcached, using replication instead of invalidation.  However, there is still the possibility of instance A triggering a refresh then instance B triggering a refresh before instance B received the updated fragment from instance A.  This window could be decreased by replicating a lock-for-update on a fragment, rather than waiting for the complete fragment, but not closed entirely.
  2. Using a central, synchronized cache would remove the race-condition from situation 1, but at the cost of decreased throughput.  Depending on the location and network access to such a cache, it may or may not be worth it.  Ping times intra-ec2-zone are only 1-2ms between instances, so it might be worth it to add a few ms to each request to avoid concurrently executing an expensive query.
  3. Use a distributed cache, as in situation #1, but only allow 1 instance to refresh the cache.  This may or may not work depending on how well you split requests between instances.  You certainly wouldn't want users on instance B to continue to receive stale data because no user on instance A has triggered a cache refresh.  This strategy may work well for pages that are guaranteed to be accessed on each instance.
I welcome comments on any additional solutions to preventing concurrent cache refreshing in a cluster.