Caching strategies for Magnolia 3.6
The stable release of Magnolia 3.6 will soon be out and it will ship with a brand new cache system implemented from scratch. The intention was not to reinvent the wheel, instead, the new cache system decouples Magnolia's presentation layer from the underlying cache mechanisms and allows to plug in any cache engine of choice. Magnolia 3.6 will be distributed with EHCache as the default cache engine, but anyone who is willing to implement a custom cache wrapper can run Magnolia with any other cache engine from the upcoming 3.6 release onwards.
With the new cache system in place, we, the Magnolia developers started to look into which advanced cache strategies should be included. The cache strategies that would allow server administrators to quickly achieve the best cache configuration for their particular use case with minimal effort. New strategies will be available in Magnolia 3.6 Enterprise Edition, adding more in subsequent versions of the Enterprise Edition (EE). We have a bunch of ideas on what strategies we would like to implement based on custom solutions implemented for clients in the past. Some of the cache strategies are more difficult to implement then others. Which ones we deliver with Magnolia EE 3.6 depends on how many obstacles we hit during implementing them and how well we can test those strategies before the release.
The picture below shows a graph of what we ended up on the whiteboard while brainstorming possibilities for advanced strategies to be delivered with Magnolia 3.6.
One option is a more granular cache to minimize server load by re-using previously generated output of single page elements. Instead of re-creating a page's complete HTML by retrieving all content elements every time a page is being requested, the granular cache would fetch all the content pieces only once and keep them as gzipped cache entries to assemble a complete page by re-using unchanged cachied content and re-building only modified content pieces.
Implementing the granular cache mechanism is actually quite tricky, because Magnolia allows content to appear as page element on various other places of a website for dynamic content reuse. For example, the title of a page in one part of the site can be used in the navigation menu at some only remotely relevant part of the same site just to provide link to it. Tracking all the places where content elements appear is actually nearly impossible without constructing a complete content usage graph within Magnolia. Hence, a basic cache flushing strategy would be to simply flush all cached pages of a website content has been updated.
The main problem with such an approach is that the user who visits the website immediately after the cache has been flushed will be the one who suffers from the performance penalty imposed by the server that will now need time to generate all the new cache entries. While this approach to caching is quite safe it is also quite heavy weight. Rather then relying on this brute force solution, Magnolia EE 3.6 will be delivered with different re-caching strategies that will be more appropriate for different scenarios.
Another typical issue is the fact that many concurrent requests affect frequently accessed entries. (Isn't that why they are called frequently accessed after all? ;) ). It is absolutely correct to trigger the generation of cache entries upon the first request, since they do not exist yet. It might happen that more requests hit the server while requested cache entry while it is still being generated. In this case it would be a waste of computing power to again trigger the creation of the same entry for those other requests. In fact, requests that have been issued after the first one won't get the results faster then the first request can be served. Additionally, if they triggered the retrieval of complete representation of a page composed from all the pieces in the repository they will increase the load of the server and slow it down. Mind we are talking millisecond differences here, but it still matters. For this reason, the cache system included in Magnolia 3.6 blocks all subsequent requests to the cache until a cache entry is ready and then Magnolia will serve it to the visitors.
Then again you don't want your server to hang forever in case such an entry can't be generated, no matter whether it is due to a page that doesn't exist or content that is corrupted or even due to a broken template. Since we implement scenarios for Magnolia EE where access to the cache is distributed over multiple modules and comes from multiple threads we made extra effort to make sure the new cache system never ever ends up in a deadlock.
As for the new cache strategies considered for inclusion, let's look at few examples of what we consider implementing.
Serve Old Content While Re-caching Strategy: Instead of flushing the cache on update, old content is kept and served until updated cache entry is created.This way first request for content after an update will trigger generating of new entry, while all the subsequent requests for same entry will be served old content until it is re-cached.
Eager Re-cache Strategy: It allows to serve already cached content to website visitors while re-caching the most accessed pages (e.g. the top 100 high traffic pages) in the background before flushing the cache. The idea behind this is to be able to have highly demanded content being served by Magnolia at highest possible speed at all times. With the Eager Re-cache, visitors will not have to wait for changed content to be cached, because this has been done in the background before they accessed the page. The ability to configure a specific number of entries to be re-cached makes it possible for server administrators to balance the trade-off between time-to-publishing and a performance penalty for users.
Another example would be what we call a Content Driven Timeout Cache Strategy that allows authors of a page to mark a page as expired based on a time set by the author. Currently such scenario is already possible to implement, but requires collaboration between a Magnolia template developer and page author instead of leaving the decision making power completely in the hands of the author.
Why would you use such a timout cache strategy? Just think of a situation where a page includes the results of a news feed and the content editor wants to make sure it is updated at least once a day even though there is no actual content update since the data is being fetched from a third-party RSS feed. Or imagine the opposite scenario that could be handled with same strategy: a page that triggers data mining mechanisms which are quite expensive in terms of server load to generate the output while it has no dependency on any other content. You would be able to mark such a page as expiring in specified intervals only and ignoring all other content updates knowing that that page is not affected by modifications of authors at all.
In sum, the new Magnolia cache implementation - apart from being more efficient in storing and serving the content - also allows for more advanced handling of the cached entries in the Magnolia Enterprise Edition.