Skip to main content

Do you execute ops in bulks or one by one?

Posted by rah003 on March 20, 2010 at 7:29 PM PDT

Last bunch of the entries have been all about the code. Today let's try something different. Partly because at the time of writing I'm just 34 thousand feet above Atlantic and really don't feel like coding anything, but partly also because i have been thinking about the problem I'm going to describe for while already.

I'm sure most of you have seen similar issue in the past as well or are fighting one such right now. So please don't be put out by the fact I describe it using Magnolia as an example because it exists in multiple other apps too.

What I want to talk about is activation and scalability.

Activation in Magnolia is an heterogeneous process that involves extracting the content from the author instance and transferring it over to one or more public instances where it needs to be re-imported and re-ordered to match the order in author instance exactly. And yes, you can already see the problem. Process of activation contains multiple steps that depend on each other and is easily serializable for a single piece of content. Which is exactly what have been done in the early activation code. First, content is extracted along with the ordering information, In the next step it is zipped and sent over http to all subscriber public instances where the ReceiveFilter picks up the incoming data, unzips it, locate the content that is being activated and replaces it with the incoming data. This is roughly what happens without going into too many details, such as need to maintain a backup copy for transactional activation in case the rollback is invoked and etc.

Now to the scalability, there are multiple ways how you can scale up the process. The most obvious case is the one with multiple subscribers. Once you have collected the content and all its related information, there's no need to transfer it to one subscriber after another, you can just launch multiple threads and publish to all (or better say, reasonably many) subscribers at once. This effectively wipes out difference between single and multiple instance activations. So far so good.

But this is just a single piece of content you have activated. So next scale-up scenario is the one in which you activate multiple pieces of content. And that's the one I would like to focus on today. The most simplistic approach to such activation would be to just create list of content that is to be activated and activate one after another. While this is technically possible with the current activation, the existing API makes it difficult to achieve. The most difficult part in the process being the fact that due to locking you can't activate a child content while it's parent is being activated, so you would need some sort of scheduling to make this work. While this is for sure not very nice, an average developer should be able to work around such problem. The other part of the issue with current API is that it actually offers you an easy way out in form of "recursive activation". During the recursive activation, Magnolia itself will traverse all the content in the tree and using provided rule will choose which bulks of the content need to be activated. And yes, it is an excellent feature allowing you to activate multiple pieces of content with just a minimal effort. But ... there's always a but in any piece of code that seems useful at first, isn't there? So the issue here is that it will first traverse the tree, extract all the content and only once the extraction is finished the transfer of the extracted content will commence. Effectively the recursive activation requires you to keep copy of all the content to be activated in memory at some point (really?) and most importantly it extends the original activation approach that worked fine for single piece of content into bulk operations - first extract ALL, then transfer and import. Once the extraction is finished, the next two phases are fine. But imagine when extracting few hundreds of pages, would not it be more efficient to start activation immediately after extracting just this first piece of content? While not working miracles, it would allow relatively inexpensive, but slow process of transferring content over the network to begin asap, while not taking much resources out of extraction.

The last point in regard to scaling the process up is branching the transfer. There again we have multiple strategies. Once that is already implemented for concurrent activations allows to receive incoming content before checking whether it is possible to write or not and then re-trying multiple times with predefined wait interval until there is no locked node in the hierarchy of activated content. This could be easily extended to keep a queue of the already-received-still-to-be-activated content without much extra effort. This can be easily coupled with transfer of multiple pieces of content at the same time and activation out of order.

The other solution is smart branching and knowing exactly what pieces of code are in the independent subtrees and can be activated in parallel. While this sounds clever, the solution is bit trickier to implement and still doesn't deal with the concurrent activations triggered by multiple editors, so the solution similar to existing check-and-wait or queue needs to still exists on the receiving end. Plus while it is nice to have an intelligent sender of the content, it is the receiving end responsibility to solve out things properly.

And just as I said at the beginning - i don't really believe that this problem is unique to Magnolia. So I'm all ears to hear how you tackled similar problems.

BTW, what I wrote above is not entirely new or a surprise - there is a concept page about new activation, waiting to be implemented for while already. But maybe there is yet another way that could or should be considered before anyone starts re-implementing activation.