Transactions and recoverability: what they mean to your applications
The main problem that I hear about using transactions is performance. In order to guarantee atomicity in the presence of failures, the coordinator must first execute the two-phase commit protocol across all participants to achieve consensus. Assuming the all say "yes" during the first (preparation) phase, the coordinator must then make its decision to commit durable (persistent), so that if there's a failure, it can pick up from where it left off.
When each participant receives the first phase message (essentially asking "can you commit the work you've done?"), it must make any changes it is responsible for (e.g., table updates) durable, but in a provisional manner - the participant doesn't know the transaction outcome yet, so it had better not second guess the coordinator (this can lead to what are known as heuristic outcomes.) When the coordinator (eventually) sends the second phase message (which is either "commit" or "rollback"), the participant can clear its durable log by making provisional updates permanent (in the case of commit), or deleting them (in the case of rollback). This is obviously an simplified description of what goes on, but it should be sufficient for the purposes of this discussion.
If everyone follows these rules, then you get guaranteed completion of all participants in a transaction, even if that completion is to completely undo the work: atomicity ensures that what one does, they all do and durability (combined with a suitable failure recovery component) ensures it happens if if there are failures.However, the two-phase commit protocol and the durability requirements on the coordinator and participants, obviously impose an overhead. If there are N participants in a transaction, then the coordinator sends 2N messages during 2PC in the commit case (there are optimisations such as read-only and one-phase that can help, but we'll consider the worst case scenario.) The coordinator must write a log record (and make sure it's flushed to disk and not cached in the operating system buffer) and each participant must do likewise (again, ignoring optimisations). Disk performance has improved a lot over the years, but it's still a physical process (e.g., moving the disk head to the correct place) and hence a major bottleneck.
But consider what you get for this overhead: guaranteed outcomes in the presence of failures and, if you've used distributed transaction support, this happens irrespective of the physical locality of your coordinator and its participants. Think about what you'd have to do if you wanted to do this yourself. Transactions are like an insurance policy for your critical applications: most of the time you don't see the benefit, but it's that "odd" occasion where you'll be really glad you had them.
So this is where the trade-off comes. You trade off some performance for these guarantees. If you don't want the guarantees (and many applications simply don't need transactions), then you shouldn't be using transactions. If you do want the guarantees, then I wouldn't encourage you to do this yourself.
Now if we return to some of those optimisations I mentioned earlier, there are cases where durability isn't needed at all. Some transaction systems allow you to use transactions to control what are often termed "recoverable" entities: these are resources that need the consensus of 2PC, but don't want the failure recovery - if a crash happens, the data is lost and that's just fine for these types of resources. You can mix recoverable and persistent (traditional) participants in the same transaction and the coordinator should correctly write logs only for the persistent resources.
Unfortunately the JTA doesn't support recoverable participants because it is tied to XA. That's not to say you couldn't have an XAResource implementation that was only recoverable, but that the XAResource interface doesn't convey that information to the coordinator. So, even if the coordinator could optimise the log writing in the case of recoverable participants, it can't in a pure JTA environment because it can't tell the difference via the participant interface. That means that this optimisation isn't available in a portable manner.
This is about the only time in a production environment that you should be considering using transactions and not having failure recovery. If you need transactions for your application and you're using a transaction system which doesn't support recovery, or it doesn't have it enabled by default, then don't use it. One argument I've heard from some for disabling recovery or not even implementing it is: "it's faster without recovery". Of course it's faster, because it's not doing anything! You've still got the 2PC overhead of course, but without recovery (or recoverable participants), what's the point? It's like buying an insurance policy that doesn't pay out: you get the pleasure of the overhead, but you'll never see any benefit.
You could argue that for testing purposes you don't need recovery, and there may be a case there. However, if you're testing, performance shouldn't be an issue anyway. So what's a little slower response actually mean in this situation, when you actually get to test the entire system as it's going to work in a deployed environment?
Another argument I've heard is: "99% of applications don't need recovery". OK, that's fair (though I might not put it as high as that). But in that case, 99% of applications shouldn't be using transactions, because without recovery they don't get much benefit.
So after all of this, what I've tried to show is that if you want transactions then you need to be prepared to pay the cost. But the benefit in the case where you do need them, can be huge. And in that case, make sure you get the whole transaction package from your implementation of choice and worry about those suppliers who try to tell you that recovery isn't important. If you're sure you don't need recovery, then you shouldn't be considering using transactions either.