Jonas Bonér at JFokus 2013 - Building Scalable, Highly Concurrent & Fault-Tolerant Systems: Lessons Learned
One of my favorite sessions at Jfokus 2013 was presented by Typesafe co-founder and Java Champion Jonas Bonér. I always enjoy discussions about how technology evolves over the decades, how we break away from concepts, then sometimes weave our way back via the latest and greatest thing, which can sometimes appear strikingly similar to something that was very much in vogue a decade or so earlier. So, when I saw the session title "Building Scalable, Highly Concurrent & Fault-Tolerant Systems: Lessons Learned" and read the description, I was pretty sure I'd find the session quite interesting.
I did not come away disappointed! You can download the slides from the Jfokus site. Just scrolling through the slides is interesting in itself.
Jonas describes the lessons as having been learned through "Agony and Pain, lots of Pain" - if you've worked for a long time on any type of high throughput system that has to run on a 24/7 schedule, you probably know what Jonas means in saying this. And to have learned these lessons, you have to have been working on those types of systems for a long time, and seen "best practices" of one type soar high, only to be discarded 18 months later once everyone recognizes problems that turn the best practice into a "worst practice" under certain conditions (like, for example, when you have to scale).
Jonas arranged his presentation into six sections. I'll briefly summarize each of the topics he covered.
It's All Trade-offs
Jonas starts off with the concept that, to accomplish one objective, you typically have to trade away some other possibilities. He highlights three cases of objectives that involve tradeoffs:
- Performance vs Scalability
- Latency vs Throughput
- Availability vs Consistency
Are these things actually opposed to one another? Maybe not on your desktop or smartphone app. But when you're trying to serve thousands or millions of users, the equations are quite different.
Serving lots of users means you're running lots of machines, lots of threads, and your load probably dramatically spikes at times... Jonas says: "Shared mutable state, together with threads, leads to code that is totally INDETERMINISTIC and the root of all EVIL. Please, avoid it at all cost" -- which we do by using immutable state.
On slide 20, Jonas goes into the problems locks entail (if you've written multithreaded software and worked with locks, these will be familiar to you). Locks can easily bring a system to a stand-still and, as Jonas notes, error recovery is hard.
Jonas describes Dataflow Concurrency as being a solution. Code developed in this manner is deterministic, declarative, data-driven, and there is no difference between concurrent code and sequential code. Examples of this include Akka and GPars.
The point here is to design your software so it reacts when something happens. You don't have a thread idly waiting for something to happen, because that's wasting that resource during that time period. "Blocking kills scalability (and performance)." A better answer is make your system asynchronous and event-driven.
Jonas begins this section with a slide about the problem of failure recovery when you have a single controlling thread. Basically, the problem is: "If this thread blows up you are screwed." That's a pretty serious problem, no? Beyond that, if this happens, there's really no good way of finding out that something's failed, because you may have thousands of similar threads running across hundreds of servers. Realistically, even with a data center consisting of a handful of networked servers with a load balancer out in front, diagnosing intermittent problems can be very, very difficult -- "painful" as Jonas might say. I know this from experience!
The solution is isolated lightweight processes that are monitored by a supervisor process. You don't load up the threads with all kinds of error checking. Rather, you make the threads simple and independent of one another, and the supervisor process notes when a crash occurs. If you start getting an unexpectedly large number of crashed threads, then, indeed, it's time to try to figure out why, but otherwise you just let threads occasionally crash -- you blame it on the network (the threads generally succeed, so what else could it be?).
Here, Jonas talks about the dilemma of Performance versus Scalability. Performance is judged by how your system works for a single user; scalability is judged by the difference between your system's performance for a single user and its performance for a great many users. The best way to go distributed, Jonas says, is to "Embrace the Network, Use Asynchronous Message Passing, and be done with it."
Wherever possible, your system should avoid guarantees. Why? Because the cost of guaranteeing something is very high, and it's almost impossible to guarantee anything with 100% reliability in a networked environment. Consider: your software does SOMETHING; to guarantee that the SOMETHING happened, you have to add a new layer of processing on top of that first layer. What's the cost? Either you're going to decrease performance, throughput, scalability; or you'll increase latency. And... hmm... what if your guarantee monitor package returns a message "doesn't look like the SOMETHING I'm monitoring was completed." How do you know it's not the guarantee monitor software that messed up this time? The point is: if it's not critical to guarantee that SOMETHING happened, don't write software that tries to guarantee it.
With respect to Latency versus Throughput, Jonas says: "You should strive for maximal throughput with acceptable latency." What's acceptable in terms of latency depends on the purpose of your system. High latency is fine with me in my "Words with Friends" games that I play with your Java.net homepage manager (aka my wife, Dale) on Facebook, since she's so incredibly better at Scrabble than I am.
In this section, among other things, Jonas describes the scalability problems relational databases can entail. Again, if you've worked on a system that's designed to serve even merely hundreds of users, and that system relies on an RDBMS, you are probably familiar with "it's a database issue" being said again and again as performance issues and other chaos ensues during usage spikes. As Jonas notes: "Scaling reads to a RDBMS is hard. Scaling writes to a RDBMS is impossible." Relational databases were designed, let's say, a while before the Internet took over the world. They're great when used in their proper role -- that is, where you need the ACID test passed (Atomic, Consistent, Isolated, Durable). But, do you need that if your system's purpose isn't financial transactions or something like that?
Here, Jonas talks about Availability versus Consistency. In a networked system, you can't have both of these at the 100% level. The network itself is inherently unreliable (if yours isn't, please let Jonas and I know what networking hardware and software you're using!). If your objective is availability, you have to not be so concerned about consistency. Attaining consistency gets you into the whole issue of guarantees. If at every moment in time, you must have consistency, your system will have to halt until that consistency is verified. A halted system is a system with reduced availability. The only way to maximize availability is to reduce the requirements for consistency right now. If it's not the end of the world if User A responds without having seen what User B did 3 seconds ago, then you don't need 100% consistency, and you should probably utilize your available resources in a way that promotes greater availability.
Jonas sums up this situation by saying you want to have a system that's basically available, has a soft state, and is eventually consistent. Eventually User A will see what User B did, but the timing of their interactions in the physical world won't necessarily be reflected in real time in the virtual world of their networked interaction. They're not exchanging money, just words, so eventual consistency is fine. In this case, go for greater availability and reduced latency in designing your networked app. "When do you need ACID?" Jonas asks. Often, you don't, right?
I really enjoyed Jonas Bonér's Jfokus 2013 session. You should download the slides at minimum. If a video becomes available, I'll definitely be watching it!