JFrog's Bintray, Part 1: the Non-Trivial Problem of Developing an Infrastructure to Reliably Serve Software Binaries
I was very pleased to see the JFrog team take me up on my suggestion early this year that "Maybe the JFrog team will consider giving a presentation on how they put all this together at JavaOne this September". Unfortunately for me, the session (Building a Massively Scalable Cloud Service from the Ground Up) was already filled to capacity as I walked up to the door, so I was unable to attend.
If you're not familiar with Bintray, it's a distribution system for software binaries that's framed within a set of modern social networking tools. The social networking tools facilitate the building of communities centered on specific binaries, repositories, types of software, companies... Since the communities are self-forming, there's really no limit on the categories around which a community could be formed.
Getting all this to work wasn't trivial, according to the description of Yoav's session:
Serving developer binaries isn’t trivial. Such binaries are consumed by tools, and create massive request load. Add to that support for metadata, REST API, storage quotas, stats, repo indexes on demand and global HA distribution, and you’ve got yourself a pretty complicated system to run and manage. This talk will show you how Bintray, JFrog’s social binary distribution service, works. We will speak about how the system segmentation supports massive loads across data centers with stateless vertical scaling; how Grails applications scale and how we tie up different NoSQL technologies such as CouchDB, MongoDB, ElasticSearch & Redis; how we chose between physical and virtual servers and how we manage deployments without service interruption.
After missing the session, I spoke with Yoav and other members of the JFrog team about Bintray. At that time, Bintray had about 7000 registered users, and about 72000 packages from popular Java repositories (including Bintray's own repository) were being served. And here's where the scalability comes in: as of late September, 1.2 Billion requests for software package downloads had already been served.
This is a good place to pause and think a bit. A tweet is at most 140 bytes of data, surrounded by a few more bytes of metadata (the identity of the tweeter, date/time of the tweet, etc.). Now, think about the difference between this and the size of a software package. A software binary involves transmitting a substantial amount of data from the server to the requestor. So, yes, you need significant processing power and everything that goes along with scalability and reliability of maintaining long-term connectivity (possibly minutes, depending on network conditions) between the server and the downloading client. That's a very different problem from processing a couple hundred bytes of data, after which it's basically OK if the connectivity temporarily disappears (Twitter's situation).
Twitter applied Java technology to address their need to process enormous onrushes of tweets. But, consider what happens when a new version of a widely-used software package comes out. In that case, you suddenly have enormous demand from the user community, a very large number of simultaneous requests to download thousands or hundreds of thousands or even more bytes, and these bytes have to be delivered to each downloading client with zero data loss. Indeed, that's not trivial.
More to come in Part 2...
Subscriptions and Archives: You can subscribe to this blog using the java.net Editor's Blog Feed. You can also subscribe to the Java Today RSS feed and the java.net blogs feed. To follow Java.net net on Twitter, follow @javanetbuzz.