How To Make Life Suck Less (While Making Scalable Systems)
As I write this, I’m sitting at my home office with a five-year-old desktop and its failing monitor. Surrounding me are protein bar wrappers, fingerprint-covered water glasses, and guitar amp vacuum tubes. It’s 7 in the morning.
If you don’t know me you might think I’m an early riser. Alas… I’ve actually been up all night, unable to sleep, babysitting my Hadoop cluster and worrying about getting this new platform into production.
The past three months have been hellish — some amazing accomplishments, but plenty of painful mistakes. “You guys are working hard but it seems so effortless,” I heard from another team lead. If he only knew. Developing a distributed, concurrent platform for web-scale data is far from effortless. A great deal of the time we spent ‘working hard’ was not new coding at all, but relentlessly beating on problems. I don’t always take my own advice, but I’m getting better. And so, I’ll try to keep the following in mind when planning what to do next:
Developing a platform for web-scale data analysis is hard work. Unlike well-worn areas like MVC on LAMP, big data (billions of records, intensive analysis) presents brand new challenges that need innovative engineering, testing strategies, and planning.
In recent months I’ve led my team at Visible Technologies through creating a new analytics platform based on Hadoop. This article lists a few lessons from our ordeal, which hopefully should help if you venture into similar territory. Here are my observations:
1. Big data is BIG
2. This is systems software, not an application (for now)
3. Learn the source, engage the community, contribute feedback
4. Scalable doesn’t imply cheap or easy. Just cheaper and easier.
5. It’s so much easier with smart, experienced people.
Let’s examine these in detail…
Big Data is BIG
“We’ve got two weeks to test our entire storage, indexing, and search workflow for up to one billion documents, with multi-lingual data”, I told my team. Our task seemed pretty easy, at the time. We had the workflow written up (in pieces at least), and had tested the system on a backup database of a few hundred million documents. Everything ran flawlessly — so far, so good.
Doubling the data size on a relatively untaxed computer cluster seemed like it should be painless. I’d run jobs to index hundreds of millions of documents before, no problem. When it came time to index a billion documents, though, it failed and kept failing. Iterative debugging processes become unsustainable when it takes 8 hours to run a test cycle. I found the answer, ten days later, in an obscure subroutine for sending a heartbeat back to a central server.
Scalable platforms may coordinate many machines to act as one, but that doesn’t mean you can engineer them the same way you normally would. Everything gets more difficult when multiple nodes are involved; more power, more problems.
The problem with big data is that it’s BIG. Things fail at scale, the same way they do on smaller projects. But with scale comes longer timeframes and more moving parts, and you need to plan testing strategies accordingly. The Hadoop/HBase world brings us back to the 70s concepts of batches and time-sharing. It’s scary to run jobs that take 8 hours to fail, and having the code of team members in various states of testing just adds to the nightmare.
If it takes a long time to fail, it’s going to take even longer to fix. Buy as much hardware as you can justify, and structure your test data to process a small amount before hitting bugs.
This is systems software, not an application (for now).
Plenty of MVC/LAMP or Rails programmers can write a good website, connect their markup through AJAX to a middle tier, and persist state to a DB. Most web-based software never gets more complex than this. It’s not always easy, especially with complex visualization and business logic thrown in — but it’s based on customizing well-delineated applications, and high-level shortcuts are plentiful.
Big data is not about applications yet. You’re not writing code to trigger APIs between standard modules, you’re working with the raw “stuff” of a distributed operating system, and digging around in the basement. Coding and debugging in this context requires a stronger handle on computer science theory than most web development demands.
As an example, here are a few problems we’ve stumbled upon:
1. Garbage collection pauses, and nodes lose heartbeat, when Java’s used memory heap exceeds 7GB.
2. Figuring out which garbage collection scheme to use (and which of the 10 associated parameters to tweak), requires a knowledge of the generational hypothesis of GC.
3. Microsecond race conditions can trip up processing on obscure data structures.
4. General headaches when dozens of threads need to share
5. Making everything an object can bog a cluster down. Instantiation is expensive and small objects are not cache-friendly.
6. It’s even important to understand the assumptions Java’s JIT Compiler makes about your code.
Note: Just because Hadoop requires you to work in Java, doesn’t mean it’s not systems code that you’re writing. Treat it with the same deference as a database written in C, which is something that’s likely to be very well architected from the beginning and highly optimized.
Learn the source, engage the community, contribute feedback
“Katta keeps failing when we try to deploy our 500M document index.”
“Why?”
“Not sure. Trying it again.”
An hour later, it failed again. So we made a configuration tweak. An hour later… more fail. For two days. Finally, we dove into the code. Two days might seem like a long wait to look deeper into a problem, but at scale it can take that long to eliminate all the random shots in the dark. We eventually traced the crash to a line of code that was timing out waiting for Zookeeper to give it some information. Once we were in the code, it took about 15 minutes to find the issue. There can be a mental barrier to examining the codebase of open source software (an ironic twist on not-invented-here syndrome), but mostly I’ve found them pretty easy to navigate.
This lesson of look fast, look early applies to pretty much any open source project, but doubly so for distributed ones because the time cost of incremental, scattershot debugging can be so high.
Along with studying OSS source code, you should engage the community around the platform you’re using. I’m extremely grateful to have the HBase team around on IRC and their mailing list. I usually get answers to even complex questions in a few minutes/hours. You can’t really engage the community without learning some of the source — you’ll be wasting everyone’s time. I’m trying to repay the HBase team for their kindness by submitting a few patches (although I wish I had the time to do much more, there’s a lot of hefty work to do).
Scalable doesn’t imply cheap or easy. Just cheaper and easier.
I was having a conversation with an associate about our new platform and the difficulties of developing in our current environment.
“Why do you need new machines? You have five. Isn’t that enough to prototype our new infrastructure? We only have two SQL servers,” he posited.
“The machines are processing so much data that they’re crashing because the CPU is unavailable waiting for I/O,” I replied.
“Where’s the proof?”
It’s difficult to set expectations in emerging fields such as commoditized distributed computing, especially around hardware costs and platform capabilities. Somewhere along the way, “commodity hardware” became interpreted as “$500 machines”.
Would you run a high-volume, production MySQL server on a 2 core, 2 GB RAM Box from 2005? Of course not. Why would you want to run Hadoop or HBase on a machine of the same specs? You can’t have a cluster that costs $5000 comprised of 10 boxes that cost $500. Nor do many of us want to pay $500,000 for Netezza hardware, or $5,000,000 to Oracle. A $50,000 cluster is cheap, and can handle a lot of data. Virtualization is not a panacea, either. If you need high-volume and high-avaliablity, you need to have control over the physical environment of the boxes. Running things in “the cloud” can have too much overhead, and shields you from important locality-awareness such as which rack you’re running on. It works for some, but you need to run the numbers to make sure you’re not losing too much performance or reliability.
Unlearning (and un-teaching) the habits of the Swiss Army RDBMS can be difficult. As mentioned in earlier articles, databases like Cassandra and HBase are not RDBMSs. You cannot just write SQL and expect everything to ‘just work’. Your storage and access paradigms need to be architected from the ground up to fit the data. Impedance mismatches between data structures and data storage are fatal at large scales, and lead to general engineering malaise — bad code, strange bugs, inconsistent performance, and much more. Not everything is a table, these days. It’s a tough sell to many of those who depended on an RDBMS to handle every aspect of data storage and manipulation.
As a friend of mine recently said: “To create something that is scalable, you have to add a certain level of complexity over and beyond the single node. Just like going to multi-threaded programming is so much more complex, so is going clustered.” But if you do it right, you’re not limited by data size any more.
It’s so much easier with smart and experienced people
I love my team dearly. Still, every day, I find myself wishing I knew someone who’d experienced the same gotchas I have; Java classpaths and Linux deployments, performance tuning, and general issues with highly concurrent software. There’s such a big difference between the problems of a typical 3-tier web app, and the problems that arise from distributed systems engineering.
Having said that, you don’t need to revamp your entire team just to move into this area. But having at least one person who’s been there before will save your whole operation weeks of Googling and downtime, especially if you can put them in a troubleshooter or coordinating role. And if you’re transitioning a former closed-source group to OSS distributed software, make sure everyone has their CS fundamentals down — that’s always a good start.












Hi, I'm Bradford. I write about scalability and the fringes of Computer Science.
September 9th, 2009 at 9:55 pm
[...] This post was mentioned on Twitter by Chris K Wensel, BradfordS, Jedi/Sector One, Josh Cutler and others. Chris K Wensel said: RT @LusciousPear: How To Make Life Suck Less (While Making Scalable Systems) http://bit.ly/6LrvI #nosql #hadoop #hbase #cassandra [...]
September 10th, 2009 at 6:54 am
[...] #scalable #hadoop http://www.roadtofailure.com/2009/09/09/how-to-make-life-suck-less-while-making-scalable-systems/ [...]
October 23rd, 2009 at 5:31 pm
What on earth do you mean by the CPU being overloaded because it is waiting for IO ?
IO starvation means the CPU is *underused* not *overused*.
October 23rd, 2009 at 5:49 pm
Semantically, true… but the CPU is still pretty useless at that point anyway :)
November 18th, 2009 at 8:06 am
Semantically looks like you have no idea what you’re talking about. Well… CPU overload because of IO-wait, funny.
November 18th, 2009 at 11:41 am
For the pedants in the audience, I changed the term “overloaded” to “unavailable”.
February 14th, 2010 at 9:39 am
[...] Programmer 97-thingsIt’s OK Not to Write Unit TestsPsychology and Security Resource PageHow To Make Life Suck Less (While Making Scalable Systems)The myth of “undesigned”The Problem with Design and ImplementationRelated Reading:Art of [...]