A New DB for 80% of Facebook, YouTube-scale Sites
Imagine you’re visiting several of your favorite massive-scale websites. Or go ahead and pull them up in your browser. Facebook, MySpace, LiveJournal, Bebo, Streamy, YouTube, anything that deals with huge amounts of data, or any sufficiently large Social Media site. While you’re at it, take a look at any data-driven website: a forum, your favorite blog, or a news site.
As you examine these sites, try to think about what kind of data they need to present to their users rapidly. There’s usually a selection of:
- Profiles (like users)
- Search results
- Sets of items by various criteria (all videos by one author, posts in a thread)
- Networks (“my friends”)
- Real-time updates “My status”
- Communications subsystems (messages to inboxes, e-mails, IMs)
Until recently, there’s been no other way to retrieve this data than with a classic RDBMS, which as we’ve seen earlier can be expensive and crippling to scale. Yet I’d estimate that a few use cases cover 80% of how data needs to be presented to these websites. The Swiss-Army RDBMS is costly overkill for what most sites *really* use it for. There needs to be a new type of database optimized for serving simple, web-scale data in real-time.
Let’s examine these operations, along with some examples. Then, I’d like to propose a scalable engine I’m architecting that’s actually optimized to drive 80% of websites, built on HBase.
THE OPERATIONS
There’s a handful of operations that can both drive websites and provide low-latency analytics on information. These are:
- GROUP BY
- FILTER
- SORT
- COUNT
- SUM
There’s a whole class of operations called “additive” and “semi-additive” that are basically simple aggregations. These are used in a majority of DB analytics use cases.
By proper structuring of data storage and denormalization, along with narrowly focusing on a few operations, we can enable new class of scalable database and analytical engines.
I can’t recall a term which properly describes what this class of database does:
- Data Warehouse? That doesn’t work. Too big. I imagine getting in a forklift, driving to the data I need, loading it on a pallet, driving it to a staging area, unpacking the pallet, and eventually getting at the pieces I want. Forklifts are pretty awesome, granted. (Sometimes I fantasize smashing our current slow-as-agriculture SQL boxes with a forklift).
- Data Mart? Smaller, but still slow. I have to grab a grocery cart, walk to the aisle which has my data, pick it off a shelf, and then take it to a cashier so I can do what I want with it.
- Operational Data Store? Getting closer. Wikipedia says it “integrates data from many sources for reporting, usually atomic pieces of data captured in ‘real’ time”
I’m going to call it a Data Station: a place where you drive up with a car, get only exactly what you need as quickly as possible, and then leave.
THE HBASE DATA STATION
Hadoop and HBase are certainly scalable, but no part of their architecture is low-latency. You can serve webpages from HBase rows, but you can’t do any aggregations there.
There’s a few scalable databases (HBase, Cassandra) out there that handle a subset of this: you can retrieve data in all of them, and Lucene does faceted search, which implements several of the operations (notably grouping and counting). Lucene is fast and easy, but it’s pretty search-and-document oriented.
I think HBase is an excellent candidate for a storage engine. It’s scalable, fast (with the latest release), column oriented, and has a fairly intuitive codebase. That’s all HBase is – distributed storage and retrieval. It’s very impressive, but it delegates the duty of doing anything to your data out to the application. This makes it hard to crunch through lots of data with low latency, because you have to move the data somewhere.
What if we implemented a package that sat on top of where the data is stored on each server, performed your operations as close to the data as possible, and then sent the result to a master node for final aggregation? It’d almost resemble a MapReduce job. Let’s look at a diagram:
It’s pretty simple, and it won’t scale to infinite data per node — but it’s not meant to. I’m calling this “HBase Analytics” for lack of a better name. This solution is meant primarily to drive websites with a low latency, especially when you can’t pre-aggregate data. Implementation is somewhat straightforward. Without going into too much detail on the HBase architecture, you can ‘piggyback’ on top of how rows are scanned, and do your aggregation there. The point of this is to move data aggregation away from the app layer, and keep it as close to the data as possible.
In addition, these basic operations are commutative and associative. What that boils down to is that the order in which you read the data and make your aggregates is unimportant. Thus, you can make this process multithreaded.
Approximate consistency is a big performance boost as well. You can pass options to “trim” your data, so only the most significant items are sent. This is important for scenarios such as “Find the Top 10 Authors for these documents”. It might be a little lossy, but if I can cut my latency in half and get data that’s 90% correct, I’ll call it in a win in quite a few scenarios.Scaling to concurrent user demand is pretty simple: increase data replication, and provide more “master” servers to perform aggregation. This is in addition to the typical approach of intelligent caching, latency profiling, etc.
THE END
This solution isn’t maximally performant of course — though secondary indices could help with that too. But it is simple and scalable, keeping a “Data Station” mentality. Taking a look at 80% of big-data-driven websites, you can implement all their functionality with GROUP, COUNT, FILTER, SUM, SORT, and a few other additive and semi-additive operations. If that can be done in a way that’s truly scalable, then building “big data” websites can be done in a few weeks by a handful of people, instead of a few months with custom, hacked-together software. Stay tuned for news on the evolution of HBase Analytics.












Hi, I'm Bradford. I write about scalability and the fringes of Computer Science.
August 7th, 2009 at 12:17 pm
I thought this was the concept of Hadoop data locality — the computations run on the machine where the data is stored. I thought you get this for free if you run Hadoop on the same servers running HBase, is that not so?
On a side note, you say the operations are “communicative” and associative where I think you mean “commutative”.
In all this is an excellent post and exactly what our organization is looking for. Can we have it yesterday?
August 7th, 2009 at 1:25 pm
Yes, we certainly do get data locality — it’s very important, which is why we’re doing our on-the-fly aggregation on the regionservers first, then sending it to the master node.
If you’d like to tell me more about what your company needs so I can envision the product better, feel free to drop me an e-mail! We can chat on AIM if you’d like.
August 12th, 2009 at 8:33 am
Lossy database operations is an interesting concept. One of the invisible but limiting rules of the existing paradigm is that all operations must be 100% complete and perfect. But if I’m searching Google for web pages and there are millions of matches, does it really matter that it’s “all the millions”? Maybe not.
August 14th, 2009 at 8:23 am
>> does it really matter that it’s “all the millions”? Maybe not.
Google uses a lossy operation, as anyone who has ever tried to use it to do any kind of deep specific search will tell you. Try going down to page 15.