Jul
28
2009
The slides for my OSCON 2009 presentation, “Approaching Scalability: Building a Business on Hadoop, HBase, and Open Source Distributed Computing”, have been posted to here:
http://www.slideshare.net/lusciouspear/building-a-business-on-hadoop-hbase-and-open-source-distributed-computing
PDF: Here
Video should be in up a day or so– having internet problems at home :) If the video quality is too poor, I’ll upload the audio track.
The talk focuses on two things: how to approach scalability from a problem solving standpoint, and how Visible Technologies used Hadoop, HBase, and Lucene to build a web-scale Business Intelligence platform.
When thinking about scalability, ask yourself a series of questions:
- What makes me special? (What part of our problem are we trying to scale? How can we use it to our advantage?)
- What can I sacrifice? (Compared to a traditional RDBMS, what are you willing to get rid of? Some sacrifices make scalability much easier).
- How will my data be structured? (Not everything should be reduced to tables and rows. Do you have documents? Graphs? Videos?)
The slides and talk focus on how Visible answered those questions, and how they led us to our Hadoop stack architecture. In the future, I’ll be talking more about the process of thinking about scalability. Stay tuned!
7 comments | posted in Scalability
Jul
10
2009
(…due to Hadoop, HBase, and Hive)
If you wanted to generate intelligence or other OLAPish things from large amounts of data (TBs) three years ago, you faced a terrifying prospect: You bought a server (or several large servers) costing $x00,000, paid for software from Oracle, Sybase, or IBM, and then prayed to Data Jesus that it would meet your needs for a few years. You’d find all the data you could in all the disparate formats your company used and try and cram it into tables, have data analysts schedule SQL queries, and output your reports. This was by far the most common use case for “big data analysis”.
Then, Social Media happened, and small businesses needed to process TB, web-scale companies needed to process PB, and neither of them could afford the $millions to do it. With the advent of the Hadoop and HBase ecosystem, however soon anyone can scale their Data Warehousing in predictable, affordable ways.
The established and new vendors in this space may become irrelevant to a decent portion of businesses. They recognize there’s a problem, but their solutions so far will do nothing to address the fundamental issues. They won’t disappear, but the high-cost Enterprise Data Warehouse is no longer the only solution out there, and its significance will continue to dwindle.
Hadoop represents the first radical shift in a long time in an industry that was built around “pay big or go home”.
Continue reading
11 comments | tags: data warehouse, Hadoop, HBase, Scalability | posted in Scalability