Decline of the Enterprise Data Warehouse

(…due to Hadoop, HBase, and Hive)

If you wanted to generate intelligence or other OLAPish things from large amounts of data (TBs) three years ago, you faced a terrifying prospect: You bought a server (or several large servers) costing $x00,000, paid for software from Oracle, Sybase, or IBM, and then prayed to Data Jesus that it would meet your needs for a few years. You’d find all the data you could in all the disparate formats your company used and try and cram it into tables, have data analysts schedule SQL queries, and output your reports. This was by far the most common use case for “big data analysis”.

Then, Social Media happened, and small businesses needed to process TB, web-scale companies needed to process PB, and neither of them could afford the $millions to do it. With the advent of the Hadoop and HBase ecosystem, however soon anyone can scale their Data Warehousing in predictable, affordable ways.

The established and new vendors in this space may become irrelevant to a decent portion of businesses. They recognize there’s a problem, but their solutions so far will do nothing to address the fundamental issues. They won’t disappear, but the high-cost Enterprise Data Warehouse is no longer the only solution out there, and its significance will continue to dwindle.

Hadoop represents the first radical shift in a long time in an industry that was built around “pay big or go home”.

The Big Data Cheap Problem

With the rise of Social Media and the decreasing cost of storage, very small companies have a need for processing massive quantities of data. Furthermore, it’s easier than ever to write software to generate/output/process (which is it? I’m unclear) this data, thanks to languages like Ruby, frameworks like Spring, and scalability best practices. A few days’ work and a handful of engineers can net you a bare-bones Twitter clone, or a crawler to get the link graph of the entire Internet. This data growth can be far from linear. For example, the more users you have tweeting to each other, the more connections they form, and so on. You simply can’t analyze this much data in an RDBMS — but these small startups can’t spend millions on DWs, either. Since the analysis is a core of the business, internal hacked-together tools cobbled together from SQL boxes often emerged. What results is a temporary solution, not a platform. With Hadoop, HBase, and Hive, there’s now a free, scalable Data Warehousing platform that makes it possible to migrate a considerable portion of DW analytics. On top of that, it’s getting easier every day for Data Analysts to just hop in and start working in a supported environment, thanks to companies like Cloudera.

Vendor Approaches

Upstart enterprise DW vendors like Greenplum, Vertica, Aster, and Netezza are really struggling here. Not only are they fighting against a low-cost solution, but it’s more scalable and easier to tweak. A lot of established enterprises won’t care about this — they can pay money to purchase something that “just works”. Upstart Social Media companies can’t afford that, in cost or risk. They often have problems that don’t work for classic SQL (like PageRank’s graph traversals). Each has interesting approaches to indexing, scaling, and dealing with bottlenecks. All of these vendors require pricey hardware — from tens to hundreds of thousands of dollars. Several of these vendors claim to support “MapReduce”, but that’s a topic for a separate article.

A frequent thing I’ve seen in vendor presentations is claims of revolutionary speed increases for specific problems vs. Oracle. “This analysis took 12 hours to run on Oracle, but only 10 minutes on our product!”. I have no doubt that it’s true — but how much tweaking did it take to achieve that? Does performance on other queries degrade? For many types of large data warehouse analytics, the difference between 20 minutes from a Vendor solution and 1-2 hours from Hadoop may not be enough to justify the extra expenditure.

The frequent comparisons to Oracle is one of the lynchpins of understanding the state of the large DW vendors. Just making the price and performance improvements of a few dozen percent over Oracle would have a great selling point a few years ago. Now, they’re competing against something that is not just orders of magnitude cheaper but also more scalable. That’s tough.

The Killer Apps on the Cloud

Putting Hadoop on a Cloud is where one it really shines — one can spawn clusters to run analytics on-demand, instead of paying for large amounts of servers 24/7. There’s quite a few success stories here. The dating service eHarmony uses Hadoop + Amazon Elastic MR to create millions of “matches” between customer profiles. The cost for this turns out to be < $50 a day. A comparable vendor solution combined with hosting in a colo would be many times that amount.

Until HBase came along, you basically had no random access in Hadoop clusters. This was a harsh limit on ad-hoc and selective queries. HBase rectifies that, however. Instead of following a paradigm of “Index everything! Buy faster servers!”, you can denormalize, replicate, and sort your data on a primary key. If you have 7 columns in your data set, you may have to copy your data set 7 times if you want to slice and dice on every field — but disk space is cheap and scales extremely well.

Pairing well with HBase, Hive brings the flexibility of SQL to the scalability of MapReduce. Now, analysts can run queries just as they would any other DW. This abstraction on top of MapReduce is incredibly powerful — the millions who know SQL can hop right into using a free, scalable Data Warehouse for the first time in history.

The Future: Real-time Analytics in HBase?

There’s an interesting question I’ve been pondering lately — just what *can* you do in real-time on HBase? There’s no Hadoop MapReduce overhead, so you can achieve low latencies — but how much data can you crunch through in a handful of seconds? This could be useful for driving a web UI with some charts and dashboards on an arbitrary dataset. A lot of these types of queries are built from the fundamentals of GROUP, SORT, FILTER, COUNT, and SUM. I think a partial MapReduce algorithm could run on each node, pulling fields and grouping results (“This author appeared 12 times in this data set on this node”), and then combining those results on the master node that made the request. Is this process infinitely scalable with any query? Of course not — you’re limited by how quickly you can get data to the master node, and how complex the operations are. It’s still incredibly applicable towards a large variety of use cases, however.

And Finally…

Data Warehousing is more than just an analytical tool for large businesses — it’s an integral component of many startups, especially in the social media and BI fields. This need is driving one of the largest revolutions in the DW industry. Anyone who knows Java or SQL can derive complex intelligence from massive datasets for a fraction of the cost a few years ago. Enterprise DW vendors, used to competing with Oracle, now have to compete with something that is both free, powerful, and highly scalable.

Share and Enjoy:
  • Reddit
  • Digg
  • Google Bookmarks
  • Technorati
  • del.icio.us
  • Facebook
  • Twitter
  • StumbleUpon
  • E-mail this story to a friend!
  • RSS
  • HackerNews
  • Slashdot

12 Responses to “Decline of the Enterprise Data Warehouse”

  • DanTheMan Says:

    I wish I had a nickel for every so called pundit who predicted the end of the data warehouse. They are all gone and the DW continues to grow in double digits.

    Hadoop is an embryonic technology with a limited track record and a lot of growing up to do. It has promise. No, you didn’t discover the next big thing that will destroy all that came before. But I notice everyone who claims it is free seems to skip over the extensive labor it takes to write queries and keep them running, the cost of extracting and loading data into Hadoop, and the fact that Hadoop and mapreduce can’t do 10% of what any average data warehouse or mart does today. You need to check your facts.

    Stonebreaker published some good research showing Hadoop has serious shortcomings, performance being one of them. So don’t predict the end of one of the biggest trends in data processing for the last 30 years until Hadoop has 10 years of investment behind it. A handful of dot-coms in social media does not erase 100s of millions of dollars of investment by 20+ major corporations for 30 years. Get your facts right.

    Hadoop has great promise — eventually. While some productivity is possible today, Hadoop should grow some until its clear what its really good at. It does some cool things now, I suspect it will be better in 5 years.
    Get your facts straight. Why not spend your blog championing what Hadoop does right accurately than to piss on a solid upstanding technology like data warehousing? Why not explain how the data warehouse and Hadoop are complementary technologies that should coexist for the benefit of the user?

    BTW — Hadoop + Hive + HBase are NOT a data warehouse. Hive is a fraction of SQL and has a long way to go. HBase is hierarchical more than its relational and its a long way from a database. And Hadoop is a batch process environment. The definitions of a DW have been hammered out for 20+ years and are very clear. See Gartner and Inmon for a clean definition. Neither Hadoop nor HBase fit the definition in any way. Get your facts straight.

    I wish I had a nickel for every guy who predicted the end of data warehousing. They’re all gone and data warehousing continues to grow in double digits 30 years in a row. Hadoop is good stuff but it won’t grow by bad mouthing a solid incumbent technology. It will grow by producing differentiated value.

    Get your facts right.

  • Jeff Says:

    Hey Dan,

    While Hadoop can certainly complement existing data warehouses, Facebook has been using Hive in place of a data warehouse for over a year now. See http://www.dbms2.com/2009/05/11/facebook-hadoop-and-hive/ for more. In fact, the scale of their data warehouse and the number and diversity of users submitting queries makes it quite clear that Hadoop is a mature (potentially “upstanding”?) technology that’s seeing serious use in enterprises and labs around the world.

    Of course, unlike other data warehousing products, you don’t have to just read about the technology. Go check it out for yourself and help us produce “differentiated value”: http://hadoop.apache.org.

  • Mike Says:

    Dan,

    Do you work for Oracle? Do you have a vested interest in the status-quo for Data Warehouses to continue? A traditional data warehouse consultant?

    The fact is that despite its substantial flaws (as you have pointed out) Hadoop SOMEHOW manages to get chosen for data warehousing by companies that seem to have SOME idea of what is going on.

    And as far as “getting my facts straight” goes I’ll simply say “Facebook, Last.fm, IBM, Sun, Amazon, Cloudera, Yahoo, Google, Hulu, ImageShack, Joost, Baidu, Rackspace” and others, all listed at http://wiki.apache.org/hadoop/PoweredBy

  • Soruja Says:

    I liek gigabytes. <3

  • Flow » Blog Archive » Daily Digest for July 12th - The zeitgeist daily Says:

    [...] Decline of the Enterprise Data Warehouse — 10:24am via Google [...]

  • ken farmer Says:

    > For many types of large data warehouse analytics, the difference between 20 minutes from a Vendor solution and 1-2 hours from Hadoop may not be enough to justify the extra expenditure.

    Actually, I’m doing real-time data warehousing now – and have peaks of 60,000+ queries running per day. And I expect over 99% of my queries to run in under a tenth of a second. And then there are those massive queries that take ten minutes.

    But 2 hours? My users aren’t happy if their queries are taking that long – especially since it often takes a half-dozen iterations to get them perfect.

  • Bradford Says:

    You know, that’s a really interesting use case. For a lot of the scenarios I’m talking about, there are few “ad-hoc” queries. I can see if you’re running that sort of scale data (and classic analysis), you’d like something like Oracle, Greenplum, etc. We’ve found for our needs, that’s way overkill.

    How large is your data size, how many concurrent requests do you have, what software are you using, and how much did everything cost?

    You can e-mail me if you want, I’m truly interested :)

  • Ali Sohani Says:

    There is no argument MapReduce/ Hadoop is and has already proven itself to be a highly scalable & fault-tolerant mechanism over the cloud for data intensive operations; to compute, to aggregate; at the end of the day it’s same old distributed computing via grid jobs that’s showing it’s magic.

    Whole NoSQL moment and everything related to it isn’t about shedding everything existing and go new way on, it’s about making people aware, to let people out of local maxima and help them see the world beyond which is to realize “There is a more than one way to do it” (Perl mantra), and there always is.

    Sticking always to a traditional approaches or systems to do some next generation or a different sort of job isn’t surely a way to go, we got to change the solution space when problem space changes.

    I think future is about hybrid technology, when it wouldn’t even be require to be called hybrid anyways… We already see combinations of technology working complementary with each other: when Bradford’s mentioned: [Hadoop + Hbase + Hive] thing is in work, when Facebook’s [Hadoop + Casandra + Hive] is in work, when Linkedin’s [Hadoop + Voldemort + RDBMS (Oracle, MySQL)] is in work. (Reference: http://developer.yahoo.net/blog/archives/2009/06/nosql_meetup.html)

    Line between traditional RDBMS & noSQL systems is already blurring, as we see HadoopDB (http://db.cs.yale.edu/hadoopdb/hadoopdb.pdf) combining power of [Hadoop + RDBMS (postgresql) + Hive] to cater the job. MongoDB and Yahoo Sherpa both are working to provide a scalable data storage system with as many friendly querying capabilities as possible. (Reference: http://developer.yahoo.net/blog/archives/2009/06/nosql_meetup.html)

    Very soon I believe big vendors like Oracle are also going to introduce such parallel DBMS, with hybrid combination of some of these noSQL system approaches in the backend, as other close sourced data warehousing vendors GreenPlum (www.greenplum.com/technology/mapreduce) and Aster Data (http://www.asterdata.com/product/mapreduce.php) did.
    (Reference: http://www.dbms2.com/2008/08/26/why-mapreduce-matters-to-sql-data-warehousing)

    Future is of parallel DBMS functioning as a one package (Reference: http://www.computerworld.com/s/article/print/9131526/Researchers_Databases_still_beat_Google_s_MapReduce), where we don’t need to worry about integration of components making it work. Still when that’ll be here, one solution of course wouldn’t cater all problems, all problems will evolve with our solutions too; we would adapt and should continue picking the hat that (most closely) fits the head.

  • DanTheMan Says:

    hadoop is great. MapReduce it great. I’ve read everrything I can find on the web, pro and con versus data warehouses and other techniques. Its only the uninformed immature coder who thinks Hadoop and MR will replace traditional data warehouses. I notice most of the people syaing the DW is bad are using the most inefficient SQL database — MYSQL.
    OK, I yield on price. Free is hard to beat with Open source.

    But what most MR/Hadoop afficianadoes overlook is that the only way to use Hadoop is through a programmer. A million years ago, the SQL guys were like that. But today, the business user has cool tools to access the data, slice, dice, OLAP, pivot, etc. They dont need no stinking programmers. The business user does amazing things, makes incredible discoveries when they interact with the data direct. Hadoop has a lon way to go to get there. Hive and Pig are a good start, but quite young and immature.

    The other thing Hadoop folks ignore is performance. They say “well, the database takes 20 minutes and Hadoop takes 2 hours but given that Hadoop is free, its worth it.” What they leave out is the 1-8 hours of programmer labor needed to get the Hadoop MR job built, tested, and finally running. We can have endless debate over every labor cost and job needed to run (ETL, load times, programming, operations, etc.) At the end of the day, the data warehouse stuff is strong, works well, and often already running.

    I suggest a better way to think of this. Hadoop does not destroy or eliminate the data warehouse. And there are many things the DW does Hadoop/Hive/PIG cannot do and might never do. In contrast, Hadoop/Pig/Hive have a place in this world and its not fully decided what that is. We dont need to piss on one another to win. Pissing contests always end up with two stinkers. Hadoop/MR is a tool. Use it wisely and it will yield amazing things. the Data Warehouse is a tool. it too produces fantastic things. And dont give me that whining about cost — I’ve seen many multi-million dollar warehouses sold when the alternative was free or $20K. If the ROI is 800%, the cost of the warehouse means nothing. We only care that we have happy execs and users (emphasize users not programmers) AND that the right tool for the job was used.

    Last, hadoop/MR has NOT proven itself in the market. A few dozens sites are doing great things. If you read Geoffrey Moores Crossing teh chasm, Hadoop/MR is at the beginng. Its called lunatic fringe, earliest adopters. Look at the Gratner Hype Cycles and Hadoop/MR hasn’t even hit the charts while cloud computing is roaring along. These things follow a consistent pattern. Hadoop/MR is not mainstream stuff and is primarily limited to a few dozen DOT.Coms. I’ve been doing this a long time and Hadoop/MR has all the earmarks of becoming mainstream in the next 5-7 years. Today, its still risky stuff, but marvelous in our eyes. Hadoop and EDW will sit side by side in the same site for the next 10-15 years. So why the fight? Use the right tool for the job.

  • Roddy Says:

    @DanTheMan: actually, here at Facebook, Hive is wildly popular, much more so than our RDBMS deployment; almost 30% of employees have used it. More than half of them are non-engineers. Check out the note at http://www.facebook.com/data.

  • Martyn Richard Jones Says:

    Hadoop, HBase, and Hive are interesting technologies, that don’t signal the end of Data Warehousing and Business Intelligence as we know it. Sorry, great article, and I’ll keep an eye on the progress, but this isn’t even a remote threat to the major DW players, never mind the niche players on the DB/DW adapter side.

    Just wait until correlation database technologies get their feet under the DW/BI table.

    Cheers, Martyn

  • sohani Says:

    [...] Decline of the Enterprise Data Warehosue Due to Hadoop, HBase …Article detailing the decline of the data warehouse industry due to social media, Hadoop, HBase, and [...]

Leave a Reply