Social Media Kills the Database
(or: How the Greatest Tool of the 1980’s Crippled a Generation, and how Hadoop and HBase help)
postnote: This isn’t about a complete death of the RDBMS. Just the death of the idea that it’s a tool meant for all your structured data storage needs.
(Like this? Check out our data platform startup, Drawn to Scale — Fill out the contact form and we’ll drop you a line!
Imagine someone told you that there was a piece of software out there which could change the way you do everything – and I mean everything. This code is so perpetually practical; it’s a Swiss Army Knife! Surely, it can handle the simple data storage and analytics you want to do on a ‘Social Media’ scale. For example, you can:
- Store pieces of content for your website so it’s easy to manage
- Store content from EVERY website on the Internet
- Track your company’s financial transactions
- Track every credit card transaction for your multibillion-dollar banking conglomerate
- Store an address so it’s easy to find by a unique key
- Store every photo on Facebook so it’s accessible by a unique key
- Generate charts and graphs on how your business did in the last month
- Generate charts and graphs on key analytics for every piece of Social Media on the Internet.
- Analyze records of how people have accessed your website and attempt to simulate their behavior
- Analyze every click ever made on MySpace and optimize intrasite workflow.
…Well, in theory you can do the above.
Does it make sense that people can install the exact same software package and be able to do this, at all scales? Just throw bigger computers at the problem? It’s counter-intuitive to me, and it sure doesn’t work in reality. With the coming need to analyze things not on an enterprise-scale, but a Web Scale, social media is driving the final stake into the large analytical RDBMS (Relational Database Management System).
The ACIDy, Transactional, RDBMS doesn’t scale, and it needs to be relegated to the proper dustbin before it does any more damage to engineers trying to write scalable software.
IN THE BEGINNING…
Traditional relational databases (like those used in finance) evolved in the 70’s and 80’s as a glorified spreadsheet-and-file-cabinet. The metaphor prevails today. RDBMSs have made the modern business possible: being able to slice, dice, and analyze data in a neat, intuitive format is immensely powerful.
The problem is that the price of processing data in an RDBMS scales exponentially with the amount of data – getting twice the processing speed on twice the data rarely costs only twice as much. No matter what you do, something has to give. Eventually, you can try to shard out databases (like Oracle, Greenplum and Vertica), but their solutions still require expensive hardware, custom hash algorithms, compression, brittle analysis structures, and the licenses are enormously pricey. With those sorts of limitations, it’s difficult to tackle “Web-Scale” problems.
In addition, developers have become crippled by being able to only think of data in terms of Rows and Columns. There’s a multitude of database paradigms: Graphs, Trees, Objects, and so on. Furthermore, databases limit developers to SQL, which is great for certain kinds of set mathematics and not much else. In order to overcome fundamental limitations of DBs, things like Conditionals, Iteration, String Manipulation, and more have been hacked into what was at first an elegant set mathematics language.
Algorithms (such as MapReduce) which make data analysis scalable are highly unintuitive in SQL. The limitations in the environment lead to a limitation in engineers’ ability to solve “big data” problems.
People treat Databases as The Only Way To Store And Process Lots Of Data, and this leads them down the Road To Failure…
THE CRIPPLING
RDBMSs are really just a storage abstraction on top of some B-trees, bitmap indices, disk sectors, etc. With SQL on top, they enable some pretty powerful (if limited) analytical expressions.
However, there are a few fundamental problems:
- The row/column approach limits you to one static storage paradigm
- Updating indices does not scale linearly… the more data you have, the longer it takes to put it in. You can’t delete data if your DB gets too large in a production environment, so performance continually decays until The Datapocalypse occurs (or you throw more money at the problem).
- There’s an immense amount of overhead with transactions.
- The internals are totally opaque for most DB solutions. Optimization, required for any performant solutions, becomes a “black art”.
- Backups are quite difficult to do – real-time mirrored, or periodic deltas.
- Normalization is the only way to keep data sizes from exploding, but joins are incredibly expensive. It’s also a bitch to shard out indices.
- The initial low latency for complex analysis lulls the average developer into a false sense of security – “Oh, when we need a larger DB, we’ll just get a larger machine”.
- Sooner or later, you hit physical limits on how many electrons you can pull from your SAS drives, dozens of processor cores, and RAM.
The Social Media space is littered with the corpses of those who have attempted to scale their RDBMSs (Twitter has teetered on the precipice several times). But even the Big Boys have dealt with the horror of DB scalability:
- Google (and Yahoo) once ran off of 40,000 MySQL Boxes (later replaced with minty MapReduce/GFS/HDFS goodness)
- MySpace has about xx,000 Microsoft SQL Servers (with license fees) to serve their content
- Facebook was spending $1,000,000/month+ for specialized database hardware just to serve their photos
- A company uses FPGAs to make disk access faster for databases, since they’ve reached their scalability limit on hardware.
OUR PROBLEM
My team at Visible Technologies focuses on ways to perform quasi-OLAP on discrete document fields in near real-time. This is part of a Social Media metrics and analysis framework. Our corpus may be hundreds of millions of documents, and we may have 100s of fields to care about.
The current system is built on top of a multitude of SQL Server instances on $x0,000 boxes, each of which contain so much data that data aging is quite difficult. Not to mention the ever-increasing latency on queries which drive user experiences.
HOW WE’RE SOLVING IT
Like most people who try to scale RDBMSs, we don’t actually need most of the features a modern database provides. Most of them prevent proper “shared-nothing scale-out”, and slow things down.
Many of our analytics and back-end ingestion processes are being supplanted by elegant MapReduce algorithms on top of Hadoop clusters. We still have a need for low-latency calculations over massive metadata sets (TBs) which drive BI Dashboards, however.
(postnote a few months later): It turns out that HBase + Drawn to Scale’s distributed indexing and search technology works wonders — you get all of the benefits of a truly indexed DB, with very few drawbacks. Before we built it, though, we looked at sacrifices we were willing to make compared to a traditional RDBMS:
WHAT WE’RE SCRAPPING:
- Transactions. Our data is written in from a Hadoop cluster in large batches. If something fails, we’ll just grab the HDFS block and try again.
- Joins. Nothing is more evil than normalization when you need to shard data across multiple servers. If we need to search on 15 primary fields, we’re fine with copying our data set 15 times, or better yet — use our new distributed indexing.
- Backup and Complex Replication. All of our data is imported from HDFS. If high-availability is a must, we can simply use Zookeeper to keep track of what nodes die, and then bring up a new one and feed it the data needed in ~ 60 seconds. With scales of hundreds of millions of documents, no one will miss a few hundred thousand for that brief period of time.
- Consistency. If our users are analyzing millions of documents, they’re not going to care if there’s 15,000 unique Authors, or 15,001.
REQUIREMENTS:
- Fast GETs at the expense of PUTs.
- Support for SUM, COUNT, GROUP, and FILTER. That covers 95% of our use cases.
- Sharding that works and handles rebalancing and additions of new nodes elegantly
- We don’t want to allocate space for data we’re not storing. In addition, we need to rapidly be able to scan through ranges of single data values – like “AuthorName”. Perhaps a columnar or graph-oriented paradigm would best fit our needs.
- Any ‘reasonable’ query needs to return its results in 2 seconds or less.
- Plays nice with our Hadoop / HBase cluster.
THE END
The RDBMS as we know it is simply not a scalable solution – its jack-of-all-trades nature leads to a fundamental weakness with scalability. Over the next few years, I think a number of solutions to our problem will emerge. A large portion of data analytics can be performed with SUM, COUNT, GROUP, and FILTER on denormalized data.
I’ll keep you updated on our progress over the next few months.












Hi, I'm Bradford. I write about scalability and the fringes of Computer Science.
June 24th, 2009 at 5:45 pm
The future is cheap distributed everything… and it’s going to be big trouble for companies that rely on per-instance software licensing for the obvious reasons.
June 26th, 2009 at 11:02 am
[...] This post was Twitted by carljohnstone [...]
June 27th, 2009 at 6:17 pm
[...] a particularly strongly-argued blog entry called “Social Media Kills the Database” that declares social media is driving the final stake into the large analytical RDBMS (Relational [...]
June 28th, 2009 at 11:57 pm
[...] blog entry, “Social Media Kills the Database”, really struck a chord with me. Although I am an unabashed RDBMS fanboy, I may be an unusual one, [...]
June 30th, 2009 at 5:24 pm
[...] reminded of a quote from a blog entry (“Social Media Kills the Database”) that keeps bearing creative fruit for me: The internals are totally opaque for most [RDBMS] [...]
July 2nd, 2009 at 1:59 pm
You mention missing or inconsistent data as a possible side effect of your new database system. Your particular problem and data sets may be amenable to that sort of thing but I suspect it’s not acceptable to a lot of businesses for a variety of reasons.
Has the RDBMS been abused? Absolutely. Is it the solution to all problems? Of course not. For the sort of problem you and companies like Google seem to be tackling, MapReduce, Hadoop and the like are ideal solutions.
For other companies it would be completely unacceptable.
July 2nd, 2009 at 4:19 pm
Yes, exactly — we’re in an era where the RDBMS isn’t the ‘Swiss Army Knife’ anymore, and we can tailor our DB architecture the problem set we need to tackle.
July 11th, 2009 at 6:07 pm
Bradford
That’very insightful analysis – I’m going to add it to the links of references on my recent post No to SQL? Anti-database movement gains steam – My Take
Based on your requirements i was wondering if you also considered any of the In-Memory approaches? In my post i was specifically referring to a social network application that needed to find a network of 2500 friends of friends out of a database more x100,000 users in every query and was able to do that in just few msec.
Anyway thanks for your comment on the highscalability.com site that led me to your article.
July 11th, 2009 at 6:44 pm
Actually, we’ve thought of the idea, discarded it, and now resurrected it again. We think that in-memory Lucene indices may very well fit for what we need on the fields we need to aggregate in real-time. Good question!
July 16th, 2009 at 8:40 pm
Lots of intriguing ideas, and it sounds like promising work. I think most people will see the pattern emerge before long.
Today, non-sql systems are all the rage. Before that, column-oriented. Before that, relational. Before that, network and hierarchical, flat files, COBOL, etc.
All of these start as the “new” technology that’s going to solve the world’s problems. A good case emerges that provides the impetus to create the technology, and it’s a tremendous success in that case. Other people learn about it, and try to apply it to everything from putting rockets on the moon to making an omelette.
As the technology matures, it’s limitations become more bothersome, it’s applicability to the next problem less perfect. As the new technology of the week pops over the horizon the current one becomes relegated to solving the problems it was originally designed for, which it was perfect for from day one.
The moral of the story? It’s better to be a consultant implementing non-sql solutions than it is to be a consultant doing IMS. You’ll have a longer shelf life on new technology before it becomes ancient, and there are still IT managers that will believe it will cook the perfect omelette.
July 16th, 2009 at 9:07 pm
I really appreciate the time you took to make those points — all good.
In the end, what you described happened to SQLish RDBMSs, and it’ll happen again with these too. Everyone wants a ‘Swiss Army Knife’.
I think the difference here is that without the limitations of RDBMS, people are learning new ways to think about Big Data — and even if they don’t roll out Hadoop, noSQL solutions, they’ll at least have a chance of coming to a proper solution.
August 1st, 2009 at 7:12 am
“Transactional, RDBMS doesn’t scale” ? Huh? Yes it does, linearly, all the way. Just stop using bad RDBMS products, and stop employing bad developers and administrators.
August 1st, 2009 at 9:03 am
This is true within the walls of OSS. However Oracle scales up, out and whichever way you want to flip it.
Of course this comes at a cost of 6 digit costs per CPU. That and even though it can scale the way one may need it to, it has its issues. but what doesn’t? Then we go back to the cost issue. Scaling Oracle starts to get into the 7-8 digit cost range.
However the title of your article still holds true since Oracle probably counts for less than 1% of the total amount of installed RDBMS systems.
There is DB2 and SQL Server but DB2 has close to the same cost issue and is still not as good Oracle and SQL Server can scale well but still not as close to the ability of Oracle and the cost may be much less but it is still not as good.
That and these 2 involve vendor lock in. Scalability in an RDBMS world has unrealistic expenses and yet they still aren’t the best choice even if they do scale.
Just figured I’d add the bit of perspective to what has already been said.
You should do articles on how people would use/transfer the data storage needs to hash based systems and the likes with a link to this article for reference. That would definitely attract of interest.
August 1st, 2009 at 1:49 pm
Gregory — thanks for the feedback! I always appreciate good ideas. You should shoot me an e-mail — when you say hash-based systems, do you mean DHTs?
August 1st, 2009 at 6:11 pm
I just meant name/value pair/dictionary based data storage in general rather then relational when I said hash. Probably not the best wording to use.
I also specialize in high-performance computing and before we had things like tokyo tyrant or memcached I would initially pull as much data from the DB and store it in collections in memory to gain the performance I could not get with direct SQL calls.
Memcached, to me, is the best software architecture/development tool I could ever have in my arsenal. Persistent storage with the same method is a nice idea but it becomes irrelevant when you store your data in memory while still being able to use things like ORM when it comes time to populate memory. Although the write performance still suffers from the relational overhead and since you can only have one active master in either mysql or pgsql. So a single point of failure still exists and limits load balancing to reads only. Which, as stated, does nothing for me since I make apps read from memory.
I want see if I can get mysql proxy working to achieve active-active. After reading your article earlier I decided to give MemcacheDB and Tyrant some research but a sside to side comparison of how you would do a SQL and a no-db process would really help people in understanding no-db storage, including myself. I understand it but I just finished designing the architecture of an application which will be using mysql and figuring out how to redo it with no-db seems it will take me allot of time to design a new no-db architecture. However I am designing/development for an app that needs to be redeveloped ASAP due to the poor state of the current version so my time is short. So as you can see anything that helps one understand it better would be a big help in getting people off the ground running with no-db (not the app named this, the methodology used in the many similar ones).
I think taking the time to write a book in this would be a great resource and profitable for the writer.
If you feel like chatting more just shoot me an email to gregory@tensailabs.com. Thanks for the great article, I know it now has me giving the approach some serious thought.
Regards!
August 1st, 2009 at 11:02 pm
Yeah, I’ve looked at Tokyo Cabinet, Voldemort, Redis, Neo4j, and about a dozen of these databases. Was actually thinking of pitching a book based on how to use them or just think in the “no-sql” kind of way. Thanks for adding fuel to the fire :)
August 2nd, 2009 at 2:14 am
Database requirements of myspace are pretty tiny compared to the search giants.
According to the article below MySpace run off 500ish database servers. Much smaller scale than tens of thousands of the big search companies.
http://highscalability.com/myspace-architecture
August 2nd, 2009 at 8:27 am
That’s very interesting — one of my friends who went to TechEd was the source of the MSSQL Boxes fact. Now I’ll have to go double-check it’s legit :)
August 6th, 2009 at 7:50 pm
Interesting article! This is a thought-provoking discussion.
It sounds like the problem you’re trying to solve is more OLAP than OLTP, so some of your objections to DB transaction cost (for example) could be moot depending on the implementation. And your storage requirements (you mentioned hundreds of millions of docs), while large, don’t seem beyond the scope of what modern OLAP (or OLTP for that matter) systems can handle. You’re certainly right of course that there will be a substantial cost.
The Hadoop etc. solution doesn’t sound exactly simple or cheap either, though. Surely it’s cheaper than all your MS SQL Server boxes, but you still need a bunch of nodes, plus a fair amount of custom development to ingest data, perform the computations, and manage everything. How has the cost equation turned out? For example, how many nodes do you need in the new cluster to store everything? (And how many SQL Servers did you need before?) What’s the relative scale of the development effort — like, team of 3 for 3 months or team of 6 for 6 months etc.?
Anyway, thanks for sharing your experiences. It’s interesting to see the different ways people are trying to solve these problems.
August 6th, 2009 at 8:33 pm
Thanks for your comment! You’re right in that some of the OLAP/OLTP points are moot.
The cost and performance benefits are really nice so far. For an example, we have a $140,000 SQL Server + License, that is so full of data that it takes 24 hours to load and process a few GB of documents.
Yesterday, we loaded all 1TB of our data set into HBase, indexed it with Lucene, and deployed it to our search cluster in about 18 hours — and that’s with some random failures that set us back because we’re kinda new to Linux and things weren’t very efficient. This was done on a cluster of 19 nodes that cost $1700 each.
To do the same work on our SQL Server box took us over a week :)
Dev time wasn’t too bad — I knew Hadoop pretty well, but the team had to ramp up on learning Linux, HBase, Zookeeper, Lucene and the rest of the stack. It took our team of 5 people about 3 weeks for an end-to-end system prototype, which was pretty fragile. We’re going to spend several months to roll out a full “enterprise” solution, with proper monitoring, workflow, perf, etc. Which is really no different than any other large-scale application.
If you have more questions, feel free to e-mail me!
August 10th, 2009 at 7:38 am
Thanks very much for your reply! One thing that wasn’t clear to me is how exactly Lucene fits into this. Do you have Lucene using HFS as its directory storage and running directly on those 19 nodes, or are you extracting from HFS to a set of conventional (local file based) index shards?
August 10th, 2009 at 3:42 pm
We make the indexes with a MR job, and then Katta (distributed Lucene) pulls them out of HDFS onto a separate cluster optimized for serving searches.
August 21st, 2009 at 8:04 am
You are right on, key-value, Hadoop, Cassandra, etc. are a must. But they don’t replace RDBMS. They replace what RDBMS don’t do well. When Dr. Codd wrote the relational concepts in 1969, hierarchal database were the norm, and accounting systems were the primary systems using a database. ACID and structure were critical. The RDBMS is great for maintaining an ACID driven, absolute relationship. Key-value is great for finding text based relationships. Like any tool, use both wisely for what they do well, not as a Swiss army knife.
September 9th, 2009 at 11:51 am
[...] Social Media Kills the Database [...]
October 5th, 2009 at 7:29 am
[...] Social Media Kills the RDBMS by Bradford Stephens Slides from a NoSQL meetup [...]
March 4th, 2010 at 9:16 pm
[...] Social Media Kills the RDBMS by Bradford Stephens [...]
March 28th, 2010 at 9:36 pm
[...] Social Media kills the database: A new web start-up guy laments the relational databases and how companies of his generation are looking to open source software such as Hbase and Hadoop to break the tyranny of the relational database. It is a remarkably coherent and easy to understand essay and worth reading. [...]
March 28th, 2010 at 10:51 pm
[...] Social Media kills the database: A new web start-up guy laments the relational databases and how companies of his generation are looking to open source software such as Hbase and Hadoop to break the tyranny of the relational database. It is a remarkably coherent and easy to understand essay and worth reading. [...]
March 29th, 2010 at 2:49 am
Social comments and analytics for this post…
This post was mentioned on Twitter by carljohnstone: http://bit.ly/Jsdjr...
March 29th, 2010 at 11:02 pm
[...] Social Media Kills the Database: A new web startup guy laments the relational database and how companies of his generation are looking to open source software such as Hbase and Hadoop to break the tyranny of the relational database. It is a remarkably coherent and easy-to-understand essay and worth reading. [...]
April 8th, 2010 at 4:35 am
[...] NoSQL movement seems to be pretty active, advocating a move away from traditional relational databases: …developers have become crippled by being able to only think of data in terms of Rows and [...]