May 17 2010

Podcast: NoSQL, Cloud, Big Data, and More!

Hello lovelies,

I just did a podcast with David Linthicum about Big Data in the Cloud, along with some NoSQL stuff. Check it out! It’s good fun.

Share and Enjoy:
  • Reddit
  • Digg
  • Google Bookmarks
  • Technorati
  • del.icio.us
  • Facebook
  • Twitter
  • StumbleUpon
  • E-mail this story to a friend!
  • RSS
  • HackerNews
  • Slashdot

Apr 8 2010

NoSQL: Why it’s So Damn Sticky

We wouldn’t care as much about NoSQL if it wasn’t called NoSQL. Even the word itself is triggering existential crises among some very smart, seasoned people.

But why?

…because the term NoSQL encapsulates just about every phase of database innovation in the last decade.

The inspiration for this post came from reading Chip and Dan Heath’s book called Made to Stick. Their theories, plus much buzzing from the anti-NoSQL camp about how “this is just a phase like XML DB” and “you obviously aren’t smart enough to understand the RDBMS”, has given me some fresh thoughts on why NoSQL is such a controversial thing.

Photobucket

According to the Heath brothers, a sticky idea is one that’s: simple, unexpected, concrete, credible, emotional, and a story. Now… the connection between psychology, computers, and social science can get pretty vague for someone untrained in two of those fields (me), but I hope this article will at least spark some interesting discussion.

So with all that in mind, let’s examine why NoSQL is such an awesome idea — and why it scares us.

(if you want a data platform that’s scalable, real-time, AND has a query language, check out Drawn to Scale)

Continue reading

Share and Enjoy:
  • Reddit
  • Digg
  • Google Bookmarks
  • Technorati
  • del.icio.us
  • Facebook
  • Twitter
  • StumbleUpon
  • E-mail this story to a friend!
  • RSS
  • HackerNews
  • Slashdot

Feb 22 2010

LinkedIn Talk: How To Make Life Suck Less (Writing Scalable Software)

Last December, I gave a talk at LinkedIn about scalability “best practices” from engineering, business, and operations perspectives. It’s a solid collection of tips and anecdotes from my experiences, as well as some from accomplished large-scale architects such as Ryan Rawson, James Hamilton, and Bradford Cross.

Enjoy!

I can’t embed the video for some reason, here’s the Link .

A big thanks to Noah Campbell at LinkedIn for inviting me — it was a blast!

Share and Enjoy:
  • Reddit
  • Digg
  • Google Bookmarks
  • Technorati
  • del.icio.us
  • Facebook
  • Twitter
  • StumbleUpon
  • E-mail this story to a friend!
  • RSS
  • HackerNews
  • Slashdot

Feb 15 2010

Announcing the Drawn-to-Scale Platform!

Whether it’s creative content or business records, data is piling up on us faster every day. Amazingly though, current database products are still being driven by technology standards from the pre-Internet era. Which means that data-centric businesses, and other businesses who find themselves pursuing data-centric opportunities, are demanding something more.

(Want to learn more? fill out this simple form and we’ll drop you a line!)

Imagine, for the sake of illustration, that you’re part of a new startup company that aggregates web content and does a lot of value-adding.

The company busily integrates gigabytes of data from various sources, analyzes it, then makes it available via search. There are three engineers: two working on an app, and one on data (administered with the ‘help’ of MySQL). A standalone machine processes incoming data, then feeds it to the database.

But there’s a problem…

Time goes on, you ingest more data, add new sources, etc. Soon, your database response times are getting longer for no apparent reason. Downtime is more frequent. Two more engineers come aboard to tune up the DB and rewrite some code.

Uh oh. Users are complaining. Management is sweating, and suddenly they’re paying attention to the data infrastructure.

$100,000 later, a state-of-the-art server arrives, and after some migration panic things are back to normal.

The business gets more popular, users demand more features… In a few months, the database is crawling again. Even more engineers are brought in and the whole dirty cycle starts again.

But that’s just the way it goes, right?

The Drawn to Scale Platform is designed to give companies room to grow. We discovered that Internet-scale software can be lean, fast, and even easy to use — all vital factors in building better systems that last. We believe in providing simple interfaces and controls, not SQL black magic (though we may support SQL in the future), and our platform offers seamless scalability to handle any peaks or spikes. We even provide a custom toolset that makes management and migration of existing data easy.

Traditional database servers are mature. They’re well understood, and well supported. But they also have performance ceilings most people won’t hit early on. This makes choosing them conventional wisdom, and a safe bet. But hit that ceiling and quite literally no amount of money will save you. Twitter learned this the hard way in 2008, and had to start over again. Not all companies survive such a rewrite.

My cofounder and I started Drawn to Scale so companies can move fast, developers can build incredible apps, and so no-one has to stay up late kicking database servers into shape.

(want to learn more? fill out this simple form and we’ll drop you a line!

The result of all this is that you can now focus on building your business and applications, instead of spending precious resources maintaining and building infrastructure.

You don’t have to fight with your tools any more.

…let’s learn more!
Continue reading

Share and Enjoy:
  • Reddit
  • Digg
  • Google Bookmarks
  • Technorati
  • del.icio.us
  • Facebook
  • Twitter
  • StumbleUpon
  • E-mail this story to a friend!
  • RSS
  • HackerNews
  • Slashdot

Jan 25 2010

Logging: Unsexy, Important, and now Usable.

Using logs is now as easy as producing them.

Logging is perhaps the least sexy part of writing software, yet it’s surprisingly important. There’s lots of boring stuff that causes heated arguments (like whitespace, build systems, and version control). But no-one has impassioned, snarky blogs or tweets about what format your logs should be in. Much like cell phones or the Internet, we don’t realize how we need them until it’s too late.

We’re drowning in information from our servers. Especially when dealing with distributed/Big Data systems, which are growing incredibly fast. Even when building small server clusters, my co-founder Nick and I found it hard to learn what was going on. So we chatted, “Well, what’s the biggest step forward in web accessin the past 15 years? Search!” In a few brief moments, we set about sketching up some ideas for a generalized “log search” engine, and ended up naming it “LogSearch” (we’re literalists.)

Continue reading

Share and Enjoy:
  • Reddit
  • Digg
  • Google Bookmarks
  • Technorati
  • del.icio.us
  • Facebook
  • Twitter
  • StumbleUpon
  • E-mail this story to a friend!
  • RSS
  • HackerNews
  • Slashdot

Oct 29 2009

HBase vs. Cassandra: NoSQL Battle!

(Note: Check out the Drawn to Scale platform. Store, query, search, process, and serve *all* your data. To *all* your users. In real time).

Distributed, scalable databases are desperately needed these days. From building massive data warehouses at a social media startup, to protein folding analysis at a biotech company, “Big Data” is becoming more important every day. While Hadoop has emerged as the de facto standard for handling big data problems, there are still quite a few distributed databases out there and each has their unique strengths.

Two databases have garnered the most attention: HBase and Cassandra. The split between these equally ambitious projects can be categorized into Features (things missing that could be added any at time), and Architecture (fundamental differences that can’t be coded away). HBase is a near-clone of Google’s BigTable, whereas Cassandra purports to being a “BigTable/Dynamo hybrid”.

In my opinion, while Cassandra’s “writes-never-fail” emphasis has its advantages, HBase is the more robust database for a majority of use-cases. Cassandra relies mostly on Key-Value pairs for storage, with a table-like structure added to make more robust data structures possible. And it’s a fact that far more people are using HBase than Cassandra at this moment, despite both being similarly recent.

Let’s explore the differences between the two in more detail…

Continue reading

Share and Enjoy:
  • Reddit
  • Digg
  • Google Bookmarks
  • Technorati
  • del.icio.us
  • Facebook
  • Twitter
  • StumbleUpon
  • E-mail this story to a friend!
  • RSS
  • HackerNews
  • Slashdot

Sep 9 2009

How To Make Life Suck Less (While Making Scalable Systems)

As I write this, I’m sitting at my home office with a five-year-old desktop and its failing monitor. Surrounding me are protein bar wrappers, fingerprint-covered water glasses, and guitar amp vacuum tubes. It’s 7 in the morning.

If you don’t know me you might think I’m an early riser. Alas… I’ve actually been up all night, unable to sleep, babysitting my Hadoop cluster and worrying about getting this new platform into production.

The past three months have been hellish — some amazing accomplishments, but plenty of painful mistakes. “You guys are working hard but it seems so effortless,” I heard from another team lead. If he only knew. Developing a distributed, concurrent platform for web-scale data  is far from effortless. A great deal of the time we spent ‘working hard’ was not new coding at all, but relentlessly beating on problems. I don’t always take my own advice, but I’m getting better. And so, I’ll try to keep the following in mind when planning what to do next:

Developing a platform for web-scale data analysis is hard work. Unlike well-worn areas like MVC on LAMP, big data (billions of records, intensive analysis) presents brand new challenges that need innovative engineering, testing strategies, and planning.

In recent months I’ve led my team at Visible Technologies through creating a new analytics platform based on Hadoop. This article lists a few lessons from our ordeal, which hopefully should help if you venture into similar territory. Here are my observations:

1. Big data is BIG
2. This is systems software, not an application (for now)
3. Learn the source, engage the community, contribute feedback
4. Scalable doesn’t imply cheap or easy. Just cheaper and easier.
5. It’s so much easier with smart, experienced people.

Let’s examine these in detail…
Continue reading

Share and Enjoy:
  • Reddit
  • Digg
  • Google Bookmarks
  • Technorati
  • del.icio.us
  • Facebook
  • Twitter
  • StumbleUpon
  • E-mail this story to a friend!
  • RSS
  • HackerNews
  • Slashdot

Aug 7 2009

A New DB for 80% of Facebook, YouTube-scale Sites

Imagine you’re visiting several of your favorite massive-scale websites. Or go ahead and pull them up in your browser. Facebook, MySpace, LiveJournal, Bebo, Streamy, YouTube, anything that deals with huge amounts of data, or any sufficiently large Social Media site. While you’re at it, take a look at any data-driven website: a forum, your favorite blog, or a news site.

As you examine these sites, try to think about what kind of data they need to present to their users rapidly. There’s usually a selection of:

  • Profiles (like users)
  • Search results
  • Sets of items by various criteria (all videos by one author, posts in a thread)
  • Networks (“my friends”)
  • Real-time updates “My status”
  • Communications subsystems (messages to inboxes, e-mails, IMs)

Until recently, there’s been no other way to retrieve this data than with a classic RDBMS, which as we’ve seen earlier can be expensive and crippling to scale. Yet I’d estimate that a few use cases cover 80% of how data needs to be presented to these websites. The Swiss-Army RDBMS is costly overkill for what most sites *really* use it for. There needs to be a new type of database optimized for serving simple, web-scale data in real-time.

Let’s examine these operations, along with some examples.  Then, I’d like to propose a scalable engine I’m architecting that’s actually optimized to drive 80% of websites, built on HBase.
Continue reading

Share and Enjoy:
  • Reddit
  • Digg
  • Google Bookmarks
  • Technorati
  • del.icio.us
  • Facebook
  • Twitter
  • StumbleUpon
  • E-mail this story to a friend!
  • RSS
  • HackerNews
  • Slashdot

Jul 28 2009

Approaching Scalability, and Building a Biz on Hadoop, HBase, and Open Source Distributed Computing

The slides for my OSCON 2009 presentation, “Approaching Scalability: Building a Business on Hadoop, HBase, and Open Source Distributed Computing”, have been posted to here:

http://www.slideshare.net/lusciouspear/building-a-business-on-hadoop-hbase-and-open-source-distributed-computing

PDF: Here

Video should be in up a day or so– having internet problems at home :) If the video quality is too poor, I’ll upload the audio track.

The talk focuses on two things: how to approach scalability from a problem solving standpoint, and how Visible Technologies used Hadoop, HBase, and Lucene to build a web-scale Business Intelligence platform.

When thinking about scalability, ask yourself a series of questions:

  1. What makes me special? (What part of our problem are we trying to scale? How can we use it to our advantage?)
  2. What can I sacrifice? (Compared to a traditional RDBMS, what are you willing to get rid of? Some sacrifices make scalability much easier).
  3. How will my data be structured? (Not everything should be reduced to tables and rows. Do you have documents? Graphs? Videos?)

The slides and talk focus on how Visible answered those questions, and how they led us to our Hadoop stack architecture. In the future, I’ll be talking more about the process of thinking about scalability. Stay tuned!

Share and Enjoy:
  • Reddit
  • Digg
  • Google Bookmarks
  • Technorati
  • del.icio.us
  • Facebook
  • Twitter
  • StumbleUpon
  • E-mail this story to a friend!
  • RSS
  • HackerNews
  • Slashdot

Jul 10 2009

Decline of the Enterprise Data Warehouse

(…due to Hadoop, HBase, and Hive)

If you wanted to generate intelligence or other OLAPish things from large amounts of data (TBs) three years ago, you faced a terrifying prospect: You bought a server (or several large servers) costing $x00,000, paid for software from Oracle, Sybase, or IBM, and then prayed to Data Jesus that it would meet your needs for a few years. You’d find all the data you could in all the disparate formats your company used and try and cram it into tables, have data analysts schedule SQL queries, and output your reports. This was by far the most common use case for “big data analysis”.

Then, Social Media happened, and small businesses needed to process TB, web-scale companies needed to process PB, and neither of them could afford the $millions to do it. With the advent of the Hadoop and HBase ecosystem, however soon anyone can scale their Data Warehousing in predictable, affordable ways.

The established and new vendors in this space may become irrelevant to a decent portion of businesses. They recognize there’s a problem, but their solutions so far will do nothing to address the fundamental issues. They won’t disappear, but the high-cost Enterprise Data Warehouse is no longer the only solution out there, and its significance will continue to dwindle.

Hadoop represents the first radical shift in a long time in an industry that was built around “pay big or go home”.

Continue reading

Share and Enjoy:
  • Reddit
  • Digg
  • Google Bookmarks
  • Technorati
  • del.icio.us
  • Facebook
  • Twitter
  • StumbleUpon
  • E-mail this story to a friend!
  • RSS
  • HackerNews
  • Slashdot