Logging: Unsexy, Important, and now Usable.
Using logs is now as easy as producing them.
Logging is perhaps the least sexy part of writing software, yet it’s surprisingly important. There’s lots of boring stuff that causes heated arguments (like whitespace, build systems, and version control). But no-one has impassioned, snarky blogs or tweets about what format your logs should be in. Much like cell phones or the Internet, we don’t realize how we need them until it’s too late.
We’re drowning in information from our servers. Especially when dealing with distributed/Big Data systems, which are growing incredibly fast. Even when building small server clusters, my co-founder Nick and I found it hard to learn what was going on. So we chatted, “Well, what’s the biggest step forward in web accessin the past 15 years? Search!” In a few brief moments, we set about sketching up some ideas for a generalized “log search” engine, and ended up naming it “LogSearch” (we’re literalists.)
Logging is something engineers have always needed, because software (no matter how experienced you are) is notoriously unpredictable. When you’ve got more than 10 computers, it’s difficult to keep track of what’s going on day-to-day, and even worse when Something Goes Wrong. When the inevitable crisis comes, you don’t want to spend a few hours poking through log files on 30 different machines. You want to be a hero and find the problem in 5 minutes, then get on with your life. Or maybe you want to take a longer view of things — run some interesting statistics on your logs, and study usage patterns.
Let’s ponder the implications of that for a second. What if you could log every action taken in your application, and analyze it to exactly how your application was used? Delicious.
All of us at Drawn to Scale have dealt with distributed systems for a while. We like to build stuff we would use. That’s why we built something that lets us:
- Be absurdly easy to use in the cloud or in a datacenter
- Easily ingest any amount of log data (hundreds of TBs)
- Find structured fields (like timestamps)
- Search these logs (we usually don’t know what we’re looking for when something goes wrong)
- Provide the actual text from the log itself
- Build trend charts (i.e., how many errors/day?)
- Send alerts when something looks “strange”
- Run Hadoop/MapReduce scripts to provide interesting analytics
If you had a log search engine, with various utilities built on top of it, you’d have an easy way to see what’s going on in the apps that drive your business. You could…
- Be aware of problems *before* they cause critical downtime
- Monitor trends so alerts can be sent for odd activities
- Examine long-term utilization patterns to optimize software
- Quickly diagnose when machines crash or become flaky
- Look smart and “in touch” with your systems!
- Save piles of money!
- Make piles of money!
Isn’t Monitoring Software Enough?
Engineers and ops people have a fantasy of coming into work, pulling up a dashboard on their web browser, and understanding everything that is going on. We want to be like Geordi in Star Trek, who can see everything on the Enterprise with a few finger presses. I wish there was such a thing! The closest we can get is ganglia, which monitors OS-level events (it can be hacked for apps). It’s pretty cool, but only so useful. Monitoring requires serious engineering effort – even if planned from the very beginning (it rarely is), there’s quite a bit of code that must be written and maintained. Monitoring services need to run on each box, and the central servers become a system of their own that needs to be reliable, redundant, and robust.
Logs, however, are a piece of cake. Here’s what usually happens.
- Put a line of code similar to Log.Error(“Something horrible happened!”); in your application
- The message and a Timestamp appear in a /logs directory on the disk
- Logs are cleaned up every few days by an automated process
Since logs are so simple to implement, it’s easy for us lazy engineers to toss them into everything we do. The problem is that when they grow beyond a few hundred MB, it takes forever to search them for what you want, and it’s even harder if spread across multiple machines!
Why it’s Hard
So It’s easy to fill logs with lots of tasty info. But it’s hard to make sense of it all. It’s semi-structured text. This means in order to find what you want, you need to explore the data: you need to search it. The only tool most of us can use is grep — tedious and manual if you need to troubleshoot several dozen logs on each box. There’s also interesting information in your log that isn’t just raw text — dates, server addresses, error types, and more. So you have to build something to parse meaning out of it.
You end up having to design a distributed system just to parse handle logs. There’s commercial tools out there which claim to do this, but they fail at large scale.
Draw to Scale has built LogSearch from the beginning to handle Big Data.
How We Do It
The Drawn to Scale BigSearch platform is built on a ton of distributed and NoSQL tools, combined with quite a bit of Secret Sauce. Our goal is to provide an end-to-end data storage, search, and serving platform. To do this, however, we need to know what the system does, and prove its capabilities. That’s why LogSearch is a great tool to put on top of BigSearch.
We run in the cloud or in your datacenter. To run our system in the cloud (specifically, AWS), it’s rather simple: spin up a few instances with our machine image, and install a script on each of your boxes to find the logs and push them onto the LogSearch cluster. Since every component of our infrastructure is scalable, it can handle as much data as is thrown at it.
Everything is seamlessly processed with Hadoop, stored in HBase (a clone of Google’s BigTable), indexed, and made searchable in a few minutes. All you have to do is go to a webpage and search for what you want to find in the logs. In the near future, you can upload MapReduce, Pig, or Hive scripts to provide more complex analytics.
Are you excited? (I am, I wrote a whole friggin’ article). Want to learn more? Follow me on twitter at @lusciouspear, or drop me a line at info@drawntoscalehq.com. We’d love to show you a demo, or get your ideas on what would make this tool Really Awesome and Indispensable.












Hi, I'm Bradford. I write about scalability and the fringes of Computer Science.
January 25th, 2010 at 12:52 pm
How do we integrate log data from remotely-operated monitoring services like Gomez and ScriptCanary?
ScriptCanary logs your Javascript errors to its own remote database but can’t write directly to your server logs.
Do we open up a logging service that can accept input from outside, so that the data is all in one place, or is it better to think of integrating outside data into our Ops dashboards?
Are there ways
January 25th, 2010 at 1:59 pm
Hey Dorian-
Thanks for making me aware of Gomez and ScriptCanary. I’ll see how we can integrate with them.
Monitoring and Logging are different. I think of Monitoring as a “real time” thing, to power dashboards and the like. Logging is on a “historical” / “20 minutes ago” timeframe.
If you were using our product, I’d suggest keeping Gomez and ScriptCanary to use them as dashboards, and seeing if you can write a service that takes information you send to them and *also* dumps it to a log file (or later, an API) in LogSearch.
January 25th, 2010 at 3:04 pm
Splunk’s been doing this for years!
January 25th, 2010 at 3:38 pm
I’ve found Splunk to be very good at this kind of thing
January 25th, 2010 at 3:46 pm
Yes, Splunk is very good with small to moderate amounts of data. Our tool focuses on scaling to massive amounts of data, and handles both search and analytics.
January 25th, 2010 at 5:22 pm
What do you guys consider huge amounts of data?
In a distributed environment are you shared nothing.
January 25th, 2010 at 5:25 pm
By “huge” amounts of data we mean several TB. We scale down to GB, of course. Not that huge these days, but it’s more than an RDBMS or several competitors can handle.
Yes, we’re shared nothing.
January 25th, 2010 at 8:40 pm
Splunk is a good introductory tool.
The NSA, Dept. of Treasury, IRS, tons of telcos, banks, insurance companies, health care companies use SenSage. Many customers gather 200GB to 500GB of log data daily from their data centers, aggregating them into 20-40 node clusters. And they keep this data around for forensic analysis for 2-7 years. At least one customer already has a petabyte under management.
http://www.sensage.com/customers/
January 26th, 2010 at 12:19 am
[...] This post was mentioned on Twitter by Deepak, Régis Gaidot, Mike Olson, stefano bertolo, Mark Nielsen and others. Mark Nielsen said: Logging: Unsexy, Important, and now Usable. http://bit.ly/6XsK5c [...]
January 26th, 2010 at 2:58 am
I agree, there is not enough discussion about logging. Some can be found in \Release it!\ and I’ve blogged about logging in the past, e.g. here
http://codemonkeyism.com/7-more-good-tips-on-logging/
Cheers
Stephan
January 26th, 2010 at 1:15 pm
@Bradford… would you consider 7-10TB per day a lot of data… because Splunk has customers of that volume… and ALOT less hardware necessary than Hadoop needs.
That being said.. i’m interested in your approach.
February 18th, 2010 at 10:17 pm
[...] Logging: unsexy, important and now usable [...]
March 30th, 2010 at 2:25 pm
Great to see discussion about logging – not a hot topic everywhere but at end of day a key in my view, to insight into any application – and not just for debugging/recovering from errors. 1st question: Is this – LogSearch – hosted (eg cloud) or does one need to deploy it?
March 31st, 2010 at 7:27 am
I too would be interested in how you differ/are better than Splunk which I am about to evaluate. Is it mainly data volume?
January 7th, 2012 at 10:56 pm
box banheiro…
box para banheiro preço…