Big Data: choosing the problem before choosing the solution

My company has started several important big data missions, and I am taking here the opportunity publish some insights are are relevant to all those initiatives.

A major (and frequent) pitfall of the Big Data projects consists of starting with a solution instead of starting with a problem. In particular, software vendors (Lokad’s included) are pushing their own Big Data recipe which will randomly involve:

However, the notion of “Big” data is very relative: cheap 1TB hard-drives are now available at your nearest supermarket, and very very few problems faced by companies, even very large ones, do require require more than 100 GB of data to process.

Usually, even the largest data sources of the largest companies do fit on a smartphone when properly represented.

Impedance mismatch of BIG frameworks

The performance achieved by well-known Big Data frameworks are mind-blowing: Facebook claims to process 100PB of data over Hadoop. That’s massive, and massively impressive as well.

However, before jumping on Hadoop (or any similar Big Data frameworks), one has to really estimate the friction costs involved. While Hadoop is certainly simpler than say MPI (Message Passing Interface), it’s still a complicated distributed framework which do require a lot of skills to be properly and efficiently operated.

If the very same goal can be achieved on a single machine within a very acceptable timeframe, then, in my experience the dumb solution is going to be about 100x cheaper (*) and easier to run and to maintain compared to the “distributed” variant.

(*) I am not refering to hardware costs, but to wetware costs (aka people) which represents 99% of the cost anyway for virtually every company, minus a few social networks and search engines.

The untold story about Hadoop (and its peers) is that it works only if and only if, the data is very meticuluously organized to be made suitable for a processing through the framework. If the data is incorrectly partioned, then Hadoop plus thousands of servers are no faster than a single machine.

Enterprise Big Data start at 100MB

Facebook is facing Petabytes of data, that’s millions of Gigabytes, but is really your company facing that much data? Do you need to plug that much data in to solve the problem at hand? Unless you work for a short list of about 100 companies on Earth, I seriously doubt it.

I observe that for most entreprises, “Big Data” starts at 100MB when:

(*) Yes, you can have a lot more than 100MB in a SQL database. However, reading the entire dataset through SQL needs to be done with care to avoid re-scanning the data thousands of times. In practice, in 90% of the data crunching situations, I observe that it’s easier to remove the SQL database, as opposed to improve the performance of the queries over the relational database.

Facing the problems

Thus, whenever data is involved, the initiative should start by facing the problems that are the true roadblock to deliver a “solution”. Those problems are typically:

Thus, before jumping to next cool vendor solution, I urge to start by facing the very uncool aspects of the problem. Frequently, the “solution” consists of removing an ingredient of the previous solution.


Reader Comments (1)

Nice post. Your sentiments can be applied to most situations that companies try to solve with technology, not only big data. Understanding what you are trying to achieve and how to measure success is where we should all be starting. Too many people who are experts in a particular technology want to use their hammer to hit the screw, whether that is the best option or not. Thanks for getting me thinking. November 1, 2012 | Damana Madden