A few tips for Big Data projects

Floppy disk illustration At Lokad, we are routinely working on[Big Data projects, primarily for retail, but with occasional missions in energy or biotech companies. Big Data is probably going to remain as one of the big buzzword of 2012, along with a big trail of failed projects. A while ago, I was offering tips for Web API design, today, let’s cover some Big Data lessons (learned the hard way, as always).

1. Small Data trump Big Data

There is one area that captures most of the community interest: web data (pages, clicks, images). Yet, the web-scale, where you have to deal with petabytes of data, is completely unlike 99% of the real-world problems faced about every other verticals beside consumer internet.

For example, at Lokad, we have found that the largest datasets found in retail could still be processed on a smartphone if the data is correctly represented. In short, for the overwhelming majority of problems, the relevant data, once properly partitioned, take less than 1GB.

With datasets smaller than 1GB, you can keep experimenting on your laptop. Map-reducing stuff on the cloud is cool, but compared to local experiments on your noteboook, cloud productivity is abysmal.

2. Smarter problems trump smarter solutions

Good developers love finding good solutions. Yet,when facing Big Data problem, it just too temping to improve stuff, as opposed to challenge the problem in the first place.

For example at Lokad, as far inventory optimization was concerned, we have been pushing years of efforts at solving the wrong problem.  Worse, our competitors has been spending hundreds of man-years of efforts doing the same mistake …

Big Data means being capable of processing large quantities of data while keeping computing resource costs negligible. Yet, most problems faced in the real world have been defined more than 3 decades ago, at a time where any calculation (no matter how trivial) was a challenge to automate. Thus, those problems come with a strong bias toward solutions that were conceivable at the time.

Rethinking those problems is long overdue.

3. Being non-intrusive is scalability-critical

The scarcest resource of all is human time. Letting a CPU chew 1 million numbers is nothing. Having people reading 1 milion numbers takes an army of clercs.

I have already posted that manpower requirements of Big Data solutions were the most frequent scalability bottleneck. Now, I believe that if any human has to read numbers from a Big Data solution, then solution won’t scale. Period.

Like AntiSpam filters, Big Data solutions need to tackle problems from an angle that does not require any attention from anyone. In practice, it means that problems have to be engineered in a way so that they can be solved without user attention.

4. Too big for Excel, treats as Big Data

While the community is frequently distracted by multi-terabyte datasets, anything that does not conveniently fit in Excel is Big Data as far practicalities go:

Then comes the question: how should handle Big Data? However, the answer is typically very domain-specific, so I will leave that to a later post.

5. SQL is not part of the solution

I won’t enter (here) the debate SQL vs NoSQL, instead let’s outline that whatever persistence approach is adopted, it won’t help:

Most of the discussions around Big Data end up distracted by persistence strategies. Persistence is a very solvable problem, so engineers love to think about it. Yet, in Big Data, it’s the wicked parts of the problem that need the most attention.