A few tips for Big Data projects

Jun 25, 2012
software engineering

Floppy disk illustration At Lokad, we are routinely working on[Big Data projects, primarily for retail, but with occasional missions in energy or biotech companies. Big Data is probably going to remain as one of the big buzzword of 2012, along with a big trail of failed projects. A while ago, I was offering tips for Web API design, today, let’s cover some Big Data lessons (learned the hard way, as always).

1. Small Data trump Big Data

There is one area that captures most of the community interest: web data (pages, clicks, images). Yet, the web-scale, where you have to deal with petabytes of data, is completely unlike 99% of the real-world problems faced about every other verticals beside consumer internet.

For example, at Lokad, we have found that the largest datasets found in retail could still be processed on a smartphone if the data is correctly represented. In short, for the overwhelming majority of problems, the relevant data, once properly partitioned, take less than 1GB.

With datasets smaller than 1GB, you can keep experimenting on your laptop. Map-reducing stuff on the cloud is cool, but compared to local experiments on your noteboook, cloud productivity is abysmal.

2. Smarter problems trump smarter solutions

Good developers love finding good solutions. Yet,when facing Big Data problem, it just too temping to improve stuff, as opposed to challenge the problem in the first place.

For example at Lokad, as far inventory optimization was concerned, we have been pushing years of efforts at solving the wrong problem. Worse, our competitors has been spending hundreds of man-years of efforts doing the same mistake …

Big Data means being capable of processing large quantities of data while keeping computing resource costs negligible. Yet, most problems faced in the real world have been defined more than 3 decades ago, at a time where any calculation (no matter how trivial) was a challenge to automate. Thus, those problems come with a strong bias toward solutions that were conceivable at the time.

Rethinking those problems is long overdue.

3. Being non-intrusive is scalability-critical

The scarcest resource of all is human time. Letting a CPU chew 1 million numbers is nothing. Having people reading 1 milion numbers takes an army of clercs.

I have already posted that manpower requirements of Big Data solutions were the most frequent scalability bottleneck. Now, I believe that if any human has to read numbers from a Big Data solution, then solution won’t scale. Period.

Like AntiSpam filters, Big Data solutions need to tackle problems from an angle that does not require any attention from anyone. In practice, it means that problems have to be engineered in a way so that they can be solved without user attention.

4. Too big for Excel, treats as Big Data

While the community is frequently distracted by multi-terabyte datasets, anything that does not conveniently fit in Excel is Big Data as far practicalities go:

Nobody is going to have a look at that many numbers.
Opportunities exist to solve a better problem.
Any non-quasi-linear algorithm will fail at processing data in a reasonable amount of time.
If data is poorly architectured / formatted, even sequential reading becomes a pain.

Then comes the question: how should handle Big Data? However, the answer is typically very domain-specific, so I will leave that to a later post.

5. SQL is not part of the solution

I won’t enter (here) the debate SQL vs NoSQL, instead let’s outline that whatever persistence approach is adopted, it won’t help:

figuring out if the problem is the proper one to be addressed,
assessing the usefulness of the analysis performed on the data,
blending Big Data outputs into user experience.