Author

I am Joannes Vermorel, founder at Lokad. I am also an engineer from the Corps des Mines who initially graduated from the ENS.

I have been passionate about computer science, software matters and data mining for almost two decades. (RSS - ATOM)

Meta
Wednesday
Oct032012

Big Data: choosing the problem before choosing the solution

My company has started several important big data missions, and I am taking here the opportunity publish some insights are are relevant to all those initiatives.

A major (and frequent) pitfall of the Big Data projects consists of starting with a solution instead of starting with a problem. In particular, software vendors (Lokad's included) are pushing their own Big Data recipe which will randomly involve:

  • Hadoop
  • SAP HANA
  • HBase
  • Amazon EC2
  • Cassandra
  • Windows Azure
  • Storm
  • Node.js
  • ...

However, the notion of "Big" data is very relative: cheap 1TB hard-drives are now available at your nearest supermarket, and very very few problems faced by companies, even very large ones, do require require more than 100 GB of data to process. 

Usually, even the largest data sources of the largest companies do fit on a smartphone when properly represented. 

Impedance mismatch of BIG frameworks

The performance achieved by well-known Big Data frameworks are mind-blowing: Facebook claims to process 100PB of data over Hadoop. That's massive, and massively impressive as well.

However, before jumping on Hadoop (or any similar Big Data frameworks), one has to really estimate the friction costs involved. While Hadoop is certainly simpler than say MPI, it's still a complicated distributed framework which do require a lot of skills to be properly and efficiently operated.

If the very same goal can be achieved on a single machine within a very acceptable timeframe, then, in my experience the dumb solution is going to be about 100x cheaper (*) and easier to run and to maintain compared to the "distributed" variant.

(*) I am not refering to hardware costs, but to wetware costs (aka people) which represents 99% of the cost anyway for virtually every company, minus a few social networks and search engines.

The untold story about Hadoop (and its peers) is that it works only if, and only if, the data is very meticuluously organized to be made suitable for a processing through the framework. If the data is incorrectly partioned, then Hadoop plus thousands of servers are no faster than a single machine.

Enterprise Big Data start at 100MB

Facebook is facing Petabytes of data, that's millions of Gigabytes, but is really your company facing that much data? Do you need to plug that much data in to solve the problem at hand? Unless you work for a short list of about 100 companies on Earth, I seriously doubt it.

I observe that for most entreprises, "Big Data" starts at 100MB when:

  • Excel is no more a solution.
  • SQL is no more a solution (*)

(*) Yes, you can have a lot more than 100MB in a SQL database. However, reading the entire dataset through SQL needs to be done with care to avoid re-scanning the data thousands of times. In practice, in 90% of the data crunching situations, I observe that it's easier to remove the SQL database, as opposed to improve the performance of the queries over the relational database.

Facing the problems

Thus, whenever data is involved, the initiative should start by facing the problems that are the true roadblock to deliver a "solution". Those problems are typically:

  • Collecting and servicing the data: About every single company I visit has problems on collecting and servicing the data. The most obvious symptom is typically the lack of documentation concerning the data itself, and all the nitty-gritty insights to need to make anything of it. No technology is going to solve that problem, only people and process.
  • Choosing the metrics to be optimized: They are so many parts of the business that could be improved through a smart exploitation of the data, that it is extremely tempting to think that some (hype) technology might be THE answer to everything. This is not going to happen. Solving a problem through data is tough, and without metrics, you don't even for sure you're moving in the right direction. Frequently, defining the metric - that is the problem to be solved - is harder than implementing the solution. 

Thus, before jumping to next cool vendor solution, I urge to start by facing the very uncool aspects of the problem. Frequently, the "solution" consists of removing an ingredient of the previous solution.

Monday
Jun252012

A few tips for Big Data projects

Floppy disk illustrationAt Lokad, we are routinely working on Big Data projects, primarily for retail, but with occasional missions in energy or biotech companies. Big Data is probably going to remain as one of the big buzzword of 2012, along with a big trail of failed projects. A while ago, I was offering tips for Web API design, today, let's cover some Big Data lessons (learned the hard way, as always).

1. Small Data trump Big Data

There is one area that captures most of the community interest: web data (pages, clicks, images). Yet, the web-scale, where you have to deal with petabytes of data, is completely unlike 99% of the real-world problems faced about every other verticals beside consumer internet

For example, at Lokad, we have found that the largest datasets found in retail could still be processed on a smartphone if the data is correctly represented. In short, for the overwhelming majority of problems, the relevant data, once properly partitioned, take less than 1GB.

With datasets smaller than 1GB, you can keep experimenting on your laptop. Map-reducing stuff on the cloud is cool, but compared to local experiments on your noteboook, cloud productivity is abysmal.

2. Smarter problems trump smarter solutions

Good developers love finding good solutions. Yet,when facing Big Data problem, it just too temping to improve stuff, as opposed to challenge the problem in the first place.

For example at Lokad, as far inventory optimization was concerned, we have been pushing years of efforts at solving the wrong problem.  Worse, our competitors has been spending hundreds of man-years of efforts doing the same mistake ...

Big Data means being capable of processing large quantities of data while keeping computing resource costs negligible. Yet, most problems faced in the real world have been defined more than 3 decades ago, at a time where any calculation (no matter how trivial) was a challenge to automate. Thus, those problems come with a strong bias toward solutions that were conceivable at the time.

Rethinking those problems is long overdue.

3. Being non-intrusive is scalability-critical

The scarcest resource of all is human time. Letting a CPU chew 1 million numbers is nothing. Having people reading 1 milion numbers takes an army of clercs. 

I have already posted that manpower requirements of Big Data solutions were the most frequent scalability bottleneck. Now, I believe that if any human has to read numbers from a Big Data solution, then solution won't scale. Period.

Like AntiSpam filters, Big Data solutions need to tackle problems from an angle that does not require any attention from anyone. In practice, it means that problems have to be engineered in a way so that they can be solved without user attention. 

4. Too big for Excel, treats as Big Data

While the community is frequently distracted by multi-terabyte datasets, anything that does not conveniently fit in Excel is Big Data as far practicalities go:

  • Nobody is going to have a look at that many numbers.
  • Opportunities exist to solve a better problem.
  • Any non-quasi-linear algorithm will fail at processing data in a reasonable amount of time.
  • If data is poorly architectured / formatted, even sequential reading becomes a pain.

Then comes the question: how should handle Big Data? However, the answer is typically very domain-specific, so I will leave that to a later post.

5. SQL is not part of the solution

I won't enter (here) the debate SQL vs NoSQL, instead let's outline that whatever persistence approach is adopted, it won't help: 

  • figuring out if the problem is the proper one to be addressed,
  • assessing the usefulness of the analysis performed on the data,
  • blending Big Data outputs into user experience.

Most of the discussions around Big Data end up distracted by persistence strategies. Persistence is a very solvable problem, so engineers love to think about it. Yet, in Big Data, it's the wicked parts of the problem that need the most attention.

Monday
May212012

Happy talk detector

Over the last couple of months, I have been pushing a lot of content on my company website (Lokad.com), and proofreading a lot of texts produced by colleagues too. The more I write, the more I realize that fighting our innate instinct to produce happy talk is a tough battle.

Recently, I came up with a simple rule to detect most happy talk content:

When by replacing a sentence by its negation, the resulting message seems totally out of place, then, odds are that the sentence was not carrying much of a message in the first place.

For example, it might be tempting write down on a company website We strive for excellency; however, if you think the opposite We strive for mediocrity, it becomes clear that nobody would claim the latter version. Hence, since the latter is obvious, the former has to be too.

The trick is purely psychological though. When producing an assertion of some kind, our mind - at least mine for sure - seems to better spot oddities rather than to recognize the obvious as such. 

Monday
Mar192012

Bizarre pricing, does it matter? (B2B)

My company has just released quantile forecasts upgrade. It's no less than a small revolution for us, however, unless you've got some inventory to manage, it's probably not too relevant to your business.

Another salient aspect is our new pricing for quantiles (the old pricing for classic forecasts remains untouched). Lokad is selling a monthly subscription, and if $q_i$ represents one of the actual quantile values retrieved by the client during the month, then the monthly cost $C$ is given by:

$$C = $0.15 \times \left(\sum_{i=0}^n q_i^{2/3} \right)^{2/3}$$

We hesitated to round 0.15 as $\frac{\pi}{2}$ because formula look better with Greek letters. Obviously, it's not simple, and most people would go as far as saying it's downright obscure, but it is really a good pricing, or just plain insanity?

To understand a bit where Lokad is coming from, let's start with the fact that we are a B2B software company. About 95% of competitors don't have any kind of public pricing: you can only ask for a quote, and then a talented sales guy will contact you to figure out your maximum budget, only to get back to you with a quote at 120% of the figure you gave him.

However, I strongly favor public pricing, not because it's more transparent, honest, fair, whatever, but because it's a massive time saver. At Lokad, we don't enter into time-consuming pricing negotiations except for the largest clients, where it does make sense to spend time negotiating.

The cardinal rule of software pricing is that it should capture the willingness to pay of the client, which, in B2B, is typically related to the economic gains generated by the usage of the product. In the case of demand forecasting, benefits can be accurately computed. However, turning this forecasting benefits formula into a pricing formula is insaly complex in the general case.

Hence, we decided to settle for heuristics that somehow mimic this theoretical willingness to pay, ran many simulations over our existing customer base, and finally figured out the formula. I do not claim that this pricing formula is optimal in any way: it is not. However, it does bring a very reasonable pricing for clients ranging from 1-man companies to 100,000+ employees companies.

Pros:

  • (As far we can judge) It's aligned with the value Lokad creates for clients.
  • It's still simple enough to be memorized in 20s.
  • It does not put incentive to game the pricing by excluding slow movers (i.e. products with low sales) from the forecasting process.
  • There is no threshold effect, where the pricing jumps to a much larger number just because the company has 1 more product than what the license would support.

Cons:

  • It certainly falls into the category of bizarre pricing.
  • The only way to know for sure the real monthly cost is to give a try (1). 
  • Some prospects try the pricing formula on their own, and get it wrong (2).

(1) This statement applies to most metered SaaS, even if the pricing is linear. For example, at Lokad we had very little clue about our exact bandwidth consumption until we migrated toward the cloud (with dedicated servers, bandwidth was part of the package).

(2) I believe this partly explains why 95% of our competitors don't put any public price on display. That, and the fact that a very expensive pricing is likely to scare away prospects, before getting the chance of cornering them into the sales process.

I would be interested to see if other B2B niches have designed their own bizarre pricing formulas. Don't hesitate to submit them in comments.

Wednesday
Feb222012

Cloud questions from Syracuse University, NY

A few days ago, I received a couple of questions from a student of Syracuse University, NY who is writing a paper about cloud computing and virtualization. Questions are relatively broad, so I am taking the opportunity to directly post here the answers.

What was the actual technical and business impact of adopting cloud technology?

The technical impact was a complete rewrite of our codebase. It has been the large upgrade ever undertaken by Lokad, and it did span over 18 months, more or less mobilizing the entire dev workforce during the transition.

As far business is concerned, it did imply that most of the business of Lokad during 2010 (the peak of our cloud migration) has been stalled for a year or so. For a young company, 1 year of delay is a very long time. 

On the upside, before the migration to the cloud, Lokad was stuck with SMBs. Serving any mid-large retail network was beyond our technical reach. With the cloud, processing super-large retail networks had become feasible. 

What, if any, negative experience did Lokad encounter in the course of migrating to the cloud?

Back in 2009, when we did start to ramp up our cloud migration efforts, the primary problem was that none of us at Lokad had any in-depth experience of what the cloud implies as software architecture is concerned. Cloud computing is not just any kind of distributed computing, it comes with a rather specific mindset.

Hence, the first obstacle was to figure out by ourselves patterns and practices for enterprise software on the cloud. It has been a tedious journey to end-up with Lokad.CQRS which is roughly the 2nd generation of native cloud apps. We rewrote everything for the cloud once, and then we did it again to get sometime simpler, leaner, more maintainable, etc.

Then, at present time, most our recurring cloud problems come from integrations with legacy pre-Web enterprise software. For example, operating through VPNs from the cloud tends to be a huge pain. In contrast, modern apps that offer REST API are a much more natural fit for cloud apps, but those are still rare in the enterprise.

From your current perspective, what, if anything, would you have done differently?

Tough question, especially for a data analytics company such as Lokad where it can take 1 year to figure out the 100 magic lines of code that will let you outperform the competion. Obviously, if we had to rewrite again Lokad from scratch, it would take us much less time. However it would be dismissing that the bulk of the effort has been the R&D that made our forecasting technology cloud native.

The two technical aspects where I feel we have been hesitating for too long were SQL and SOAP.

  • It took us too long to decide to ditch SQL entirely in favor of some native cloud storage (basically the Blob Storage offered by Windows Azure).
  • SOAP was a somewhat similar case. It took us a long time to give up on SOAP in favor of REST.

In both cases, the problem was that we had (or maybe it was just me) not been fully accepting the extent of the implications of a migration toward the cloud. We remained stuck for months with older paradigms that caused a lot of uneeded frictions. Giving up on those from Day 1 would have save a lot of efforts.

Page 1 ... 2 3 4 5 6 ... 31 Next 5 Entries »