I am Joannes Vermorel, founder at Lokad. I am also an engineer from the Corps des Mines who initially graduated from the ENS.

I have been passionate about computer science, software matters and data mining for almost two decades. (RSS - ATOM)


A buyer’s guide for enterprise software

Through a series of Big Data consulting missions that were overlapping with the entire IT landscape, data being all over the place, I have observed software purchasing processes of many large companies. Being also an enterprise software vendor myself, I have been baffled countless times by broken buying processes that lead smart people routinely choose about the worst price-quality ratio that the market has to offer.

In this post, I am trying to gather a survival kit for buying enterprise software. By enterprise, I refer to companies with over 1,000 employees.

Get rid of Requests for Quotes (RFQ)

The rational goes like this: let’s write down all our requirements, then, we identify all suitable vendors, send them the RFQ, collect the quotes, review demos, and finally select the best option. Fairly reasonable, and yet, as far software is concerned, deeply dysfunctional.

Why? Because writing an accurate specification for software you need is harder than writing the software itself. It’s somehow the Heisenberg uncertainty principle transposed to software. Just try to write a RFQ for a webmail (aka GMail) if you’re not convinced. Specifying the fine print would already takes hundreds of pages, that’s only a webmail. Enterprise software tends to be a lot more complex than webmail.

As anecdotal evidence, writing and managing an RFQ is so time consuming and complex that many companies throw consultants at the problem, increasing costs even further.

Then, when the whole process is in place, guess what happens? All decent vendors walk away. Indeed, if you have a decent product that sells by itself (think Microsoft Excel), why would you bother paying an army of account managers to walk through broken RFQs? In the end, the only vendors left are either the ones so outrageously expensive that they can afford whatever it takes, or the ones selling crappy products that would never get sold without the active contribution of dysfunctional (yet widespread) processes.

My advice: Take your favorite search engine and forge yourself an opinion. It’s easier than you think, and in less than 3h, you're likely to have already a convincing shortlist of vendors. A few tips:

  • If all you find about a software is happy talk, software is worthless or vaporware.
  • If you can’t find screenshots of the software, then it means that the user interface, and probably the user experience as well, is abysmal.
  • If the web documentation is awful or absent, then whatever private documentation exists, it won’t be better.
  • If there is no public pricing, then, you will face somebody who’s professionally trained in reverse engineering the exact budget you have. That's going to be a time sink.
  • If the company does open source, then bonus points. They are not afraid that other people might have a look at the software code they produce.

Customizing Off-the-shelf Software is deadly

Joel Spolsky stated that it takes 10 years to write good software, and looking at the development curve of my own company I tend to agree. Well, it took us 5 years to realize that we were not even solving the correct problem (but this will be the subject for another post).

Software is (mostly) a take it or leave it business. Yes, you can request tiny adjustments, but asking for anything substantial is like swimming in molasses, mostly because of hidden costs. Generic upgrades won't work, support staff will be incompetent, design balance will be thrown out of whack, etc. As a result, good software companies, where people truly care and dedicate a good portion of their lifespan in carefully crafting truly valuable products, will actually decline such requests.

In contrast, vendors who will gladly accept customization requests are the one putting little value on the integrity of their product; which has already degenerated into some byzantine architecture through the acceptance of disparate customization requests over time.

My advice: choose your battles wisely. If a feature gives you an edge against the competition then internalize it and treat it as a core asset. Otherwise, it’s easier and cheaper to adjust your own organization to whatever software which reflects the dominant practice - as long the new practice offers a tangible improvement over the old one.

Bargaining over the price is against your interest

Bill Gates said that you don’t get what you deserve: you get what you negotiate. True enough but you should wisely choose what you want to negotiate. Negotiating over the software pricing is enormously expensive for the vending party. It takes a highly talented workforce, with employees both technically savvy and yet having all the smooth skills it takes to interact with large organizations. In most Western countries, the yearly total cost of a seasoned account manager is above $150k.

When you start bargaining with software companies, you also start an adverse selection process. Only the companies willing to afford an expensive sales force remains while others walk away. The vendors who stay are the ones where the business model is geared toward an ever increasing sales force. However, you want to put your money on a vendor that invest the bulk of her revenue in developing good software, not funding an army of people that specializes in reverse engineering budgets.

My advice: negotiating the price downward is usually a dead end. If it's out of budget, then discard the vendor and move to the next one.

Instead of focusing on price, you should try is to capture as much attention as you can from the core development team. Since you are an enterprise (hence a large prospect), you already got an edge here. For example, you can negotiate a small case study against a series of meetings with whoever are in charge of the core product development.

By making sure that the people in charge of the product are familiar with your business, you are vastly improving the odds that the future developments will be aligned to your company needs. Furthermore, you won’t even need to fund or think about those enhancements, the people you’ve met will do this free of charge; because it’s a natural thing engineers do when learning about client's problems.

More is less and stay clear of platforms

The inevitable corollary of RFQ is that the more advertised features the better. Furthermore, the impression is amplified by the vendors themselves who promote an ongoing stream of new features to sustain a form of recurring business (even for SaaS).

In the context of enterprise software, more features raise a very specific problem: soon enough supposedly distinct products start to overlap.

  • Both CRM and CMS (Content Management System) want to capture web leads.
  • Both the inventory management system and the accounting system want to manage suppliers.
  • Both BI and Web Analytics want to analyze the sales channels.
  • ...

Overlapping is deadly because data flows within the IT landscape start to look like the Tokyo subway map. Indeed, each time, on both ends of the overlap, divisions want to use their software, and consequently IT gets forced into moving the data around, struggling with inconsistent domain models.

Platforms are the worst offender here, because platforms are bound to overlap with practically everything else in the company. Worse, platforms create rampant dependencies making it quasi-impossible to get rid of the platform vendor later on.

My advice: favor highly focused app over jack-of-all trade apps. If you get it wrong, it will fail fast, and you will have ample possibility to try again. Most enterprises incorrectly think that managing many vendors is a problem, hence favoring Big Systems. However, dealing with hundreds of apps is quasi-painless - well, for a large company - as long as apps remain decoupled, and as long as you’re not bargaining with an army of vendors. In contrast, it only takes two overlapping platforms to create an IT integration nightmare.

Time is of the essence

In the world of enterprise software, vendor lock-in is all over the place; and yet, all enterprises I met had supposedly everything in place to avoid those lock-ins: contractual reversibility (to be able to revert to back to the previous system), contractual migration support (to be able to move forward to a new system), favorable termination clauses, etc. On the surface, extensive safeguards were in place against any vendor lock-in.

How could the reality be so different from the theory? It’s because time is the strongest and most widely underestimated vendor lock-in mechanism. What’s the point of being able to terminate a contract any day if it takes years to de-entangle operations from the vendor? As a rule of thumb, phasing out an enterprise vendor take about 3x as long as phasing the vendor in; and if it’s a platform, then you can reasonably consider your company locked on the platform forever.

Indeed, don’t expect vendors to be overly motivated by the prospect of having their products being phased out. At best, the legacy vendor will be slow but ultimately responsive to whatever problems which will necessarily arise during the transition. But that's the best case ...

My advice: Internet was a revolution. SaaS was a revolution. Cloud computing was a revolution. Mobile internet was a revolution. Software is a fast-paced industry, even more than consumer electronics. Would you buy a tablet and consider it’s an asset to be amortized over the next decade? Certainly not, it’s the same for enterprise software. No matter what software you are considering, if you can’t roll out in a matter of weeks, then move on to the next vendor. Otherwise, by the time your company is done with its acquisition, the software industry will have moved to its next revolution, you will be left with a freshly obsolete technology.

Your data is your DNA, take ownership

There is one area where COTS (commercial off-the-shelf software) works poorly for large enterprises: it’s the orchestration of the data. By orchestration the data, I am not referring to databases, ESB (Enterprise Service Bus) or equivalent low-level generic data system; I am referring to the layer on top that unify the IT landscape. This layer typically involves a mix of people and some middleware to carry on with the changes.

Indeed, COTS necessarily carry strong domain models, that it, the abstract software representation adopted to model the business itself. However, there is (almost) not a single chance that any of those predesigned models would fit a large company made of fusions, acquisitions, restructurings and possibly somewhat heterogeneous branches. Some software, notoriously SAP, can be made to fit practically the domain any large company, but there are so many developments involved, that it hardly counts as off-the-shelf software.

My advice: if there is one area where every large company should have a small team of software developers, it’s its own private data platform: an entity dedicated to the collection and the service all the data generated within the IT landscape. Here I strongly favor in-house software developers, because business data is always core business, no matter what your company is doing.

Furthermore, for most enterprises, a small tiger team of software developers in charge of the data would typically vastly reduce IT spendings compared to vast teams of recycled IT workforce. Indeed, great control over your data grants you the capability to swiftly phase software in and out, and speed is decisive. In practice, it does not take many people, but it takes talented people.


8 tips to turn your Big Data into Small Data

Hectic times. Looking at the last entry, I realize it has been half a year already since my last post.

The Big Data projects I do, and the more I realize how usually scalability aspects for business projects are irrelevant to the point that the quasi-totality of the valuable data crunching processes could actually be run on a smartphone if the proper approaches are taken. Obviously there is no point in actually doing the analysis on a smartphone, this merely illustrating that really it does not take much computational power.

While all vendors are boasting being able to crunch terabytes of data, it turns out that it's very rare to even face dataset bigger than 100MB when properly represented in memory. The catch is that between a fine tuned data representation and a verbose representation - say XML or SQL; there is typically a factor 100x to 1000x involved as far the data footprint is concerned.

The simplest way to deal with Big Data is to turn it in to Small Data. Let's review a couple of handy tricks frequently used at Lokad to compress data.

1. Get rid of everything that is not required

While this might seem obvious, whenever we tackle a Big Data, we typically start by ditching about 90% of the data that is not even required for the task at hand. Frequently, this covers unused fields and segments of the data that can be safely excluded for the analysis.

2. Turn dates in 16-bits integers when the time is not needed

A date-time is represented as an 8-bytes data structure in most languages. Yet, a single unsigned 16-bits integer gives you 65536 combinations, that is, enough to cover 179 years of daily increments, which proves to be usually sufficient. That's a 4x memory saving.  

3. Turn 8-bytes floating point values in 4-bytes or even 2-bytes values

Whenever money is involved, businesses rely on 8-bytes or even 16-bytes floating point values. However, from a statistical viewpoint, such a precision typically makes little sense, it's like computing everything in grams, to finally upper round the final result to the next ton. The 2-bytes precision, aka the half-precision floating point format, is sufficient to accurately represent the price of most consumer goods for example. That's a 4x memory saving.

4. Replace strings by keys with lookup tables

The lookup tables are extremely simple and fast data structure. Depending on the situation, you can typically use lookups to replace fields that contain strings but with many repeated occurrences. Your mileage may vary (YMMV) but lookups, when applicable frequently bring a 10x memory saving.

5. Get rid of objects, use value types instead

Objects (as in C# objects or Java objects) are very handy, but unfortunately, they come with a significant memory overhead, typically of 16-bytes per object when working under 64-bits environments, that is, the default situation nowadays. To avoid those situations, you need to use value types (aka struct, unfortunately not available in Java). Value types usually bring a 2x memory saving.

6. Use plain arrays not "smart" collections

Most modern languages emphasize collections such as dynamic arrays; however, those collections are far from being as memory-efficient as plain old arrays. YMMV but arrays over collections frequently bring a 2x memory saving.

7. Use variable length encoding

The variable length encoding represents a simple compression pattern favoring small values over large values. This technique is especially useful when the original dataset is preprocessed to reassign the identifiers based on their usage frequency; i.e. allocating integers by decreasing frequency. YMMV depending on the actual distribution of identifiers in the dataset, but this typically grants a 4x memory saving.

8. Vectorialize listing when possible

Many data represented as listings in their original relational representation can be vectorialized somehow. For example, if I am interested in the analysis of the return frequency of a web visitor over the last 6 months on a given website, a bit array of 184 bits (aka 23 bytes) can already provides boolean flags of visits for any given day over the last 6 months. When application, this typically grants a 10x memory saving.


Big Data: choosing the problem before choosing the solution

My company has started several important big data missions, and I am taking here the opportunity publish some insights are are relevant to all those initiatives.

A major (and frequent) pitfall of the Big Data projects consists of starting with a solution instead of starting with a problem. In particular, software vendors (Lokad's included) are pushing their own Big Data recipe which will randomly involve:

  • Hadoop
  • HBase
  • Amazon EC2
  • Cassandra
  • Windows Azure
  • Storm
  • Node.js
  • ...

However, the notion of "Big" data is very relative: cheap 1TB hard-drives are now available at your nearest supermarket, and very very few problems faced by companies, even very large ones, do require require more than 100 GB of data to process. 

Usually, even the largest data sources of the largest companies do fit on a smartphone when properly represented. 

Impedance mismatch of BIG frameworks

The performance achieved by well-known Big Data frameworks are mind-blowing: Facebook claims to process 100PB of data over Hadoop. That's massive, and massively impressive as well.

However, before jumping on Hadoop (or any similar Big Data frameworks), one has to really estimate the friction costs involved. While Hadoop is certainly simpler than say MPI, it's still a complicated distributed framework which do require a lot of skills to be properly and efficiently operated.

If the very same goal can be achieved on a single machine within a very acceptable timeframe, then, in my experience the dumb solution is going to be about 100x cheaper (*) and easier to run and to maintain compared to the "distributed" variant.

(*) I am not refering to hardware costs, but to wetware costs (aka people) which represents 99% of the cost anyway for virtually every company, minus a few social networks and search engines.

The untold story about Hadoop (and its peers) is that it works only if, and only if, the data is very meticuluously organized to be made suitable for a processing through the framework. If the data is incorrectly partioned, then Hadoop plus thousands of servers are no faster than a single machine.

Enterprise Big Data start at 100MB

Facebook is facing Petabytes of data, that's millions of Gigabytes, but is really your company facing that much data? Do you need to plug that much data in to solve the problem at hand? Unless you work for a short list of about 100 companies on Earth, I seriously doubt it.

I observe that for most entreprises, "Big Data" starts at 100MB when:

  • Excel is no more a solution.
  • SQL is no more a solution (*)

(*) Yes, you can have a lot more than 100MB in a SQL database. However, reading the entire dataset through SQL needs to be done with care to avoid re-scanning the data thousands of times. In practice, in 90% of the data crunching situations, I observe that it's easier to remove the SQL database, as opposed to improve the performance of the queries over the relational database.

Facing the problems

Thus, whenever data is involved, the initiative should start by facing the problems that are the true roadblock to deliver a "solution". Those problems are typically:

  • Collecting and servicing the data: About every single company I visit has problems on collecting and servicing the data. The most obvious symptom is typically the lack of documentation concerning the data itself, and all the nitty-gritty insights to need to make anything of it. No technology is going to solve that problem, only people and process.
  • Choosing the metrics to be optimized: They are so many parts of the business that could be improved through a smart exploitation of the data, that it is extremely tempting to think that some (hype) technology might be THE answer to everything. This is not going to happen. Solving a problem through data is tough, and without metrics, you don't even for sure you're moving in the right direction. Frequently, defining the metric - that is the problem to be solved - is harder than implementing the solution. 

Thus, before jumping to next cool vendor solution, I urge to start by facing the very uncool aspects of the problem. Frequently, the "solution" consists of removing an ingredient of the previous solution.


A few tips for Big Data projects

Floppy disk illustrationAt Lokad, we are routinely working on Big Data projects, primarily for retail, but with occasional missions in energy or biotech companies. Big Data is probably going to remain as one of the big buzzword of 2012, along with a big trail of failed projects. A while ago, I was offering tips for Web API design, today, let's cover some Big Data lessons (learned the hard way, as always).

1. Small Data trump Big Data

There is one area that captures most of the community interest: web data (pages, clicks, images). Yet, the web-scale, where you have to deal with petabytes of data, is completely unlike 99% of the real-world problems faced about every other verticals beside consumer internet

For example, at Lokad, we have found that the largest datasets found in retail could still be processed on a smartphone if the data is correctly represented. In short, for the overwhelming majority of problems, the relevant data, once properly partitioned, take less than 1GB.

With datasets smaller than 1GB, you can keep experimenting on your laptop. Map-reducing stuff on the cloud is cool, but compared to local experiments on your noteboook, cloud productivity is abysmal.

2. Smarter problems trump smarter solutions

Good developers love finding good solutions. Yet,when facing Big Data problem, it just too temping to improve stuff, as opposed to challenge the problem in the first place.

For example at Lokad, as far inventory optimization was concerned, we have been pushing years of efforts at solving the wrong problem.  Worse, our competitors has been spending hundreds of man-years of efforts doing the same mistake ...

Big Data means being capable of processing large quantities of data while keeping computing resource costs negligible. Yet, most problems faced in the real world have been defined more than 3 decades ago, at a time where any calculation (no matter how trivial) was a challenge to automate. Thus, those problems come with a strong bias toward solutions that were conceivable at the time.

Rethinking those problems is long overdue.

3. Being non-intrusive is scalability-critical

The scarcest resource of all is human time. Letting a CPU chew 1 million numbers is nothing. Having people reading 1 milion numbers takes an army of clercs. 

I have already posted that manpower requirements of Big Data solutions were the most frequent scalability bottleneck. Now, I believe that if any human has to read numbers from a Big Data solution, then solution won't scale. Period.

Like AntiSpam filters, Big Data solutions need to tackle problems from an angle that does not require any attention from anyone. In practice, it means that problems have to be engineered in a way so that they can be solved without user attention. 

4. Too big for Excel, treats as Big Data

While the community is frequently distracted by multi-terabyte datasets, anything that does not conveniently fit in Excel is Big Data as far practicalities go:

  • Nobody is going to have a look at that many numbers.
  • Opportunities exist to solve a better problem.
  • Any non-quasi-linear algorithm will fail at processing data in a reasonable amount of time.
  • If data is poorly architectured / formatted, even sequential reading becomes a pain.

Then comes the question: how should handle Big Data? However, the answer is typically very domain-specific, so I will leave that to a later post.

5. SQL is not part of the solution

I won't enter (here) the debate SQL vs NoSQL, instead let's outline that whatever persistence approach is adopted, it won't help: 

  • figuring out if the problem is the proper one to be addressed,
  • assessing the usefulness of the analysis performed on the data,
  • blending Big Data outputs into user experience.

Most of the discussions around Big Data end up distracted by persistence strategies. Persistence is a very solvable problem, so engineers love to think about it. Yet, in Big Data, it's the wicked parts of the problem that need the most attention.


Happy talk detector

Over the last couple of months, I have been pushing a lot of content on my company website (, and proofreading a lot of texts produced by colleagues too. The more I write, the more I realize that fighting our innate instinct to produce happy talk is a tough battle.

Recently, I came up with a simple rule to detect most happy talk content:

When by replacing a sentence by its negation, the resulting message seems totally out of place, then, odds are that the sentence was not carrying much of a message in the first place.

For example, it might be tempting write down on a company website We strive for excellency; however, if you think the opposite We strive for mediocrity, it becomes clear that nobody would claim the latter version. Hence, since the latter is obvious, the former has to be too.

The trick is purely psychological though. When producing an assertion of some kind, our mind - at least mine for sure - seems to better spot oddities rather than to recognize the obvious as such.