I am Joannes Vermorel, founder at Lokad. I am also an engineer from the Corps des Mines who initially graduated from the ENS.

I have been passionate about computer science, software matters and data mining for almost two decades. (RSS - ATOM)


Bitcoin, more thoughts on an emerging currency

Two years ago, I was publishing some first thoughts on Bitcoin. Meantime, Bitcoin has grown tremendously, and I remain an enthusiast observer of those developments. I had originally proposed a vision in 5 stages for the development of Bitcoin with

  1. Mining stage
  2. Trading stage
  3. End-user stage
  4. Merchant stage
  5. Enterprise stage

Back in 2011, I had written that mining was taken care of. Well, since that time, Bitcoin has witnessed an explosion of the hashing power through the development of ASICs, that is, hardware dedicated to the sole purpose of mining Bitcoins. Mining has definitively emerged as an extremely specialized niche.

Bitcoin is now halfway through its trading stage. Two years ago, MtGox was so dominant that it was the closest thing to be considered as a single point of failure for Bitcoin. Meantime, many other exchanges have emerged: Bitstamp, Kraken, Btcchina … I suspect that MtGox holds no more than 20% of the exchange market share. Are we done with exchanges yet? Well, not yet, Bitcoins remain convoluted to acquire – I will get back to this point.

Fade of interest, a fading danger but still the main danger

Price volatility, malevolent uses and adverse regulations are usually quoted as dangers faced by the emerging currency. I think that those threats have grown into non-issues for Bitcoins. Indeed, the very same criticisms can be made about most currencies and commodities anyway, and Bitcoin is now beyond the point where a roadblock could wipe out the initiative.

No, the one major risk for Bitcoin remains a fade of interest from the community. High-tech is a fast paced environment and few technologies survive a decade. However, considering the steady growth of Bitcoin in the public awareness, I am inclined to think that this risk, the one true danger for Bitcoin, is itself fading away.

Bitcoin, a poster child for antifragility

Over the last two years, Antifragile from Nassim Nicholas Taled, is the most noticeable book I have given the chance to read. In particular, I realized that antifragility is probably one of the greatest and most misunderstood quality of Bitcoin. Bitcoin might seem complex, but it’s nothing but a protocol sitting on top of a shared ledger. Thanks to the present Bitcoin reach, the ledger itself – technically the blockchain – is probably the dataset in existence that benefits from the greatest number of backups world-wide. That part is safe, arguably orders of magnitude safer than the ledger of any bank.

What about the protocol then? Well, the protocol can fail, like any piece of software. It certainly did in the past, and most likely, it will fail again in the future. Let’s bring the case further, and imagine that instead of a simple glitch, someone manages to crack the protocol tomorrow, what would happen? Well, as it’s exactly what happened to Namecoin not too long ago, it’s not too hard to make a good guess. First, a corrupted blockchain would spread wreaking havocs in the Bitcoin ecosystem. Exchange rates would drop of 90% overnight, and then exchanges would simply stop operating. Meantime, within hours after the emergence of the problem, community developers, possibly members of the Bitcoin Foundation, would start working on a fix.

Depending on the nature of the weakness found the Bitcoin, fixing the problem would take from a few hours to a few weeks. Considering the amount of people involved, I fail to see why it would take much more than that. Indeed, Bitcoin is complex, but in the end, it’s not that complex, especially when compared to other popular open source projects such as Linux, Firefox or Open Office.

In the case of Namecoin, the terminal protocol bug was resolved in about 24h, and that’s Namecoin, an alt-coin with about 0.1% of the community traction of Bitcoin.

Then, once a solution is found, a new blockchain would be restarted from one of the many non-corrupted copies of the old blockchain still available. Depending on the depth of the problem, multiple and incompatible solutions might be proposed more or less at the same time by distinct developers. The market might even undergo a few competing solutions for a while, but then a “winner’s take all” effect will quickly push to oblivion all solutions BUT the leading one. Within a few months (maybe less), the exchange rates would have returned to their previous levels.

It’s Bitcoin as a ledger that is truly antifragile. The other part, the Bitcoin as a protocol is fragile and it is likely to be modified dozens of times over the next decade, each new version annihilating the previous version if the community consents to it.

If a massive protocol breach was to happen, many companies part of the Bitcoin ecosystem could go burst overnight: some exchanges might accumulate instant but terminal losses, a revised protocol could possibly make former hardware designs incompatible with the revised protocol, etc. The Bitcoin ledger itself is the only entity to be antifragile within the ecosystem, simply because many developers are personally vested in the preservation of this ledger.

Moreover, shocks do benefit to Bitcoin:

  • Blockchain spam forced the community into making the protocol more resilient,
  • Major thefts, the rise and fall of Silkroad, helped Bitcoin to make the headlines,
  • Cyprus crisis undermined a bit the trust in the euros, again in favor of Bitcoin,
  • Etc.

The next country printing its money into oblivion, the next bank failing with or without bail-out, the next country not to honor its debts … any of those events will further boost Bitcoin: not because Bitcoin will have succeeded at doing or succeeded at preventing anything, but merely because Bitcoin will have remained un-impacted.

In a way, betting on Bitcoin is betting on a degree of economic chaos for the years to come. A world of perfectly stable economies offering frictionless currencies does not need Bitcoin.

When an Unstoppable Force meets an Immovable Object

While many trading options have emerged for Bitcoin, exchanging national currencies for Bitcoin remains a convoluted exercise; and, I suspect that it will remain non-trivial for a while, possibly a long while.

Indeed, pretty much everything in the banking system has been built around the notion of reversible transactions: the money on the bank is your’s, but only from a legal viewpoint. If a court decides that one of the transactions that originally funded your account was not legitimate, then the transaction can be reversed, and the money can change of owner based on third party interventions. With Bitcoin, ownership is a matter of knowledge. If you know the private key of a Bitcoin address, and if nobody else knows it, then you are the true owner of whatever Bitcoins this address has accumulated. It’s very physical process deeply uncaring for any legal considerations: no court order can recover a transaction made toward an address if keys have been lost.

This aspect explains why it remains almost impossible to use a credit card to buy Bitcoins, and why considerable delays tend to be introduced by parties even when wire transfers are involved. Exchanging cash for Bitcoins feels a more natural option though. A Bitcoin-to-Cash ATM is now already available in Vancouver. However, I suspect that ATM owners are heading for frictions. For any ATM model that takes off, bad guys will start buying ATMs for the sole purpose of reverse-engineering them with ad-hoc counterfeit money printed for the sole purpose of fooling this specific type of ATM. Indeed, bad guys don’t need to produce quasi-perfect counterfeit bank notes, merely counterfeit notes good enough to fool this one machine – a much easier task.

Again, with regular ATMs, it’s a non-issue. If someone manages to stuff an ATM with counterfeit money, the bank will simply cancel the corresponding transaction later on when the misdeed is uncovered. The bank has full control on its ledger.

A store of value

When I discovered Bitcoin, I was inclined to think it would succeed because it made world-wide payment frictionless. Well, it’s certainly still part of the picture, but the more I observe the community, the more I believe it’s a positive but relatively marginal driving force.

Few people would argue that the growth of Bitcoin has been essentially driven by speculative investments. Then, according to the Bitcoin community wisdom, many would also argue that the ecosystem will gradually transition from pure speculation to more mundane uses, hence justifying high anticipated conversion rates. However,

  • what if speculation stayed the dominant force not to be replaced by any other?
  • what if Bitcoin did not need any alternative force to maintain its value?

Indeed, a shared yet incorruptible ledger may offer a fantastic intrinsic value on its own, as it gives people the possibility to save value without trusting any designated third party – trusting instead the community as a whole.

Gold arguably offers the same benefice, but in practice, gold is an impractical medium to make any payment; and, as a result, any gold transaction starts by converting the gold back to a local currency.

Then, why trusting a designed third party should be a problem, one might ask? Well, most currencies are simply not managed in the interest of the currency holders. China, Brazil, Russia and Argentina probably come top of the list here because of their respective size, but they are far from being the worst offenders. Then, even dollars, euros and yens are hardly managed in the best interest of currency holders.

Here, Bitcoin benefits from an ancient social pattern called the Gresham's law. According to the Wikipedia:

Gresham's law is an economic principle that states: "When a government overvalues one type of money and undervalues another, the undervalued money will leave the country or disappear from circulation into hoards, while the overvalued money will flood into circulation." It is commonly stated as: "Bad money drives out good".

This law has been quoted many times about Bitcoin, but its consequences are usually misunderstood. Many detractors argue People are just hoarding Bitcoin, instead of spending them, which will be the downfall of Bitcoin. This observation is partial, and I believe that the conclusion incorrect too. Note that Bitcoin can still fail, but not because of this (see above).

A more accurate observation would be Many, if not most, are hoarding Bitcoins until they have an actual need to spend them. Meanwhile, those people just keep spending whatever non-Bitcoin currency they have. This behavior exactly fits the Gresham’s law, but what does it imply for Bitcoin?

First, merchants should not expect too many people rushing to spend their Bitcoins. Most people will keep spending their non-Bitcoin currency as long as they can. However, as accepting Bitcoins is an inexpensive option, there are little downsides in accepting Bitcoins - especially if Bitcoins are immediately converted to the local currency. Second, the more people keep their coins, the more the exchange rate will rise, due to simple market mechanics; thus, actually preserving the value storage property of Bitcoin.

At this point, detractors would argue that if there is little exchanges through Bitcoin and if it’s only about hoarding something that has no real value, how could this something be worth anything? This brings me back to the ledger (i.e. the blockchain). The one distinctive innovation brought by Satoshi Nakamoto is to make the world realize that a fully decentralized and yet incorruptible ledger was possible. The Bitcoin ledger is unique and it’s is what gives Bitcoin its value.

What people really owns when owning Bitcoins is a quantified amount of favors that could be given back from any member of the community; as long community interest has not faded, and it can be a valuable privilege – hence, not needing further benefits to justify the value.

Alt-coins will drive the evolution of Bitcoin

As an asset, what is the value of the Bitcoin protocol? Well, zero. Anybody can fork the source code, almost 2000 already did. Anybody can restart an alt-coin variant, dozens already did. While Bitcoin can be arguably estimated as invaluable to mankind, the protocol itself has zero market value: nobody makes money by selling the protocol.

The market value is in the ledger and only in the ledger, and this is why alt-coins are unlikely to gain any significant market value: they recycle the bulk of the Bitcoin protocol (the value-less part) while ditching the blockchain (the valuable part).

Namecoin is barely an alt-coin, because it addresses a very different problem; and that’s precisely because it does not compete with Bitcoin that it managed to gain traction.

Nevertheless, alt-coins represent an incredible opportunity for Bitcoin. Through experiments with alternative approaches, alt-coins are producing the knowledge that will make Bitcoin more secure, more usable, leaner, etc. Alt-coins, by being fragile experiments, directly helps Bitcoin in becoming more antifragile.

For example, Zerocoin brings an unprecedented level of anonymity in transactions by introducing rocket-science zero-knowledge cryptography in the protocol. From the Bitcoin perspective, there is absolutely no need to rush to import Zerocoin into the protocol. After all, Bitcoin has been striving without it so far. It’s much more reasonable to remain a passive observer for a (long) while, to let Zerocoin take all the bullets as bugs and flaws are uncovered, to let the Zerocoin community patiently address performance issues; and then, once Zerocoin has fully matured, to upgrade the Bitcoin protocol leveraging all this hard-won knowledge.

Thus, from a currency holder perspective, it means that alt-coins are doomed with high probability, because they won’t be able to preserve any technological advantage over time, bringing the competition back to a competition between ledgers where Bitcoin will only grow stronger over time.

Preserving Bitcoins

Since Bitcoin is about storing value, foolproof ways to secure Bitcoins is a critical ingredient. Two years ago, I was already indicating this challenge was not specific of Bitcoins: it’s just incredibly convoluted to operate a computing environment that you can fully trust. Long story short: you need air gaps, but it’s harder than it looks.

Furthermore, the overall amount of trust that people should have in their computing devices - notebooks, phones, servers in the cloud – has rather gone downward since the Snowden revelations. Thus, I am inclined to think that many successful ventures of the end-user stage will be Bitcoin appliances, that is, hardware devices designed for the sole purpose of dealing with Bitcoins. The Bitcoin Card and Trezor are both promising appliances, and I suspect there is room for a lot more contenders in this market.

Indeed, as most people invest in Bitcoins, it’s fairly reasonable to assure that most of those people will be inclined in spending a bit to more to secure their investment.

The widespread availability of Bitcoin appliances that have gained the trust of the community will be the sign that the end-user stage of Bitcoin is taken care of.

Annex: More technical considerations

Instant transactions are coming without much effort. It takes half a dozen of blocks to gain an absolute confidence in a Bitcoin transaction, which means about 1h of delay. Many people see this aspect as a design failure, which prevents most live payment scenarios. However, if one is OK from relaxing the constraint from absolute confidence to quasi-absolute, then instant transactions can be made very secure, arguably a lot more secure than credit cards transactions (because of chargebacks). All it takes is an online service that aggressively spreads the transaction over the network while in the same time it aggressively monitors any double-spend attempt. Such a service does not exist yet, but it’s not the most pressing issue for Bitcoin either.

Scalability is a very addressable concern. Scalability is frequently presented as a core design flaw, that is, if Bitcoin starts gaining traction, it will fail because it won’t be scalable enough. (Disclaimer: argument from authority) My own experience in teaching distributed computing and tackling Big Data projects for year indicates is that scalability is never a terminal problem. Scalability problems are straightforward problems merely needing patience and dedication to be solved. Furthermore, many developers just love tackling scalability challenges well beyond market needs. That part of Bitcoin is probably very safe.


Thinking Big Data for commerce

As of October 2013, Google Trends indicates that the buzz around Big Data is still growing. Based on my observations of many services company (mostly retailers though), I believe that it's not all hype, and that indeed Big Data is going to deeply transform those businesses. However, I also believe that Big Data is wildly misunderstood by most, and that most Big Data vendors should not be trusted.

Search volume on the term "Big Data" as reported by Google Trends

A mechanization of the mind based on data

Big Data, like many technological viewpoint, is both quite new, and very ancient. It's a movement deeply rooted in the perspective of mechanization of the human mind which started decades ago. The Big Data viewpoint states the data can be used as the source of knowledge to produce automated decisions.

This viewpoint is very different from Business Intelligence or Data Mining, because humans get kicked out of the loop altogether. Indeed, producing millions of numbers a day cost almost nothing; however, if you need people to read those numbers, the costs are fantastic. Big Data is mechanization : the number produced are decision and nobody is needed to interfere at the lowest level to get things done.

The archetype of the Big Data application is the spam filter. First, it's an ambient software, taking important decisions all the time: deciding for me which message is not worth my time to read certainly is an important decision. Second, it has has been built based on the analysis of lot of data, mostly by establishing databases of messages labeled as spam or not-spam. Finally, it requires almost zero contribution of its end-user to deliver its benefits.

Not every decision is eligible to Big Data

Decisions that are eligible to a Big Data processing are everywhere:

  • Choosing the next most profitable prospect to send the paper catalog.
  • Choosing the quantity of goods to replenish from the supplier.
  • Choosing the price of an item in a given store.

Yet, such a decision needs two key ingredients:

  1. A large number of very similar decisions are taken all the time.
  2. Relevant data exist to bootstrap an automated decision process.

Without No1, it's not cost-efficient to even tackle the problem from a Big Data viewpoint, Business Intelligence is better suited. Without No2, it's only rule-based automation.

Choosing the problem first

The worst mistake that a company can do when starting a Big Data project consists of choosing the solution before choosing the problem. As of fall 2013, Hadoop is now the worst offender. I am not saying that there is anything wrong with Hadoop per se; however choosing Hadoop without even checking that it's a relevant solution is a costly mistake. My own experience with service companies (commerce, hospitality, healthcare ...) indicates that, for those companies, you hardly ever need any kind of distributed framework.

The one thing you need to start a Big Data project is a business problem that would be highly profitable to solve:

  • Suffering from too high or too low stock levels.
  • Not offering the right deal to the right person.
  • Not having the right price for the right location.

The ROI is driven by the manpower saved by the automation, and by getting better decisions. Better decisions can be achieved because the machine can be made smarter than a human with less than 5s of brain-time per individual decision. Indeed, in service companies, productivity targets imply that decisions have to be made fast.

Your average supermarket has typically about 20,000 references, that is, 330 hours of work if employees spend 1 minute per reordered quantity. In practice, when reorders are made manually, employees can only afford a few seconds per reference.

Employees are not going to like Big Data

Big Data is a mechanization process. And, it's usually easy to tell to spot the real thing just by looking at internal changes caused by the project. Big Data is not BI (Business Intelligence) where employees/managers are offered a new gizmo, changing nothing to the status quo. The first effect of Big Data project is typically to reduce the number of employees (sometimes massively) required to process a certain type of decisions.

Don't expect teams soon-to-replaced-by-machines to be overjoyed by such a perspective. At best, upper management can expect passive resistance from their organization. Across dozens of companies, I don't think I have observed, among services companies, any Big Data project succeeding without a direct involvement from the CEO herself. That's unfortunately the steepest cost of Big Data.

The roots of Big Data

The concept of Big Data did emerge only in 2012 among mainstream media, but its roots are much older. Big Data results from 3 vectors of innovations:

  • Better computing infrastructures to move data around, to process data, to store data. The latest instance of this trend is cloud computing.
  • Algorithms and statistics. Over the last 15 years, the domain of statistical learning has exploded (driverless cars are a testimonial of that).
  • Enterprise digitalization. Most (large) companies have completed their digitalization process where each key business operation has its counterpart digital record.

Big Data is first a result of the mix of those 3 ingredients.

Big Data = Big Budget?

Enterprise vendors are now chanting their new motto big data = big budget, and you can't afford not to take the Big Data train, right? Hadoop, SAP Hana, Oracle Exalytics, to name a few, are all going to cost you a small fortune when looking at TCO (total cost of ownership).

For example, when I ask my clients how much does it cost to store 24TB  of data on disk? Most of the answers come above 10,000€ per month. Well, OVH (hosting company) is offering 24TB servers from 110€ / month. Granted, this is not highly redundant storage, but at this price, you can afford a few spares.

Then, the situation is looks even more absurd when one realizes that storing 1 year of receipts of Walmart - the largest retailer world-wide - can fit on a USB key. Unless you want to process images or videos, very few datasets cannot be made to fit on a USB key.

Here, there is a media-induced problem: there are about 50 companies world-wide who have web-scale computing requirements such as Google, Microsoft, Facebook, Apple, Amazon to name a few. Most Big Data frameworks originate from those companies: Hadoop comes from Yahoo, Cassandra comes from Facebook, Storm is now Twitter, etc. However, all those companies have in common of roughly processing about 1000x more data that the largest retailers.

The primary cost of a Big Data project is the focus that top management needs to invest. Indeed, focus comes with a strong opportunity cost: while the CEO is busy thinking how to transform her company with Big Data, fewer decisions can be made on other pressing matters.

Iterations and productivity

A big data solution does not survive its first contact with data.

Big Data is a very iterative process. First, the qualification of data (see below) is an extremely iterative process. Second, tuning the logic to obtain acceptable results (the quality of the decisions) is also iterative. Third, tuning the performance of the system is, again, iterative. Expecting a success at first try is heading for failure, unless you tackle a commoditized problem, like spam filtering, with a commoditized solution, like Akismet.

Since many iterations are unavoidable, the cost of the Big Data project strongly impacted by the productivity of the people executing the project. In my experience at teaching distributed computing at the Computer Science Department Ecole normale supérieure, if the data can kept on a regular 1000€ workstation, the productivity of any developer is tenfold higher compared to a situation where the data has to be distributed - no matter how many frameworks and toolkits are thrown at the problem.

That's why it's critical to keep things lean and simple as long it's possible. Doing so gives your companies an edge against all your competitors who will tar pit themselves with Big Stuff.

Premature optimization is the root of all evil. Donald Knuth. 1974

Qualifying the data is hard

The most widely estimated challenge is Big Data is the qualification of the data. Enterprise data has never been created with the intent to feed some statistical decision-taking processes. Data exist only as the by-product of software operating the company. Thus, enterprise data (nearly) always full of subtle artifacts that need to be carefully addressed or mitigated.

The primary purpose of point of sales is to let people pay; NOT to produce historical sales records. If a barcode has become unreadable, then, many cashiers might just scan twice another item that happens to have the same price than the non-scannable one. Such a practice is horrifying from a data analysis perspective, but looking at the primary goal (i.e. letting people pay), it's not unreasonable business-wise.

One of the most frequent antipatterns observed in Big Data projects is the lack of envolvement of the non-IT teams with all the technicalities involved. This is a critical mistake. Non-IT teams need to tackle hands-on data problems because solutions only come from a deep understanding of the business.


Big Data is too important to be discarded as an IT problem, it's first and foremost a business modernization challenge.


A buyer’s guide for enterprise software

Through a series of Big Data consulting missions that were overlapping with the entire IT landscape, data being all over the place, I have observed software purchasing processes of many large companies. Being also an enterprise software vendor myself, I have been baffled countless times by broken buying processes that lead smart people routinely choose about the worst price-quality ratio that the market has to offer.

In this post, I am trying to gather a survival kit for buying enterprise software. By enterprise, I refer to companies with over 1,000 employees.

Get rid of Requests for Quotes (RFQ)

The rational goes like this: let’s write down all our requirements, then, we identify all suitable vendors, send them the RFQ, collect the quotes, review demos, and finally select the best option. Fairly reasonable, and yet, as far software is concerned, deeply dysfunctional.

Why? Because writing an accurate specification for software you need is harder than writing the software itself. It’s somehow the Heisenberg uncertainty principle transposed to software. Just try to write a RFQ for a webmail (aka GMail) if you’re not convinced. Specifying the fine print would already takes hundreds of pages, that’s only a webmail. Enterprise software tends to be a lot more complex than webmail.

As anecdotal evidence, writing and managing an RFQ is so time consuming and complex that many companies throw consultants at the problem, increasing costs even further.

Then, when the whole process is in place, guess what happens? All decent vendors walk away. Indeed, if you have a decent product that sells by itself (think Microsoft Excel), why would you bother paying an army of account managers to walk through broken RFQs? In the end, the only vendors left are either the ones so outrageously expensive that they can afford whatever it takes, or the ones selling crappy products that would never get sold without the active contribution of dysfunctional (yet widespread) processes.

My advice: Take your favorite search engine and forge yourself an opinion. It’s easier than you think, and in less than 3h, you're likely to have already a convincing shortlist of vendors. A few tips:

  • If all you find about a software is happy talk, software is worthless or vaporware.
  • If you can’t find screenshots of the software, then it means that the user interface, and probably the user experience as well, is abysmal.
  • If the web documentation is awful or absent, then whatever private documentation exists, it won’t be better.
  • If there is no public pricing, then, you will face somebody who’s professionally trained in reverse engineering the exact budget you have. That's going to be a time sink.
  • If the company does open source, then bonus points. They are not afraid that other people might have a look at the software code they produce.

Customizing Off-the-shelf Software is deadly

Joel Spolsky stated that it takes 10 years to write good software, and looking at the development curve of my own company I tend to agree. Well, it took us 5 years to realize that we were not even solving the correct problem (but this will be the subject for another post).

Software is (mostly) a take it or leave it business. Yes, you can request tiny adjustments, but asking for anything substantial is like swimming in molasses, mostly because of hidden costs. Generic upgrades won't work, support staff will be incompetent, design balance will be thrown out of whack, etc. As a result, good software companies, where people truly care and dedicate a good portion of their lifespan in carefully crafting truly valuable products, will actually decline such requests.

In contrast, vendors who will gladly accept customization requests are the one putting little value on the integrity of their product; which has already degenerated into some byzantine architecture through the acceptance of disparate customization requests over time.

My advice: choose your battles wisely. If a feature gives you an edge against the competition then internalize it and treat it as a core asset. Otherwise, it’s easier and cheaper to adjust your own organization to whatever software which reflects the dominant practice - as long the new practice offers a tangible improvement over the old one.

Bargaining over the price is against your interest

Bill Gates said that you don’t get what you deserve: you get what you negotiate. True enough but you should wisely choose what you want to negotiate. Negotiating over the software pricing is enormously expensive for the vending party. It takes a highly talented workforce, with employees both technically savvy and yet having all the smooth skills it takes to interact with large organizations. In most Western countries, the yearly total cost of a seasoned account manager is above $150k.

When you start bargaining with software companies, you also start an adverse selection process. Only the companies willing to afford an expensive sales force remains while others walk away. The vendors who stay are the ones where the business model is geared toward an ever increasing sales force. However, you want to put your money on a vendor that invest the bulk of her revenue in developing good software, not funding an army of people that specializes in reverse engineering budgets.

My advice: negotiating the price downward is usually a dead end. If it's out of budget, then discard the vendor and move to the next one.

Instead of focusing on price, you should try is to capture as much attention as you can from the core development team. Since you are an enterprise (hence a large prospect), you already got an edge here. For example, you can negotiate a small case study against a series of meetings with whoever are in charge of the core product development.

By making sure that the people in charge of the product are familiar with your business, you are vastly improving the odds that the future developments will be aligned to your company needs. Furthermore, you won’t even need to fund or think about those enhancements, the people you’ve met will do this free of charge; because it’s a natural thing engineers do when learning about client's problems.

More is less and stay clear of platforms

The inevitable corollary of RFQ is that the more advertised features the better. Furthermore, the impression is amplified by the vendors themselves who promote an ongoing stream of new features to sustain a form of recurring business (even for SaaS).

In the context of enterprise software, more features raise a very specific problem: soon enough supposedly distinct products start to overlap.

  • Both CRM and CMS (Content Management System) want to capture web leads.
  • Both the inventory management system and the accounting system want to manage suppliers.
  • Both BI and Web Analytics want to analyze the sales channels.
  • ...

Overlapping is deadly because data flows within the IT landscape start to look like the Tokyo subway map. Indeed, each time, on both ends of the overlap, divisions want to use their software, and consequently IT gets forced into moving the data around, struggling with inconsistent domain models.

Platforms are the worst offender here, because platforms are bound to overlap with practically everything else in the company. Worse, platforms create rampant dependencies making it quasi-impossible to get rid of the platform vendor later on.

My advice: favor highly focused app over jack-of-all trade apps. If you get it wrong, it will fail fast, and you will have ample possibility to try again. Most enterprises incorrectly think that managing many vendors is a problem, hence favoring Big Systems. However, dealing with hundreds of apps is quasi-painless - well, for a large company - as long as apps remain decoupled, and as long as you’re not bargaining with an army of vendors. In contrast, it only takes two overlapping platforms to create an IT integration nightmare.

Time is of the essence

In the world of enterprise software, vendor lock-in is all over the place; and yet, all enterprises I met had supposedly everything in place to avoid those lock-ins: contractual reversibility (to be able to revert to back to the previous system), contractual migration support (to be able to move forward to a new system), favorable termination clauses, etc. On the surface, extensive safeguards were in place against any vendor lock-in.

How could the reality be so different from the theory? It’s because time is the strongest and most widely underestimated vendor lock-in mechanism. What’s the point of being able to terminate a contract any day if it takes years to de-entangle operations from the vendor? As a rule of thumb, phasing out an enterprise vendor take about 3x as long as phasing the vendor in; and if it’s a platform, then you can reasonably consider your company locked on the platform forever.

Indeed, don’t expect vendors to be overly motivated by the prospect of having their products being phased out. At best, the legacy vendor will be slow but ultimately responsive to whatever problems which will necessarily arise during the transition. But that's the best case ...

My advice: Internet was a revolution. SaaS was a revolution. Cloud computing was a revolution. Mobile internet was a revolution. Software is a fast-paced industry, even more than consumer electronics. Would you buy a tablet and consider it’s an asset to be amortized over the next decade? Certainly not, it’s the same for enterprise software. No matter what software you are considering, if you can’t roll out in a matter of weeks, then move on to the next vendor. Otherwise, by the time your company is done with its acquisition, the software industry will have moved to its next revolution, you will be left with a freshly obsolete technology.

Your data is your DNA, take ownership

There is one area where COTS (commercial off-the-shelf software) works poorly for large enterprises: it’s the orchestration of the data. By orchestration the data, I am not referring to databases, ESB (Enterprise Service Bus) or equivalent low-level generic data system; I am referring to the layer on top that unify the IT landscape. This layer typically involves a mix of people and some middleware to carry on with the changes.

Indeed, COTS necessarily carry strong domain models, that it, the abstract software representation adopted to model the business itself. However, there is (almost) not a single chance that any of those predesigned models would fit a large company made of fusions, acquisitions, restructurings and possibly somewhat heterogeneous branches. Some software, notoriously SAP, can be made to fit practically the domain any large company, but there are so many developments involved, that it hardly counts as off-the-shelf software.

My advice: if there is one area where every large company should have a small team of software developers, it’s its own private data platform: an entity dedicated to the collection and the service all the data generated within the IT landscape. Here I strongly favor in-house software developers, because business data is always core business, no matter what your company is doing.

Furthermore, for most enterprises, a small tiger team of software developers in charge of the data would typically vastly reduce IT spendings compared to vast teams of recycled IT workforce. Indeed, great control over your data grants you the capability to swiftly phase software in and out, and speed is decisive. In practice, it does not take many people, but it takes talented people.


8 tips to turn your Big Data into Small Data

Hectic times. Looking at the last entry, I realize it has been half a year already since my last post.

The Big Data projects I do, and the more I realize how usually scalability aspects for business projects are irrelevant to the point that the quasi-totality of the valuable data crunching processes could actually be run on a smartphone if the proper approaches are taken. Obviously there is no point in actually doing the analysis on a smartphone, this merely illustrating that really it does not take much computational power.

While all vendors are boasting being able to crunch terabytes of data, it turns out that it's very rare to even face dataset bigger than 100MB when properly represented in memory. The catch is that between a fine tuned data representation and a verbose representation - say XML or SQL; there is typically a factor 100x to 1000x involved as far the data footprint is concerned.

The simplest way to deal with Big Data is to turn it in to Small Data. Let's review a couple of handy tricks frequently used at Lokad to compress data.

1. Get rid of everything that is not required

While this might seem obvious, whenever we tackle a Big Data, we typically start by ditching about 90% of the data that is not even required for the task at hand. Frequently, this covers unused fields and segments of the data that can be safely excluded for the analysis.

2. Turn dates in 16-bits integers when the time is not needed

A date-time is represented as an 8-bytes data structure in most languages. Yet, a single unsigned 16-bits integer gives you 65536 combinations, that is, enough to cover 179 years of daily increments, which proves to be usually sufficient. That's a 4x memory saving.  

3. Turn 8-bytes floating point values in 4-bytes or even 2-bytes values

Whenever money is involved, businesses rely on 8-bytes or even 16-bytes floating point values. However, from a statistical viewpoint, such a precision typically makes little sense, it's like computing everything in grams, to finally upper round the final result to the next ton. The 2-bytes precision, aka the half-precision floating point format, is sufficient to accurately represent the price of most consumer goods for example. That's a 4x memory saving.

4. Replace strings by keys with lookup tables

The lookup tables are extremely simple and fast data structure. Depending on the situation, you can typically use lookups to replace fields that contain strings but with many repeated occurrences. Your mileage may vary (YMMV) but lookups, when applicable frequently bring a 10x memory saving.

5. Get rid of objects, use value types instead

Objects (as in C# objects or Java objects) are very handy, but unfortunately, they come with a significant memory overhead, typically of 16-bytes per object when working under 64-bits environments, that is, the default situation nowadays. To avoid those situations, you need to use value types (aka struct, unfortunately not available in Java). Value types usually bring a 2x memory saving.

6. Use plain arrays not "smart" collections

Most modern languages emphasize collections such as dynamic arrays; however, those collections are far from being as memory-efficient as plain old arrays. YMMV but arrays over collections frequently bring a 2x memory saving.

7. Use variable length encoding

The variable length encoding represents a simple compression pattern favoring small values over large values. This technique is especially useful when the original dataset is preprocessed to reassign the identifiers based on their usage frequency; i.e. allocating integers by decreasing frequency. YMMV depending on the actual distribution of identifiers in the dataset, but this typically grants a 4x memory saving.

8. Vectorialize listing when possible

Many data represented as listings in their original relational representation can be vectorialized somehow. For example, if I am interested in the analysis of the return frequency of a web visitor over the last 6 months on a given website, a bit array of 184 bits (aka 23 bytes) can already provides boolean flags of visits for any given day over the last 6 months. When application, this typically grants a 10x memory saving.


Big Data: choosing the problem before choosing the solution

My company has started several important big data missions, and I am taking here the opportunity publish some insights are are relevant to all those initiatives.

A major (and frequent) pitfall of the Big Data projects consists of starting with a solution instead of starting with a problem. In particular, software vendors (Lokad's included) are pushing their own Big Data recipe which will randomly involve:

  • Hadoop
  • HBase
  • Amazon EC2
  • Cassandra
  • Windows Azure
  • Storm
  • Node.js
  • ...

However, the notion of "Big" data is very relative: cheap 1TB hard-drives are now available at your nearest supermarket, and very very few problems faced by companies, even very large ones, do require require more than 100 GB of data to process. 

Usually, even the largest data sources of the largest companies do fit on a smartphone when properly represented. 

Impedance mismatch of BIG frameworks

The performance achieved by well-known Big Data frameworks are mind-blowing: Facebook claims to process 100PB of data over Hadoop. That's massive, and massively impressive as well.

However, before jumping on Hadoop (or any similar Big Data frameworks), one has to really estimate the friction costs involved. While Hadoop is certainly simpler than say MPI, it's still a complicated distributed framework which do require a lot of skills to be properly and efficiently operated.

If the very same goal can be achieved on a single machine within a very acceptable timeframe, then, in my experience the dumb solution is going to be about 100x cheaper (*) and easier to run and to maintain compared to the "distributed" variant.

(*) I am not refering to hardware costs, but to wetware costs (aka people) which represents 99% of the cost anyway for virtually every company, minus a few social networks and search engines.

The untold story about Hadoop (and its peers) is that it works only if, and only if, the data is very meticuluously organized to be made suitable for a processing through the framework. If the data is incorrectly partioned, then Hadoop plus thousands of servers are no faster than a single machine.

Enterprise Big Data start at 100MB

Facebook is facing Petabytes of data, that's millions of Gigabytes, but is really your company facing that much data? Do you need to plug that much data in to solve the problem at hand? Unless you work for a short list of about 100 companies on Earth, I seriously doubt it.

I observe that for most entreprises, "Big Data" starts at 100MB when:

  • Excel is no more a solution.
  • SQL is no more a solution (*)

(*) Yes, you can have a lot more than 100MB in a SQL database. However, reading the entire dataset through SQL needs to be done with care to avoid re-scanning the data thousands of times. In practice, in 90% of the data crunching situations, I observe that it's easier to remove the SQL database, as opposed to improve the performance of the queries over the relational database.

Facing the problems

Thus, whenever data is involved, the initiative should start by facing the problems that are the true roadblock to deliver a "solution". Those problems are typically:

  • Collecting and servicing the data: About every single company I visit has problems on collecting and servicing the data. The most obvious symptom is typically the lack of documentation concerning the data itself, and all the nitty-gritty insights to need to make anything of it. No technology is going to solve that problem, only people and process.
  • Choosing the metrics to be optimized: They are so many parts of the business that could be improved through a smart exploitation of the data, that it is extremely tempting to think that some (hype) technology might be THE answer to everything. This is not going to happen. Solving a problem through data is tough, and without metrics, you don't even for sure you're moving in the right direction. Frequently, defining the metric - that is the problem to be solved - is harder than implementing the solution. 

Thus, before jumping to next cool vendor solution, I urge to start by facing the very uncool aspects of the problem. Frequently, the "solution" consists of removing an ingredient of the previous solution.