Author

I am Joannes Vermorel, founder at Lokad. I am also an engineer from the Corps des Mines who initially graduated from the ENS.

I have been passionate about computer science, software matters and data mining for almost two decades. (RSS - ATOM)

Meta
Monday
Jun052017

## .NET-first strategy for CNTK

CNTK is an incredible deep learning toolkit from Microsoft. Despite being little known, under the hood, the technology rivals the capabilities of TensorFlow. While originating from Microsoft, it’s unfortunate that CNTK decided steer away from the Microsoft ecosystem, actually making CNTK a less viable option than TensorFlow as far .NET is concerned.

My conclusions:

• as a contender for the Python ecosystem, CNTK is a lost cause. TensorFlow has already won by a large margin; just like x86-64 won over IA-64.
• unlike Python, the .NET ecosystem is still vastly underserved as deep learning is concerned, this would be a perfect spot for CNTK if CNTK opted for a .NET-first strategy.
• by establishing CNTK as the go-to option for deep learning in .NET, CNTK would be very well positioned to become the go-to option for Java as well.

As a long time supporter of the Microsoft ecosystem, it really pains me to see otherwise excellent Microsoft technologies heading to the wall, just because their strategy make them irrelevant by design for the broader community.

At the core, CNTK is a C++ product, with a primary focus on raw performance, that is, making the most accuracy-wise of the computing resources that are allocated to machine learning task. Yet, as confirmed by the team, CNTK is focusing on Python as the high-level entry point for CNTK.

CNTK also features BrainScript, a tiny DSL (domain specific language) intended to design a deep learning network with a high-level declarative syntax. While BrainScript is advertized as a scripting language, it’s a glorified configuration file; which is an excellent option in the deep learning context.

### A frontal assault on TensorFlow is a lost cause

The Python-first orientation of CNTK is a strategic mistake for CNTK, and will only consolidate CNTK as a distant second behind TensorFlow.

The Python deep-learning ecosystem is already very well-served through TensorFlow and its own ecosystem. As a quick guestimation, TensorFlow has presently 100x the momentum of CNTK. Amazon tells me that there are 50+ books on TensorFlow against exactly zero books for CNTK.

Can CNTK catch-up frontally against TensorFlow? No. TensorFlow has the first-mover advantage and my own casual observations of HackerNews indicates that TensorFlow is even growing faster than CNTK, further widening the gap. CNTK might have better performance, but it’s not a game changer, not a sustainable game changer anyway. The TensorFlow teams are strong, and the CNTK performance tuning is being aggressively replicated.

Anecdotally, the Microsoft teams themselves seem to be internally favoring TensorFlow over CNTK. TensorFlow is already more mature for the Microsoft ecosystem, i.e. .NET, than CNTK.

Business 101: don’t engage frontally is a battle that is already lost. You can be ambitious, but you need an "angle".

### Why Python is a lost cause for Microsoft

Microsoft has tried to reach out to the Python ecosystem for more than a decade; the efforts dating back from IronPython in 2006, followed by the Visual Studio tools in 2011. Yet, at present time, after a decade of efforts, the fraction of the Python ecosystem successfully attracted by Microsoft remains negligible: let's say it underflows measurements. I fail to see why CNTK would be any different.

The Python ecosystem has consolidated itself around strictly non-Microsoft technologies and non-Microsoft environments. There is no SQL Server, it’s PostgreSQL. There is no Microsoft Azure, it’s AWS or Google Cloud. Etc. If I were to bet some money, I would gamble that CNTK won’t have any meaningful presence in the Python machine learn ecosystem of tomorrow. Most likely, it will be TensorFlow and a couple of non-Microsoft contenders (*).

(*) I am not saying that no strong contenders for TensorFlow will emerge from the deep learning community; I am saying is that no strong contenders for TensorFlow targeting Python will emerge from Microsoft.

### Python is a major friction for a .NET solution

One deep yet frequent misunderstanding from academic or research circles is the degree of pain that heterogeneous software stacks represent for companies, clients and vendors alike. Maintaining healthy production systems when one stack (e.g. .NET or Java or Python) is involved is already a challenge. Introducing a second stack is very costly in practice.

My company Lokad is developing a complex .NET solution based on Microsoft Azure. If tomorrow Lokad were to start relying on Python, we would have:

• to monitor closely all the Python packages and dependencies, just like we do for .NET packages, if only to be capable of swiftly deploying security fixes.
• to install and maintain consistent Python versions across all our machines, from the development workstations to production servers, and organize company-wide (*) upgrades accordingly.
• to develop and foster a culture of Python, includes knowing the language (easy) but also all the institutional knowledge to build good Python apps (hard).

(*) Sometimes you're out of lucks and bits of your software live on the client side too, forcing you into multi-version supports of the base framework itself.

Keeping the technological mass of a software solution under control is a very important concern. LaTeX might have succeeded despite being built out of a dozen of programming languages, but this will kill any independent software vendor (ISV) with a high degree of certainty.

All those considerations have nothing to do whether Python is good or bad. The challenge would be identical if we were to introduce the Java or Swift stacks in our .NET codebase.

### The tech mass of BrainScript is minimal

While Python is whole technology stack of its own, BrainScript, the DSL of CNTK is nothing but a glorified configuration file. As far the technological mass is concerned, managing a DSL like BrainScript, is a no-brainer. Let’s face it, this is not a new language to learn. The configuration file for ASP.NET (web.config) is an order of magnitude more complex than BrainScript as a whole, and nobody refers to web.config files as being a programming language of their own.

While the CNTK team decided to steer away from BrainScript, I would, on the contrary suggest to double-down on this approach. Machine learning is complicated, bugs are hard to track down, and unlike machine learning competitions where datasets are 100% well-prepared, data in the real world is messy and poorly documented. Real software businesses are facing deadlines and limited budgets. We need machine learning tools, but more importantly, we need tools that deliver some degree of correctness by design. BrainScript is imperfect, but it’s a solid step in the right direction.

### .NET is a massive opportunity for CNTK

The .NET ecosystem is vast, arguably significantly larger that the one of Python, and yet fully underserved as far deep learning is concerned.

It is a misconception to think that .NET software solutions designed with .NET would benefit less from deep learning than Python software solutions. The needs for deep learning and the expected benefits are the same. Python just happens to be much more popular in the publishing community than .NET.

Most .NET solution vendors, like Lokad, would immediately jump on CNTK if .NET was given a strong clear priority. Indeed, a .NET-first perspective would be a game changer for CNTK in this ecosystem. Instead, of struggling with second-class citizens, like the current C# bindings of CNTK, we would benefit from first-class citizens, which would happen to be completely aligned with the rest of the .NET ecosystem.

Also, .NET is very close to Java - at least, Java is much closer to .NET than it is from Python. Establishing CNTK as a deep learning leader in the .NET ecosystem would also make a very strong case to reach out to the Java ecosystem later on.

### .NET/C# is superior to Python for deep learning

Caveat: opinionated section

C# is nearly uniformly superior to Python: performance, productivity, tooling.

Performance is a primary concern for the platform intended as the high-level instrumentation of the deep learning infrastructure. Indeed, one should never underestimate how much development efforts the data pipeline represent. It might be possible to describe a deep learning network in 50 lines of codes, yet, in practice, the data pipeline that surrounds those lines is weighing thousands of lines. Moreover, because we are moving a lot of data around, and because preprocessing the data can be quite challenging on its own, the data pipeline needs to be efficient.

Anecdote: at Lokad, we have multiple data pipelines that involve crunching over 1TB of data on a daily basis. We use highly optimized C# to run those data pipelines that make aggressive use of both async capabilities of C# but also of low level C-like algorithms.

With .NET/C#, my own experience at building Lokad, indicates that it’s usually possible to achieve over 50% of the raw C performance on CPU by paying minimal attention to performance - aka don’t go crazy with objects, use struct when relevant, etc. With CPython, achieving even 10% of the C performance is a struggle, and frequently we end-up with 1% of the performance of C. Yes, PyPy exists, but then, the Python ecosystem is badly fragmented, and PyPy is not even compatible with TensorFlow. When it comes to machine learning, raw performance of the high-level interop language matters a lot, because it’s where all the data preprocessing happen, that is where 99% of the software investments are made. Falling back to C++ whenever you need performance is no more a reasonable option in 2017.

.NET/C# is a superior alternative to Python to built type-safe high-performance production-grade data pipelines. Moreover, I would argue, another debatable point, that C# as a language, is also evolving faster than Python.

Finally, .NET Core being open source and now working both on Linux and Windows, there is no more limitations not to use .NET/C# as the middleware glue for CNTK either. Again, this would contribute in making make CNTK easier easy to integrate into .NET solutions which represent the strategic market that Microsoft has a solid chance to capture.

Monday
Aug292016

## The sad state of .NET deployments on Azure

One of the core benefit of cloud computing should be ease of deployment. At Lokad, we have been using Azure in production since 2010. We love the platform, and we depend on it absolutely. Yet, it remains very frustrating that .NET deployments have not nearly made enough progress as it could have been expected 6 years ago.

The situation of .NET deployements is reminiscent of the data access re-inventions which were driving Joel Spolsky nuts 14 years ago. Now, Microsoft has shifted its attention to app deployements, and waves of stuff keep rolling in, without really addressing core concerns, and leaving the whole community disoriented.

At present time, there are 6 major ways of deploying a .NET app on Azure:

• ASM, WebRole and WorkerRole
• ASM, Classic VM
• ARM, WebApp
• ARM, Azure Batch
• ARM, Azure Functions
• ARM, VM scale set

ASM stands for Azure Service Manager, while ARM stands for Azure Resource Manager. The ASM gathers the first generation of cloud services on Azure. The ARM gathers the second generation of cloud services on Azure.

ASM to ARM transition is a mess

In Azure, pretty much everything comes in two flavors: the ASM one and the ARM one; even the Blob Storage accounts (equivalent of S3 on AWS). Yet, there are no migration possible. Once you create a resource - a storage account, a VM, etc - on one side, it cannot be migrated to the other side. It's such a headache. Why is the responsibility of the clients to deal with this mess? Most resources should be accessible from both sides, or be "migratable" in a few clicks. Then, whenever, the ASM/ARM distinction is not even relevant (eg. storage accounts), the distinction should not even be visible.

So many ways to do the same thing

It's maddening to think that pretty every service handle .NET deployments in a different way:

• With WebRole and WorkerRole, you locally produce a package of assemblies (think of it as a Zip archive containing a list of DLLs), and you push this package to Azure.
• With the Classic VM, you get a fresh barebone OS, and you do your own cooking to deploy.
• With WebApp, you push the source code through Git, and Azure takes care of compiling and deploying.
• With Azure Batch, you push your DLLs to the blob storage, and script how those files should be injected/executed in the target VM.
• With Azure Functions, you push the source code throuhg Git, except that unlike WebApp, this is not-quite-exactly-C#.
• With the VM scale set, you end up cooking your own OS image that you push to deploy.

Unfortunately, the sanest option, the package option as used for WebRole and WorkerRole, is not even available in the ARM world.

The problem with Git pushes

Many companies - Facebook or Google for example - leverage a single source code repository. Lokad does too now (we transitionned to single repository 2 years ago, it's much better now). While having a large repository creates some challenges, it also make tons of things easier. Deploying through Git looks super cool in a demo, but as soon as your repository reaches hundreds of megabytes, problems arise. As a matter of fact, our own deployments on Azure routinely crashes while our Git repository "only" weights 370MB. By the time our repository reaches 1GB, we will probably have entirely given up on using Git pushes to deploy.

In hindsight, it was expected. The size of the VM needed to compile the app has no relevance to the size of the VM needed to run the app. Plus, the compiling the app may require many software pieces that are not required afterward (do you need your JS minifier to be shipped along with your webapp?). Thus, all in all, deployment through Git push only gets you so far.

The problem with OS management

Computer security is tough. For the average software company, or rather for about 99% of the (software) companies, the only way to ensure a decent security for their apps consists of not managing the OS layer. Dealing with the OS is only asking for trouble. Delegating the OS to a trusted party who knows what she is doing is about the only way not to mess it up, unless you are fairly good yourself; which, in practice, elimitates 99% of the software practionners (myself included).

From this perspective, the Classic VM and the VM scale set feel wrong for a .NET app. Managing the OS has no upside: the app will not be faster, the app is not be more reliable, the app will not have more capabilities. OS management only offers dramatic downsides if you get something wrong at the OS level.

Packages should have solved it all

In retrospect, the earliest deployement method introduced in Azure - the packages used for WebRole and WorkerRole - was really the good approach. Packages scale well and remain uncluttered by the original size of the source code respository. Yet, for some reason this approach was abandonned on the ARM side. Now, the old ASM design does not offer the most obvious benefits that should have been offered by this approach:

• The packages could have been made even more very secure: signing and validating packages is straightforward.
• Deployment could have been super fast: injecting a .NET app into a pre-booted VM is also straightforward.

For demo purposes, it would have been simple enough to have a Git-to-package utility service running with Azure to offer Heroku-like swiftness to small projects, with the possibility to transition naturally to package deployments afterward.

Almost reinventing the packages

Azure Batch is kinda like the package, but without the packaging. It's more like x-copy deployment with file hosted in a Blob Storage. Yet, because it's x-copy, it will be tricky to support any signing mechanisms. Then, looking further ahead, the pattern 1-blob-per-file is near guaranteed to become a performance bottleneck for large apps. Indeed, the Blob Storage offers much better performance at retrieving a 40MB package rather than 10,000 blobs of 4KB each. Thus, for large apps, batch deployments will be somewhat slow. Then, somebody somewhat will start re-inventing the notion of "package" to reduce the number of files ...

With the move toward .NET Core, .NET has never been more awesome, and yet, it could be so much more with a clarified vision and technology around deployments.

Tuesday
Mar082016

## Cloud-first programming languages

The art of crafting of programming languages is probably one of the most mature fields of software, and yet it’s surprising to realize how much potential there is in rethinking programming from a cloud-first [0] perspective. At my company Lokad, we ended-up writing our own programming language - a narrow domain specific language geared toward commerce analytics – and, we keep stumbling on elements that would have been hard to achieve from a more traditional perspective.

Our language – Envision – lives within the walled garden of its parent company: Lokad provides the tools to author the code as well as the platform to execute the scripts. While this approach has limitations of its own; it offers some rather unique upsides as well.

Designing a programming language is like any other design challenge: even the most brilliant designer makes mistakes. Then, assuming that the language gains some traction, a myriad of programs get written leveraging what has now become an unintended feature. At this point, rolling back any bad design decision takes a monumental effort, because every single piece of code ever written needs to be upgraded separately. All major programming languages (C++, JavaScript, Python, C#) are struggling with this problem. Overall, change is very slow, measured in decades [1].

However, if the parent company happens to be in control of all the code in existence, then it becomes possible to refactor automatically, through static code analysis, all code ever written, and through refactoring to undo the original design mistake. This does not mean that making mistakes becomes cheap but only that it becomes possible to fix those mistakes within days [2], while regular programming languages mostly have to carry on forever with their past mistakes.

From a cloud-first perspective, it’s OK to take some degree of risk with language features as long as the features being introduced are simple enough to be refactored away later on. The language evolution speed-up is massive.

## 2. Identifying and fixing programming antipatterns

Programming languages are for humans and humans make mistakes. Some mistakes can be identified automatically through static code analysis; and then, many more can be identified through dynamic code analysis. Within its walled garden, the company has direct access not only to all the source code, but all past executions as well, plus all the input data as well. It this context, it becomes considerably easier to identify programming antipatterns.

Once an antipattern is identified, it becomes possible to selectively warn impacted programmers with a high degree of accuracy. However, it also becomes possible to think of the deep-fix: the programming alternative that should resolve the antipattern.

For example, at Lokad, we realized a few months ago that lines of code dealing with minimal ordering quantities were frequently buggy. The deep fix was to get rid of this logic entirely through a dedicated numerical solver. The challenge was not so much of implementing the solver – although it happened to be a non-trivial algorithm – but to realize that such a solver was needed in the first place.

## 3. Out-of-band calculations

As soon as your logic needs to process a lot of data, computation delays creep in. Calculation delays are typically not an issue in production: results should to be served fast, but refreshing the results [3] can typically take minutes without any impact. As long as nobody is waiting for the newer results, latency matters little.

However, there is one point of time when calculation latency is critical: design time, when the programmer is slowly iterating over hundreds of versions of the same code to incrementally craft the intended calculation. At design time, calculation delays are a real hindrance. Data scientists know the pattern too well: add 2 lines to your code, execute, and go grab a coffee while the calculation completes.

But what if the platform was compiling and running your code in the background? What if the platform was even planning things ahead of you, and pre-computing many elements before you actually need them? It turns out that if the language has been designed upfront with this sort of perspectives, it’s very feasible; not all the time, just frequently enough. Through Envision, we are already doing those, and it’s not even that hard [4].

A careful cloud-first design of the programming language can be used to intensify the amount of calculations that can be performed out-of-band. Those calculations could be performed on local machines, but in practice, a relying on a cloud makes everything easier.

## 4. Data-rich environment

From a classic programming perspective, the programming language – or the framework – is supposed to be decoupled from data. Indeed, why would anyone ship a compiler with datasets in the first place? Except for edge cases, e.g. Unicode ranges or timezones, it’s not clear that it would even make sense to bundle any data with the programming language or the development environment.

Yet, from a cloud-first perspective, it does make a sense. For example, in Envision, we provide a native access to currency rates, both present and historical. Then, even within the narrow focus of Lokad, there are many more potential worthy additions: national tax rates, ZIP code geolocation, manufacturer identification through UPC... Other fields would probably have their own domain-specific datasets ranging from the properties of chemical compounds to trademark registrations.

Embedding terabytes of external data along with the programming environment is a non-issue from a cloud-first perspective; and it offers the possibility to make vast datasets readily available with zero hassle for the programmer.

In conclusion, the transition toward a cloud-first programming language represents an evolution similar to the one that happens when transitioning from desktop software to SaaS. From afar, both options look similar, but the closer you get, the more differences you notice.

[0] I am not entirely satisfied with this terminology; it could have been LaaS for “Language as a Service”, or maybe IDEE for “Integrated Development and Execution Environment”.

[1] The upgrade from Python 2 from Python 3 will have roughly cost about a decade to this community. Improving the way null values are handled in C# is also a process that will most likely to span over a decade; the end-game being to make those null values unnecessary in C#.

[2] In the initial version of Envision, we decided that the operator == when applied to strings would perform a case-insensitive equality test. In hindsight, this was a plain bad idea. The operator == should perform a case-sensitive equality test. Recently, we rolled a major upgrade where all Envision scripts got upgraded toward the new case-insensitive operators, effectively freeing the operator == for the revised intended semantic.

[3] Most people would favor a spam filter introducing 10 seconds of processing delay per message if the filtering accuracy is at 99.99% versus a spam filter needing 0.1 seconds but offering only a 99% accuracy. Similarly, when Lokad computes demand forecasts to optimize containers shipped from China to the USA, speeding up the calculation of a few minutes is irrelevant compared to any extra forecasting accuracy to be gained through a better forecasting model.

[4] If somebody uploads a flat file – say a CSV file – to your data processing platform, what comes next? You can safely assume that loading and parsing the file will come next; and Lokad does just that. Envision has more fancy tricks under the hood than flat file pre-parsing, but it's same sort of ideas.

Friday
May082015

## Nearly all web APIs get paging wrong

Data paging, that is, the retrieval of a large amount of data through a series of smaller data retrievals, is a non-trivial problem. Through Lokad, we have implemented about a dozen of extensive API integrations, and reviewed a few dozens of other APIs as well.

The conclusion is that as soon as paging is involved, nearly all web APIs get it wrong. Obviously, rock-solid APIs like the ones offered by Azure or AWS are getting it right, but those outstanding APIs are exceptions rather than the norm.

### The obvious pattern that doesn't work

I have lost count of the APIs that propose the following broken pattern to page through their data, a purchase order history for example:

https://example.com/api/purchaseorders?page=2&pagesize=25

Where page is the page number and pagesize the number of orders to be retrieved. This pattern is fundamentally unsafe. Any order deleted while the enumeration is in-progress will shift the indices which, in turn, is likely to cause another order to be skipped.

There are many variants of the pattern, and everytime the problem boils down to: the "obvious" paging pattern leads to a flawed implementation that fail whenever concurrent writes are involved.

### The "updated_after" filter doesn't work either

Another popular approach for paging is to leverage a filter on the update timestamp of the elements to be retrieved, that is:

https://example.com/api/purchaseorders?updated_after=2015-04-29

Then, in order to page the request, the client is supposed to take the most recent updated_at value from the response and to feed this value back to the API to further enumerate the elements.

However this approach does not (really) work either. Indeed, what if all elements have been updated at once? This can happen because of a system upgrade or because of any kind of bulk operation. Even if the timestamp can be narrowed down to the microsecond, if there are 10,000 elements to be served all having the exact same udpate timestamp, then, the API will keep sending a response where max(updated_at) is equal to the request timestamp.

The client is not enumerating anymore, the pattern has failed.

Sure, it's possible to tweak the timestamps to make sure that all the elements gracefully spread over distinct values, but it's a very non-trivial property to enforce. Indeed, a datetime column isn't really supposed to be defined with unicity constraint in your database. It's feasible, but odd and error prone.

### The fallacy of the "power" APIs

Some APIs provides powerful filtering and sorting mechanisms. Thus, through those mechanims, it is possible to correctly implement paging. For example by combining two filters: one the update datetime of the items and one on the item identifier. A correct implementation is far from trivial however.

Merely offering the possibility to do the right thing is not sufficient: doing the right thing should be the only one possibility. This point is something that Lokad learned the hard way early on: web APIs should offer one and only one way to do each intended operation.

If the API offers a page mechanism but that the only way to correctly implement paging is to not use it; then, rest assured that the vast majority of the client implementations will get it wrong. From a design viewpoint, it's like baiting developers into a trap.

### The "continuation token" as the only pattern that works

To my knowledge, there is about only one pattern that works for paging, it's the continuation token pattern.

https://example.com/api/purchaseorders?continue=token

Where every request to a paged resource like the purchase orders has the possibility of returning a continuation token on top of the elements returned when not all elements could be returned in one batch.

On top of being correct, that pattern has two key advantages:

• It's very hard to get it wrong on the client side. There is only one way to do anything with the continuation token: it's to feed it again to the API.
• The API is not commited into returning any specific number of elements (in practice a high upper bound can still be documented). Then, if some elements are particularly heavy or if the server is already under heavy workload, smaller chunks can be returned.

This enumeration should not provide any garantee that the same element won't be enumerated more than once. The only garantee that should be provided by the paging through tokens is that ultimately all elements will be enumerated at least once. Indeed, you don't want to end-up with tokens that embed some kind of state on the API side; and in order to keep it smooth and stateless, it's important to lift this constraint.

Then, continuation tokens should not expire. This property is important in order to offer the possibility to the client perform incremental update on a daily, weekly or even  on a monthly schedule depending on what makes sense from a business viewpoint.

### No concurrency but data partitions

The continuation token does not support concurrent data retrieval: the next response has to be retrieved before being able to post the next request. Thus, in theory, this pattern somehow limit the amount of data that can be retrieved.

Well, it's somewhat true, and yet mostly irrevelant. First, Big (Business) Data is exceedingly rare in practice, as the transation data of the largest companies tend to fit on a USB key. For all the APIs that we have integrated, putting aside the cloud APIs (aka Azure or AWS), there was not a single integration where the amount of data was even to close to justifying concurrent data accesses. Slow data retrieval is merely a sign a non-incremental data retrieval.

Second, if the data is so large that concurrency is required, then, partitionning the data is typically a much better approach. For example, if you with to retrieve all the data from a large retail network, then the data can be partitionned per store. Partitionning will be making things easier both on the API side and on the client side.

Wednesday
Mar042015

## Buying software? You should ignore references

Being a (small) software entrepreneur, it is still amazing to witness how hell is breaking loose when certain large software vendors start deploying their “solution”. Even more fascinating, is that after causing massive damage, the vendor just signs another massive deal with another large company and hell breaks loose again. Repeat this one hundred times, and you witness a world-wide verticalized software leader crippling an entire industry with half-backed technology.

Any resemblance between the characters in this post and any real retail company is purely coincidental.

I already pointed out that Requests For Quotes (RFQ) were a recipe for disaster, but RFQs alone do not explain the scale of the mess. As become more and more familiar with selling to large companies, I now tend to think that one heavyweight driver behind these epic failures is a banal flaw of the human mind: we massively overvalue other people’s opinion on a particular subject instead of relying on our own judgment.

In B2B software, one’s references usually come from is a person who works in a company similar to the one you are trying to sell to, and who, when called by your prospects, conveys exceptionally positive feelings about you and extremely vague information about your solution. Having tested this approach myself, I can say that the results are highly impressive: the reference call is an incredibly efficient sales method. Thus, it is pretty safe to assume that any sufficiently large software B2B vendor is also be acutely aware of this pattern as well.

At this point, for the vendor, it becomes extremely tempting not to merely stumble upon happy customers who happen to be willing to act as referees, but to manufacture these references directly, or even to fake them if it’s what it takes. How hard could this be? It turns out that it’s not hard at all.

As a first-hand witness, I have observed that there are two main paths to manufacturing such references, which I would refer to as the non-subtle path and the subtle path. My observations indicate that both options are routinely leveraged by most B2B software vendors once they reach a certain size.

The non-subtle path is, well, not subtle: you just pay. Don’t get me wrong, there is no bribery involved or anything that would be against the law. Your “reference” company get paid through a massive discount on its own setup fee, and is under a strict agreement that they will play their part in acting as a reference later on. Naturally, it is difficult to include this in the official contract, but it turns out that you don’t need to. Once a verbal agreement is reached, most business executives stick to the spirit of the agreement, even if they are not bound by written contract to do so. Some vendors go even a step further by directly offering a large referral fee to their flagship references.

The subtle path takes another angle: you overinvest in order to make your “reference” client happy. Indeed, usually, even the worst flaws of an enterprise software can be fixed given unreasonable efforts, that is, efforts that go well beyond the budget of your client. As a vendor, you still have the option to pick a few clients where you decide to overinvest and make sure that they are genuinely happy. When the time comes and a reference has to be provided, the reference is naturally chosen as one of those “happy few” clients who benefit from an outstanding service.

While one can be tempted to argue that the subtle path is morally superior to the non-subtle path, I would argue that they are both equally deceptive, because a prospect gets a highly distorted view of the service actually provided by the vendor. The subtle path has the benefit of not being a soul crushing experience for the vendor staff, but many people accommodate with the non-subtle path as well.

If you happen to be in a position of buying enterprise software, it means that you should treat all such hand-picked references with downright mistrust. While it is counter-intuitive, the rational option is to refuse any discussions with these references as they are likely to distort your imperfect (but so far unbiased) perception of the product to be acquired.

Refusing calls with references? Insanity, most will say. Let’s step back for one second, and let’s have a look at what can be considered as the “gold standard” [1] of rational assessment: the paper selection process of international scientific publications. The introduction of blind, and now double-blind, peer reviews was precisely motivated to fight the very same kind of mundane human flaws. Nowadays, if a research team was to try to get a paper published based on the ground that they have buddies who think that their work is “cool”, the scientific community would laugh at them, and rightly so. Only the cold examination of the work itself by peers stands ground.

And that is what references are: they are buddies of the vendor.

In addition, there is another problem with references that is very specific to the software industry: time is of the essence. References are a reflection of the past, and by definition, when looking at the past, you are almost certain to miss recent innovations. However, software is an incredibly fast-paced industry. Since I first launched Lokad, the software business for commerce has been disrupted by three major tech waves: cloud computing, multichannel commerce and mobile commerce; and that is not even counting “minor” waves like Big Data. Buying software is like buying a map: you don’t want an outdated version.

Software that is used to run large companies is typically between one and two decades behind what would be considered as “state of the art”. Thus, even if a vendor is selling technology that is one decade behind the rest of the market, this vendor can still manage to be perceived as an “upgraded” company by players who were two decades behind the market. It is a fallacy to believe that because the situation improved somewhat, the move to purchase a particular software was a good one. The opportunity to get up to speed with the market has been wasted, and the company remains uncompetitive.

No matter which approach is adopted by the vendor to obtain its references, one thing is certain: it takes a tremendous amount of time to obtain references, typically years. Thus, by the time a references are obtained, chances are high that the technology that has been assessed by the referee has now become outdated. At Lokad, it happened to us twice: by the time we obtained references for our “classic” forecasting technology, we had already released our “quantile” forecasting technology and our former “classic” forecasting software was already history. And three years later, history repeated itself as we released “quantile grids” forecasting that is vastly superior to our former “quantiles”. If companies were buying iPhone based on customer references, they would just be starting to buy the iPhone 1 now, not trusting iPhone 2 yet because it would still lack customer references; and it would be unimaginable to even consider all the different versions from iPhone 3 to iPhone 6 that have not yet been time-tested.

The need for references emerges because the software buyer is vulnerable and insecure, and rightly so, as epic failures are extremely frequent when buying enterprise software. While the need to obtain security during the buying process is real, references, as we have seen, is a recipe for major failures.

A much better approach is to carry out a thorough examination of the solution being proposed, and yes, this usually means becoming a bit of an expert in this field in order to perform an in-depth assessment of the solution being presented by the vendor. Don’t delegate your judgment to people you have no reason to trust in the first place.

[1] The scientific community is not devoid of flaws, it is still large bunch of humans after all. Peer reviewing is a research area in progress. Publication protocols are still being improved, always seeking to uphold higher standards of rationality.

Page 1 2 3 4 5 ... 32