Author

I am Joannes Vermorel, founder at Lokad. I am also an engineer from the Corps des Mines who initially graduated from the ENS.

I have been passionate about computer science, software matters and data mining for almost two decades. (RSS - ATOM)

Meta

Entries in machine learning (2)

Tuesday
Jun062017

Details on the .NET first strategy for CNTK

An extensive discussion is taking place on the CNTK project. As I am partly responsible for this discussion, I am gather some more concrete proposals for CNTK.

Correctness by design and BrainScript

My company, Lokad, has built as complex data-driven analytical solution built on .NET. Because machine learning data pipelines are hellish to debug, we seek technologies to ensure as much design correctness as possible. For example, in many programming language, a certain degree of design correctness can be obtained through strong typing. Some languages like Rust or Closure offer other kind of guarantees.

My immediate interest for BrainScript was not for the language itself, but for the degree of design correctness appears to be enforceable in BrainScript at compile time. For example, a static analysis can tell me the total number of model parameters. For example, based on this number, it would be easy to implement a rule in our continuous integration built that prevent an abusively large training task to ever go in production.

Because of the limited expressivity of BrainScript (a good thing!), many more properties can be enforced at compile time, not even starting CNTK. Compile time is important because the continuous integration server may not have access to all the required data this is required to get CNTK up and running.

Then, BrainScript is only one option to deliver this correctness by design. In .NET/C#, it would be straightforward to implement a tiny API that deliver the sample expressivity of BrainScript. The network definition would then be compiled in .NET just like Expression Trees are compiled (*). BrainScript itself could be through at a human-readable serialization format for a valid expression built through this .NET API.

(*) OK, it's not strictly C# compile time, but in practice if your CNTK-network-description-to-be-compiled is reachable through a unit test, then any failure to compile will be caught through unit tests which is good enough in practice.

In the ticket #1962, I was requesting an extension for BrainScript to be made available for Visual Studio Code because, at the time, I was incorrectly thinking that BrainScript was the core strategy for CNTK. Indeed, BrainScript is still listed as one of the Top 8 reasons to favor CNTK over TensorFlow. Then, as it appears that BrainScript is not the core strategy anymore anyway, then I don't see any particular reason for the CNTK team to invest in the BrainScript tooling. I am completely fine with that, as long as a .NET-friendly alternative is provided which share the good properties of BrainScript.

Train vs. Eval, production and versioning support

From a machine learning perspective, training is a very distinct operation from evaluation. However, as far software engineering is concerned, the two operations typically live very close. Indeed, it usually one system that collects the data, feed the data to the training logic, collect the model, distribute the model to possibly "clients", and ensure that those "clients" are capable of executing the training logic. In company like Lokad, we have complex data pipelines, and the best way to ensure that the training data are consistent with the evaluation inputs is to factorize the logic - aka have the same bits of code (C# in our case) being used to cover both use cases.

By design this implies that any machine learning toolkit that does not offer a unified support for both training and evaluation is a major friction. It's not only a lot more costly to implement, it's also very error prone, as we need to find alternative ways to ensure that the two implementations (training-side and evaluation-side) are and remain strictly consistent in the way data are feed to the deep learning toolkit on both sides. In particular, this is why Python is so painful for a .NET solution: we not only end-up spreading an alternative stack all over the place, we end-up duplicating implementations.

Then, from v1 to v2, the CNTK changed the serialization format for models. Companies may end-up significant amount of resources invested in training one particular model, thus breaking the serialization format is bad. Yet, in the same time, it would be unreasonable to freeze the model serialization format forever, because it would actually prevent many desirable improvements for CNTK.

Once again, the solution is simply .NET. In C#, implementing complex binary (de)serializer is straightforward; arguably less than 1/10th of the effort compared to C++. Thus, CNTK could adopt an approach where the C++ toolkit only supports one format - the latest; and transfer the burden of maintaining multiple (de)serializers to C#. This approch would also offer the possibility to easily translate models in the "old" formats to the new formats. Moreover, the translation could even be done at runtime in .NET/C# if performance is not concern (it's not always a concern).

The laundry list for .NET-first CNTK

In this section, I try to briefly cover the most pressing elements for a .NET-first CNTK.

Naked Nugget deployments. Nugget is the de-factor approach to deploy components in the .NET ecosystem. It's already adopted by CNTK for the C# Evaluation bindings but not for the other parts. In particular, deployements should not involve 3rd party stacks like Python (or Node.js or Java).

A network description API in .NET/C#. The important angle is: the API is declarative, and offers the possibility to ensure some degree of correctness by design. The CNTK team is not even expected to provide the tooling to ensure the correctness by design. As long the network description can be reflected in C#, the community can handle that part.

Low-level abstractions for high perf I/O. As posted at #1963, it's important to offer the possibility to efficiently stream data to CNTK. From a .NET/C# perspective, a p/invoke passing around byte arrays is good enough as long as the corresponding binary serializers are provided in .NET/C# as well.

The non-goals for a .NET-first CNTK

Alternatively, there also non-goals the first version of a .NET-first CNTK.

Fully managed implementation. For a high-performance library like CNTK, a native C++ implementation feels just fine. Many low level parts of .NET are implemented this way, like System.Numerics.

ASP.NET specifics. As long as compatibility is ensured with .NET, compatibility will be ensured for ASP.NET. I don't anything to be done specifically for ASP.NET.

Jupyter notebooks. Jupyter is cool, no question. Yet, the interactive perspective is a very Pythonic way of doing things. While more features is desirable, Jupyter does not strike me as critical. Interative C# has been around for a long time, but there is very little community traction to support it.

Visual designer for networks. Visual design is cool for teaching, but this does not strike me as a high-priority feature for the .NET ecosystem. Again, the tools you need for 2h training sessions are very different from what you need for a mission-critical business system.

Unity specifics. Unity is very cool, but what Unity needs most - as far CNTK is concerned - is a clean .NET integration for CNTK itself. The rest is a bonus.

Monday
Jun052017

.NET-first strategy for CNTK

CNTK is an incredible deep learning toolkit from Microsoft. Despite being little known, under the hood, the technology rivals the capabilities of TensorFlow. While originating from Microsoft, it’s unfortunate that CNTK decided steer away from the Microsoft ecosystem, actually making CNTK a less viable option than TensorFlow as far .NET is concerned.

My conclusions:

  • as a contender for the Python ecosystem, CNTK is a lost cause. TensorFlow has already won by a large margin; just like x86-64 won over IA-64.
  • unlike Python, the .NET ecosystem is still vastly underserved as deep learning is concerned, this would be a perfect spot for CNTK if CNTK opted for a .NET-first strategy.
  • by establishing CNTK as the go-to option for deep learning in .NET, CNTK would be very well positioned to become the go-to option for Java as well.

As a long time supporter of the Microsoft ecosystem, it really pains me to see otherwise excellent Microsoft technologies heading to the wall, just because their strategy make them irrelevant by design for the broader community.

At the core, CNTK is a C++ product, with a primary focus on raw performance, that is, making the most accuracy-wise of the computing resources that are allocated to machine learning task. Yet, as confirmed by the team, CNTK is focusing on Python as the high-level entry point for CNTK.

CNTK also features BrainScript, a tiny DSL (domain specific language) intended to design a deep learning network with a high-level declarative syntax. While BrainScript is advertized as a scripting language, it’s a glorified configuration file; which is an excellent option in the deep learning context.

A frontal assault on TensorFlow is a lost cause

The Python-first orientation of CNTK is a strategic mistake for CNTK, and will only consolidate CNTK as a distant second behind TensorFlow.

The Python deep-learning ecosystem is already very well-served through TensorFlow and its own ecosystem. As a quick guestimation, TensorFlow has presently 100x the momentum of CNTK. Amazon tells me that there are 50+ books on TensorFlow against exactly zero books for CNTK.

Can CNTK catch-up frontally against TensorFlow? No. TensorFlow has the first-mover advantage and my own casual observations of HackerNews indicates that TensorFlow is even growing faster than CNTK, further widening the gap. CNTK might have better performance, but it’s not a game changer, not a sustainable game changer anyway. The TensorFlow teams are strong, and the CNTK performance tuning is being aggressively replicated.

Anecdotally, the Microsoft teams themselves seem to be internally favoring TensorFlow over CNTK. TensorFlow is already more mature for the Microsoft ecosystem, i.e. .NET, than CNTK.

Business 101: don’t engage frontally is a battle that is already lost. You can be ambitious, but you need an "angle".

Why Python is a lost cause for Microsoft

Microsoft has tried to reach out to the Python ecosystem for more than a decade; the efforts dating back from IronPython in 2006, followed by the Visual Studio tools in 2011. Yet, at present time, after a decade of efforts, the fraction of the Python ecosystem successfully attracted by Microsoft remains negligible: let's say it underflows measurements. I fail to see why CNTK would be any different.

The Python ecosystem has consolidated itself around strictly non-Microsoft technologies and non-Microsoft environments. There is no SQL Server, it’s PostgreSQL. There is no Microsoft Azure, it’s AWS or Google Cloud. Etc. If I were to bet some money, I would gamble that CNTK won’t have any meaningful presence in the Python machine learn ecosystem of tomorrow. Most likely, it will be TensorFlow and a couple of non-Microsoft contenders (*).

(*) I am not saying that no strong contenders for TensorFlow will emerge from the deep learning community; I am saying is that no strong contenders for TensorFlow targeting Python will emerge from Microsoft.

Python is a major friction for a .NET solution

One deep yet frequent misunderstanding from academic or research circles is the degree of pain that heterogeneous software stacks represent for companies, clients and vendors alike. Maintaining healthy production systems when one stack (e.g. .NET or Java or Python) is involved is already a challenge. Introducing a second stack is very costly in practice.

My company Lokad is developing a complex .NET solution based on Microsoft Azure. If tomorrow Lokad were to start relying on Python, we would have:

  • to monitor closely all the Python packages and dependencies, just like we do for .NET packages, if only to be capable of swiftly deploying security fixes.
  • to install and maintain consistent Python versions across all our machines, from the development workstations to production servers, and organize company-wide (*) upgrades accordingly.
  • to develop and foster a culture of Python, includes knowing the language (easy) but also all the institutional knowledge to build good Python apps (hard).

(*) Sometimes you're out of lucks and bits of your software live on the client side too, forcing you into multi-version supports of the base framework itself.

Keeping the technological mass of a software solution under control is a very important concern. LaTeX might have succeeded despite being built out of a dozen of programming languages, but this will kill any independent software vendor (ISV) with a high degree of certainty.

All those considerations have nothing to do whether Python is good or bad. The challenge would be identical if we were to introduce the Java or Swift stacks in our .NET codebase.

The tech mass of BrainScript is minimal

While Python is whole technology stack of its own, BrainScript, the DSL of CNTK is nothing but a glorified configuration file. As far the technological mass is concerned, managing a DSL like BrainScript, is a no-brainer. Let’s face it, this is not a new language to learn. The configuration file for ASP.NET (web.config) is an order of magnitude more complex than BrainScript as a whole, and nobody refers to web.config files as being a programming language of their own.

While the CNTK team decided to steer away from BrainScript, I would, on the contrary suggest to double-down on this approach. Machine learning is complicated, bugs are hard to track down, and unlike machine learning competitions where datasets are 100% well-prepared, data in the real world is messy and poorly documented. Real software businesses are facing deadlines and limited budgets. We need machine learning tools, but more importantly, we need tools that deliver some degree of correctness by design. BrainScript is imperfect, but it’s a solid step in the right direction.

.NET is a massive opportunity for CNTK

The .NET ecosystem is vast, arguably significantly larger that the one of Python, and yet fully underserved as far deep learning is concerned.

It is a misconception to think that .NET software solutions designed with .NET would benefit less from deep learning than Python software solutions. The needs for deep learning and the expected benefits are the same. Python just happens to be much more popular in the publishing community than .NET.

Most .NET solution vendors, like Lokad, would immediately jump on CNTK if .NET was given a strong clear priority. Indeed, a .NET-first perspective would be a game changer for CNTK in this ecosystem. Instead, of struggling with second-class citizens, like the current C# bindings of CNTK, we would benefit from first-class citizens, which would happen to be completely aligned with the rest of the .NET ecosystem.

Also, .NET is very close to Java - at least, Java is much closer to .NET than it is from Python. Establishing CNTK as a deep learning leader in the .NET ecosystem would also make a very strong case to reach out to the Java ecosystem later on.

.NET/C# is superior to Python for deep learning

Caveat: opinionated section

C# is nearly uniformly superior to Python: performance, productivity, tooling.

Performance is a primary concern for the platform intended as the high-level instrumentation of the deep learning infrastructure. Indeed, one should never underestimate how much development efforts the data pipeline represent. It might be possible to describe a deep learning network in 50 lines of codes, yet, in practice, the data pipeline that surrounds those lines is weighing thousands of lines. Moreover, because we are moving a lot of data around, and because preprocessing the data can be quite challenging on its own, the data pipeline needs to be efficient.

Anecdote: at Lokad, we have multiple data pipelines that involve crunching over 1TB of data on a daily basis. We use highly optimized C# to run those data pipelines that make aggressive use of both async capabilities of C# but also of low level C-like algorithms.

With .NET/C#, my own experience at building Lokad, indicates that it’s usually possible to achieve over 50% of the raw C performance on CPU by paying minimal attention to performance - aka don’t go crazy with objects, use struct when relevant, etc. With CPython, achieving even 10% of the C performance is a struggle, and frequently we end-up with 1% of the performance of C. Yes, PyPy exists, but then, the Python ecosystem is badly fragmented, and PyPy is not even compatible with TensorFlow. When it comes to machine learning, raw performance of the high-level interop language matters a lot, because it’s where all the data preprocessing happen, that is where 99% of the software investments are made. Falling back to C++ whenever you need performance is no more a reasonable option in 2017.

.NET/C# is a superior alternative to Python to built type-safe high-performance production-grade data pipelines. Moreover, I would argue, another debatable point, that C# as a language, is also evolving faster than Python.

Finally, .NET Core being open source and now working both on Linux and Windows, there is no more limitations not to use .NET/C# as the middleware glue for CNTK either. Again, this would contribute in making make CNTK easier easy to integrate into .NET solutions which represent the strategic market that Microsoft has a solid chance to capture.