Details on the .NET first strategy for CNTK

Jun 6, 2017
review

An extensive discussion is taking place on the CNTK project. As I am partly responsible for this discussion, I am gather some more concrete proposals for CNTK.

Correctness by design and BrainScript

My company, Lokad, has built as complex data-driven analytical solution built on .NET. Because machine learning data pipelines are hellish to debug, we seek technologies to ensure as much design correctness as possible. For example, in many programming language, a certain degree of design correctness can be obtained through strong typing. Some languages like Rust or Closure offer other kind of guarantees.

My immediate interest for BrainScript was not for the language itself, but for the degree of design correctness appears to be enforceable in BrainScript at compile time. For example, a static analysis can tell me the total number of model parameters. For example, based on this number, it would be easy to implement a rule in our continuous integration built that prevent an abusively large training task to ever go in production.

Because of the limited expressivity of BrainScript (a good thing!), many more properties can be enforced at compile time, not even starting CNTK. Compile time is important because the continuous integration server may not have access to all the required data this is required to get CNTK up and running.

Then, BrainScript is only one option to deliver this correctness by design. In .NET/C#, it would be straightforward to implement a tiny API that deliver the sample expressivity of BrainScript. The network definition would then be compiled in .NET just like Expression Trees are compiled (*). BrainScript itself could be through at a human-readable serialization format for a valid expression built through this .NET API.

(*) OK, it’s not strictly C# compile time, but in practice if your CNTK-network-description-to-be-compiled is reachable through a unit test, then any failure to compile will be caught through unit tests which is good enough in practice.

In the ticket #1962, I was requesting an extension for BrainScript to be made available for Visual Studio Code because, at the time, I was incorrectly thinking that BrainScript was the core strategy for CNTK. Indeed, BrainScript is still listed as one of the Top 8 reasons to favor CNTK over TensorFlow. Then, as it appears that BrainScript is not the core strategy anymore anyway, then I don’t see any particular reason for the CNTK team to invest in the BrainScript tooling. I am completely fine with that, as long as a .NET-friendly alternative is provided which share the good properties of BrainScript.

Train vs. Eval, production and versioning support

From a machine learning perspective, training is a very distinct operation from evaluation. However, as far software engineering is concerned, the two operations typically live very close. Indeed, it usually one system that collects the data, feed the data to the training logic, collect the model, distribute the model to possibly “clients”, and ensure that those “clients” are capable of executing the training logic. In company like Lokad, we have complex data pipelines, and the best way to ensure that the training data are consistent with the evaluation inputs is to factorize the logic - aka have the same bits of code (C# in our case) being used to cover both use cases.

By design this implies that any machine learning toolkit that does not offer a unified support for both training and evaluation is a major friction. It’s not only a lot more costly to implement, it’s also very error prone, as we need to find alternative ways to ensure that the two implementations (training-side and evaluation-side) are and remain strictly consistent in the way data are feed to the deep learning toolkit on both sides. In particular, this is why Python is so painful for a .NET solution: we not only end-up spreading an alternative stack all over the place, we end-up duplicating implementations.

Then, from v1 to v2, the CNTK changed the serialization format for models. Companies may end-up significant amount of resources invested in training one particular model, thus breaking the serialization format is bad. Yet, in the same time, it would be unreasonable to freeze the model serialization format forever, because it would actually prevent many desirable improvements for CNTK.

Once again, the solution is simply .NET. In C#, implementing complex binary (de)serializer is straightforward; arguably less than 1/10th of the effort compared to C++. Thus, CNTK could adopt an approach where the C++ toolkit only supports one format - the latest; and transfer the burden of maintaining multiple (de)serializers to C#. This approch would also offer the possibility to easily translate models in the “old” formats to the new formats. Moreover, the translation could even be done at runtime in .NET/C# if performance is not concern (it’s not always a concern).

The laundry list for .NET-first CNTK

In this section, I try to briefly cover the most pressing elements for a .NET-first CNTK.

Naked Nugget deployments. Nugget is the de-factor approach to deploy components in the .NET ecosystem. It’s already adopted by CNTK for the C# Evaluation bindings but not for the other parts. In particular, deployements should not involve 3rd party stacks like Python (or Node.js or Java).

A network description API in .NET/C#. The important angle is: the API is declarative, and offers the possibility to ensure some degree of correctness by design. The CNTK team is not even expected to provide the tooling to ensure the correctness by design. As long the network description can be reflected in C#, the community can handle that part.

Low-level abstractions for high perf I/O. As posted at #1963, it’s important to offer the possibility to efficiently stream data to CNTK. From a .NET/C# perspective, a p/invoke passing around byte arrays is good enough as long as the corresponding binary serializers are provided in .NET/C# as well.

The non-goals for a .NET-first CNTK

Alternatively, there also non-goals the first version of a .NET-first CNTK.

Fully managed implementation. For a high-performance library like CNTK, a native C++ implementation feels just fine. Many low level parts of .NET are implemented this way, like System.Numerics.

ASP.NET specifics. As long as compatibility is ensured with .NET, compatibility will be ensured for ASP.NET. I don’t anything to be done specifically for ASP.NET.

Jupyter notebooks. Jupyter is cool, no question. Yet, the interactive perspective is a very Pythonic way of doing things. While more features is desirable, Jupyter does not strike me as critical. Interative C# has been around for a long time, but there is very little community traction to support it.

Visual designer for networks. Visual design is cool for teaching, but this does not strike me as a high-priority feature for the .NET ecosystem. Again, the tools you need for 2h training sessions are very different from what you need for a mission-critical business system.

Unity specifics. Unity is very cool, but what Unity needs most - as far CNTK is concerned - is a clean .NET integration for CNTK itself. The rest is a bonus.

Joannes Vermorel's blog

Details on the .NET first strategy for CNTK

Correctness by design and BrainScript

Train vs. Eval, production and versioning support

The laundry list for .NET-first CNTK

The non-goals for a .NET-first CNTK