Author

I am Joannes Vermorel, founder at Lokad. I am also an engineer from the Corps des Mines who initially graduated from the ENS.

I have been passionate about computer science, software matters and data mining for almost two decades. (RSS - ATOM)

Meta
Sunday
Dec172017

Terabyte blocks for Bitcoin Cash

Terabyte blocks are feasible both technically and economically, they will allow over 50 transactions per human on earth per day for a cost of less than 1/10th of a cent of USD. This analysis assumes no further decrease in hardware costs, and no further software breakthrough, only assembling existing, proven technologies.

Introduction

As pointed out in the original Bitcoin whitepaper, achieving very large blocks do require taking advantage of Moore's Law rather than being stuck with fixed-capacity device. A terabyte block represents a block of 1e12 bytes, which can contain about 4 billion Bitcoin transactions. Assuming a worldwide population of 10 billion humans, terabyte blocks offer about 50 transactions per human per day (57 actually, but the extra numerical precision is not significant).

50 transactions per day per human appears sufficient to cover all human-driven activities; and only a healthy machine-to-machine market would require an even greater number of transactions. Such a market remains hypothetical at present time, and goes beyond the scope of this post.

Bigger blocks is the go-to plan to make the most of the hashing power invested in the Bitcoin network. Indeed, the hashing power provides the same security no matter if 1MB blocks or 1TB blocks are used, yet in the later case, the each transaction is secured with a million times less energy per transaction.

The on-chain scalability challenge is irrelevant for Bitcoin Core, as blocks are capped at 1MB which ensures no more than half a dozen of transactions per second. However, terabyte blocks are relevant for Bitcoin Cash, which could face about 7 millions transactions per second while producing terabyte blocks. In the following for the sake of concision, the term Bitcoin is always referring to Bitcoin Cash.

The mining rig detailed below, a combination of existing and proven hardware and software technologies, delivers the data processing capacity to process terabyte blocks. The cost associated to this mining rig is also sufficiently low to ensure a healthy decentralized market that includes hundreds of independent miners; arguably a more decentralized market than Bitcoin mining as of today.

For the sake of the scalability analysis, I am excluding the Bitcoin emission revenues, focusing only on the transaction fees and other alternative revenue streams which do not depend on Bitcoin inflation. Naturally, for the next decades, the bulk of the mining revenues are expected to be associated with the emission of Bitcoins rather than transaction fees.

A terabyte block mining rig

The mining rig includes 256 nodes, where each node includes:

  • 1 Intel Xeon Processor E7, 8 cores (USD 1250)
  • 2 Intel Xeon Phi 7210, 64 cores (USD 4000)
  • 1 Intel Optane 4800X 750GB (USD 3400) 
  • 2 Samsung 64GB PC4-19200 DDR4 (USD1400)
  • 2 WD Red 10TB HDD (USD 750)
  • Misc (rack, power, network) (USD 3000)

The prices have been obtained from public sources such as Amazon. Totaling those 256 nodes gives a price point of 3.5M USD. In addition to the nodes, a storage layer of optical storage based on the Freeze-Ray technology of Panasonic. While the pricing point of this technology is not publicly advertized, various sources are quoting 10 USD/TB as the price point for optical storage. This figure also matches the price point of the optical storage cartridges sold by Sony. Then, Facebook, who has deployed the freeze-ray claims a 80% reduction in energy consumption compared to HDDs. As current 10TB HDDs have a typical consumption of 5W when active, the freeze-ray energy consumption should be about 0.1W per TB. I will be using those two estimates in the following.

In order to cover 20 years worth of terablocks, the storage layer would require 553 free-ray rackable units of 1.9PB, which would represent a cost of 11M USD.

Then, the cost of energy should be accounted for. Each node consumes about 700W/h according to the nominal consumption of its parts, which gives about 180kW/h for the 256 nodes. Also with 0.1W per TB, the storage layer consumes an extra 100kW/h. Assuming a kWh at 0.1 USD, the yearly energy consumption cost would be 250k USD, totaling 5M USD over 20 years.

Finally, a 50Gbps internet connection is added for a price of 25,000 USD per month; which totals at 6M USD over 20 years.

The cost for the mining equipment per se, i.e. computing terahashes is voluntarily ignored, because this hardware can be considered as independently funded through the Bitcoin inflation.

Thus, I am considering here a 26M USD investment, to be amortized over 20 years; that is 1.3M USD/year of funding. At this point, it still needs to be proven that (A) the Bitcoin fee market can sustain such an expensive mining rig (B) this mining rig is capable of processing terabyte blocks.

As it does not make sense to build such a rig if the market cannot reasonably fund it, let's start with the financing part.

Financing terablocks

Assuming 250 bytes per transaction, terabyte blocks would deliver about 55 transactions per human per day, assuming a rough 10 billion humans on earth. The exact count of human is not important, as the cost of the mining rig is essentially linear in the number of transactions, which is also itself essentially linear in the number of humans transacting on the blockchain. If there are less humans using the blockchain, then the mining rig is linearly cheaper.

If we assume that the same 10 billion humans contribute 1/10th of a cent per day to fund the miners through their transaction fees, then the yearly transaction fees would be of 3,65 billion USD. Thus, those yearly transaction fees would cover the amortized cost of over 3650 / 1.3 = 2800 mining rigs. Assuming that miners want to profit beyond the marginal cost of operating a rig, a gross operating margin at 60% would still leave room for over 1000 profitable miners.

Funding a large number of copies of the blockchain is important to ensure a high degree of decentralization. The analysis that has been carried so far shows that minimal transaction fees would be sufficient to fund over a hundred of competing yet very profitable miners. However, our analysis is ignoring all the economic value that can be generated by holding a copy of blockchain data for other purposes than validating transactions.

Let's assume that a wallet app, which display an ad along with Bitcoin balance could earn about 1 USD per 100,000 views for a simple non-intrusive banner. Assuming that the same humans would check their balance once a week on such a service, we are considering a 5 billion USD market just through advertising revenues, which would fund hundreds of additional copies of the blockchain. This single use case does fund by itself hundreds more copies of the terabyte blockchain.

Then, assuming that the upper bound for the monetization of a terabyte blockchain is at 0.5 USD per user per year, is conservative. In 2016, Google is extracting about 7 USD per user per year, while Facebook is extracting about 16 USD per user per year. If Bitcoin reaches 1TB per block, a large portion of the world economy will be running on top of this blockchain offering numerous monetization opportunities.

I fail to see why, collectively, the market would not manage to extract at least 5 USD per user per year on average through blockchain related services. At this point, we are entering the realm of profitably funding a thousand more copies of the terabyte blockchain.

Scaling the terabyte blockchain

Some data processing problems are intrinsically difficult to spread over multiple computers (like machine learning) or are even designed to prohibitively difficult (like breaking encryption). However, Bitcoin is neither. Bitcoin is an embarrassingly parallel, the easiest and most straightforward kind of problems to be addressed through distributed systems.

The scalability challenges faced by Bitcoin are:

  1. Propagating transactions
  2. Validating transactions
  3. Building and broadcasting blocks

Let's review each one of those challenges.

Scaling the transaction propagation

Propagating transactions is the easiest. It merely requires bandwidth. As Bloom filters, or even better filters can be used, the P2P propagation of 1TB worth of transactions needs less than 3 TB of bandwidth per miner every 10 min (assuming that miner resent twice every transaction for fast propagation of the transaction through the network). Indeed, miners transmit the filters first which are vastly more compact, and transfers the actual transactions only when those transactions is actually requested.

A direct calculation gives a minimal requirement of 45 Gbps to operate. The mining rig has 50 Gbps which is sufficient to reach a sustained throughput of 1TB blocks while aggressively relaying transactions.

Scaling the cryptographic validation

Validating the correctness of a Bitcoin transaction is a two-fold process. First, the cryptographic correctness of the transaction must validated: the miner must verify that the transaction has been properly signed by the sender of the funds. Second, the economic correctness of the transaction must be validated: the miner must verify that the originating address contains enough fund to cover the transaction. In this section, I am focusing on the first part of this challenge, the cryptographic correctness.

Based on [1], I assume a 2ms CPU cost per transaction on a regular 2Ghz x86 CPU. At 250 bytes per transaction, a 1TB block every 10 mins represents 6.7 millions transactions per second. With 2ms of CPU per transaction, we need 13400 CPUs to perform the concurrent validation. The mining rig contains 256 * 2 * 64 = 32768 CPUs through the the Intel Xeon Phi boards. The mining rig is largely sufficient to keep up with the transaction validation. The rig has even spare capacity to catch-up with a delayed validation which could, for example, occur in case of a local network outage. As transactions can be trivially partitioned against a fast hash, achieving a linear scaling of the cryptographic validation is straightforward.

Scaling the economic validation

As pointed out above, in order to validate a transaction, the miner must also check the balance of the Bitcoin addresses in order to ensure that a transaction does not end-up creating Bitcoins out of thin air. In the present implementation of Bitcoin, this validation is performed through a software component known as the UTXO database, the database of unspent transaction outputs.

Terabyte blocks represent 7 millions transactions per second. An optimized implementation only requires 2 reads and 2 writes per transaction to the persistent UTXO storage:

  • First read: check whether the transaction is even legit.
  • First write: If the transaction is legit, the address is marked as dirty with the fund removed.
  • Second read: If the transaction makes its way into the next block (produced by another miner), another check is performed to recheck correctness.
  • Second write: if the foreign block is correct, update the balance of the transaction.

Thus, the miner needs a sustained IOPS throughput of 4*7=28 millions IOPS. As every Intel Optane card offers 550,000 IOPS, the mining rig delivers a collective 140 millions IOPS, largely sufficient to sustain the throughput associated with 1TB blocks. Moreover, the rig has also spare capacity to catch-up after an outage.

Once again, sharding transactions against a fast hash is trivial, thus, implementing a Cassandra-like UTXO database is straightforward. Using Cassandra, Netflix had already done benchmark up to 1 million write / sec back in 2011 while Intel Optane delivers more than 50x the IOPS available back in 2011 through SSDs. Thus, there is no doubt that a specialized database could scale to 28M IOPS and more.

Then, beyond the IOPS, the miner also needs to ensure to have enough storage to store UTXO database. A compact binary encoding of the UTXO database requires:

  • 1 byte for flags (up to 8)
  • 3 bytes for the block height
  • 20 bytes for the Bitcoin address
  • 4 bytes for the "clean" amount in Satoshis (*)
  • 4 bytes for the "dirty" amount in Satoshis (*)

(*) There are only 21 millions Bitcoins, and each Bitcoin contains only 100 million Satoshis. Thus, the number of Bitcoin addresses that can contain over 4 billions (2^32) Satoshis (40 bitcoins) is limited to 550,000 addresses or so. This number of "super-rich" addresses is very small, and thus would be special cased in order to let the rest of the UTXO database benefit from a more compact encoding. In total, 32 bytes are needed per entry in the UTXO database.

With 256 nodes equipped of 750GB Intel Optane, there is enough storage to store 6e12 hot addresses, that is, 600 addresses per user considering 1e10 humans. Then, the HDDs of the nodes, which provide over 20TB of additional storage could be used to increase to the number of hot addresses to 6000 per human, while keeping more than half of the original storage capacity to spare for other needs.

In practice, both modern HDDs and the Intel Optane are performing 4KB block reads and 4KB writes at the hardware level (beware block reads and block writes should not be confused with the blockchain blocks). Thus, the most efficient strategy when writing would be to read a storage block, which contains 4096/32 = 128 entries and to evict the oldest entry, according to the blockchain block height.

Beyond, those hot addresses, the miner leverages its slower optical storage layer, which contains checkpointed copies of the full UTXO database. As it would take more than 100 days for all users to collectively touch more than their 6000 "hot" addresses, the full snapshots of UTXO database can be done rather infrequently, probably about one per month, the final tuning being dependent on the precise hardware specification.

Updating the UTXO database once a new block is found is also a non-issue. The mining rig has 32TB of RAM available, and this RAM can be used to keep the latest blocks in-memory while those blocks are being gradually written to the UTXO database. In particular, the amount of RAM is sufficient to cover the even rarest situations where a short dozen of blocks end-up being orphaned.

Scaling the block propagation

Once a miner has found a target hash, there is a strong incentive of quickly broadcasting the corresponding block, otherwise, another miner might win the mining race by broadcasting faster its own alternative block in the mean time. However, by the time a block is found, the bulk of its content, the transactions is already known to the other miners. Thus, the only information that needs to be transferred is a compact filter which points out the exact set of transactions that has been included in the block.

This mechanism is leveraged by Graphene, which reduces the amount of data that needs to be broadcast when a new block is found to a fraction of the original block size. Graphene demonstrates a compression factor of 186, which would bring down a 1TB block to 5.5GB. As the mining rig has a 50Gbps network connection, it will take less than 1 second to transfer the the full payload to a second miner, triggering an exponential cascade of broadcasts. However, it would be inefficient for the receiving miner to wait for the full payload to be received; the cascade of broadcast would usually start from the first "chunk" received. The Graphene payload would be chunked in smaller chunks, of say, 100MB.

Indeed, the economic interest of the miners is to always work on the latest block, thus if a miner claim to have found a new valid block and that the latest, say, 100 claims made by the same miner all proved to be correct, then it would a profitable assumption to put a limited trust into this miner and immediately start the cascade of broadcasts. Breaching this trust would not earn anything to the miner as its peers would still reject the faulty block within a minute. Worse, the bad behaving miner would immediately lose its hard-earned reputation, hence slowing down the propagation of its own future blocks, for tens of blocks, as the other miners would opt for the full prior validation. In practice, such a miner would most likely have to re-earn the trust of its peers by mining dozens of reduced blocks (faster to transmit), forfeiting most of the transaction fees to the benefit of its peers.

Through an early broadcast, and assuming that the Bitcoin network is comprised of miners with similar or superior internet bandwidth, the full broadcast of the 5.5GB to 10,000 miners is straightforward to achieve in 10 seconds or so, assuming that each miner starts propagating the fresh data upon reception of the first chunk, which would happen in less than 200ms no matter the distance between two miners on earth.

Conclusions

At this point, we have seen that a rig costing 1.3M USD a year in amortized costs is sufficient to support terabyte blocks. However, my hardware and bandwidth costs assumptions are wildly unrealistic. It will take at least 5 years from now for the Bitcoin ecosystem to reach the point where terabyte blocks are needed (onboarding mankind just takes time). Within 5 years from now, the hardware costs will have diminished - a lot.

Since the publication of the original Bitcoin paper 8 years ago, practically every cost quoted in this document have been reduced by a factor greater than 10. The cost of long term data storage is already anticipated to be divided by 3 by 2020. The bandwidth cost is also expected to decrease of 30% per year for the coming years as well.

Then, I am not accounting for any additional software improvements. Flexible Transactions and Schnorr signature could reduce the transaction size by more than 20%. Pruning the blockchain itself could probably halve the amount of storage actually needed.

Thus, within 5 years, it is conservative to assume that the amortized cost will only be 1/3 of my present estimate with a conservative mix of cheaper hardware and more efficient software. At this point, we would be reaching 400k USD/year of a rig capable of processing all the transaction that mankind will ever need (maybe not all the transactions that machines will ever need though, but that's a different scenario altogether).

For the average individual, 400k USD/year may feel like a huge amount of money, yet from a business perspective, this is a modest amount. In Paris, many well-placed boutiques are paying more than that for the rent alone. A small consultancy firm of 50 consultants, still in Paris, does also pay over 400k USD/year for their offices. Opening an IKEA store is considered being a typical 50M USD investment, twice as much as much as the mining rig presently considered. The investment cost associated to a small 10 turbine's wind farm would also exceed the cost of such mining rig.

While it is true that this cost represents an entry barrier, mining has been a highly specialized business with high entry barriers for years already. Impotent miners, nodes who do not mine blocks, do not add security to the network. The only option to decentralize further Bitcoin is not to wish for a downsize of miners, but to organize a massive expansion of the mining pie which will comparatively shrink every miner.

Saturday
Nov112017

Bitcoin Cash is Bitcoin, a software CEO perspective

TLDR: my company, Lokad, is redirecting its attention to Bitcoin Cash, as the true Bitcoin

Like Jeff Bezos, I also believe that being successful in business depends on being right rather than being smart. Smarter means that you will solve given problems faster and better. Righter means that you will identify better problems. As I had been writing in the past, smarter problems trump smarter solutions. Any single time.

What’s Bitcoin about? The intent is to let anyone send and receive secure money in a way that is almost free and almost instant (check the original paper). Over the last two years, Blockstream, a heavily funded company, has brought “smart” but terribly wrong “improvements” to Bitcoin:

  • They have denied the almost free property of Bitcoin by capping the block size.
  • They have denied the almost instant property of Bitcoin through RBF (Replace By Fee).
  • They have weakened the security of Bitcoin through SegWit (Segregated Witness)

To be fair to the Blockstream team, they can’t claim the full ownership of this mess. They got help from other, smart, but unfortunately equally wrong, people.

Now, the Bitcoin community is not without resources. Reasonable people, including the very first non-anonymous Bitcoin developer, have been pointing in the same direction for years. Thus, last August, the community finally made a stand: Bitcoin Cash.

The only thing that you really need to know about Bitcoin Cash is Bitcoin Cash is Bitcoin. Bitcoin Cash has simply undone the damaging Bitcoin features; and yes, sending money is back being almost free and almost instant. Plus, the whole thing does not rely anymore on insecure shenanigans such as anyone can spend tricks.

For my company Lokad which specializes in supply chain optimization, the blockchain has many promising applications. Naturally, it’s always possible to roll-out your own blockchain, but that somewhat defeats the purpose of having a globally unified ledger. Yet, a ledger limited to 7 transactions per second is unusable. At Lokad, we have clients who are already doing more than that! My company needs a ledger that can process tens of thousands of transactions per second; and this happens to be exactly what Bitcoin Cash is about.

Finally, the biggest hurdle that I see with SegWit Bitcoin (for a lack of better name) is SegWit. From my software engineering perspective, this feature is poison: an over-engineered mess that is going to increasingly hurt as time passes. Having personally rewritten four times the core forecasting engine of my own company, I do claim some experience in recognizing unsustainable engineering mess when I see one: SegWit is one of them. If you really seek to fix malleability (a non-urgent problem btw) then FlexTrans is a much simpler and more secure alternative. Removing SegWit from SegWit Bitcoin feels more unrealistic every passing day.

Thus, Bitcoin Cash remains as the only viable option, which fortunately, happens to be a very good option.

Tuesday
Jun062017

Details on the .NET first strategy for CNTK

An extensive discussion is taking place on the CNTK project. As I am partly responsible for this discussion, I am gather some more concrete proposals for CNTK.

Correctness by design and BrainScript

My company, Lokad, has built as complex data-driven analytical solution built on .NET. Because machine learning data pipelines are hellish to debug, we seek technologies to ensure as much design correctness as possible. For example, in many programming language, a certain degree of design correctness can be obtained through strong typing. Some languages like Rust or Closure offer other kind of guarantees.

My immediate interest for BrainScript was not for the language itself, but for the degree of design correctness appears to be enforceable in BrainScript at compile time. For example, a static analysis can tell me the total number of model parameters. For example, based on this number, it would be easy to implement a rule in our continuous integration built that prevent an abusively large training task to ever go in production.

Because of the limited expressivity of BrainScript (a good thing!), many more properties can be enforced at compile time, not even starting CNTK. Compile time is important because the continuous integration server may not have access to all the required data this is required to get CNTK up and running.

Then, BrainScript is only one option to deliver this correctness by design. In .NET/C#, it would be straightforward to implement a tiny API that deliver the sample expressivity of BrainScript. The network definition would then be compiled in .NET just like Expression Trees are compiled (*). BrainScript itself could be through at a human-readable serialization format for a valid expression built through this .NET API.

(*) OK, it's not strictly C# compile time, but in practice if your CNTK-network-description-to-be-compiled is reachable through a unit test, then any failure to compile will be caught through unit tests which is good enough in practice.

In the ticket #1962, I was requesting an extension for BrainScript to be made available for Visual Studio Code because, at the time, I was incorrectly thinking that BrainScript was the core strategy for CNTK. Indeed, BrainScript is still listed as one of the Top 8 reasons to favor CNTK over TensorFlow. Then, as it appears that BrainScript is not the core strategy anymore anyway, then I don't see any particular reason for the CNTK team to invest in the BrainScript tooling. I am completely fine with that, as long as a .NET-friendly alternative is provided which share the good properties of BrainScript.

Train vs. Eval, production and versioning support

From a machine learning perspective, training is a very distinct operation from evaluation. However, as far software engineering is concerned, the two operations typically live very close. Indeed, it usually one system that collects the data, feed the data to the training logic, collect the model, distribute the model to possibly "clients", and ensure that those "clients" are capable of executing the training logic. In company like Lokad, we have complex data pipelines, and the best way to ensure that the training data are consistent with the evaluation inputs is to factorize the logic - aka have the same bits of code (C# in our case) being used to cover both use cases.

By design this implies that any machine learning toolkit that does not offer a unified support for both training and evaluation is a major friction. It's not only a lot more costly to implement, it's also very error prone, as we need to find alternative ways to ensure that the two implementations (training-side and evaluation-side) are and remain strictly consistent in the way data are feed to the deep learning toolkit on both sides. In particular, this is why Python is so painful for a .NET solution: we not only end-up spreading an alternative stack all over the place, we end-up duplicating implementations.

Then, from v1 to v2, the CNTK changed the serialization format for models. Companies may end-up significant amount of resources invested in training one particular model, thus breaking the serialization format is bad. Yet, in the same time, it would be unreasonable to freeze the model serialization format forever, because it would actually prevent many desirable improvements for CNTK.

Once again, the solution is simply .NET. In C#, implementing complex binary (de)serializer is straightforward; arguably less than 1/10th of the effort compared to C++. Thus, CNTK could adopt an approach where the C++ toolkit only supports one format - the latest; and transfer the burden of maintaining multiple (de)serializers to C#. This approch would also offer the possibility to easily translate models in the "old" formats to the new formats. Moreover, the translation could even be done at runtime in .NET/C# if performance is not concern (it's not always a concern).

The laundry list for .NET-first CNTK

In this section, I try to briefly cover the most pressing elements for a .NET-first CNTK.

Naked Nugget deployments. Nugget is the de-factor approach to deploy components in the .NET ecosystem. It's already adopted by CNTK for the C# Evaluation bindings but not for the other parts. In particular, deployements should not involve 3rd party stacks like Python (or Node.js or Java).

A network description API in .NET/C#. The important angle is: the API is declarative, and offers the possibility to ensure some degree of correctness by design. The CNTK team is not even expected to provide the tooling to ensure the correctness by design. As long the network description can be reflected in C#, the community can handle that part.

Low-level abstractions for high perf I/O. As posted at #1963, it's important to offer the possibility to efficiently stream data to CNTK. From a .NET/C# perspective, a p/invoke passing around byte arrays is good enough as long as the corresponding binary serializers are provided in .NET/C# as well.

The non-goals for a .NET-first CNTK

Alternatively, there also non-goals the first version of a .NET-first CNTK.

Fully managed implementation. For a high-performance library like CNTK, a native C++ implementation feels just fine. Many low level parts of .NET are implemented this way, like System.Numerics.

ASP.NET specifics. As long as compatibility is ensured with .NET, compatibility will be ensured for ASP.NET. I don't anything to be done specifically for ASP.NET.

Jupyter notebooks. Jupyter is cool, no question. Yet, the interactive perspective is a very Pythonic way of doing things. While more features is desirable, Jupyter does not strike me as critical. Interative C# has been around for a long time, but there is very little community traction to support it.

Visual designer for networks. Visual design is cool for teaching, but this does not strike me as a high-priority feature for the .NET ecosystem. Again, the tools you need for 2h training sessions are very different from what you need for a mission-critical business system.

Unity specifics. Unity is very cool, but what Unity needs most - as far CNTK is concerned - is a clean .NET integration for CNTK itself. The rest is a bonus.

Monday
Jun052017

.NET-first strategy for CNTK

CNTK is an incredible deep learning toolkit from Microsoft. Despite being little known, under the hood, the technology rivals the capabilities of TensorFlow. While originating from Microsoft, it’s unfortunate that CNTK decided steer away from the Microsoft ecosystem, actually making CNTK a less viable option than TensorFlow as far .NET is concerned.

My conclusions:

  • as a contender for the Python ecosystem, CNTK is a lost cause. TensorFlow has already won by a large margin; just like x86-64 won over IA-64.
  • unlike Python, the .NET ecosystem is still vastly underserved as deep learning is concerned, this would be a perfect spot for CNTK if CNTK opted for a .NET-first strategy.
  • by establishing CNTK as the go-to option for deep learning in .NET, CNTK would be very well positioned to become the go-to option for Java as well.

As a long time supporter of the Microsoft ecosystem, it really pains me to see otherwise excellent Microsoft technologies heading to the wall, just because their strategy make them irrelevant by design for the broader community.

At the core, CNTK is a C++ product, with a primary focus on raw performance, that is, making the most accuracy-wise of the computing resources that are allocated to machine learning task. Yet, as confirmed by the team, CNTK is focusing on Python as the high-level entry point for CNTK.

CNTK also features BrainScript, a tiny DSL (domain specific language) intended to design a deep learning network with a high-level declarative syntax. While BrainScript is advertized as a scripting language, it’s a glorified configuration file; which is an excellent option in the deep learning context.

A frontal assault on TensorFlow is a lost cause

The Python-first orientation of CNTK is a strategic mistake for CNTK, and will only consolidate CNTK as a distant second behind TensorFlow.

The Python deep-learning ecosystem is already very well-served through TensorFlow and its own ecosystem. As a quick guestimation, TensorFlow has presently 100x the momentum of CNTK. Amazon tells me that there are 50+ books on TensorFlow against exactly zero books for CNTK.

Can CNTK catch-up frontally against TensorFlow? No. TensorFlow has the first-mover advantage and my own casual observations of HackerNews indicates that TensorFlow is even growing faster than CNTK, further widening the gap. CNTK might have better performance, but it’s not a game changer, not a sustainable game changer anyway. The TensorFlow teams are strong, and the CNTK performance tuning is being aggressively replicated.

Anecdotally, the Microsoft teams themselves seem to be internally favoring TensorFlow over CNTK. TensorFlow is already more mature for the Microsoft ecosystem, i.e. .NET, than CNTK.

Business 101: don’t engage frontally is a battle that is already lost. You can be ambitious, but you need an "angle".

Why Python is a lost cause for Microsoft

Microsoft has tried to reach out to the Python ecosystem for more than a decade; the efforts dating back from IronPython in 2006, followed by the Visual Studio tools in 2011. Yet, at present time, after a decade of efforts, the fraction of the Python ecosystem successfully attracted by Microsoft remains negligible: let's say it underflows measurements. I fail to see why CNTK would be any different.

The Python ecosystem has consolidated itself around strictly non-Microsoft technologies and non-Microsoft environments. There is no SQL Server, it’s PostgreSQL. There is no Microsoft Azure, it’s AWS or Google Cloud. Etc. If I were to bet some money, I would gamble that CNTK won’t have any meaningful presence in the Python machine learn ecosystem of tomorrow. Most likely, it will be TensorFlow and a couple of non-Microsoft contenders (*).

(*) I am not saying that no strong contenders for TensorFlow will emerge from the deep learning community; I am saying is that no strong contenders for TensorFlow targeting Python will emerge from Microsoft.

Python is a major friction for a .NET solution

One deep yet frequent misunderstanding from academic or research circles is the degree of pain that heterogeneous software stacks represent for companies, clients and vendors alike. Maintaining healthy production systems when one stack (e.g. .NET or Java or Python) is involved is already a challenge. Introducing a second stack is very costly in practice.

My company Lokad is developing a complex .NET solution based on Microsoft Azure. If tomorrow Lokad were to start relying on Python, we would have:

  • to monitor closely all the Python packages and dependencies, just like we do for .NET packages, if only to be capable of swiftly deploying security fixes.
  • to install and maintain consistent Python versions across all our machines, from the development workstations to production servers, and organize company-wide (*) upgrades accordingly.
  • to develop and foster a culture of Python, includes knowing the language (easy) but also all the institutional knowledge to build good Python apps (hard).

(*) Sometimes you're out of lucks and bits of your software live on the client side too, forcing you into multi-version supports of the base framework itself.

Keeping the technological mass of a software solution under control is a very important concern. LaTeX might have succeeded despite being built out of a dozen of programming languages, but this will kill any independent software vendor (ISV) with a high degree of certainty.

All those considerations have nothing to do whether Python is good or bad. The challenge would be identical if we were to introduce the Java or Swift stacks in our .NET codebase.

The tech mass of BrainScript is minimal

While Python is whole technology stack of its own, BrainScript, the DSL of CNTK is nothing but a glorified configuration file. As far the technological mass is concerned, managing a DSL like BrainScript, is a no-brainer. Let’s face it, this is not a new language to learn. The configuration file for ASP.NET (web.config) is an order of magnitude more complex than BrainScript as a whole, and nobody refers to web.config files as being a programming language of their own.

While the CNTK team decided to steer away from BrainScript, I would, on the contrary suggest to double-down on this approach. Machine learning is complicated, bugs are hard to track down, and unlike machine learning competitions where datasets are 100% well-prepared, data in the real world is messy and poorly documented. Real software businesses are facing deadlines and limited budgets. We need machine learning tools, but more importantly, we need tools that deliver some degree of correctness by design. BrainScript is imperfect, but it’s a solid step in the right direction.

.NET is a massive opportunity for CNTK

The .NET ecosystem is vast, arguably significantly larger that the one of Python, and yet fully underserved as far deep learning is concerned.

It is a misconception to think that .NET software solutions designed with .NET would benefit less from deep learning than Python software solutions. The needs for deep learning and the expected benefits are the same. Python just happens to be much more popular in the publishing community than .NET.

Most .NET solution vendors, like Lokad, would immediately jump on CNTK if .NET was given a strong clear priority. Indeed, a .NET-first perspective would be a game changer for CNTK in this ecosystem. Instead, of struggling with second-class citizens, like the current C# bindings of CNTK, we would benefit from first-class citizens, which would happen to be completely aligned with the rest of the .NET ecosystem.

Also, .NET is very close to Java - at least, Java is much closer to .NET than it is from Python. Establishing CNTK as a deep learning leader in the .NET ecosystem would also make a very strong case to reach out to the Java ecosystem later on.

.NET/C# is superior to Python for deep learning

Caveat: opinionated section

C# is nearly uniformly superior to Python: performance, productivity, tooling.

Performance is a primary concern for the platform intended as the high-level instrumentation of the deep learning infrastructure. Indeed, one should never underestimate how much development efforts the data pipeline represent. It might be possible to describe a deep learning network in 50 lines of codes, yet, in practice, the data pipeline that surrounds those lines is weighing thousands of lines. Moreover, because we are moving a lot of data around, and because preprocessing the data can be quite challenging on its own, the data pipeline needs to be efficient.

Anecdote: at Lokad, we have multiple data pipelines that involve crunching over 1TB of data on a daily basis. We use highly optimized C# to run those data pipelines that make aggressive use of both async capabilities of C# but also of low level C-like algorithms.

With .NET/C#, my own experience at building Lokad, indicates that it’s usually possible to achieve over 50% of the raw C performance on CPU by paying minimal attention to performance - aka don’t go crazy with objects, use struct when relevant, etc. With CPython, achieving even 10% of the C performance is a struggle, and frequently we end-up with 1% of the performance of C. Yes, PyPy exists, but then, the Python ecosystem is badly fragmented, and PyPy is not even compatible with TensorFlow. When it comes to machine learning, raw performance of the high-level interop language matters a lot, because it’s where all the data preprocessing happen, that is where 99% of the software investments are made. Falling back to C++ whenever you need performance is no more a reasonable option in 2017.

.NET/C# is a superior alternative to Python to built type-safe high-performance production-grade data pipelines. Moreover, I would argue, another debatable point, that C# as a language, is also evolving faster than Python.

Finally, .NET Core being open source and now working both on Linux and Windows, there is no more limitations not to use .NET/C# as the middleware glue for CNTK either. Again, this would contribute in making make CNTK easier easy to integrate into .NET solutions which represent the strategic market that Microsoft has a solid chance to capture.

Monday
Aug292016

The sad state of .NET deployments on Azure

One of the core benefit of cloud computing should be ease of deployment. At Lokad, we have been using Azure in production since 2010. We love the platform, and we depend on it absolutely. Yet, it remains very frustrating that .NET deployments have not nearly made enough progress as it could have been expected 6 years ago.

The situation of .NET deployements is reminiscent of the data access re-inventions which were driving Joel Spolsky nuts 14 years ago. Now, Microsoft has shifted its attention to app deployements, and waves of stuff keep rolling in, without really addressing core concerns, and leaving the whole community disoriented.

At present time, there are 6 major ways of deploying a .NET app on Azure:

  • ASM, WebRole and WorkerRole
  • ASM, Classic VM
  • ARM, WebApp
  • ARM, Azure Batch
  • ARM, Azure Functions
  • ARM, VM scale set

ASM stands for Azure Service Manager, while ARM stands for Azure Resource Manager. The ASM gathers the first generation of cloud services on Azure. The ARM gathers the second generation of cloud services on Azure.

ASM to ARM transition is a mess

In Azure, pretty much everything comes in two flavors: the ASM one and the ARM one; even the Blob Storage accounts (equivalent of S3 on AWS). Yet, there are no migration possible. Once you create a resource - a storage account, a VM, etc - on one side, it cannot be migrated to the other side. It's such a headache. Why is the responsibility of the clients to deal with this mess? Most resources should be accessible from both sides, or be "migratable" in a few clicks. Then, whenever, the ASM/ARM distinction is not even relevant (eg. storage accounts), the distinction should not even be visible.

So many ways to do the same thing

It's maddening to think that pretty every service handle .NET deployments in a different way:

  • With WebRole and WorkerRole, you locally produce a package of assemblies (think of it as a Zip archive containing a list of DLLs), and you push this package to Azure.
  • With the Classic VM, you get a fresh barebone OS, and you do your own cooking to deploy.
  • With WebApp, you push the source code through Git, and Azure takes care of compiling and deploying.
  • With Azure Batch, you push your DLLs to the blob storage, and script how those files should be injected/executed in the target VM.
  • With Azure Functions, you push the source code throuhg Git, except that unlike WebApp, this is not-quite-exactly-C#.
  • With the VM scale set, you end up cooking your own OS image that you push to deploy.

Unfortunately, the sanest option, the package option as used for WebRole and WorkerRole, is not even available in the ARM world.

The problem with Git pushes

Many companies - Facebook or Google for example - leverage a single source code repository. Lokad does too now (we transitionned to single repository 2 years ago, it's much better now). While having a large repository creates some challenges, it also make tons of things easier. Deploying through Git looks super cool in a demo, but as soon as your repository reaches hundreds of megabytes, problems arise. As a matter of fact, our own deployments on Azure routinely crashes while our Git repository "only" weights 370MB. By the time our repository reaches 1GB, we will probably have entirely given up on using Git pushes to deploy.

In hindsight, it was expected. The size of the VM needed to compile the app has no relevance to the size of the VM needed to run the app. Plus, the compiling the app may require many software pieces that are not required afterward (do you need your JS minifier to be shipped along with your webapp?). Thus, all in all, deployment through Git push only gets you so far.

The problem with OS management

Computer security is tough. For the average software company, or rather for about 99% of the (software) companies, the only way to ensure a decent security for their apps consists of not managing the OS layer. Dealing with the OS is only asking for trouble. Delegating the OS to a trusted party who knows what she is doing is about the only way not to mess it up, unless you are fairly good yourself; which, in practice, elimitates 99% of the software practionners (myself included).

From this perspective, the Classic VM and the VM scale set feel wrong for a .NET app. Managing the OS has no upside: the app will not be faster, the app is not be more reliable, the app will not have more capabilities. OS management only offers dramatic downsides if you get something wrong at the OS level.

Packages should have solved it all

In retrospect, the earliest deployement method introduced in Azure - the packages used for WebRole and WorkerRole - was really the good approach. Packages scale well and remain uncluttered by the original size of the source code respository. Yet, for some reason this approach was abandonned on the ARM side. Now, the old ASM design does not offer the most obvious benefits that should have been offered by this approach:

  • The packages could have been made even more very secure: signing and validating packages is straightforward.
  • Deployment could have been super fast: injecting a .NET app into a pre-booted VM is also straightforward.

For demo purposes, it would have been simple enough to have a Git-to-package utility service running with Azure to offer Heroku-like swiftness to small projects, with the possibility to transition naturally to package deployments afterward.

Almost reinventing the packages

Azure Batch is kinda like the package, but without the packaging. It's more like x-copy deployment with file hosted in a Blob Storage. Yet, because it's x-copy, it will be tricky to support any signing mechanisms. Then, looking further ahead, the pattern 1-blob-per-file is near guaranteed to become a performance bottleneck for large apps. Indeed, the Blob Storage offers much better performance at retrieving a 40MB package rather than 10,000 blobs of 4KB each. Thus, for large apps, batch deployments will be somewhat slow. Then, somebody somewhat will start re-inventing the notion of "package" to reduce the number of files ...

With the move toward .NET Core, .NET has never been more awesome, and yet, it could be so much more with a clarified vision and technology around deployments.