I am Joannes Vermorel, founder at Lokad. I am also an engineer from the Corps des Mines who initially graduated from the ENS.

I have been passionate about computer science, software matters and data mining for almost two decades. (RSS - ATOM)


Entries in cloudcomputing (30)


Thinking an academic package for Azure

This year, this is the 4th time that I am a teaching Software Engineering and Distributed Computing at the ENS. The classroom project of 2010 is based on Windows Azure, like the one of 2009.

Today, I have been kindly asked by Microsoft folks to suggest an academic package for Windows Azure, to help those like me who wants to get their students started with cloud computing in general, and Windows Azure in particular.

Thus, I decided to post here my early thoughts on that one. Keep in mind I have strictly no power whatsoever on Microsoft strategy. I am merely publishing here some thoughts in the hope of getting extra community feedback.

I believe that Windows Azure, with its strong emphasis on tooling experience, is a very well suited platform for introducing cloud computing to Computer Science students.

Unfortunately, getting an extra budget allocated for the cloud computing course in the Computer Science Department of your local university is typically complicated. Indeed, very few CS courses require a specific budget for experiments. Furthermore, the pay-as-you-go pricing of cloud computing goes against nearly every single budgeting rule that universities have in place - at least for France - but I would guess similar policies in other countries. Administrations tend to be wary of “elastic” budgeting, and IMHO, rightfully so. This is not a criticism, merely an observation.

In my short 2-years cloud teaching experience, a nice academic package sponsored by Microsoft has tremendously simplified my situation to deliver the “experimental” part of my cloud computing course.

Obviously, unlike software licenses, offering cloud resources costs very real money. In order to make the most of resources allocated to academics, I would suggest narrowing the offer down to the students who are the most likely to have an impact on the software industry in the next decade.

The following conditions could be applied to the offer:

  1. Course is setup by a Computer Science Department.
  2. It focuses on cloud computing and/or software engineering.
  3. Hands-on, project-oriented teaching.
  4. Small classroom, 30 students or less.

Obviously there are plenty of situations where cloud computing would make sense, and not fit into these constraints such as bioinformatics class with data crunching project or large audience courses with +100 students … but the package that I am proposing below is unlikely to match the needs of those situations anyway.

For the academic package itself, I would suggest:

  1. 1 book per student on Visual Studio 20XY (replace XY by the latest edition).
  2. 4 months of Azure hosting with:
    • 4 small VMs
    • 30 storage accounts, cumulative storage limited to 50GB.
    • 4 small SQL instances of 1GB.
    • Cumulative bandwidth limited to 5GB per day.

Although, it might be surprising in this day and age of cloud computing, I have found that artifacts made out of dead trees tend to be the most valuable ingredient for a good cloud computing course, especially those +1000 pages ones about the latest versions of Visual Studio / .NET (future editions might even include a chapter or two about Windows Azure which would be really nice).

Indeed, in order to tackle the cloud, students must first overcome difficulties posed by their programming environments. One can argue that everything can be found on the web. That’s true, but there is so much information online about .NET and Visual Studio, and that students get lost and lose their motivation if they have to browse through an endless flow of information.

Furthermore, I have found that teaching basics of C# or .NET in a Computer Science is a bad idea. First, it's like an attempt to kill students out of sheer boredom. Just imagine yourself listening for 3h straight at someone enumerating keywords of a programming language. Second, you have little or no control on the background of your students. Some might be Java or C++ gurus already; while some might have never heard of OO programming.

With the book on hand, I suggest to simply ask students to read a couple of chapters from one week to the next, and to interrogate them on their reading at the beginning of each session.

Then, concerning the Windows Azure package itself, I suggest 4 months worth of CPU as it should fit for most courses. If the course spread longer than 4 months then I would suggest students to start optimizing their app not to use all the 4 VMs all the time.

4 VMs seems just enough to feel both the power and the pain of scaling out. It brings a handy 4x speed-up if the app is well designed, but represents a world of pain if the app does not correctly handle concurrency.

Same idea applies to SQL instances. Offering a single 10GB instance would make things easier, but  course should be focused on scaling out, not scaling up. Thus, there is no reason to make things easier here.

In practice, I have found that offering individual storage accounts simplifies experiments, although there is little support for offering either lot of storage or lot of bandwidth.

In total, the package would represent a value of roughly $2500 (assuming $30 per book), and, from a different angle, about $100 per student. Not cheap, but attracting talented students seems worth a worthy (albeit long-term) investment.

1.     Focus on cloud computing and/or software engineering.


You don't know how much you'd miss an O/C mapper till you get one

When we started moving our enterprise app toward Windows Azure, we quickly realized that scalable enterprise cloud apps were tough to develop, real tough.

Windows Azure wasn't at fault here, quite the opposite actually, but the cloud computing paradigm itself is tough to develop enterprise apps. Indeed, scalability in enterprise apps can't be solved by just pilling up tons of memcached servers.

Enterprise apps aren't about scaling out some super-simplistic webapp to a billion users who will be performing reads 99.9% of the time, but rather scaling out complex business logic and accordingly complex business data along.

This lead us to implement Lokad.Cloud, an open source .NET O/C mapper (object-to-cloud) much similar in the spirit to O/R mapper such as NHibernate but tailored for NoSQL storage.

I am proud to announce that Lokad.Cloud has reached its v1.0 milestone.

As a matter of fact, you've probably never heard of O/C mappers, so I will explain why relying a decent O/C mapper should be a primary concern for any ambitious cloud app developer.

To illustrate the point, I am going to list a few subtleties that arise as soon you start using the Queue Storage. As far cloud apps are concerned, Queue Storage is one of the most powerful and most handy abstraction to achieve true scale out behaviors.

Microsoft provides the StorageClient which is basically a .NET wrapper around the REST API offered by the Queue Storage. Let see how an O/C mapper implemented on top of the StorageClient can make queues even better:

  • Strong typed messages: Queue Storage deals with binary messages, not with objects. Obviously, you don't want to entangle your business logic with serialization/deserialization logic. Business logic only cares about the semantic of the processing, not about the underlying data format used for persistence while transiting data over the cloud. The O/C mapper is here to provide a strong typed view of the Queue Storage.
  • Overflowing messages: Queue Storage upper bounds messages to 8kB. This limitation is fine as the Blob Storage is available to deal with large (even gigantic) blobs. Yet again, you don't want to mix storage contingencies (8kB message limit) with your business logic. The O/C mapper lets large message overflow into the Blob Storage.
  • Garbage collection: you might think that manually handling overflowing messages is just fine. Not quite so. What will happen to your overflowing messages, conveniently stored in the Blob Storage, if the queue (for good or ill reasons) happens to be cleared? Simple, you end up with a cloud storage leak: dead piece of data start to pill-up into your storage, and you get charge for it. In order to avoid such situation, you need a cloud garbage collector that makes sure that expired data are automatically collected. The O/C mapper embeds a storage garbage collector.
  • Auto-deletion of messages: Messages should not only be retrieved from the Queue, but also deleted once processed. Following the GC idea, developers should not be expected to delete queue messages when the message processing goes OK, much like you don't have to care about destroying objects getting out of reach. The O/C mapper auto-deletes queue messages upon process completion.
  • Delayed messages: Queue Storage does not offer any simple way to schedule a message to reappear in the queue at a specified time. You can come up with your own custom logic, but again, why should the business logic even bother about such details. The O/C mapper supports delayed messages so that you don't have to think about it.
  • Poisoned queues: that one is deadly subtle one. A poisoned queue message refers to a message that leads to a faulty processing, typically an uncaught exception being thrown by the business logic while trying to process the message. The problem is intricately coupled to the good behavior of the Queue, indeed, if a retrieved message fails to be deleted within a certain amount of time, the message will reappear in the Queue. This behavior is excellent for building robust cloud apps. but deadly if not properly handled. Indeed, faulty messages are going to fail and to reappear over and over, consuming ever increasing cloud resources for no good reason. In a way, poisoned messages represents processing leaks. The O/C mapper detects poisoned messages and isolate them for further investigation and eventual reprocessing once the code is fixed.
  • Abandoning messages: In the clouds, you should not expect VM instances to stay up forever.  In addition to hardware faults, the fabric might decide anytime to shutdown one of your instance.  If a worker instance gets shut down while processing a message, then the processing will be lost until the message reappears in the Queue. Nevertheless, such extra delay might negatively impact your business service level, as an operation that was supposed to take only half a minute might suddenly take 1h (the expiration delay of your message). If the VM gets the chance to be notified of the upcoming shutdown, the O/C mapper abandons in-process messages, making them available for processing again without waiting for expiration.

I have only illustrated here a few point about Queue Storage, but Blob Storage, Table Storage, Management API, Performance Monitoring, ...  also need to rely on higher level abstractions as offered by an O/C mapper such as Lokad.Cloud to become fluently usable.

Don't waste any more time crippling your business logic with cloud contingencies, and start using some O/C mapper. I suggest Lokad.Cloud, but I admit this is biased viewpoint.


MapReduce as burstable low-cost CPU

About two months ago, when Mike Wickstrand setup a UserVoice instance for Windows Azure, I immediately posted my own suggestion concerning MapReduce. MapReduce is a distributed computing concept initially published by Google late 2004.

Against all odds, my suggestion, driven by the needs of Lokad, made it into the Top 10 most requested features for Windows Azure (well, 9th rank and about 20x times less voted than the No1 request for scaled down hosting).

Lately, I had the opportunity to discuss more with folks at Microsoft gathering market feedback on this item. In software business, there is frequent tendency for users to ask for features they don't want in the end. The difficulty being that proposed features may or may not correctly address initial problems.

Preparing the interview, I realized that, to some extend, I had fallen for the same trap when asking for MapReduce. Actually, we have already reimplemented our own MapReduce equivalent, which is not that hard thanks to the Queue Storage.

I care very little about framework specifics, may it be MapReduce, Hadoop, DryadLinq or something not-invented-yet. Lokad has no cloud legacy calling for a specific implementation.

What I do care about is much simpler. In order to deliver truckloads of forecasts, Lokad needs :

  1. large scale CPU
  2. burstable CPU
  3. low cost CPU

Windows Azure is already doing a great job addressing Point 1. Thanks to the massive Microsoft investments on Azure datacenters, thousands of VMs can already be instantiated if needed.

When asking for MapReduce, I was instead expressing my concern for Point 2 and Point 3. Indeed,

  • Amazon MapReduce offers 5x cheaper CPU compared to classical VM-based CPU.
  • VM-based CPU is not very burstable: it takes minutes to spawn a new VM, not seconds.

Then, low-cost CPU is somehow conflicting with burstable CPU, as illustrated by the Reserved Instances pricing of Amazon.

As far low-level cloud computing components are concerned, lowering costs usually mean giving up on expressiveness as a resulting trade-off:

  • Relational DB at $10/GB too expensive? Go for NoSQL storage at $0.1/GB, much cheaper, but much weaker as far querying capabilities are concerned.
  • Guaranteed VMs too expensive? Go for Spot VMs, price is lower on average but you've have no more certainties about either the price or the availability of VMs.
  • Latency of cloud storage too high? Go for CDN, latency is much better for reads, yet, much worse for writes.

Seeking large scale burstable CPU, here are the list of items that we would be very willing to surrender in order to lower the CPU pricing:

  • No need for local storage. VM comes with 250GB hard-drive, which we typically don't need.
  • No need for 2GB of memory. Obviously, we still need a bit of memory but 512MB would be fine.
  • No need for any level of access to the OS
  • Runtime could be made .NET only, and restricted to safe IL (which would facilitate code sandboxing).
  • No need for generic network I/O. Contrained accesses to specific Tables / Queues / Containers would be fine. This would facilitate colocation of storage and CPU.
  • No need for geolocalized resources. Cloud can push wherever CPU is available. Yet, we would expect no to be charged from bandwidth that happens between cloud data centers (if the transfer is caused by offsite computations).
  • No need for fixed pricing. Prioritization of requests based on a variable pricing would be fine (considering that the CPU price could be lowered in average).

Obviously, options are plenty to drag the price down in exchange of a more constrained framework. Since Azure has the unique opportunity to deliver some very .NET oriented features, I am especially interested by approaches that would leverage sandboxed code executions - giving up entirely on the OS itself to purely focus on the .NET Runtime.

I am very eager to see how Microsoft will be moving forward on this request. Stay tuned.


Paging indices vs Continuation tokens

Developers coming from the world of relational databases are well familiar with indexed paging. Paging is rather straightforward:

  • Each row gets assigned a unique integer, starting from 0, and going with +1 increment for each additional row.
  • The query is said to be paged, because constraints specifies that only the row assigned an index greater or equal to N and lower than N+PageSize are retrieved.

I call such as pattern a chunked enumeration: instead of trying to retrieve all the data at once, client app is retrieving chunks of data, potentially splitting a very large enumeration is into a very large number of much small chunks.

Indexed paging is a client-driven process. Indeed, it is the client code (aka the code retrieving the enumerated content) that decides how big each chunk is supposed to be. Indeed, it's the client code that is responsible for incrementally updating indices from one request to the next. In particular, the client code might decide to make a fast forward read on the data.

Although indexed paging is well established pattern, I have found that it's not such a good fit for cloud computing. Indeed, client-driven enumeration is causing several issues:

  • Chunks may be highly heterogeneous in size.
  • Retrieval latency on the server-side might be erratic too.
  • If a chunk retrieval fails (chunk too big), client code has no option but to initiate a tedious trial-and-error process to gradually go for smaller chunks.
  • Chunking optimization is done on the client side, injecting replicated logic into every single client implementation.
  • Fast forward may be completely impractical to implement on the server side.

For those reasons, on the continuation tokens are usually favored in a cloud computing situation. This pattern is simple too:

  1. Request a chunk, passing a continuation token if you have one (not for the first call)
  2. Server returns an arbitrarily sized chunk plus an eventual continuation token.
  3. If no continuation token is retrieved, then the enumeration is finished.
  4. If a token is returned, then go back to 1, passing the token in the call.

Although, this pattern looks similar to the indexed paging, constraints are very different. Continuation tokens are a server-driven process. It's up to the server to decide how much data should be send at each request, which yield many benefits:

  • Chunk size, and retrieval latency can be made much more homogeneous.
  • Server has (potentially) much more local info to wisely choose appropriate chunk sizes.
  • Clients hold no more complex optimization logic for data retrieval.
  • Fast-forward is disabled which leads to a simpler server-side implementation.

Then, there are even more subtle benefits:

  • Better resilience against denial of service. If the server suffer an overload, then, it can optionally delay the retrieval by returning nothing but the current continuation token (a proper server-driven way of saying busy, try again later to the client).
  • Better client resilience to evolution. Indeed, the logic that optimize the chunking process might evolve on the server-side over time, but client code is not impacted and implicitly benefits from those improvements.

Bottom line: unless you specifically want to offer support for fast-forward, you are nearly always better off relying continuation tokens in your distributed computing patterns.


Ambient Cloud is bunk and irrelevant

Every couple of weeks, I hear about computing resource scavenging projects of some sort (think Folding@Home). Indeed, there is an astonishing amount of latent processing power out there, just think of those billions of idle personal computers.

Latest news, the scavenging thingy has been renamed Ambient Cloud. Unfortunately, the Ambient Cloud is both bunk and irrelevant, as far raw processing power is concerned.

Energy amount for more than 50% of the total amortized cost of a modern data center. Doing computation anywhere else but in a data center, increases the energy cost by x2 to x10 depending of your configuration. Modern data centers are industrial units optimizing energy consumption in ways that are simply not available to home and office environments.

If we include the energy costs related to data transport, the figure is probably even worse.

Hence, with the Ambient Cloud, although you're might be saving pennies on hardware assets, you end up wasting truckload of dollars through energy bill. Energy is rather cheap and people tend to neglect that running any electrical device, however small, costs very real $$$ at the end of the month.

Hence, there is an astonishing latent processing power around us, but it comes with:

  • Massive energy overheads, which equate massive financial overhead.
  • Low latency, low bandwidth and unreliable connectivity which is not suited for 99.99% of the situations.

Which bring me to the initial point that Ambient Cloud is bunk and irrelevant as far intensive computing is concerned.