MapReduce as burstable low-cost CPU
About two months ago, when Mike Wickstrand set up a UserVoice instance for Windows Azure, I immediately posted my own suggestion concerning MapReduce. MapReduce is a distributed computing concept initially published by Google late 2004.
Against all odds, my suggestion, driven by the needs of Lokad, made it into the Top 10 most requested features for Windows Azure (well, 9th rank and about 20x times less voted than the No1 request for scaled down hosting).
Lately, I had the opportunity to discuss more with folks at Microsoft gathering market feedback on this item. In software business, there is frequent tendency for users to ask for features they don’t want in the end. The difficulty being that proposed features may or may not correctly address initial problems.
Preparing the interview, I realized that, to some extend, I had fallen for the same trap when asking for MapReduce. Actually, we have already reimplemented our own MapReduce equivalent, which is not that hard thanks to the Queue Storage.
I care very little about framework specifics, may it be MapReduce, Hadoop, DryadLinq or something not-invented-yet. Lokad has no cloud legacy calling for a specific implementation.
What I do care about is much simpler. In order to deliver truckloads of forecasts, Lokad needs :
- large scale CPU
- burstable CPU
- low cost CPU
Windows Azure is already doing a great job addressing Point 1. Thanks to the massive Microsoft investments on Azure datacenters, thousands of VMs can already be instantiated if needed.
When asking for MapReduce, I was instead expressing my concern for Point 2 and Point 3. Indeed,
- Amazon MapReduce offers 5x cheaper CPU compared to classical VM-based CPU.
- VM-based CPU is not very burstable: it takes minutes to spawn a new VM, not seconds.
Then, low-cost CPU is somehow conflicting with burstable CPU, as illustrated by the Reserved Instances pricing of Amazon.
As far low-level cloud computing components are concerned, lowering costs usually mean giving up on expressiveness as a resulting trade-off:
- Relational DB at $10/GB too expensive? Go for NoSQL storage at $0.1/GB, much cheaper, but much weaker as far querying capabilities are concerned.
- Guaranteed VMs too expensive? Go for Spot VMs, price is lower on average but you’ve have no more certainties about either the price or the availability of VMs.
- Latency of cloud storage too high? Go for CDN, latency is much better for reads, yet, much worse for writes.
Seeking large scale burstable CPU, here are the list of items that we would be very willing to surrender in order to lower the CPU pricing:
- No need for local storage. VM comes with 250GB hard-drive, which we typically don’t need.
- No need for 2GB of memory. Obviously, we still need a bit of memory but 512MB would be fine.
- No need for any level of access to the OS
- Runtime could be made .NET only, and restricted to safe IL (which would facilitate code sandboxing).
- No need for generic network I/O. Contrained accesses to specific Tables / Queues / Containers would be fine. This would facilitate colocation of storage and CPU.
- No need for geolocalized resources. Cloud can push wherever CPU is available. Yet, we would expect no to be charged from bandwidth that happens between cloud data centers (if the transfer is caused by offsite computations).
- No need for fixed pricing. Prioritization of requests based on a variable pricing would be fine (considering that the CPU price could be lowered in average).
Obviously, options are plenty to drag the price down in exchange of a more constrained framework. Since Azure has the unique opportunity to deliver some very .NET oriented features, I am especially interested by approaches that would leverage sandboxed code executions - giving up entirely on the OS itself to purely focus on the .NET Runtime.
I am very eager to see how Microsoft will be moving forward on this request. Stay tuned.
Reader Comments (2)
I was wondering if you got a chance to check the Cloud MapReduce implementation before reimplementing your own MapReduce.
March 2, 2010 | Alex Popescu
Hi Alex, yes I had a look at a couple of papers before reimplementing own version. Yet, the trick is that once you have the Queue Storage and the Blob Storage in your hand, MapReduce has suddenly become a lot simpler to implement. In fact most of the actual MapReduce complexity just get abstracted away by the Azure Storage itself. Code length is somewhat equivalent to the one outlined in the Cloud MapReduce paper.
March 3, 2010 | Joannes Vermorel