Author

I am Joannes Vermorel, founder at Lokad. I am also an engineer from the Corps des Mines who initially graduated from the ENS.

I have been passionate about computer science, software matters and data mining for almost two decades. (RSS - ATOM)

Meta

Monday
Sep142009

Thinking the Table Storage of Windows Azure

Disclaimer: I am not exactly a Table Storage expert. In this post, I am just trying to sort out my own thoughts about this service offered with Windows Azure. Check my follow-up post.

Soon after the release announcement of the release of our new O/C mapper (object to cloud) named Lokad.Cloud, folks on the Azure Forums raised the question of the Table Storage.

Although it might be surprising, Lokad.Cloud does not provide - yet - any support for Table Storage.

At this point, I feel very uncertain about Table Storage, not in the sense that I do not trust Microsoft to end-up with finely tuned product, but rather at the patterns and practices level.

Basically, the Table Storage is an entity storage that features three special system properties:

• PartitionKey: a grouping criterion - data having the same PartitionKey being kept close.

• RowKey: the unique identifier for the entity.

• Timestamp: the equivalent of Blob Storage ETag.

So far, I got the feeling that many developers feel attracted toward the Table Storage for the wrong reasons. In particular, Table Storage is not a substitute of your old plain SQL tables:

• No support for transactions.

• No support for keys (let alone foreign keys).

• No possible refactoring (properties are frozen at setup).

If you are looking for those features, you're most likely betting on the wrong horse. You should be considering SQL Azure instead.

Then, some might argue that SQL Azure won't scale above 10GB (at least considering the current pricing plans offered by Microsoft). Well, the trick is Table Storage won't scale either, at least not unless you're not very cautious with your queries.

AFAIK, the only indexed column of the Table Storage is the RowKey. Thus, any filtering criterion based on custom entity properties is likely to get abyssal performance as soon your Table Storage get large.

Well, sort of, the most probable scenario is like to to be worse as your queries are just going to timeout after exceeding 60s.

Again, my goal here is not to bash the Table Storage, but it must be understood that the Table Storage is clearly not a magically scalable equivalent of the plain old SQL tables.

Back to Lokad.Cloud, we did not consider adding Table Storage because we did not feel the need either although our forecasting back-end is probably very high in the currently complexity spectrum of the cloud apps.

Indeed, the Blob Storage is surprisingly powerful with very predicable performance too:

• Storing complex objects is a non-issue with a serializer at hand.

• A blob name prefix is a very efficient substitute to the PartitionKey.

Basically, it seems to me that any Table Storage operation can be executed with the same performance with the Blob Storage for now. Later on, when the Table Storage will start supporting secondary indexes, this situation is likely to evolve, but meantime I still cannot think a single situation that would definitively support Table Storage over Blob Storage.

Monday
Sep142009

O/C mapper - object to cloud

When we started to port our forecasting technology toward the cloud, we decided to create a new open source project called Lokad.Cloud that would isolate all the pieces of our cloud infrastructure that weren't specific of Lokad.

The project has been initially subtitled Lokad.Cloud - .NET execution framework for Windows Azure, as the primary goal of this project was to provide some cloud equivalent of the plain old Windows Services. We did quickly end-up with QueueServices which happens to be quite handy to design horizontally scalable apps.

But more recently, the project has taken a new orientation, becoming more and more an O/C mapper (object to cloud) inspired by the terminology used by O/R mappers. When it comes to horizontal scaling, a key idea is that data and data processing cannot be considered in isolation anymore.

With classic client-server apps, persistence logic is not supposed to invade your core business logic. Yet, when your business logic happens to become so intensive that it must be distributed, you end-up in a very cloudy situation where data and data processing becomes closely coupled in order to achieve horizontal scalability.

That, being said, close coupling between data and data processing isn't doomed to be an ugly mess. We have found that obsessively object-oriented patterns applied to Blob Storage can made the code both elegant and readable.

Lokad.Cloud is entering its beta stage with the release of the 0.2.x series, check it out.

Tuesday
Jul282009

Thoughts about the Windows Azure pricing

Microsoft has recently unveiled its pricing for Windows Azure. In short, Microsoft did exactly align with the pricing offered by Amazon. CPU costs $0.12 / h, meaning that a single instance running 24/24 for a month costs$86.4 which is fairly expensive compared to classical hosting provider where you can get more for basically half the price.

But well, this situation was expected as Microsoft probably does not want to start a price war with his business partners still selling dedicated Windows Server hosting. Current Azure pricing is sufficiently high to deter most companies except the ones who happen to have peaky needs.

To me, the Azure pricing is fine except in 3 areas:

• Each Azure WebRole costs at least $86.4 / month no matter how few web traffic you have (reminder: with Azure you need a distinct webrole for every distinct webapp). This situation is caused by the architecture of Windows Azure where a VM gets dedicated for every WebRole. If we compare to Google App Engine (GAE), the situation does not looks to good for Azure, indeed, with GAE, hosting a low traffic webapp is virtually free. Free vs.$1000 / year is likely to make a difference for most small / medium businesses, especially if you end-up with a dozen of webapps to cover all your needs.

• Cloud Storage operations are expensive: the storage itself is rather cheap $0.15 / GB / month, but the cost of$0.01 per 10K operations might be a killer for cloud apps intensively relying on small storage operations. Yes, one can argue that this price ain't cheaper with AWS, but this is not entirely true as AWS provides other services such as the block storage that comes with 10x lower price per operation (EBS could be used to lower the pressure on blob storage whenever possible).

• Raw CPU at $0.12 / h is expensive and Azure offers no solution to lower this price whereas AWS offers CPU at$0.015 / h through their MapReduce service.

Obviously, those pricing weaknesses closely reflect missing cloud technologies for Azure (at the moment). The MapReduce issue will be fixed when Microsoft ports DryadLinq to Azure. Block storage and shared low cost web hosting might be also on their way too (although I have little info on that matter). As a side note, the Azure Cache Provider might be a killing tool to reduce the pressure on the cloud storage (but pricing is unknown yet).

As a final note, it's interesting to see that the cloud computing pricing is really dependent on the quality of the software used to run the cloud. Better software typically leads to computing hardware being delivered at much lower costs, almost 10x lower costs in many situations.

Monday
Apr062009

Cloud Computing vs. Hardware as a Service

In a previous post, I have discussed why I believed that cloud computing was going to be a big player arena, and not a friendly place for the little guys.

Recently, many people told about such and such small company that was supposed to deliver cloud computing too, and that their service would match the ones offered by big players.

Basically, the discussion goes like this:

Hey, we too are able to instantiate virtual machines on-demand. We have some nice virtual machine deployment scripts, a nice WebUI to administrate all the nodes, we are now matching the Amazon offer.

Nope, you’re not.

Basically, what those little players are doing is simply Hardware as a Service. For years now, computing hardware has been more or less an on-demand commodity. My favorite host can typically set-up a new server in 48h, and I can cancel my subscription anytime (although I will have to pay for the entire month). Some more aggressive host providers are providing fully automated server setup, and your new server is usually available in less than 1h.

Now, what those small companies calling themselves cloud providers are able setup new servers in seconds instead of minutes; and the trick to do that is simple: they use virtualization and deployment scripts.

But, in my opinion, this isn’t cloud computing, this is just hardware as a service with a lower overhead both at infrastructure level, but also for the system administrators themselves.

So, what is so radically different with cloud computing?

In my opinion, the radical novelty of cloud computing is the promise that you won’t have to worry about resource allocation anymore.

In particular, I don’t want to figure out if I need 1, 2, 3 or 42 computing nodes to handle a massive web traffic surge from Slashdot, I just want to tell my cloud provider:

Here is the script for my web page, do whatever is needed to ensure good performance, and send me the bill at the end of the month.

Note that this is exactly what Google App Engine is doing. Google App Engine is relieving web developers from the burden of having to figure out how they are going to scale their web apps. Google is doing the magic for them so that web developers can focus on the specific value of their web apps instead of focusing on the complex infrastructure actually needed to achieve scalability.

Quoting Thomas Serval from Round Table about Azure a few months ago:

In the past, each time we have multiplied the traffic by 10 on our applications, we have been forced more or less to rewrite the application from scratch. The promise of cloud computing is to let you achieve unlimited scalability from day one.

Obviously, cloud computing isn’t magic, thus applications will need to be carefully designed to achieve unlimited scalability, yet I believe that thanks to the cloud computing frameworks currently being published, it won’t be that hard in the future.

Thus cloud computing is not Hardware as a Service simply because Hardware as a Service does not do anything about scalability in itself.

The true benefits of cloud computing is to provide what I would call scalable computing abstractions. Those abstractions represent physical resources such as CPU, memory or bandwidth, but with additional constraints (usually structural constraints) so that it becomes actually possible to provide an infinitely scalable instance of the desired resource.

For example, nowadays more or less all cloud providers are including in their offer a distributed and reliable hashtable implementation: S3 for Amazon, Blob Storage for Windows Azure, ... FIPFO is another popular scalable storage abstraction: First-In Probably First Out, i.e. queues but without deterministic behavior. As long as you rely only on those scalable storage abstractions, you should not care about scalability of your storage.

So far, scalable storage abstractions have been the primary focus of most cloud providers. Yet, I suspect that the next battle will be scalable CPU abstractions.

Indeed, Amazon has recently unveiled their now Amazon Elastic MapReduce, and as other people believe, I too believe that MapReduce will be a game changer. First, Amazon is delivering CPU at 0.015 USD/h while its competitors are still above 0.10 USD/h at the time. Then, if we consider that the Amazon native MapReduce implementation is going to be way more efficient than custom in-house implementations - simply because Amazon folks have the time and the experience needed to get the load balancing settings rights - then Amazon has just divided the CPU price by 10.

Then, what I see as the killing benefit of MapReduce is that I don’t have to care anymore about how many nodes I need.

For example at Lokad, we have tons of time-series to process. Let say that we want to extract seasonality patterns out of 100 millions time-series (each time time-series ranging from a few hundreds to a few thousands points). With MapReduce, I just have to specify the algorithm to process a single time-series and pass the huge time-series collection as argument. The cloud infrastructure will be handling all the magic for me. In particular, I don’t have to care anymore about node crashing along the way, or about dynamically expanding / shrinking the number of computing nodes.

MapReduce is a very constrained framework that forces you to apply the very same function everywhere, but the input collection can be arbitrarily large. In my experience, if you’re not able to scale a data-mining problem through MapReduce, then nothing will – or, more precisely, the design complexity will be so great that you are most likely to give up anyway.

Those scalable resource abstractions represent the core value offered by cloud providers. Yet, those scalable resource abstractions are truly hard to design and even harder to optimize. Yes, you might know a small company that auto-deploys virtual machines, but, in my opinion, this does not reach even 10% of the potential benefits brought by the cloud.

Those benefits will be achieved though scalable resource abstractions; and each one of those abstractions is going to cost a massive amount of brain power to get done right.

Wednesday
Jan072009