I am Joannes Vermorel, founder at Lokad. I am also an engineer from the Corps des Mines who initially graduated from the ENS.

I have been passionate about computer science, software matters and data mining for almost two decades. (RSS - ATOM)


Entries in storage (9)


Table Storage or the 100x cost factor

Until very recently, I was a bit puzzled by the Table Storage. I couldn't manage to get a clear understanding how the Table Storage could be a killer option against the Blob Storage.

I get it now: Table Storage can cut your storage costs by 100x.

At outlined by other folks already, I/O costs typically represents more than 10x the storage costs if your objects are weighting less than 6kb (the computation has been done for the Amazon S3 pricing, but the Windows Azure pricing happens to be nearly identical).

Thus, if you happen to have loads of fine grained objects to store in your cloud, say less-than-140-characters tweets for example, you're likely to end-up with an insane I/O bill if you happen to store those fine-grained items in the Blob Storage.

But don't lower your hopes, that's precisely the sort of situations the Table Storage has been designed for, as this service lets you insert/update/delete entities by batches of 100 through Entity Group Transactions.

This fine-grained item orientation is reflected in the limitations that apply to entities:

  • A single entity should not weight more than 1MB.

  • A single group transaction should not weight more than 4MB.

  • A single entity property should not weight more than 64kb.

Situations where loads of small items end-ups being processed - threshold being at 60kb - by your cloud apps are likely to be good candidate for the Table Storage.

We will definitively try to reflect this in our favorite O/C mapper.


Thinking the Table Storage of Windows Azure

Disclaimer: I am not exactly a Table Storage expert. In this post, I am just trying to sort out my own thoughts about this service offered with Windows Azure. Check my follow-up post.

Soon after the release announcement of the release of our new O/C mapper (object to cloud) named Lokad.Cloud, folks on the Azure Forums raised the question of the Table Storage.

Although it might be surprising, Lokad.Cloud does not provide - yet - any support for Table Storage.

At this point, I feel very uncertain about Table Storage, not in the sense that I do not trust Microsoft to end-up with finely tuned product, but rather at the patterns and practices level.

Basically, the Table Storage is an entity storage that features three special system properties:

  • PartitionKey: a grouping criterion - data having the same PartitionKey being kept close.

  • RowKey: the unique identifier for the entity.

  • Timestamp: the equivalent of Blob Storage ETag.

So far, I got the feeling that many developers feel attracted toward the Table Storage for the wrong reasons. In particular, Table Storage is not a substitute of your old plain SQL tables:

  • No support for transactions.

  • No support for keys (let alone foreign keys).

  • No possible refactoring (properties are frozen at setup).

If you are looking for those features, you're most likely betting on the wrong horse. You should be considering SQL Azure instead.

Then, some might argue that SQL Azure won't scale above 10GB (at least considering the current pricing plans offered by Microsoft). Well, the trick is Table Storage won't scale either, at least not unless you're not very cautious with your queries.

AFAIK, the only indexed column of the Table Storage is the RowKey. Thus, any filtering criterion based on custom entity properties is likely to get abyssal performance as soon your Table Storage get large.

Well, sort of, the most probable scenario is like to to be worse as your queries are just going to timeout after exceeding 60s.

Again, my goal here is not to bash the Table Storage, but it must be understood that the Table Storage is clearly not a magically scalable equivalent of the plain old SQL tables.

Back to Lokad.Cloud, we did not consider adding Table Storage because we did not feel the need either although our forecasting back-end is probably very high in the currently complexity spectrum of the cloud apps.

Indeed, the Blob Storage is surprisingly powerful with very predicable performance too:

  • Storing complex objects is a non-issue with a serializer at hand.

  • A blob name prefix is a very efficient substitute to the PartitionKey.

Basically, it seems to me that any Table Storage operation can be executed with the same performance with the Blob Storage for now. Later on, when the Table Storage will start supporting secondary indexes, this situation is likely to evolve, but meantime I still cannot think a single situation that would definitively support Table Storage over Blob Storage.


Delete-proof data paging

In order to retrieve a large amount of data from a SQL table, you need to resort to a data paging scheme. Conceptually, a typical paged SQL query looks like (the syntax is approximate and vaguely inspired from MS SQL Server 2005)

WHERE RowNumber() BETWEEN @Index AND @Index + @PageSize

The queries are made iteratively until no rows get returned any more. Yet, this approach fails both if rows are added or deleted in the table Foo during the iteration.

If rows are inserted, then the RowNumbers() will be impacted. Some rows will see their number to be incremented.

  • The newly inserted rows maybe missed.

  • Certain rows are going to be retrieved twice.

In the overall, the situation isn't bad, because, after all, all rows that were present when the retrieval iteration started will be retrieved.

In the other hand, if rows are deleted, then some rows will get their RowNumbers() decremented. As a consequence, certain rows will never get retrieved. This issue is quite troublesome, because you would not expect deleted rows to prevent the retrieval of other (valid) rows.

One workaround is to this situation would be to add some IsDeleted flag to the table (instead of actually deleting rows, flags would be changed from false to true); and to purge the database only once a while. Yet, this solution has many drawbacks: table schema must be changed and all existing queries that target this table must add the extra condition IsDeleted = false.

A more subtle approach to the problem consists in changing the definition of the iterating index. Instead of using an index that correspond to a generate RowNumbers(), let directly use @IndexId, the greatest identifier retrieved so far. Thus, the paged query becomes

WHERE Foo.Id > @IndexId

With this new approach, we are now certain that no rows will be skipped during the retrieval process. Plus, we haven't made any change to the table schema.


Weird consequences of full transaction logs

Let say that you have an ASP.Net 2.0 web application running on top of MSSQL Server 2005. Guess what happen if you database transaction log get full? Well, you will get a large amount of weird side effects, most of them seeming totally unrelated to the saturation of the transaction log.

Among the problems that I have encountered

  • The web services of your website start to send totally misleading error messages like authentication failed.

  • You can not login through web form into your ASP.Net application any more. You will not get any error message, but the login control just tells you that your password is wrong.

  • You look at your error logs (like ELMAH), but nothing gets recorded.

  • You decide to go through the "recover password" (because you're still no suspecting the transaction logs), but actually it fails and no email is sent.

For the note, the following SQL question clears your transaction logs

Page 1 2