Author

I am Joannes Vermorel, founder at Lokad. I am also an engineer from the Corps des Mines who initially graduated from the ENS.

I have been passionate about computer science, software matters and data mining for almost two decades. (RSS - ATOM)

Meta

Entries in algorithm (8)

Friday
May082015

Nearly all web APIs get paging wrong

Data paging, that is, the retrieval of a large amount of data through a series of smaller data retrievals, is a non-trivial problem. Through Lokad, we have implemented about a dozen of extensive API integrations, and reviewed a few dozens of other APIs as well.

The conclusion is that as soon as paging is involved, nearly all web APIs get it wrong. Obviously, rock-solid APIs like the ones offered by Azure or AWS are getting it right, but those outstanding APIs are exceptions rather than the norm.

The obvious pattern that doesn't work

I have lost count of the APIs that propose the following broken pattern to page through their data, a purchase order history for example:

https://example.com/api/purchaseorders?page=2&pagesize=25

Where page is the page number and pagesize the number of orders to be retrieved. This pattern is fundamentally unsafe. Any order deleted while the enumeration is in-progress will shift the indices which, in turn, is likely to cause another order to be skipped.

There are many variants of the pattern, and everytime the problem boils down to: the "obvious" paging pattern leads to a flawed implementation that fail whenever concurrent writes are involved.

The "updated_after" filter doesn't work either

Another popular approach for paging is to leverage a filter on the update timestamp of the elements to be retrieved, that is:

https://example.com/api/purchaseorders?updated_after=2015-04-29

Then, in order to page the request, the client is supposed to take the most recent updated_at value from the response and to feed this value back to the API to further enumerate the elements.

However this approach does not (really) work either. Indeed, what if all elements have been updated at once? This can happen because of a system upgrade or because of any kind of bulk operation. Even if the timestamp can be narrowed down to the microsecond, if there are 10,000 elements to be served all having the exact same udpate timestamp, then, the API will keep sending a response where max(updated_at) is equal to the request timestamp.

The client is not enumerating anymore, the pattern has failed.

Sure, it's possible to tweak the timestamps to make sure that all the elements gracefully spread over distinct values, but it's a very non-trivial property to enforce. Indeed, a datetime column isn't really supposed to be defined with unicity constraint in your database. It's feasible, but odd and error prone.

The fallacy of the "power" APIs

Some APIs provides powerful filtering and sorting mechanisms. Thus, through those mechanims, it is possible to correctly implement paging. For example by combining two filters: one the update datetime of the items and one on the item identifier. A correct implementation is far from trivial however.

Merely offering the possibility to do the right thing is not sufficient: doing the right thing should be the only one possibility. This point is something that Lokad learned the hard way early on: web APIs should offer one and only one way to do each intended operation.

If the API offers a page mechanism but that the only way to correctly implement paging is to not use it; then, rest assured that the vast majority of the client implementations will get it wrong. From a design viewpoint, it's like baiting developers into a trap.

The "continuation token" as the only pattern that works

To my knowledge, there is about only one pattern that works for paging, it's the continuation token pattern. 

https://example.com/api/purchaseorders?continue=token

Where every request to a paged resource like the purchase orders has the possibility of returning a continuation token on top of the elements returned when not all elements could be returned in one batch.

On top of being correct, that pattern has two key advantages:

  • It's very hard to get it wrong on the client side. There is only one way to do anything with the continuation token: it's to feed it again to the API.
  • The API is not commited into returning any specific number of elements (in practice a high upper bound can still be documented). Then, if some elements are particularly heavy or if the server is already under heavy workload, smaller chunks can be returned.

This enumeration should not provide any garantee that the same element won't be enumerated more than once. The only garantee that should be provided by the paging through tokens is that ultimately all elements will be enumerated at least once. Indeed, you don't want to end-up with tokens that embed some kind of state on the API side; and in order to keep it smooth and stateless, it's important to lift this constraint.

Then, continuation tokens should not expire. This property is important in order to offer the possibility to the client perform incremental update on a daily, weekly or even  on a monthly schedule depending on what makes sense from a business viewpoint.

No concurrency but data partitions

The continuation token does not support concurrent data retrieval: the next response has to be retrieved before being able to post the next request. Thus, in theory, this pattern somehow limit the amount of data that can be retrieved.

Well, it's somewhat true, and yet mostly irrevelant. First, Big (Business) Data is exceedingly rare in practice, as the transation data of the largest companies tend to fit on a USB key. For all the APIs that we have integrated, putting aside the cloud APIs (aka Azure or AWS), there was not a single integration where the amount of data was even to close to justifying concurrent data accesses. Slow data retrieval is merely a sign a non-incremental data retrieval.

Second, if the data is so large that concurrency is required, then, partitionning the data is typically a much better approach. For example, if you with to retrieve all the data from a large retail network, then the data can be partitionned per store. Partitionning will be making things easier both on the API side and on the client side.

Monday
Feb162015

Super-fast flat file parsing in C# and Java with a perfect hash function

At Lokad, (almost) all we do is to crunch flat text files. It's not that we haven't tried anything else - we did - many times - and it went poorly. Flat files are ubiquitous, well understood, and they yield very good performance both of the write side and the read side when working under tight budgets.

Keep in mind that the files we crunch are frequently generated by our clients, so while ProtoBuf or Cap'n Proto are very cool, asking our clients to deliver such formats would be roughly equivalent asking them to reimplement their in-house Java ERP in Haskell. To preserve the sanity of our clients, we keep it simple and we stick to flat files.

However we have decided to make flat file read fast, really fast. Thus, one of us decided to tackle the challenge dead-on, and came up with a very nice pattern: file parsing starts with a Perfect Hash Function preprocessing. Simply put, the flat file gets tokenized, and then each token gets replaced by an integer uniquely identifying this piece of string. Not only this saves a tremendous amount of string object instantiation, but afterward, all the complex parsing operations, such as parsing a date, can be performed only once, even if the token is encountered hundreds of times in the file. Performance-wise, it works because flat files tend to be very denormalized and very redundant.

We have released a tiny open source package codenamed Lokad.FlatFiles for C#/.NET (and a Java version too) under the MIT license. This library takes care of generating the perfect hashes out of a flat file. Our (unfair) benchmarks indicate that we typically reach about 30MB/second on a single CPU. Then, when the subsequent parsing operations take advantage of the token hashing, the speed-up is so massive that this initial perfect hashing tend to completly dominate the total CPU cost - so we stay at roughly 30MB/second.

Saturday
Feb272010

MapReduce as burstable low-cost CPU

About two months ago, when Mike Wickstrand setup a UserVoice instance for Windows Azure, I immediately posted my own suggestion concerning MapReduce. MapReduce is a distributed computing concept initially published by Google late 2004.

Against all odds, my suggestion, driven by the needs of Lokad, made it into the Top 10 most requested features for Windows Azure (well, 9th rank and about 20x times less voted than the No1 request for scaled down hosting).

Lately, I had the opportunity to discuss more with folks at Microsoft gathering market feedback on this item. In software business, there is frequent tendency for users to ask for features they don't want in the end. The difficulty being that proposed features may or may not correctly address initial problems.

Preparing the interview, I realized that, to some extend, I had fallen for the same trap when asking for MapReduce. Actually, we have already reimplemented our own MapReduce equivalent, which is not that hard thanks to the Queue Storage.

I care very little about framework specifics, may it be MapReduce, Hadoop, DryadLinq or something not-invented-yet. Lokad has no cloud legacy calling for a specific implementation.

What I do care about is much simpler. In order to deliver truckloads of forecasts, Lokad needs :

  1. large scale CPU
  2. burstable CPU
  3. low cost CPU

Windows Azure is already doing a great job addressing Point 1. Thanks to the massive Microsoft investments on Azure datacenters, thousands of VMs can already be instantiated if needed.

When asking for MapReduce, I was instead expressing my concern for Point 2 and Point 3. Indeed,

  • Amazon MapReduce offers 5x cheaper CPU compared to classical VM-based CPU.
  • VM-based CPU is not very burstable: it takes minutes to spawn a new VM, not seconds.

Then, low-cost CPU is somehow conflicting with burstable CPU, as illustrated by the Reserved Instances pricing of Amazon.

As far low-level cloud computing components are concerned, lowering costs usually mean giving up on expressiveness as a resulting trade-off:

  • Relational DB at $10/GB too expensive? Go for NoSQL storage at $0.1/GB, much cheaper, but much weaker as far querying capabilities are concerned.
  • Guaranteed VMs too expensive? Go for Spot VMs, price is lower on average but you've have no more certainties about either the price or the availability of VMs.
  • Latency of cloud storage too high? Go for CDN, latency is much better for reads, yet, much worse for writes.

Seeking large scale burstable CPU, here are the list of items that we would be very willing to surrender in order to lower the CPU pricing:

  • No need for local storage. VM comes with 250GB hard-drive, which we typically don't need.
  • No need for 2GB of memory. Obviously, we still need a bit of memory but 512MB would be fine.
  • No need for any level of access to the OS
  • Runtime could be made .NET only, and restricted to safe IL (which would facilitate code sandboxing).
  • No need for generic network I/O. Contrained accesses to specific Tables / Queues / Containers would be fine. This would facilitate colocation of storage and CPU.
  • No need for geolocalized resources. Cloud can push wherever CPU is available. Yet, we would expect no to be charged from bandwidth that happens between cloud data centers (if the transfer is caused by offsite computations).
  • No need for fixed pricing. Prioritization of requests based on a variable pricing would be fine (considering that the CPU price could be lowered in average).

Obviously, options are plenty to drag the price down in exchange of a more constrained framework. Since Azure has the unique opportunity to deliver some very .NET oriented features, I am especially interested by approaches that would leverage sandboxed code executions - giving up entirely on the OS itself to purely focus on the .NET Runtime.

I am very eager to see how Microsoft will be moving forward on this request. Stay tuned.

Tuesday
May192009

FIPFO - First In Probably First Out

The FIFO (First In First Out) is a very well known concept in computer science. In one of my previous post, I used the word FIPFO to refer to First In Probably First Out to refer to the cloud equivalent of the FIFO.

Indeed, the basic idea behind that term is that you can't scale much pure FIFOs due to synchronization constraints. Yet, if you just loosen a little bit the semantic, that is to say, FIPFO, then you have an infinitely scalable data structure.

Considering its simplicity and usefulness, I believe that FIFPO will be ubiquitous in future cloud computing applications.

Matthieu, a colleague at Lokad, was asking me if FIPFO was a well-known concept or yet another wacky term that I had just made-up on my blog. Well, sorry, I can't really remember. FIPFO seems just to be a very appropriate way to describe data structure such as the Windows Azure Queue, but I am not sure if I read about it elsewhere now.

According to Google, there is just a single other result at the time for FIPFO by another person who came up with this term a few days before my initial post while describing the Drupal API. Yet, I had never read the Drupal forums before, so I guess we did separately end up with the same idea and terminology.

Friday
May152009

Machine learning company, what’s so special?

Developing a machine learning startup is a very particular venture. Check out: Machine learning company, what’s so special? (based on my experience at Lokad)