Author

I am Joannes Vermorel, founder at Lokad. I am also an engineer from the Corps des Mines who initially graduated from the ENS.

I have been passionate about computer science, software matters and data mining for almost two decades. (RSS - ATOM)

Meta
« Cloud-first programming languages | Main | Buying software? You should ignore references »
Friday
May082015

Nearly all web APIs get paging wrong

Data paging, that is, the retrieval of a large amount of data through a series of smaller data retrievals, is a non-trivial problem. Through Lokad, we have implemented about a dozen of extensive API integrations, and reviewed a few dozens of other APIs as well.

The conclusion is that as soon as paging is involved, nearly all web APIs get it wrong. Obviously, rock-solid APIs like the ones offered by Azure or AWS are getting it right, but those outstanding APIs are exceptions rather than the norm.

The obvious pattern that doesn't work

I have lost count of the APIs that propose the following broken pattern to page through their data, a purchase order history for example:

https://example.com/api/purchaseorders?page=2&pagesize=25

Where page is the page number and pagesize the number of orders to be retrieved. This pattern is fundamentally unsafe. Any order deleted while the enumeration is in-progress will shift the indices which, in turn, is likely to cause another order to be skipped.

There are many variants of the pattern, and everytime the problem boils down to: the "obvious" paging pattern leads to a flawed implementation that fail whenever concurrent writes are involved.

The "updated_after" filter doesn't work either

Another popular approach for paging is to leverage a filter on the update timestamp of the elements to be retrieved, that is:

https://example.com/api/purchaseorders?updated_after=2015-04-29

Then, in order to page the request, the client is supposed to take the most recent updated_at value from the response and to feed this value back to the API to further enumerate the elements.

However this approach does not (really) work either. Indeed, what if all elements have been updated at once? This can happen because of a system upgrade or because of any kind of bulk operation. Even if the timestamp can be narrowed down to the microsecond, if there are 10,000 elements to be served all having the exact same udpate timestamp, then, the API will keep sending a response where max(updated_at) is equal to the request timestamp.

The client is not enumerating anymore, the pattern has failed.

Sure, it's possible to tweak the timestamps to make sure that all the elements gracefully spread over distinct values, but it's a very non-trivial property to enforce. Indeed, a datetime column isn't really supposed to be defined with unicity constraint in your database. It's feasible, but odd and error prone.

The fallacy of the "power" APIs

Some APIs provides powerful filtering and sorting mechanisms. Thus, through those mechanims, it is possible to correctly implement paging. For example by combining two filters: one the update datetime of the items and one on the item identifier. A correct implementation is far from trivial however.

Merely offering the possibility to do the right thing is not sufficient: doing the right thing should be the only one possibility. This point is something that Lokad learned the hard way early on: web APIs should offer one and only one way to do each intended operation.

If the API offers a page mechanism but that the only way to correctly implement paging is to not use it; then, rest assured that the vast majority of the client implementations will get it wrong. From a design viewpoint, it's like baiting developers into a trap.

The "continuation token" as the only pattern that works

To my knowledge, there is about only one pattern that works for paging, it's the continuation token pattern. 

https://example.com/api/purchaseorders?continue=token

Where every request to a paged resource like the purchase orders has the possibility of returning a continuation token on top of the elements returned when not all elements could be returned in one batch.

On top of being correct, that pattern has two key advantages:

  • It's very hard to get it wrong on the client side. There is only one way to do anything with the continuation token: it's to feed it again to the API.
  • The API is not commited into returning any specific number of elements (in practice a high upper bound can still be documented). Then, if some elements are particularly heavy or if the server is already under heavy workload, smaller chunks can be returned.

This enumeration should not provide any garantee that the same element won't be enumerated more than once. The only garantee that should be provided by the paging through tokens is that ultimately all elements will be enumerated at least once. Indeed, you don't want to end-up with tokens that embed some kind of state on the API side; and in order to keep it smooth and stateless, it's important to lift this constraint.

Then, continuation tokens should not expire. This property is important in order to offer the possibility to the client perform incremental update on a daily, weekly or even  on a monthly schedule depending on what makes sense from a business viewpoint.

No concurrency but data partitions

The continuation token does not support concurrent data retrieval: the next response has to be retrieved before being able to post the next request. Thus, in theory, this pattern somehow limit the amount of data that can be retrieved.

Well, it's somewhat true, and yet mostly irrevelant. First, Big (Business) Data is exceedingly rare in practice, as the transation data of the largest companies tend to fit on a USB key. For all the APIs that we have integrated, putting aside the cloud APIs (aka Azure or AWS), there was not a single integration where the amount of data was even to close to justifying concurrent data accesses. Slow data retrieval is merely a sign a non-incremental data retrieval.

Second, if the data is so large that concurrency is required, then, partitionning the data is typically a much better approach. For example, if you with to retrieve all the data from a large retail network, then the data can be partitionned per store. Partitionning will be making things easier both on the API side and on the client side.

Reader Comments (6)

Too high of an implementation burden for the benefits it provides, and has its disadvantages as well: client cannot skip to a particular subset of the collection without iterating through lots of tokens. Also would feel weird in a UI where the number of results is unexpected or results are enumerated again (though other pagination strategies have this problem as well).

It's a tradeoff and I don't see it as better than providing limit/offset. One could even argue that worse is better, it is certainly the more dominant pattern.
May 9, 2015 | Unregistered CommenterDali
Interesting idea.. But if the tokens never expire, how do you avoid maintaining state on the server side?

Wouldn't you need to have some canonical ordering, and expose an id that maps directly into that order? That's easy for the simple case, but not if the consumer can specify sorting in any way.

Or are you referring to continuations in the formal CS sense? That would certainly require state and memory on the server, and is hard to do in many languages.

The "limit and offset" problem is vexing. It'd be great to have a good, flexible alternative!
May 9, 2015 | Unregistered CommenterA
They/we don't get it wrong.

It's a trade off between simplicity, speed, efford and result.
May 9, 2015 | Unregistered CommenterSigi
Nice post. Continuation tokens can be tricky to implement. I wrote up something called "timestamp offset checksum" cursors ( https://fanout.io/docs/smartfeeds.html#cursors ), which is a way to implement these kinds of tokens without requiring a special DB schema (only need datetime and primary key columns) and without needed to enforce unique timestamps. It also assumes the same concession you suggest, which is that sometimes the server may send the same item more than once. Thought you might find it interesting.
May 10, 2015 | Unregistered CommenterJustin
Pagination is actually easy. The problem is that the api pattern is designed for a localized architecture causing an architectural cross cutting concern. We have to redesign the api pattern for a distributed architecture thus getting rid of the architectural cross curring concern.

By unbinding communications (request/response) from the business logic and moving it higher up in the MVC architecture in an application, we can use the filter/handlerIntercepter to preProcess/postProcess (much like the proxy and MQ due) creating a better and more abstracted IO flow
May 10, 2015 | Unregistered CommenterOwen Rubel
Dali / Sigi. No, it's not a trade-off. Just buggy design. Almost all the companies where we did point out the flaw in their API design are now considering a next version with a correct paging mechanism.

Justin (Fanout). Yes, tricky to implement indeed. However, as soon as you do it right for one list of objects, replicating the solution to all the other lists is easy. Thanks for the link!
May 11, 2015 | Registered CommenterJoannes Vermorel
Comments for this entry have been disabled. Additional comments may not be added to this entry at this time.