Author

I am Joannes Vermorel, founder at Lokad. I am also an engineer from the Corps des Mines who initially graduated from the ENS.

I have been passionate about computer science, software matters and data mining for almost two decades. (RSS - ATOM)

Meta
Tags
« Hard times ahead for shopping cart providers | Main | Tracking file downloads in Google Analytics AND Google Adwords »
Monday
Nov192007

Delete-proof data paging

In order to retrieve a large amount of data from a SQL table, you need to resort to a data paging scheme. Conceptually, a typical paged SQL query looks like (the syntax is approximate and vaguely inspired from MS SQL Server 2005)

SELECT Foo.Bar
FROM Foo
WHERE RowNumber() BETWEEN @Index AND @Index + @PageSize
ORDER BY Foo.Id;

The queries are made iteratively until no rows get returned any more. Yet, this approach fails both if rows are added or deleted in the table Foo during the iteration.

If rows are inserted, then the RowNumbers() will be impacted. Some rows will see their number to be incremented.

  • The newly inserted rows maybe missed.

  • Certain rows are going to be retrieved twice.

In the overall, the situation isn't bad, because, after all, all rows that were present when the retrieval iteration started will be retrieved.

In the other hand, if rows are deleted, then some rows will get their RowNumbers() decremented. As a consequence, certain rows will never get retrieved. This issue is quite troublesome, because you would not expect deleted rows to prevent the retrieval of other (valid) rows.

One workaround is to this situation would be to add some IsDeleted flag to the table (instead of actually deleting rows, flags would be changed from false to true); and to purge the database only once a while. Yet, this solution has many drawbacks: table schema must be changed and all existing queries that target this table must add the extra condition IsDeleted = false.

A more subtle approach to the problem consists in changing the definition of the iterating index. Instead of using an index that correspond to a generate RowNumbers(), let directly use @IndexId, the greatest identifier retrieved so far. Thus, the paged query becomes

SELECT TOP @PageSize
FROM Foo
WHERE Foo.Id > @IndexId
ORDER BY Foo.Id

With this new approach, we are now certain that no rows will be skipped during the retrieval process. Plus, we haven't made any change to the table schema.

Reader Comments (3)

If the ID is GUID that's being generated by the newid() routine (or some other proper guid generator) then there's problem with this solution. Newly inserted rows may be missed since they are not being added to the end of the ordered collection.

November 30, 2007 | Unregistered CommenterRinat Abdullin

Joannes,

This is not a comment on your current post. Apologies for the off-topic comment.

I was lead to your blog from the NGrid site. Is the NGrid project still active? It seems the last activity there was a couple of years ago.
I was looking for a open source Grid computing framework in .Net when I "discovered" NGrid.

Thanks,
Vyas

December 9, 2007 | Unregistered CommenterVyas Bharghava

I've had the old school 1st gen iPhone (EDGE, no GPS, etc.) for a couple of years now and just finally switched to the new iPhone 3G. Having never been able to use any of the GPS features I was wondering if anyone out there could suggest some cool apps in the Apple App Store that take advantage of the 3G's GPS and other features. Thanks in advance!

________________
http://www.youtube.com/watch?v=aBoMmqEIsGk" rel="nofollow">unlock iphone 3g

October 14, 2009 | Unregistered CommenterAbern
Comments for this entry have been disabled. Additional comments may not be added to this entry at this time.