Delete-proof data paging
In order to retrieve a large amount of data from a SQL table, you need to resort to a data paging scheme. Conceptually, a typical paged SQL query looks like (the syntax is approximate and vaguely inspired from MS SQL Server 2005)
SELECT Foo.Bar FROM Foo WHERE RowNumber() BETWEEN @Index AND @Index + @PageSize ORDER BY Foo.Id;
The queries are made iteratively until no rows get returned any more. Yet, this approach fails both if rows are added or deleted in the table
Foo during the iteration.
If rows are inserted, then the
RowNumbers() will be impacted. Some rows will see their number to be incremented.
- The newly inserted rows maybe missed.
- Certain rows are going to be retrieved twice.
In the overall, the situation isn’t bad, because, after all, all rows that were present when the retrieval iteration started will be retrieved.
In the other hand, if rows are deleted, then some rows will get their
RowNumbers() decremented. As a consequence, certain rows will never get retrieved. This issue is quite troublesome, because you would not expect deleted rows to prevent the retrieval of other (valid) rows.
One workaround is to this situation would be to add some
IsDeleted flag to the table (instead of actually deleting rows, flags would be changed from
true); and to purge the database only once a while. Yet, this solution has many drawbacks: table schema must be changed and all existing queries that target this table must add the extra condition
IsDeleted = false.
A more subtle approach to the problem consists in changing the definition of the iterating index. Instead of using an index that correspond to a generate
RowNumbers(), let directly use
@IndexId, the greatest identifier retrieved so far. Thus, the paged query becomes
SELECT TOP @PageSize FROM Foo WHERE Foo.Id > @IndexId ORDER BY Foo.Id
With this new approach, we are now certain that no rows will be skipped during the retrieval process. Plus, we haven’t made any change to the table schema.
Reader Comments (3)
If the ID is GUID that’s being generated by the newid() routine (or some other proper guid generator) then there’s problem with this solution. Newly inserted rows may be missed since they are not being added to the end of the ordered collection.
November 30, 2007 | Rinat Abdullin
Joannes, This is not a comment on your current post. Apologies for the off-topic comment. I was lead to your blog from the NGrid site. Is the NGrid project still active? It seems the last activity there was a couple of years ago. I was looking for a open source Grid computing framework in .Net when I “discovered” NGrid. Thanks, Vyas
December 9, 2007 | Vyas Bharghava