In order to retrieve a large amount of data from a SQL table, you need to resort to a data paging scheme. Conceptually, a typical paged SQL query looks like (the syntax is approximate and vaguely inspired from MS SQL Server 2005)
WHERE RowNumber() BETWEEN @Index AND @Index + @PageSize
ORDER BY Foo.Id;
The queries are made iteratively until no rows get returned any more. Yet, this approach fails both if rows are added or deleted in the table
Foo during the iteration.
If rows are inserted, then the
RowNumbers() will be impacted. Some rows will see their number to be incremented.
- The newly inserted rows maybe missed.
- Certain rows are going to be retrieved twice.
In the overall, the situation isn't bad, because, after all, all rows that were present when the retrieval iteration started will be retrieved.
In the other hand, if rows are deleted, then some rows will get their
RowNumbers() decremented. As a consequence, certain rows will never get retrieved. This issue is quite troublesome, because you would not expect deleted rows to prevent the retrieval of other (valid) rows.
One workaround is to this situation would be to add some
IsDeleted flag to the table (instead of actually deleting rows, flags would be changed from
true); and to purge the database only once a while. Yet, this solution has many drawbacks: table schema must be changed and all existing queries that target this table must add the extra condition
IsDeleted = false.
A more subtle approach to the problem consists in changing the definition of the iterating index. Instead of using an index that correspond to a generate
RowNumbers(), let directly use
@IndexId, the greatest identifier retrieved so far. Thus, the paged query becomes
SELECT TOP @PageSize
WHERE Foo.Id > @IndexId
ORDER BY Foo.Id
With this new approach, we are now certain that no rows will be skipped during the retrieval process. Plus, we haven't made any change to the table schema.