I had a question yesterday on an internal IRC channel from one of my colleagues in Norway about a SQL query that would "for a given id value, return the 50 rows centered around the row with this id", where the id column can contain gaps (either because they were inserted with gaps, or because there are further WHERE restrictions in the query).
I came up with a reasonably working solution fairly quickly, but I made one mistake. For fun, I asked around a number of my PostgreSQL contacts on IM and IRC for their solutions, and it turns out that almost everybody made the exact same mistake at first. I'm pretty sure all of them, like me, would've found and fixed that issue within seconds if they were in front of a psql console. But I figured that was a good excuse to write a blog post about it.
The solution itself becomes pretty simple if you rephrase the problem as "for a given id value, return the 25 rows preceding and the 25 rows following the row with this id". That pretty much spells a UNION query. Thus, the solution to the problem is:
SELECT * FROM (
SELECT id,field1,field2 from mytable where id >= 123456 order by id limit 26
) AS a
UNION ALL
SELECT * FROM (
SELECT id,field1,field2 from mytable where id < 123456 order by id desc limit 25
) AS b
ORDER BY id;
The mistake everybody made? Forgetting that you need a subselect in order to use LIMIT. Without subselects, you can't put ORDER BY or LIMIT inside the two separate parts of the query, only at the outer end of it. But we specifically need to apply the LIMIT individually, and the ORDER BY needs to be different for the two parts.
Another question I got around this was, why use UNION ALL. We know, after all, that there are no overlapping rows so the result should be the same as for UNION. And this is exactly the reason why UNION ALL should be used, rather than a plain UNION. We know it - the database doesn't. A UNION query will generate a plan that requires an extra unique node at the top, to make sure that there are no overlapping rows. So the tip here is - always use UNION ALL rather than UNION whenever you know that the results are not overlapping.
All things considered, this query produces a pretty quick plan even for large datasets, since it allows us to do two independent index scans, one backwards. Since there are LIMIT nodes on the scans, they will stop running as soon as they have produced the required number of rows, which is going to be very small compared to the size of the table. This is the query plan I got on my test data:
Sort (cost=54.60..54.73 rows=51 width=86)
Sort Key: id
-> Append (cost=0.00..53.15 rows=51 width=86)
-> Limit (cost=0.00..35.09 rows=26 width=51)
-> Index Scan using mytable_pk on mytable (cost=0.00..55425.06 rows=41062 width=51)
Index Cond: (id >= 100000)
-> Limit (cost=0.00..17.04 rows=25 width=51)
-> Index Scan Backward using mytable_pk on mytable (cost=0.00..56090.47 rows=82306 width=51)
Index Cond: (id < 100000)
And yes, the final ORDER BY is still needed if we want the total result to come out in the correct order. With the default query plan, it will come out in the wrong order after the append node. But it's important to remember that by the specification the database is free to return the rows in any order it chooses unless there is an explicit ORDER BY in the query. The rows may otherwise be returned in a completely different order between different runs, depending on the size/width of the table and other parameters.
It's for just this kind of use case that OLAP functions were invented. The ROWS n (PRECEDING|FOLLOWING) syntax simply solves the problem.
The bad news: it's an 8.5 feature.
Assuming id's are coming without "holes" such statement may be used:
SELECT * FROM customer ORDER BY ABS(custno - 123456) LIMIT 51
Well, it assumes there are no gaps, and that the sequence starts at a controlled value. And that there are no other restrictions in WHERE.
All of which were requirements here...
New comments can no longer be posted on this entry.
instead of doing select [*] from (select ... limit) as a union all, you can also use this notation:
(select [] from x where i < 25 order by i desc limit 5) union all (select [] from x where i >= 25 order by i asc limit 6) order by i;
(p.s. sorry, i put the asterisk in [], because otherwise it was treated as bold marker) which works just as well, but is easier (at least for me) to read. also - there is no way (or i don't see any way) to put line breaks in comments :(