Finding gaps in partitioned sequences

Posted on Jan 27, 2012 at 16:53. Tags: postgresql, sql, window aggregates.

There are an almost unlimited number of articles on the web about how to find gaps in sequences in SQL. And it doesn't have to be very hard. Doing it in a "partitioned sequence" makes it a bit harder, but still not very hard. But when I turned to a window aggregate to do that, I was immediately told "hey, that's a good example of a window aggregate to solve your daily chores, you should blog about that". So here we go - yet another example of finding a gap in a sequence using SQL.

I have a database that is very simply structured - it's got a primary key made out of (groupid, year, month, seq), all integers. On top of that it has a couple of largish text fields and an fti field for full text search. (Initiated people will know right away which database this is). The sequence in the seq column resets to zero for each combination of (groupid, year, month). And I wanted to find out where there were gaps in it, and how big they were, to debug the tool that wrote the data into the database. This is really easy with a window aggregate:


SELECT * FROM (
   SELECT
      groupid,
      year,
      month,
      seq, 
      seq-lag(seq,1) OVER (PARTITION BY groupid, year, month ORDER BY seq) AS gap FROM mytable
) AS t
WHERE NOT (t.gap=1)
ORDER BY groupid, year, month, seq

One advantage to using a window aggregate for this is that we actually get the whole row back, and not just the primary key - so it's easy enough to include all the data you need to figure something out.

What about performance? I don't really have a big database to test this on, so I can't say for sure. It's going to be a sequential scan, since I look at the whole table,and not just parts of it. It takes about 4 seconds to run over a table of about a million rows, 2.7Gb, on a modest VM with no actual I/O capacity to speak of and a very limited amount of memory, returning about 100 rows. It's certainly by far fast enough for me in this case.

And as a bonus, it found me two bugs in the loading script and at least one bug in somebody elses code that I'm now waiting on to get fixed...

Comments

what is seq-lag ? i assume this is a user defined func.

Posted on Jan 30, 2012 at 17:23 by mark.

No, there are no user defined functions in this.

seq is the column name, then minus the operator, then lag the builtin window function. So a standard integer minus, while calling the window function lag(), which is a standard one.

Posted on Jan 30, 2012 at 17:25 by Magnus Hagander.

thanks, sorry I should have noticed that was the seq MINUS the lag function.

p.s. you example may have a typo in it.

groupid != gropid, line 3

Posted on Jan 31, 2012 at 19:01 by mark.

Hah, thanks for pointing that out. That's what I get for trying to clean up the formatting when posting it...

Posted on Feb 6, 2012 at 10:18 by Magnus Hagander.

Add comment

New comments can no longer be posted on this entry.