Diffing two sets

Something I've had to do fairly often in SQL is finding the difference between two sets. A typical case which came up again this week is when you have a set of snapshots in a history table, and want to find which keys were added and removed between the two occassions. One way to do this is with a set of queries - one for what was added, one for what was removed, and possibly one for what wasn't changed (if you need that data). You can also do this with OUTER JOINs, which can often be a lot more efficient given that you can answer all your questions in one query.

So, to illustrate this, here's a simple, sample setup. It's basically a table that contains snapshots of a set of personnummer (swedish social security numbers, but it can of course be anything you want. If it's a single unique field, that makes life a lot easier). Snapshots are generated at a certain date by a scheduled taks that inserts the result of a select on a different set of tables. For this illustration, I'll just insert a couple of bogus rows of data. CREATE TABLE snapshots (

snapdate timestamptz NOT NULL,

personnr char(13) NOT NULL,

CONSTRAINT pk_snapshots PRIMARY KEY (snapdate,personnr)

);

INSERT INTO snapshots VALUES ('2006-12-01 00:00:00','11111111-1111');

INSERT INTO snapshots VALUES ('2006-12-01 00:00:00','22222222-2222');

INSERT INTO snapshots VALUES ('2006-12-01 00:00:00','33333333-3333');

INSERT INTO snapshots VALUES ('2006-12-01 00:00:00','44444444-4444');

INSERT INTO snapshots VALUES ('2006-12-02 00:00:00','22222222-2222');

INSERT INTO snapshots VALUES ('2006-12-02 00:00:00','33333333-3333');

INSERT INTO snapshots VALUES ('2006-12-02 00:00:00','55555555-5555');

INSERT INTO snapshots VALUES ('2006-12-02 00:00:00','66666666-6666');



Now, to simply get whatever rows were either added or removed between these two sets, use the following query:

<code style="white-space:pre">
SELECT 

     COALESCE(s1.personnr,s2.personnr) AS personnr,

     CASE WHEN s1.personnr IS NULL THEN 'Added' 

          ELSE 'Removed' END AS action

FROM

  (SELECT personnr FROM snapshots WHERE snapdate='2006-12-01 00:00:00')

    AS s1

FULL OUTER JOIN

  (SELECT personnr FROM snapshots WHERE snapdate='2006-12-02 00:00:00')

    AS s2

ON s1.personnr=s2.personnr

WHERE s1.personnr IS NULL OR s2.personnr IS NULL

If you want to see both the changed and the unchanged rows, remove the WHERE statement and expand the CASE like this:

SELECT

 COALESCE(s1.personnr,s2.personnr) AS personnr,

 CASE WHEN s1.personnr IS NULL THEN 'Added'

      WHEN s2.personnr IS NULL THEN 'Removed'

      ELSE 'Unchanged' END AS action

FROM

(SELECT personnr FROM snapshots WHERE snapdate='2006-12-01 00:00:00')

AS s1

FULL OUTER JOIN

(SELECT personnr FROM snapshots WHERE snapdate='2006-12-02 00:00:00')

AS s2

ON s1.personnr=s2.personnr ```

If you want only changes in one direction, you should be using LEFT or RIGHT JOIN instead, and if you want just the ones not changed, INNER JOIN. You can still do it with OUTER JOIN, but it will likely be much better performing using the proper JOIN for the case.


Add comment

New comments can no longer be posted on this entry.

Conferences

I speak at and organize conferences around Open Source in general and PostgreSQL in particular.

Upcoming

PGDay Chicago 2024
Apr 26, 2024
Chicago, USA
PGConf.DEV 2024
May 28-31, 2024
Vancouver, Canada

Past

SCaLE 2024
Mar 14-17, 2024
Pasadena, USA
Nordic PGDay 2024
Mar 12, 2024
Oslo, Norway
FOSDEM PGDay 2024
Feb 2-4, 2024
Brussels, Belgium
PGConf.EU 2023
Dec 12-15, 2023
Prague, Czechia
PGConf.NYC 2023
Oct 3-5, 2023
New York, USA
More past conferences