Diffing two sets

Something I've had to do fairly often in SQL is finding the difference between two sets. A typical case which came up again this week is when you have a set of snapshots in a history table, and want to find which keys were added and removed between the two occassions. One way to do this is with a set of queries - one for what was added, one for what was removed, and possibly one for what wasn't changed (if you need that data). You can also do this with OUTER JOINs, which can often be a lot more efficient given that you can answer all your questions in one query.

So, to illustrate this, here's a simple, sample setup. It's basically a table that contains snapshots of a set of personnummer (swedish social security numbers, but it can of course be anything you want. If it's a single unique field, that makes life a lot easier). Snapshots are generated at a certain date by a scheduled taks that inserts the result of a select on a different set of tables. For this illustration, I'll just insert a couple of bogus rows of data. CREATE TABLE snapshots (

snapdate timestamptz NOT NULL,

personnr char(13) NOT NULL,

CONSTRAINT pk_snapshots PRIMARY KEY (snapdate,personnr)

);

INSERT INTO snapshots VALUES ('2006-12-01 00:00:00','11111111-1111');

INSERT INTO snapshots VALUES ('2006-12-01 00:00:00','22222222-2222');

INSERT INTO snapshots VALUES ('2006-12-01 00:00:00','33333333-3333');

INSERT INTO snapshots VALUES ('2006-12-01 00:00:00','44444444-4444');

INSERT INTO snapshots VALUES ('2006-12-02 00:00:00','22222222-2222');

INSERT INTO snapshots VALUES ('2006-12-02 00:00:00','33333333-3333');

INSERT INTO snapshots VALUES ('2006-12-02 00:00:00','55555555-5555');

INSERT INTO snapshots VALUES ('2006-12-02 00:00:00','66666666-6666');



Now, to simply get whatever rows were either added or removed between these two sets, use the following query:

<code style="white-space:pre">
SELECT 

     COALESCE(s1.personnr,s2.personnr) AS personnr,

     CASE WHEN s1.personnr IS NULL THEN 'Added' 

          ELSE 'Removed' END AS action

FROM

  (SELECT personnr FROM snapshots WHERE snapdate='2006-12-01 00:00:00')

    AS s1

FULL OUTER JOIN

  (SELECT personnr FROM snapshots WHERE snapdate='2006-12-02 00:00:00')

    AS s2

ON s1.personnr=s2.personnr

WHERE s1.personnr IS NULL OR s2.personnr IS NULL

If you want to see both the changed and the unchanged rows, remove the WHERE statement and expand the CASE like this:

SELECT

 COALESCE(s1.personnr,s2.personnr) AS personnr,

 CASE WHEN s1.personnr IS NULL THEN 'Added'

      WHEN s2.personnr IS NULL THEN 'Removed'

      ELSE 'Unchanged' END AS action

FROM

(SELECT personnr FROM snapshots WHERE snapdate='2006-12-01 00:00:00')

AS s1

FULL OUTER JOIN

(SELECT personnr FROM snapshots WHERE snapdate='2006-12-02 00:00:00')

AS s2

ON s1.personnr=s2.personnr ```

If you want only changes in one direction, you should be using LEFT or RIGHT JOIN instead, and if you want just the ones not changed, INNER JOIN. You can still do it with OUTER JOIN, but it will likely be much better performing using the proper JOIN for the case.


Add comment

New comments can no longer be posted on this entry.

Conferences

I speak at and organize conferences around Open Source in general and PostgreSQL in particular.

Upcoming

SCaLE 2023
Mar 9-12, 2023
Pasadena, CA, USA
Nordic PGDay 2023
Mar 21, 2023
Stockholm, Sweden
pgday.Paris 2023
Mar 23, 2023
Paris, France
PGCon 2023
May 30-Jun 2, 2023
Ottawa, Canada
PGConf.EU 2023
Dec 12-15, 2023
Prague, Czechia

Past

FOSDEM 2023
Feb 2-5, 2023
Brussels, Belgium
PGConf.EU 2022
Oct 25-28, 2022
Berlin, Germany
PGConf.NYC 2022
Sep 22-23, 2022
New York, USA
Swiss PGDay 2022
Jul 1, 2022
Rapperswil, Switzerland
pgCon 2022
May 24-27, 2022
Online, Online
More past conferences