Viewing entries tagged with sql. Return to full view.

Getting random rows faster. Very much faster.

Getting a single random row, or a few rows, from a table in order to get representative data for example is a frequent need. The most common way to do this in PostgreSQL is using ORDER BY random() like:

SELECT id FROM data ORDER BY random() LIMIT 1

But when run on a large table this can be very slow because it will have to scan the entire table to find the rows. Jonathan Katz mentioned a different way to do it on Twitter, which reminded me that people keep coming up with different (and sometimes very complicated) ways of trying to solve this problem.

And while Jonathan's method (he has the super simple sample code and results up on a gist) is still about twice as fast as ORDER BY random() on my test (with his data), it comes with some problems. For example, it requires a contiguous set of id values, that have to be integers. And it still takes about a second to run on my machine with his sample of 5 million rows -- and will keep getting slower as the table grows.

And it turns out, if you don't need your row to be perfectly random, just mostly random, and can deal with some caveats, PostgreSQL has built-in functionality that does the job about 20,000 times faster than Jonathan's version and 40,000 times faster than ORDER BY random(). Enter TABLESAMPLE.

Continue reading

Tracking foreign keys throughout a schema

I recently ran into the need with a customer to track the usage of a specific key throughout the schema. Basically, "what are all the tables and columns referencing this key, directly or indirectly". Luckily, with a little bit of catalog query, that's not hard:

WITH RECURSIVE what (tbl) AS (
   VALUES ('public.tt')
),
t (oid, key, constrid) AS (
 SELECT tbl::regclass::oid, conkey, NULL::oid
  FROM what INNER JOIN pg_constraint ON (contype='p' AND conrelid=tbl::regclass)
UNION ALL
 SELECT conrelid, conkey, c.oid
 FROM pg_constraint c
 INNER JOIN t ON (c.confrelid=t.oid AND c.confkey=t.key)
 WHERE contype='f'
)
SELECT nspname, relname, key, ARRAY(
    SELECT attname FROM pg_attribute a WHERE a.attrelid=t.oid AND attnum=ANY(key)
  )
FROM t
INNER JOIN pg_class cl ON cl.oid=t.oid
INNER JOIN pg_namespace n ON n.oid=cl.relnamespace

The output can be similar to:

 nspname | relname | key | array 
---------+---------+-----+-------
 public  | tt      | {1} | {ttt}
 public  | foo1    | {1} | {a}
 public  | foo2    | {3} | {z}

for a single column key (tt being the table with the primary key in, and the foo1 and foo2 tables referencing it directly or through the other one), or:

 nspname | relname |  key  | array 
---------+---------+-------+-------
 public  | m1      | {1,2} | {a,b}
 public  | m2      | {1,2} | {a,b}

for a multi-column foreign key.

In this particular use-case, it was an efficient way to track down key usage where naming standards for using the key had not always been followed. And of course, we also found a couple of cases where the column had the correct name but lacked the actual FOREIGN KEY definition, but that was done by just looking at the column names.

Finding gaps in partitioned sequences

There are an almost unlimited number of articles on the web about how to find gaps in sequences in SQL. And it doesn't have to be very hard. Doing it in a "partitioned sequence" makes it a bit harder, but still not very hard. But when I turned to a window aggregate to do that, I was immediately told "hey, that's a good example of a window aggregate to solve your daily chores, you should blog about that". So here we go - yet another example of finding a gap in a sequence using SQL.

I have a database that is very simply structured - it's got a primary key made out of (groupid, year, month, seq), all integers. On top of that it has a couple of largish text fields and an fti field for full text search. (Initiated people will know right away which database this is). The sequence in the seq column resets to zero for each combination of (groupid, year, month). And I wanted to find out where there were gaps in it, and how big they were, to debug the tool that wrote the data into the database. This is really easy with a window aggregate:


SELECT * FROM (
   SELECT
      groupid,
      year,
      month,
      seq, 
      seq-lag(seq,1) OVER (PARTITION BY groupid, year, month ORDER BY seq) AS gap FROM mytable
) AS t
WHERE NOT (t.gap=1)
ORDER BY groupid, year, month, seq

One advantage to using a window aggregate for this is that we actually get the whole row back, and not just the primary key - so it's easy enough to include all the data you need to figure something out.

What about performance? I don't really have a big database to test this on, so I can't say for sure. It's going to be a sequential scan, since I look at the whole table,and not just parts of it. It takes about 4 seconds to run over a table of about a million rows, 2.7Gb, on a modest VM with no actual I/O capacity to speak of and a very limited amount of memory, returning about 100 rows. It's certainly by far fast enough for me in this case.

And as a bonus, it found me two bugs in the loading script and at least one bug in somebody elses code that I'm now waiting on to get fixed...

Getting a range of entries centered around a point

I had a question yesterday on an internal IRC channel from one of my colleagues in Norway about a SQL query that would "for a given id value, return the 50 rows centered around the row with this id", where the id column can contain gaps (either because they were inserted with gaps, or because there are further WHERE restrictions in the query).

I came up with a reasonably working solution fairly quickly, but I made one mistake. For fun, I asked around a number of my PostgreSQL contacts on IM and IRC for their solutions, and it turns out that almost everybody made the exact same mistake at first. I'm pretty sure all of them, like me, would've found and fixed that issue within seconds if they were in front of a psql console. But I figured that was a good excuse to write a blog post about it.

The solution itself becomes pretty simple if you rephrase the problem as "for a given id value, return the 25 rows preceding and the 25 rows following the row with this id". That pretty much spells a UNION query. Thus, the solution to the problem is:


    SELECT * FROM (
        SELECT id,field1,field2 from mytable where id >= 123456 order by id limit 26
    ) AS a
UNION ALL
    SELECT * FROM (
        SELECT id,field1,field2 from mytable where id < 123456 order by id desc limit 25
    ) AS b
ORDER BY id;

The mistake everybody made? Forgetting that you need a subselect in order to use LIMIT. Without subselects, you can't put ORDER BY or LIMIT inside the two separate parts of the query, only at the outer end of it. But we specifically need to apply the LIMIT individually, and the ORDER BY needs to be different for the two parts.

Another question I got around this was, why use UNION ALL. We know, after all, that there are no overlapping rows so the result should be the same as for UNION. And this is exactly the reason why UNION ALL should be used, rather than a plain UNION. We know it - the database doesn't. A UNION query will generate a plan that requires an extra unique node at the top, to make sure that there are no overlapping rows. So the tip here is - always use UNION ALL rather than UNION whenever you know that the results are not overlapping.

All things considered, this query produces a pretty quick plan even for large datasets, since it allows us to do two independent index scans, one backwards. Since there are LIMIT nodes on the scans, they will stop running as soon as they have produced the required number of rows, which is going to be very small compared to the size of the table. This is the query plan I got on my test data:


 Sort  (cost=54.60..54.73 rows=51 width=86)
   Sort Key: id
   ->  Append  (cost=0.00..53.15 rows=51 width=86)
         ->  Limit  (cost=0.00..35.09 rows=26 width=51)
               ->  Index Scan using mytable_pk on mytable  (cost=0.00..55425.06 rows=41062 width=51)
                     Index Cond: (id >= 100000)
         ->  Limit  (cost=0.00..17.04 rows=25 width=51)
               ->  Index Scan Backward using mytable_pk on mytable  (cost=0.00..56090.47 rows=82306 width=51)
                     Index Cond: (id < 100000)

And yes, the final ORDER BY is still needed if we want the total result to come out in the correct order. With the default query plan, it will come out in the wrong order after the append node. But it's important to remember that by the specification the database is free to return the rows in any order it chooses unless there is an explicit ORDER BY in the query. The rows may otherwise be returned in a completely different order between different runs, depending on the size/width of the table and other parameters.

Conferences

I speak at and organize conferences around Open Source in general and PostgreSQL in particular.

Upcoming

PGDay Chicago 2024
Apr 26, 2024
Chicago, USA
PGConf.DEV 2024
May 28-31, 2024
Vancouver, Canada

Past

SCaLE 2024
Mar 14-17, 2024
Pasadena, USA
Nordic PGDay 2024
Mar 12, 2024
Oslo, Norway
FOSDEM PGDay 2024
Feb 2-4, 2024
Brussels, Belgium
PGConf.EU 2023
Dec 12-15, 2023
Prague, Czechia
PGConf.NYC 2023
Oct 3-5, 2023
New York, USA
More past conferences