Magnus Hagander's blog

Getting random rows faster. Very much faster.

Posted on May 7, 2021 at 20:30. Tags: postgresql, sql, extensions, tablesample.

Getting a single random row, or a few rows, from a table in order to get representative data for example is a frequent need. The most common way to do this in PostgreSQL is using ORDER BY random() like:

SELECT id FROM data ORDER BY random() LIMIT 1

But when run on a large table this can be very slow because it will have to scan the entire table to find the rows. Jonathan Katz mentioned a different way to do it on Twitter, which reminded me that people keep coming up with different (and sometimes very complicated) ways of trying to solve this problem.

And while Jonathan's method (he has the super simple sample code and results up on a gist) is still about twice as fast as ORDER BY random() on my test (with his data), it comes with some problems. For example, it requires a contiguous set of id values, that have to be integers. And it still takes about a second to run on my machine with his sample of 5 million rows -- and will keep getting slower as the table grows.

And it turns out, if you don't need your row to be perfectly random, just mostly random, and can deal with some caveats, PostgreSQL has built-in functionality that does the job about 20,000 times faster than Jonathan's version and 40,000 times faster than ORDER BY random(). Enter TABLESAMPLE.

A surprising? lesson in the speed of languages

Posted on Jan 3, 2021 at 15:38. Tags: python, rust, go.

Scenario: I had the need for a small tool that would parse logfiles of approximately 6 million lines. To each line apply two regexps to extract a few values, and using two separate dictionaries/hashmaps (choose your poison wrt terminology) calculate how many times each capture group int he regexp shows up.

All in all, this is not going to be a very performance critical run, as it will only parse about 4 times these 6 million lines per day, so even at worst case we're talking less than 100M regexp matches per day. Piece of cake for any language. And since it's a weekend and I don't have any time when I have to deliver this finished script, I set out to prove to myself that python is going to be fast enough for this.

FDWs, curl and LIMIT

Posted on Sep 10, 2020 at 12:53. Tags: postgresql, extensions, fdw, curl.

I'm sure everybody is aware you can have PostgreSQL fetch data live across the internet (or locally of course) as part of your queries and use the data directly. In fact there are a large number of drivers available to access different kinds of data for download. But in the simplest case, we can also just use the file_fdw wrapper that's included in the standard PostgreSQL packages, together with everybody's http-swiss-army-knife, curl.

In attempting to adapt this post for the public, what more time-relevant dataset to work off in these pandemic-affected times than the open data provided by the ECDC, being both open data and very current. In particular for this example, they provide public datasets with COVID numbers from across the world (actual public data, requiring no registration to read, and actual data, not just a limited API).

So, let's see how we can access this data from PostgreSQL:

CREATE EXTENSION file_fdw;

CREATE SERVER curly FOREIGN DATA WRAPPER file_fdw;

CREATE FOREIGN TABLE _rawdata (
 daterep text not null,
 day int not null,
 month int not null,
 year int not null,
 cases int not null,
 deaths int not null,
 countries text not null,
 geoid text not null,
 countrycode text null,
 popdata int null,
 continent text not null,
 cumulative14days float null
)
SERVER curly
OPTIONS (
 PROGRAM 'curl -s https://opendata.ecdc.europa.eu/covid19/casedistribution/csv/',
 FORMAT 'csv',
 HEADER 'on'
);

And to use this we can simply query the foreign table, which will then trigger a http GET of the data:

covid=# SELECT count(*) FROM _rawdata;
 count 
-------
 41837
(1 row)

However, there is an unfortunate interaction with LIMIT. So if we for example try to get just the first 10 rows:

covid=# SELECT * FROM _rawdata LIMIT 10;
ERROR:  program "curl -s https://opendata.ecdc.europa.eu/covid19/casedistribution/csv/" failed
DETAIL:  child process exited with exit code 23

Repartitioning with logical replication in PostgreSQL 13

Posted on Jun 10, 2020 at 19:43. Tags: postgresql, replication, partitioning.

So, you have a partitioned table. And you want to change your mind. Re-partitioning is "easy" if you can take downtime -- just create a new table with a new name and copy all the data over. But what if we want to try to do it without downtime? Logical replication enhancements in PostgreSQL 13 brings us some new options for this!

But first a disclaimer -- this is definitely not pretty! And does not take into consideration things like foreign keys and similar. But sometimes a quick hack can be the best hack.

So let's go!

Connecting to Azure PostgreSQL with libpq 12 in a Kerberos environment

Posted on Feb 20, 2020 at 11:26. Tags: postgresql, release, ssl, gssapi.

If you are using Azure PostgreSQL and have upgraded your client side libpq to version 12 (which can happen automatically for example if you use the PostgreSQL apt repositories), you may see connection attempts fail with symptoms like:

$ psql -hZZZZZZ.postgres.database.azure.com -dpostgres -UXXXXX_dba@ZZZ-db01
psql: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.

With no log information whatsoever available. This can happen if your client is in a Kerberos environment and has valid Kerberos credentials (which can be verified with the klist command). In this case, PostgreSQL 12 will attempt to negotiate GSSAPI encryption with the server, and it appears the connection handler in Azure PostgreSQL is unable to handle this and just kills the connection.

When running the same thing against a local PostgreSQL server prior to version 12, a message like the following will show up in the log:

2020-02-20 10:48:08 CET [35666]: [2-1] client=1.2.3.4 FATAL:  unsupported frontend protocol 1234.5680: server supports 2.0 to 3.0

This is a clear indicator of what's going on, but unfortunately the information isn't always available when connecting to a managed cloud service, such as Azure PostgreSQL. The hard error from Azure also prevents libpq from retrying without GSSAPI encryption, which is what would happen when connecting to a regular PostgreSQL backend or for example through pgbouncer.

The fix/workaround? Disable GSSAPI encryption in the client:

$ export PGGSSENCMODE=disable
$ psql -hZZZZZZ.postgres.database.azure.com -dpostgres -UXXXXX_dba@ZZZ-db01
Password for user XXXXX_dba@ZZZ-db01:
psql (11.6 (Ubuntu 11.6-1.pgdg16.04+1), server 9.5.20)
SSL connection (protocol: TLSv1.2, cipher: ECDHE-RSA-AES256-GCM-SHA384, bits: 256, compression: off)
Type "help" for help.

postgres=>

If you have this type of issue, it's probably worth putting this environment variable in your startup scripts. It can also be set using the gssencmode parameter as part of the connection string, in environments where this is more convenient.

When a vulnerability is not a vulnerability

Posted on Apr 2, 2019 at 21:39. Tags: postgresql, security.

Recently, references to a "new PostgreSQL vulnerability" has been circling on social media (and maybe elsewhere). It's even got it's own CVE entry (CVE-2019-9193). The origin appears to be a blogpost from Trustwave.

So is this actually a vulnerability? (Hint: it's not) Let's see:

PGConf.EU 2018 - the biggest one yet!

Posted on Nov 19, 2018 at 21:01. Tags: pgconfeu, postgresql, conferences, community, pgeu.

It's now almost a month since PGConf.EU 2018 in Lisbon. PGConf.EU 2018 was the biggest PGConf.EU ever, and as far as I know the biggest PostgreSQL community conference in the world! So it's time to share some of the statistics and feedback.

I'll start with some attendee statistics:

451 registered attendees 2 no-shows 449 actual present attendees

Of these 451 registrations, 47 were sponsor tickets, some of who were used by sponsors, and some were given away to their customers and partners. Another 4 sponsor tickets went unused.

Another 52 were speakers.

This year we had more cancellations than we've usually had, but thanks to having a waitlist on the conference we managed to re-fill all those spaces before the event started.

Tracking foreign keys throughout a schema

Posted on Nov 5, 2018 at 14:44. Tags: postgresql, sql.

I recently ran into the need with a customer to track the usage of a specific key throughout the schema. Basically, "what are all the tables and columns referencing this key, directly or indirectly". Luckily, with a little bit of catalog query, that's not hard:

WITH RECURSIVE what (tbl) AS (
   VALUES ('public.tt')
),
t (oid, key, constrid) AS (
 SELECT tbl::regclass::oid, conkey, NULL::oid
  FROM what INNER JOIN pg_constraint ON (contype='p' AND conrelid=tbl::regclass)
UNION ALL
 SELECT conrelid, conkey, c.oid
 FROM pg_constraint c
 INNER JOIN t ON (c.confrelid=t.oid AND c.confkey=t.key)
 WHERE contype='f'
)
SELECT nspname, relname, key, ARRAY(
    SELECT attname FROM pg_attribute a WHERE a.attrelid=t.oid AND attnum=ANY(key)
  )
FROM t
INNER JOIN pg_class cl ON cl.oid=t.oid
INNER JOIN pg_namespace n ON n.oid=cl.relnamespace

The output can be similar to:

 nspname | relname | key | array 
---------+---------+-----+-------
 public  | tt      | {1} | {ttt}
 public  | foo1    | {1} | {a}
 public  | foo2    | {3} | {z}

for a single column key (tt being the table with the primary key in, and the foo1 and foo2 tables referencing it directly or through the other one), or:

 nspname | relname |  key  | array 
---------+---------+-------+-------
 public  | m1      | {1,2} | {a,b}
 public  | m2      | {1,2} | {a,b}

for a multi-column foreign key.

In this particular use-case, it was an efficient way to track down key usage where naming standards for using the key had not always been followed. And of course, we also found a couple of cases where the column had the correct name but lacked the actual FOREIGN KEY definition, but that was done by just looking at the column names.

Updates about upcoming conferences

Posted on Aug 22, 2018 at 14:39. Tags: conferences, fosdem, postgresql, pgconfeu, stockholm, prague, tokyo, san francisco, postgresopen, pgeu.

Summer vacation times are over. Well, for some of us at least, clearly some are still lucky enough to be off, which is showing itself a bit (see below). But as both conference organisation and participation throughout the rest of the year is starting to be clear, I figure it's time to share some updates around different ones.

Postgres Open SV

First of all - if you haven't already, don't forget to register for Postgres Open SV in San Francisco in two weeks time! Registration for the main US West Coast/California PostgreSQL community conference will close soon, so don't miss your chance. I'm looking forward to meeting many old and new community members there.

PostgreSQL Conference Europe

Next up after Postgres Open will be pgconf.eu, the main European PostgreSQL community conference of 2018. The planning for this years conference is at full speed, but unfortunately we are slightly behind. In particular, we were supposed to be notifying all speakers today if they were accepted or not, and unfortunately our program committee are a bit behind schedule on this one. We had over 200 submissions this year which makes their work even bigger than usual. But speakers will be receiving their notification over the upcoming couple of days.

Hopefully once all speakers have been able to confirm their attendance, we will also have a schedule out soon. Until then, you can always look at the list of accepted talks so far. This list is dynamically updated as speakers get approved and confirm their talks.

We have already sold approximately half of the tickets that we have available this year, so if you want to be sure to get your spot, we strongly recommend that you register as soon as you can! And if you want to attend the training sessions, you should hurry even more as some are almost sold out!

PGConf.ASIA

Work on the program committee of PGConf.ASIA has also been going on over the late summer, and is mostly done! The schedule is not quite ready yet, but expected out shortly. You can look forward to a very interesting lineup of speakers, so if you are in the Asian region, I strongly recommend keeping an eye out for when the registration opens, and join us in Tokyo!

FOSDEM PGDay

As has been announced, PostgreSQL Europe will once again run a FOSDEM PGDay next to the big FOSDEM conference in Brussels in February next year. We hope to also run our regular booth and developer room during FOSDEM, but those are not confirmed yet (more info to come). The Friday event, however, is fully confirmed. Of course not open for registration yet, but we'll get there.

Nordic PGDay

Nordic PGDay has been confirmed for March 19th next year. The format will be similar to previous years, and we will soon announce the location. For now, mark your calendars to make sure you don't double book! And rest assured, the conference will take place somewhere in the Nordics!

Usergroups and PGDays

Then there are a number of smaller events of course. Next week, I will speak at the Prague PostgreSQL Meetup. We should be kicking off the Stockholm usergroup. PDXPUG runs a PGDay in Portland in September (which I unfortunately won't be able to attend). In general, it seems like usergroups are starting to get going again after the summer break, so check with your local group(s) what's happening!

What does it mean to be on the board of PostgreSQL Europe

Posted on Feb 7, 2018 at 17:49. Tags: conferences, pgeu, pgconfeu, postgresql.

With the upcoming elections in PostgreSQL Europe, I'm excited to see that we have more candidates than ever before. But during the FOSDEM conference we just finished in Brussels, that also lead to a fairly large number of people who asked me the simple question "what does it actually mean to be on the board of PostgreSQL Europe". So I think it's time to summarize that for both those standing for election, and for the members of the organisation in general.

For a TL; DR; version, being on the board basically means a lot of administrative work :) But read on for some details.