Viewing entries tagged with extensions. Return to full view.

Getting random rows faster. Very much faster.

Getting a single random row, or a few rows, from a table in order to get representative data for example is a frequent need. The most common way to do this in PostgreSQL is using ORDER BY random() like:

SELECT id FROM data ORDER BY random() LIMIT 1

But when run on a large table this can be very slow because it will have to scan the entire table to find the rows. Jonathan Katz mentioned a different way to do it on Twitter, which reminded me that people keep coming up with different (and sometimes very complicated) ways of trying to solve this problem.

And while Jonathan's method (he has the super simple sample code and results up on a gist) is still about twice as fast as ORDER BY random() on my test (with his data), it comes with some problems. For example, it requires a contiguous set of id values, that have to be integers. And it still takes about a second to run on my machine with his sample of 5 million rows -- and will keep getting slower as the table grows.

And it turns out, if you don't need your row to be perfectly random, just mostly random, and can deal with some caveats, PostgreSQL has built-in functionality that does the job about 20,000 times faster than Jonathan's version and 40,000 times faster than ORDER BY random(). Enter TABLESAMPLE.

Continue reading

FDWs, curl and LIMIT

I'm sure everybody is aware you can have PostgreSQL fetch data live across the internet (or locally of course) as part of your queries and use the data directly. In fact there are a large number of drivers available to access different kinds of data for download. But in the simplest case, we can also just use the file_fdw wrapper that's included in the standard PostgreSQL packages, together with everybody's http-swiss-army-knife, curl.

In attempting to adapt this post for the public, what more time-relevant dataset to work off in these pandemic-affected times than the open data provided by the ECDC, being both open data and very current. In particular for this example, they provide public datasets with COVID numbers from across the world (actual public data, requiring no registration to read, and actual data, not just a limited API).

So, let's see how we can access this data from PostgreSQL:

CREATE EXTENSION file_fdw;

CREATE SERVER curly FOREIGN DATA WRAPPER file_fdw;

CREATE FOREIGN TABLE _rawdata (
 daterep text not null,
 day int not null,
 month int not null,
 year int not null,
 cases int not null,
 deaths int not null,
 countries text not null,
 geoid text not null,
 countrycode text null,
 popdata int null,
 continent text not null,
 cumulative14days float null
)
SERVER curly
OPTIONS (
 PROGRAM 'curl -s https://opendata.ecdc.europa.eu/covid19/casedistribution/csv/',
 FORMAT 'csv',
 HEADER 'on'
);

And to use this we can simply query the foreign table, which will then trigger a http GET of the data:

covid=# SELECT count(*) FROM _rawdata;
 count 
-------
 41837
(1 row)

However, there is an unfortunate interaction with LIMIT. So if we for example try to get just the first 10 rows:

covid=# SELECT * FROM _rawdata LIMIT 10;
ERROR:  program "curl -s https://opendata.ecdc.europa.eu/covid19/casedistribution/csv/" failed
DETAIL:  child process exited with exit code 23

Continue reading

Getting information about PostgreSQL SSL connections

PostgreSQL can, as many other products, use SSL to secure client/server communications. It can be configured to be optional or required, to require a client certificate, or even to use the SSL client certificate to log in. However, the DBA tools are currently a bit lacking in this regard. In particular, there is no way for a DBA to see what SSL parameters are in force (or indeed if SSL is enabled at all) for an already existing connection.

There are multiple ways to see the status of a specific connection (such as the libpq PQgetssl() function, the psql startup message or the sslinfo module. Unfortunately all these methods have one thing in common - they are only available to the process making the connection to the database, not to a DBA inspecting the system from the outside.

9.4 will make it a little bit better, because log_connections now include SSL information when the user connects, similar to:

LOG:  connection authorized: user=mha database=postgres SSL enabled (protocol=TLSv1.1, cipher=ECDHE-RSA-AES256-SHA)

But this can still be a bit of a pain to figure out for existing connectioons of course.

To deal with this problem, I've written a small PostgreSQL extension called pg_sslstatus. When loaded using shared_preload_libraries it will keep track of the SSL status for each connection, and expose it through a system view named pg_sslstatus like this:

postgres=# SELECT * FROM pg_sslstatus;
  pid  | ssl | bits | compression | version |        cipher        |                         clientdn                         
-------+-----+------+-------------+---------+----------------------+----------------------------------------------------------
 27286 | t   |  256 | f           | TLSv1   | ECDHE-RSA-AES256-SHA | 
 26682 | t   |  256 | t           | TLSv1   | ECDHE-RSA-AES256-SHA | /C=AU/ST=Some-State/O=Internet Widgits Pty Ltd/CN=magnus
 26693 | f   |      |             |         |                      | 
(3 rows)

It will include one row for each connection, and can then be joined with either pg_stat_activity or pg_stat_replication to view the SSL data in a bigger context:

postgres=# SELECT datname, usename, client_addr, ssl, cipher FROM pg_stat_activity INNER JOIN pg_sslstatus ON pg_stat_activity.pid=pg_sslstatus.pid;
 datname  | usename | client_addr | ssl |        cipher        
----------+---------+-------------+-----+----------------------
 postgres | mha     | 127.0.0.1   | t   | ECDHE-RSA-AES256-SHA
 postgres | mha     |             | f   | 
 pgweb    | pgweb   |             | f   | 
(2 rows)

The extension is available from my github account today for existing version of PostgreSQL, under the standard PostgreSQL license. My plan is to work on having a similar view included in PostgreSQL 9.5 by default, but it's unfortunately too late to include it in 9.4 at this point.

Extensions in PostgreSQL 9.1 fixes another pet-peeve

One thing I've really disliked is the fact that contrib modules had installation scripts that enforced the schema to public for the installation. In my opinion, for no useful reason at all.

For example, I often install the pgcrypto contrib module. And I install this in the pgcrypto schema, that I then either add to the search_path variable or just explicitly use in my queries, with things like pcrypto.crypt('foobar','barfoo'). For versions prior to 9.1, being able to do this required me to manually edit the installed pgcrypto.sql file, to remove the SET search_path = public; command.

Extensions in 9.1 makes this so much nicer. To get pgcrypto into it's own schema, I now just need to do:

postgres=# CREATE SCHEMA pgcrypto;
CREATE SCHEMA
postgres=# CREATE EXTENSION pgcrypto SCHEMA pgcrypto;
CREATE EXTENSION

If I happen to create it in public by mistake, I can even move it after the fact!

postgres=# ALTER EXTENSION pgcrypto SET SCHEMA pgcrypto;
ALTER EXTENSION

You still need to create the schema manually - in theory we could auto-create that, but the work is still a lot easier than before. And fully supported!

Remote log reading in PostgreSQL 9.1

PostgreSQL 9.1 beta1 now available - now is a great time to start testing it, and trying out all the great new features.

There have always been a number of ways to read your PostgreSQL logs remotely, over a libpq connection. For example, you can use the pg_read_file() function - which is what pgadmin does. PostgreSQL 9.1 adds a new and more convenient way (in some ways) to do this - using SQL/MED.

PostgreSQL 9.1 comes with SQL standard SQL/MED functionality. The MED in is short for "Managemend of External Data", and as the name sounds, it's about accessing data that's external to the PostgreSQL server. The SQL/MED functionality is not (yet) complete, but it's already very useful in it's current state.

In SQL/MED, there is something called a Foreign Data Wrapper, that can be compared to a driver. Using this FDW, we can create one or more Foreign Servers, which is a definition of how to connect to a specific instance of the service - if any. Finally, we can create one or more Foreign Tables on each of the Foreign Servers, giving us direct access to the remote data using SQL.

Continue reading

Conferences

I speak at and organize conferences around Open Source in general and PostgreSQL in particular.

Upcoming

PGDay Chicago 2024
Apr 26, 2024
Chicago, USA
PGConf.DEV 2024
May 28-31, 2024
Vancouver, Canada

Past

SCaLE 2024
Mar 14-17, 2024
Pasadena, USA
Nordic PGDay 2024
Mar 12, 2024
Oslo, Norway
FOSDEM PGDay 2024
Feb 2-4, 2024
Brussels, Belgium
PGConf.EU 2023
Dec 12-15, 2023
Prague, Czechia
PGConf.NYC 2023
Oct 3-5, 2023
New York, USA
More past conferences