FDWs, curl and LIMIT

I'm sure everybody is aware you can have PostgreSQL fetch data live across the internet (or locally of course) as part of your queries and use the data directly. In fact there are a large number of drivers available to access different kinds of data for download. But in the simplest case, we can also just use the file_fdw wrapper that's included in the standard PostgreSQL packages, together with everybody's http-swiss-army-knife, curl.

In attempting to adapt this post for the public, what more time-relevant dataset to work off in these pandemic-affected times than the open data provided by the ECDC, being both open data and very current. In particular for this example, they provide public datasets with COVID numbers from across the world (actual public data, requiring no registration to read, and actual data, not just a limited API).

So, let's see how we can access this data from PostgreSQL:

CREATE EXTENSION file_fdw;

CREATE SERVER curly FOREIGN DATA WRAPPER file_fdw;

CREATE FOREIGN TABLE _rawdata (
 daterep text not null,
 day int not null,
 month int not null,
 year int not null,
 cases int not null,
 deaths int not null,
 countries text not null,
 geoid text not null,
 countrycode text null,
 popdata int null,
 continent text not null,
 cumulative14days float null
)
SERVER curly
OPTIONS (
 PROGRAM 'curl -s https://opendata.ecdc.europa.eu/covid19/casedistribution/csv/',
 FORMAT 'csv',
 HEADER 'on'
);

And to use this we can simply query the foreign table, which will then trigger a http GET of the data:

covid=# SELECT count(*) FROM _rawdata;
 count 
-------
 41837
(1 row)

However, there is an unfortunate interaction with LIMIT. So if we for example try to get just the first 10 rows:

covid=# SELECT * FROM _rawdata LIMIT 10;
ERROR:  program "curl -s https://opendata.ecdc.europa.eu/covid19/casedistribution/csv/" failed
DETAIL:  child process exited with exit code 23

So what actually causes this problem?

When the PostgreSQL query executes with LIMIT 10 it stops reading after 10 rows, and closes the pipe from curl. At which point curl gives us an error. If you run the command manually with a pipe to head and without silent mode, this becomes clear:

$ curl  https://opendata.ecdc.europa.eu/covid19/casedistribution/csv/ | head -1
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0dateRep,day,month,year,cases,deaths,countriesAndTerritories,geoId,countryterritoryCode,popData2019,continentExp,Cumulative_number_for_14_days_of_COVID-19_cases_per_100000
  0 2797k    0 17109    0     0   556k      0  0:00:05 --:--:--  0:00:05  556k
curl: (23) Failed writing body (723 != 1448)

It's hard to complain about curl putting out an error here, because, well, it is an error. Unfortunately, while curl has a crazy amount of command-line parameters, it does not have one that would help with this scenario. However, a little bit of shell trickery can actually help us. Let's instead create the foreign table like this:

CREATE FOREIGN TABLE _rawdata (
 daterep text not null,
 day int not null,
 month int not null,
 year int not null,
 cases int not null,
 deaths int not null,
 countries text not null,
 geoid text not null,
 countrycode text null,
 popdata int null,
 continent text not null,
 cumulative14days float null
)
SERVER curly
OPTIONS (
 PROGRAM 'curl -s https://opendata.ecdc.europa.eu/covid19/casedistribution/csv/ || exit $(( $? == 23 ? 0 : $? ))',
 FORMAT 'csv',
 HEADER 'on'
);

Since the parameter for PROGRAM in file_fdw can be an arbitrary shell expression, we can use this type of syntax to turn error 23 into error 0 (no error), while at the same time still detecting any other errors.

And once we have done this we can query with LIMIT just fine (wrapped in a subquery to make the result shorter for this post), and we can notice that it works correctly because it runs a lot faster than across the whole table so clearly the curl command aborted soon after it started:

covid=# SELECT count(*) FROM (SELECT * FROM _rawdata LIMIT 10) x;
 count 
-------
    10
(1 row)

Time: 54.818 ms
covid=# SELECT count(*) FROM _rawdata;
 count 
-------
 41837
(1 row)

Time: 6563.012 ms (00:06.563)

This little hack can be useful in a lot of cases when we really just need to peek at the beginning of the data (perhaps to validate that something we're doing is done the right way, such as column name mapping, or just because that's where the interesting data is). But in the example data picked for this case, we know that the data is only published once a day, and that we are likely to want to look at more than just the first n rows of it. In this case, the obvious thing we want to do is to materialize the data so we can query it repeatedly, and also to convert some data to a more reasonable format:

CREATE MATERIALIZED VIEW covid AS
 SELECT to_date(daterep, 'dd/mm/yyyy') AS date,
        cases,
        deaths,
        countries,
        geoid,
        countrycode,
        popdata,
        continent,
        cumulative14days
 FROM _rawdata;

And of course much faster since the data is now local:

covid=# SELECT count(*) FROM covid;
 count 
-------
 41837
(1 row)

Time: 34.267 ms

And for any actual analytics we might want to create a couple of local indexes:

CREATE UNIQUE INDEX idx_covid_datecountry ON covid(date, countrycode);
CREATE INDEX idx_covid_countrycode ON covid(countrycode);
CREATE INDEX idx_covid_countries ON covid(countries);

And then schedule a job that once per day refreshes our data so that we always have the latest:

covid=# REFRESH MATERIALIZED VIEW CONCURRENTLY covid;
REFRESH MATERIALIZED VIEW
Time: 6684.797 ms (00:06.685)

This obviously takes a bit of time as it re-downloads the data, but then we have all our data freshly indexed and available for whatever type of analytics we want to do.

In conclusion, using a MATERIALIZED VIEW for this is clearly the correct solution, and if we do that we can use LIMIT or any other query constructs without having to consider the underlying implementation at all. However, there are many cases where this might not make sense at all, such as when the freshness of the data is more important. And when this is the case, using the exit code shell hack can be quite useful to be able to run a LIMIT query against the foreign table.


Add comment

New comments can no longer be posted on this entry.

Conferences

I speak at and organize conferences around Open Source in general and PostgreSQL in particular.

Upcoming

PGDay Chicago 2024
Apr 26, 2024
Chicago, USA
PGConf.DEV 2024
May 28-31, 2024
Vancouver, Canada

Past

SCaLE 2024
Mar 14-17, 2024
Pasadena, USA
Nordic PGDay 2024
Mar 12, 2024
Oslo, Norway
FOSDEM PGDay 2024
Feb 2-4, 2024
Brussels, Belgium
PGConf.EU 2023
Dec 12-15, 2023
Prague, Czechia
PGConf.NYC 2023
Oct 3-5, 2023
New York, USA
More past conferences