Viewing entries tagged with tablesample. Return to full view.

Getting random rows faster. Very much faster.

Getting a single random row, or a few rows, from a table in order to get representative data for example is a frequent need. The most common way to do this in PostgreSQL is using ORDER BY random() like:

SELECT id FROM data ORDER BY random() LIMIT 1

But when run on a large table this can be very slow because it will have to scan the entire table to find the rows. Jonathan Katz mentioned a different way to do it on Twitter, which reminded me that people keep coming up with different (and sometimes very complicated) ways of trying to solve this problem.

And while Jonathan's method (he has the super simple sample code and results up on a gist) is still about twice as fast as ORDER BY random() on my test (with his data), it comes with some problems. For example, it requires a contiguous set of id values, that have to be integers. And it still takes about a second to run on my machine with his sample of 5 million rows -- and will keep getting slower as the table grows.

And it turns out, if you don't need your row to be perfectly random, just mostly random, and can deal with some caveats, PostgreSQL has built-in functionality that does the job about 20,000 times faster than Jonathan's version and 40,000 times faster than ORDER BY random(). Enter TABLESAMPLE.

Continue reading

Conferences

I speak at and organize conferences around Open Source in general and PostgreSQL in particular.

Upcoming

SCaLE 2023
Mar 9-12, 2023
Pasadena, CA, USA
Nordic PGDay 2023
Mar 21, 2023
Stockholm, Sweden
pgday.Paris 2023
Mar 23, 2023
Paris, France
PGCon 2023
May 30-Jun 2, 2023
Ottawa, Canada
PGConf.EU 2023
Dec 12-15, 2023
Prague, Czechia

Past

FOSDEM 2023
Feb 2-5, 2023
Brussels, Belgium
PGConf.EU 2022
Oct 25-28, 2022
Berlin, Germany
PGConf.NYC 2022
Sep 22-23, 2022
New York, USA
Swiss PGDay 2022
Jul 1, 2022
Rapperswil, Switzerland
pgCon 2022
May 24-27, 2022
Online, Online
More past conferences