New search live on www.postgresql.org

Today I finally got around to cleaning up the final things, and activating the new tsearch2 based search engine for www.postgresql.org. Other than Stefan finding an escaping bug (hello magic_quotes, you should be off) and one encoding bug (yes, even though everything is utf-8 everywhere, you still have to explicitly tell htmlentities that it should be utf-8), things are looking good. The load on the search machine is way down, mainly because the indexing part of aspseek loaded things up horribly. The search itself wasn't too bad, and it's not too bad now either.

So, feel free to pound it a bit, and let me know if there are issues.

Oh, and yes, I'll be doing a complete writeup on it. Eventually :-)

Update: As Devrim so kindly informed me, I forgot to credit him. He of course helped me rsync some changes to the archives box at commandprompt, but most importantly he restored from backup the file that I accidentally removed even though it wasn't in CVS. So - thanks Devrim, couldn't have done it without you!

Custom synonym dictionaries in tsearch2

(this could be considered part 4 of the PostgreSQL website search update)

Custom synonym dictionaries is not actually new functionality for 8.2 - I think it's actually been around since 7.4 or so. But it's definitly been a function I did not know about, and it can be really useful in scenarios when you know some specifics about your data. In general, selecting and tuning your dictionaries in tsearch2 can give a big boost to the accuracy of your searches, so it's definitly worth looking into.

The concept of custom synonym dictionaries is really simple (and so is the implementation) - you get to define synonyms for certain words that all the tsearch2 functions will then apply to both indexing and searching, including the headline() function. For example, how many people incorrectly call PostgreSQL "postgre"? Answer: way too many. But that doesn't mean we shouldn't help them by providing whatever search matches we can. Also, some people refer to it as pgsql, and why should the user need to know that?
So how is this done in tsearch2? With the dict_syn dictionary it's very simple, and this dictionary is built and installed by default as long as you install tsearch2 (from source - I assume it's in all binary packages as well, since there is no config switch to turn it off automaticaly).

The first step is to create the dictionary file itself. It's a simple textfile (though you could keep it in the database and update it with a trigger in for example pl/perlu if you wanted to) with two whitespace separated fields listing the word and it's synonym. For the websearch, I started out with the following, a simple set of synonyms for the word postgres (all words are case-insensitive at this point):

postgresql postgres

pgsql postgres

postgre postgres



I chose to store this along with other tsearch2 files in *<pgdir>/share/contrib/pg_dict*, but you can of course store this anywhere you like as long as the backend can read the file.



With this done, you need to add the dictionary to the list of available ones in tsearch2, using a command like this (this assumes you have installed tsearch2 in your database already):

<code style="white-space:pre">
INSERT INTO pg_ts_dict (SELECT 'pg_dict',dict_init,

 '/usr/local/pgsql/share/contrib/pg_dict',

 dict_lexize, 'pg-specific dictionary'

 FROM pg_ts_dict WHERE dict_name='synonym'

);

At this time, you should be able to test the new dictionary, using the lexize function. Notice how any word defined in the dictionary returns the synonym, in this case postgres. search=# select lexize('pg_dict','postgresql'),lexize('pg_dict','pgsql');

lexize | lexize

------------+------------

{postgres} | {postgres}

(1 row)



Finally, you need to modify the tsearch2 configuration for it. Normally you'll probalby be working off a custom configuration, but if you're working off the default one, the command below will activate the *pg_dict* dictionary along with the english stemmer for the default config:
<code style="white-space:pre">
UPDATE pg_ts_cfgmap SET dict_name='{pg_dict,en_stem}'

 WHERE tok_alias IN ( 'lword', 'lhword', 'lpart_hword' )

  AND ts_name='default';

With this done, you should now be able to verify that the system works properly by converting a string to a tsvector:

search=# SELECT to_tsvector('postgresql');

to_tsvector


'postgres':1

(1 row) ```

And that's all there is to it. You can easily modify the dictionary by just editing the file. Note that you will need to re-index your tables (or more specifically, re-create the contents of the tsvector columns) if you change the dictionary and want those changes to be usable to search data. Also, the contents of the synonym dictionary is cached the first time it's used in a backend, so if you're using connection pooling you may need to forcibly close your connections in order for the changes to show up.

Finally, a big thanks to Oleg Bartunov for helping me sort out this configuration stuff in the first place, and the rapid fixing of the couple of issues I came across during the testing of 8.2.

Homepage facelift

A short while ago I committed a bunch of changes to the frontpage of www.postgresql.org, trying to include a couple of things that have been much requested. Most of the work is based on a design proposal by Simon Riggs.

The most obvious change is that we now list more items under news and events, and also a short aggregation of some posts from Planet PostgreSQL. Further incremental changes are sure to come...

Diffing two sets

Something I've had to do fairly often in SQL is finding the difference between two sets. A typical case which came up again this week is when you have a set of snapshots in a history table, and want to find which keys were added and removed between the two occassions. One way to do this is with a set of queries - one for what was added, one for what was removed, and possibly one for what wasn't changed (if you need that data). You can also do this with OUTER JOINs, which can often be a lot more efficient given that you can answer all your questions in one query.

So, to illustrate this, here's a simple, sample setup. It's basically a table that contains snapshots of a set of personnummer (swedish social security numbers, but it can of course be anything you want. If it's a single unique field, that makes life a lot easier). Snapshots are generated at a certain date by a scheduled taks that inserts the result of a select on a different set of tables. For this illustration, I'll just insert a couple of bogus rows of data. CREATE TABLE snapshots (

snapdate timestamptz NOT NULL,

personnr char(13) NOT NULL,

CONSTRAINT pk_snapshots PRIMARY KEY (snapdate,personnr)

);

INSERT INTO snapshots VALUES ('2006-12-01 00:00:00','11111111-1111');

INSERT INTO snapshots VALUES ('2006-12-01 00:00:00','22222222-2222');

INSERT INTO snapshots VALUES ('2006-12-01 00:00:00','33333333-3333');

INSERT INTO snapshots VALUES ('2006-12-01 00:00:00','44444444-4444');

INSERT INTO snapshots VALUES ('2006-12-02 00:00:00','22222222-2222');

INSERT INTO snapshots VALUES ('2006-12-02 00:00:00','33333333-3333');

INSERT INTO snapshots VALUES ('2006-12-02 00:00:00','55555555-5555');

INSERT INTO snapshots VALUES ('2006-12-02 00:00:00','66666666-6666');



Now, to simply get whatever rows were either added or removed between these two sets, use the following query:

<code style="white-space:pre">
SELECT 

     COALESCE(s1.personnr,s2.personnr) AS personnr,

     CASE WHEN s1.personnr IS NULL THEN 'Added' 

          ELSE 'Removed' END AS action

FROM

  (SELECT personnr FROM snapshots WHERE snapdate='2006-12-01 00:00:00')

    AS s1

FULL OUTER JOIN

  (SELECT personnr FROM snapshots WHERE snapdate='2006-12-02 00:00:00')

    AS s2

ON s1.personnr=s2.personnr

WHERE s1.personnr IS NULL OR s2.personnr IS NULL

If you want to see both the changed and the unchanged rows, remove the WHERE statement and expand the CASE like this:

SELECT

 COALESCE(s1.personnr,s2.personnr) AS personnr,

 CASE WHEN s1.personnr IS NULL THEN 'Added'

      WHEN s2.personnr IS NULL THEN 'Removed'

      ELSE 'Unchanged' END AS action

FROM

(SELECT personnr FROM snapshots WHERE snapdate='2006-12-01 00:00:00')

AS s1

FULL OUTER JOIN

(SELECT personnr FROM snapshots WHERE snapdate='2006-12-02 00:00:00')

AS s2

ON s1.personnr=s2.personnr ```

If you want only changes in one direction, you should be using LEFT or RIGHT JOIN instead, and if you want just the ones not changed, INNER JOIN. You can still do it with OUTER JOIN, but it will likely be much better performing using the proper JOIN for the case.

Interesting analysis of db security

David Litchfield has posted a (fairly short, which is nice) analysis of the security of Oracle vs SQL Server. Unbreakable? Don't think so.

It's interesting to note that PostgreSQL has had nowhere near Oracle's amount of security issues. Then again, I'll readily admit that PostgreSQL also hasn't received as much attention from professional security researchers, but it's still interesting.

Sometimes you just have to try

Finally took the time look at the issue with attached files to the new techdocs. Turns out it was really easy - unless you specify a file extension, the mirroring script used to generate our static mirrors gets confused. Only the instructions told the person adding the files not to use an extension when referencing them. Oops.

While at it I also fixed the problem that it didn't return 404 for missing attached files - instead it returned a proper HTML page that said the file didn't exist. Which was then dutifully mirrored by all the static mirrors. Oops again, but another very easy fix.

Sometimes it's just so easy to find the bugs your code. Sometimes I wish life was that easy to debug...

Revamping the postgresql.org web search, part 3

Just a couple of notes during the further progress I've made:

  • While OOP in PHP is certainly pretty far from polished, it is a lot nicer than in Perl. As for the actual implementation details, I think they pretty much even out in the end - but the fact that I couldn't get encoding to work at all in Perl was the killer for it. So the PHP implementation will be the one that's used.

  • Being able to use persistent connections when connecting to websites to download their content for indexing would give noticeable speedups, specifically over slow connections. But doing that requires implementing HTTP/1.1, which in turn requires implementing chunked encoding. Something for the future - I can still full-index all the sites we pull down except the archives (a little over 200 sites) in less than 10 minutes. And archives index pretty fast anyway, since Josh has kindly set me up with a box that lives on the same network as the archives server.

  • Many sites don't implement If-Modified-Since properly. Luckily I've been able to bug a couple of the site-owners into fixing it, given that they are pg sites and I have "fairly good connections" with some of the webmasters there. Common problems include not implementing it at all, or just comparing exact values instead of ranges (this second one is actually mentioned in a lot of places as a caveat for implementations, but hey, I want to us eit..)

  • I need to tune my tsearch2 dictionaries. Just running on the standard one now, can probably be a lot more efficient using ispell and/or snowball. And tune some stopwords.

  • While not done yet, this is progressing nicely and I should be able to move to proper testing fairly soon. Yay.

(Updated: removed note about headline() because obviously I can't read my own testresults properly!)

GIN performance (postgresql.org websearch part 2)

As I finished loading the approximately 550,000 emails in our archives into my UTF8 encoded database again (bad encoding earlier), I decided to run a quick benchmark of the different ways of doing fulltext search on it with tsearch2, partially because Devrim requested it. I ran the same query a couple of times to bring things into the cache, and then saw very stable performance on all the queries. The query I ran was:

explain analyze select count(*) from messages where fti @@ plainto_tsquery('magnus hagander');

Which should be simple and self-explaining. I got the following search times:

  • tsearch2 with no index: ~6000 milliseconds (6 seconds)
  • tsearch2 with GiST index: ~250ms (0.25 seconds)
  • tsearch2 with GIN index: ~6ms (very fast)

This shows that tsearch2 with GIN is a lot faster than the GiST version we had in 8.1. Updates are more expensive, but when your dataset has a lot more read than write, then GIN will definitly be a big win.

Revamping the postgresql.org web search, part 1

Over the past couple of weeks (that's how I mostly do these things - an hour here, half an hour there, over an extended period of time), I've been working on revamping the search on www.postgresql.org and archives.postgresql.org. oday we're using a custom patched version of ASPSeek backed with a PostgreSQL database. Unfortunatly, it's a real pain to maintain - upstream didn't want the patches John wrote (if I understood the situation correctly), it requires a very special version of GCC to build, even the web interface is in C++ and thus a pain to edit for layout changes etc. Short story, time to look at something else.

The new solution I'm working on is based on PostgreSQL 8.2 with tsearch2 and GIN indexes. So far it's showing good performance, and a very good flexibility given that you get to use metadata in the PostgreSQL database to further enhance hits. Plus, the web interface can be integrated with the main site layout engine. Finally, the indexer is "context aware" and knows how to read our archives in a way

This has also taught me some bad things about the common languages/frameworks used out there, and their (non) dealing with encoding. Basically, the system needs to deal with multiple encodings (iso-8859-1, utf-8 etc etc), and more specifically with files that have broken encodings (such as claiming to be utf-8 but half the file is utf-8 and the other half iso-8859-1).

Initially, my indexer implementation was in Perl using LWP and HTML::Parser. Eventually I had to abandon this completely, because I could just not find any way to get Perl to ensure the output data was proper UTF-8, which is required to insert it into a PostgreSQL database with UTF8 encoding. I tried several different ways (after all, it's Perl, so you should do it different ways), but it always broke one way or another.

I've currently re-implemented most of the indexer in PHP instead. This does appear to work much better. The iconv() function actually works as advertised and can be set to always output clean UTF8 and just ignore broken encoding on input characters replacing them with blanks. Initially, I was using the Tidy extensions to PHP to parse the HTML, but had to give this one up because of the insane memory leaks (such as eating up a gigabyte of memory after indexing <10,000 pages - and I need to index more than 500,000). There's also a bug in 5.1.x at least wrt strtotime() that causes a coredump, but it appears to be fixed in 5.2.

Current version uses preg_match() with a couple of fairly simple regexps, and this appears to be working much better. It also gives significantly better performance than the Perl version, because all the "heavy duty" work is in C code linked into PHP, and not in interpreted code.

There are still some issues with the PHP indexer, but it looks a lot better. Will keep posting more info when I have it :-)

Meeting at EnterpriseDB

I just got back (well, yesterday) from the meeting at EnterpriseDB, about their cooperation with the community and such. It was basically me, a couple of -core guys and a bunch of EDB folks. IMHO, a very good meeting, and there appears to be a lot of very interesting things that EDB are about to contirbute back into the community. Most if it has already been discussed on the lists or at the conference in Toronto, but there are a couple of more very nice things in the works that we'll know more about shortly. Apart from that, it was also another nice chance to meet with some of the PostgreSQL people face to face.

So big thanks to Denis at EDB for including me in the group for this meeting. Just make sure Rich gets the mints right next time, so everybody's happy...

Conferences

I speak at and organize conferences around Open Source in general and PostgreSQL in particular.

Upcoming

PGConf.DEV 2024
May 28-31, 2024
Vancouver, Canada

Past

PGDay Chicago 2024
Apr 26, 2024
Chicago, USA
SCaLE 2024
Mar 14-17, 2024
Pasadena, USA
Nordic PGDay 2024
Mar 12, 2024
Oslo, Norway
FOSDEM PGDay 2024
Feb 2-4, 2024
Brussels, Belgium
PGConf.EU 2023
Dec 12-15, 2023
Prague, Czechia
More past conferences