Custom synonym dictionaries in tsearch2

(this could be considered part 4 of the PostgreSQL website search update)

Custom synonym dictionaries is not actually new functionality for 8.2 - I think it's actually been around since 7.4 or so. But it's definitly been a function I did not know about, and it can be really useful in scenarios when you know some specifics about your data. In general, selecting and tuning your dictionaries in tsearch2 can give a big boost to the accuracy of your searches, so it's definitly worth looking into.

The concept of custom synonym dictionaries is really simple (and so is the implementation) - you get to define synonyms for certain words that all the tsearch2 functions will then apply to both indexing and searching, including the headline() function. For example, how many people incorrectly call PostgreSQL "postgre"? Answer: way too many. But that doesn't mean we shouldn't help them by providing whatever search matches we can. Also, some people refer to it as pgsql, and why should the user need to know that?
So how is this done in tsearch2? With the dict_syn dictionary it's very simple, and this dictionary is built and installed by default as long as you install tsearch2 (from source - I assume it's in all binary packages as well, since there is no config switch to turn it off automaticaly).

The first step is to create the dictionary file itself. It's a simple textfile (though you could keep it in the database and update it with a trigger in for example pl/perlu if you wanted to) with two whitespace separated fields listing the word and it's synonym. For the websearch, I started out with the following, a simple set of synonyms for the word postgres (all words are case-insensitive at this point):

postgresql postgres

pgsql postgres

postgre postgres



I chose to store this along with other tsearch2 files in *<pgdir>/share/contrib/pg_dict*, but you can of course store this anywhere you like as long as the backend can read the file.



With this done, you need to add the dictionary to the list of available ones in tsearch2, using a command like this (this assumes you have installed tsearch2 in your database already):

<code style="white-space:pre">
INSERT INTO pg_ts_dict (SELECT 'pg_dict',dict_init,

 '/usr/local/pgsql/share/contrib/pg_dict',

 dict_lexize, 'pg-specific dictionary'

 FROM pg_ts_dict WHERE dict_name='synonym'

);

At this time, you should be able to test the new dictionary, using the lexize function. Notice how any word defined in the dictionary returns the synonym, in this case postgres. search=# select lexize('pg_dict','postgresql'),lexize('pg_dict','pgsql');

lexize | lexize

------------+------------

{postgres} | {postgres}

(1 row)



Finally, you need to modify the tsearch2 configuration for it. Normally you'll probalby be working off a custom configuration, but if you're working off the default one, the command below will activate the *pg_dict* dictionary along with the english stemmer for the default config:
<code style="white-space:pre">
UPDATE pg_ts_cfgmap SET dict_name='{pg_dict,en_stem}'

 WHERE tok_alias IN ( 'lword', 'lhword', 'lpart_hword' )

  AND ts_name='default';

With this done, you should now be able to verify that the system works properly by converting a string to a tsvector:

search=# SELECT to_tsvector('postgresql');

to_tsvector


'postgres':1

(1 row) ```

And that's all there is to it. You can easily modify the dictionary by just editing the file. Note that you will need to re-index your tables (or more specifically, re-create the contents of the tsvector columns) if you change the dictionary and want those changes to be usable to search data. Also, the contents of the synonym dictionary is cached the first time it's used in a backend, so if you're using connection pooling you may need to forcibly close your connections in order for the changes to show up.

Finally, a big thanks to Oleg Bartunov for helping me sort out this configuration stuff in the first place, and the rapid fixing of the couple of issues I came across during the testing of 8.2.


Add comment

New comments can no longer be posted on this entry.

Conferences

I speak at and organize conferences around Open Source in general and PostgreSQL in particular.

Upcoming

PGDay Chicago 2024
Apr 26, 2024
Chicago, USA
PGConf.DEV 2024
May 28-31, 2024
Vancouver, Canada

Past

SCaLE 2024
Mar 14-17, 2024
Pasadena, USA
Nordic PGDay 2024
Mar 12, 2024
Oslo, Norway
FOSDEM PGDay 2024
Feb 2-4, 2024
Brussels, Belgium
PGConf.EU 2023
Dec 12-15, 2023
Prague, Czechia
PGConf.NYC 2023
Oct 3-5, 2023
New York, USA
More past conferences