I think I (we) may have finally nailed the stats issue. The symptoms were the stats collector stopping to process data under load (this is different from the old stats-collector crashes under load that we had back in early 8.0). Since the latest changes regarding autovacuum made the probability for it happening around 100%25 when running the parallel regression tests, debugging was suddenly a lot easier.
Turns out there was a big problem in the pgwin32_select() emulation code (src/backend/port/win32/socket.c for the interested), that simply had it stop telling the caller that there was data available on the socket when it was under high load. And it only happens for UDP sockets (stats collector uses UDP because it's designed to drop packets under really heavy load in order not to slow down the actual database work).
pgwin32_select() still isn't fixed, but with the current architecture for the collector it's not needed anymore. Originally, when we had both a collector and a bufferer, the same code had to look at two sockets and needed it. Now it doesn't need to look at more than one socket, so we can use pgwin32_waitforsinglesocket(), which doesn't have this problem (that I've been able to tell, at least).
It's been applied to 8.2 and HEAD, and I'll hopefully have time to see if it is an easy backport to 8.1 sometime tomorrow. Per discussions on the list, we are not likely to backport it to 8.0 at all.
Hopefully, I'll be able to get rid of the hack that is pgwin32_select() altogether soon - with this change there is only one place left that uses it, and I'm going to look at replacing that with a proper native implementation as well.
Some of you may have the impression that 8.2 already builds with Visual C++. And it does. Just not quite all of it, and with a lot of rough edges. Hopefully, 8.3 will be much nicer in that regard. The current goal is to get it up on the buildfarm as soon as possible to make it a "proper platform". Unfortunately, I don't know the buildfarm code well enough to do that part myself without a lot of learning to do first.
It's not that far off. There's still one fairly large patch to the regression tests themselves to make it work (or rather, to the regression test execution program), but the really big parts of it are in. Then it needs some glue to make it run nicer. Having to remember a three lines long command line is not very nice...
There have been plenty of small fixes to make it work in environments that aren't identical to mine, and I know there is at least one more pending. And I just submitted a patch to clean up most of the compiler warnings created (because VC++ sees other warnings than gcc and friends).
Bottom line - it's now definitely supposed to be in a state where others can try it out. So if you are interested and have the time, more testers are definitly needed. You need VC++ 2005 (Express should work fine) and current CVS-HEAD of PostgreSQL. All the vc stuff is in src/tools/msvc, see the README file.
Per a news report I noticed today, it seems New Zealand's Electoral Enrollment Centre has "consolidated the country's electoral roll on the open source PostgreSQL database". Sounds like an interesting project, and a very good use for an Open Source database in general and PostgreSQL in particular.
It's not a particularly huge database, but it's distributed. And they're even running it on 7.4 (though they plan to upgrade) which certainly means they're missing out on a lot of things in 8.x, so they'll likely be even more pleased in the future.
And it's a matter of principle when it comes to sensitive data belonging to the public /mha/templates/default/img/emoticons/wink.png alt=";-)" style="display: inline; vertical-align: bottom;" class="emoticon" / Dealing with it using open source software just makes a lot of sense (even more than just the monetary savings and the very stable platform that you get). Now I just wish something similar would happen here...
Now that the new GIN/tsearch2 based search engine has been running on search.postrgresql.org it's time for a bit of statistics of the searches being performed. The following statistics are pulled from the most recent 99,733 searches performed. Note that these statistics are taken pre-lexing, so searches for different variants of the same word count as different words.
A total of 32,810 unique search terms were submitted, meaning that each search term was used on an average only 3 times.
4930 of the searches resulted in stepping pas the first page of results, meaning that 95%25 of all searchers only looked at the first page of results.
17 searches stepped all the way to the 50th page (the maximum number of pages shown). My guess is that all of these were people testing and not actually looking for the hits there.
The 25 most common search terms, and the number of searches for them, were:
foobar 3873
search 1796
alter table 732
create table 588
jdbc 517
select 508
update 507
insert 475
copy 463
pg_dump 458
odbc 386
sequence 381
vacuum 368
date 361
cast 350
case 349
replication 330
grant 320
psql 318
join 296
create user 275
between 266
tsearch2 264
timestamp 260
cluster 250
The first two are certainly very interesting, with foobar representing almost 4%25 of the searches. You may wonder why this is, and it has a very logical explanation - that's the nagios system that monitors the servers functionality...
The second one is explained by that being what you get if you hit "search" on the frontpage without typing anything in the search box first.
The good news is that system load has remained vastly below what it was with the old search engine. Most of the time, you can't see anything on the server at all, and the load avg is 0.02/0.01/0.00 right now (my stats processing being what brought it so far up).
takes time. A lot of time. With rebuilding, I mean when you for example have to UPDATE all your tsvector fields due to a change in tsearch2 configuration. (Yes, I had to do that because I had a slightly incorrect tsearch2 configuration for the archives search database). So don't do it. Instead use the fact that PostgreSQL has nice transactional DDL and do something like this:
BEGIN TRANSACTION;
CREATE TABLE messages_new AS SELECT id,txt,to_tsvector(txt) AS fti
FROM messages;
CREATE INDEX messages_new_fti ON messages_new USING gin(fti);
ANALYZE messags_new;
ALTER TABLE messages RENAME TO messages_old;
ALTER TABLE messages_new RENAME TO messages;
COMMIT;
DROP TABLE messages_old;
ALTER INDEX messages_new_fti RENAME TO messages_fti;
(apologies for any typos, I didn't bother to actually type these commands into the database again, and I lost my cut-and-paste of what I ran)
This way, the messages table can still serve up searches without any disruption to the searches at all. And creating the new index is a lot faster than updating the existing one if you have to touch all rows.
Today I finally got around to cleaning up the final things, and activating the new tsearch2 based search engine for www.postgresql.org. Other than Stefan finding an escaping bug (hello magic_quotes, you should be off) and one encoding bug (yes, even though everything is utf-8 everywhere, you still have to explicitly tell htmlentities that it should be utf-8), things are looking good. The load on the search machine is way down, mainly because the indexing part of aspseek loaded things up horribly. The search itself wasn't too bad, and it's not too bad now either.
So, feel free to pound it a bit, and let me know if there are issues.
Oh, and yes, I'll be doing a complete writeup on it. Eventually :-)
Update: As Devrim so kindly informed me, I forgot to credit him. He of course helped me rsync some changes to the archives box at commandprompt, but most importantly he restored from backup the file that I accidentally removed even though it wasn't in CVS. So - thanks Devrim, couldn't have done it without you!
(this could be considered part 4 of the PostgreSQL website search update)
Custom synonym dictionaries is not actually new functionality for 8.2 - I think it's actually been around since 7.4 or so. But it's definitly been a function I did not know about, and it can be really useful in scenarios when you know some specifics about your data. In general, selecting and tuning your dictionaries in tsearch2 can give a big boost to the accuracy of your searches, so it's definitly worth looking into.
The concept of custom synonym dictionaries is really simple (and so is the implementation) - you get to define synonyms for certain words that all the tsearch2 functions will then apply to both indexing and searching, including the headline() function. For example, how many people incorrectly call PostgreSQL "postgre"? Answer: way too many. But that doesn't mean we shouldn't help them by providing whatever search matches we can. Also, some people refer to it as pgsql, and why should the user need to know that?
So how is this done in tsearch2? With the dict_syn dictionary it's very simple, and this dictionary is built and installed by default as long as you install tsearch2 (from source - I assume it's in all binary packages as well, since there is no config switch to turn it off automaticaly).
The first step is to create the dictionary file itself. It's a simple textfile (though you could keep it in the database and update it with a trigger in for example pl/perlu if you wanted to) with two whitespace separated fields listing the word and it's synonym. For the websearch, I started out with the following, a simple set of synonyms for the word postgres (all words are case-insensitive at this point):
postgresql postgres
pgsql postgres
postgre postgres
I chose to store this along with other tsearch2 files in *<pgdir>/share/contrib/pg_dict*, but you can of course store this anywhere you like as long as the backend can read the file.
With this done, you need to add the dictionary to the list of available ones in tsearch2, using a command like this (this assumes you have installed tsearch2 in your database already):
<code style="white-space:pre">
INSERT INTO pg_ts_dict (SELECT 'pg_dict',dict_init,
'/usr/local/pgsql/share/contrib/pg_dict',
dict_lexize, 'pg-specific dictionary'
FROM pg_ts_dict WHERE dict_name='synonym'
);
At this time, you should be able to test the new dictionary, using the lexize function. Notice how any word defined in the dictionary returns the synonym, in this case postgres.
search=# select lexize('pg_dict','postgresql'),lexize('pg_dict','pgsql');
lexize | lexize
------------+------------
{postgres} | {postgres}
(1 row)
Finally, you need to modify the tsearch2 configuration for it. Normally you'll probalby be working off a custom configuration, but if you're working off the default one, the command below will activate the *pg_dict* dictionary along with the english stemmer for the default config:
<code style="white-space:pre">
UPDATE pg_ts_cfgmap SET dict_name='{pg_dict,en_stem}'
WHERE tok_alias IN ( 'lword', 'lhword', 'lpart_hword' )
AND ts_name='default';
With this done, you should now be able to verify that the system works properly by converting a string to a tsvector:
search=# SELECT to_tsvector('postgresql');
to_tsvector
'postgres':1
(1 row) ```
And that's all there is to it. You can easily modify the dictionary by just editing the file. Note that you will need to re-index your tables (or more specifically, re-create the contents of the tsvector columns) if you change the dictionary and want those changes to be usable to search data. Also, the contents of the synonym dictionary is cached the first time it's used in a backend, so if you're using connection pooling you may need to forcibly close your connections in order for the changes to show up.
Finally, a big thanks to Oleg Bartunov for helping me sort out this configuration stuff in the first place, and the rapid fixing of the couple of issues I came across during the testing of 8.2.
A short while ago I committed a bunch of changes to the frontpage of www.postgresql.org, trying to include a couple of things that have been much requested. Most of the work is based on a design proposal by Simon Riggs.
The most obvious change is that we now list more items under news and events, and also a short aggregation of some posts from Planet PostgreSQL. Further incremental changes are sure to come...
Something I've had to do fairly often in SQL is finding the difference between two sets. A typical case which came up again this week is when you have a set of snapshots in a history table, and want to find which keys were added and removed between the two occassions. One way to do this is with a set of queries - one for what was added, one for what was removed, and possibly one for what wasn't changed (if you need that data). You can also do this with OUTER JOINs, which can often be a lot more efficient given that you can answer all your questions in one query.
So, to illustrate this, here's a simple, sample setup. It's basically a table that contains snapshots of a set of personnummer (swedish social security numbers, but it can of course be anything you want. If it's a single unique field, that makes life a lot easier). Snapshots are generated at a certain date by a scheduled taks that inserts the result of a select on a different set of tables. For this illustration, I'll just insert a couple of bogus rows of data.
CREATE TABLE snapshots (
snapdate timestamptz NOT NULL,
personnr char(13) NOT NULL,
CONSTRAINT pk_snapshots PRIMARY KEY (snapdate,personnr)
);
INSERT INTO snapshots VALUES ('2006-12-01 00:00:00','11111111-1111');
INSERT INTO snapshots VALUES ('2006-12-01 00:00:00','22222222-2222');
INSERT INTO snapshots VALUES ('2006-12-01 00:00:00','33333333-3333');
INSERT INTO snapshots VALUES ('2006-12-01 00:00:00','44444444-4444');
INSERT INTO snapshots VALUES ('2006-12-02 00:00:00','22222222-2222');
INSERT INTO snapshots VALUES ('2006-12-02 00:00:00','33333333-3333');
INSERT INTO snapshots VALUES ('2006-12-02 00:00:00','55555555-5555');
INSERT INTO snapshots VALUES ('2006-12-02 00:00:00','66666666-6666');
Now, to simply get whatever rows were either added or removed between these two sets, use the following query:
<code style="white-space:pre">
SELECT
COALESCE(s1.personnr,s2.personnr) AS personnr,
CASE WHEN s1.personnr IS NULL THEN 'Added'
ELSE 'Removed' END AS action
FROM
(SELECT personnr FROM snapshots WHERE snapdate='2006-12-01 00:00:00')
AS s1
FULL OUTER JOIN
(SELECT personnr FROM snapshots WHERE snapdate='2006-12-02 00:00:00')
AS s2
ON s1.personnr=s2.personnr
WHERE s1.personnr IS NULL OR s2.personnr IS NULL
If you want to see both the changed and the unchanged rows, remove the WHERE statement and expand the CASE like this:
SELECT
COALESCE(s1.personnr,s2.personnr) AS personnr,
CASE WHEN s1.personnr IS NULL THEN 'Added'
WHEN s2.personnr IS NULL THEN 'Removed'
ELSE 'Unchanged' END AS action
FROM
(SELECT personnr FROM snapshots WHERE snapdate='2006-12-01 00:00:00')
AS s1
FULL OUTER JOIN
(SELECT personnr FROM snapshots WHERE snapdate='2006-12-02 00:00:00')
AS s2
ON s1.personnr=s2.personnr ```
If you want only changes in one direction, you should be using LEFT or RIGHT JOIN instead, and if you want just the ones not changed, INNER JOIN. You can still do it with OUTER JOIN, but it will likely be much better performing using the proper JOIN for the case.
David Litchfield has posted a (fairly short, which is nice) analysis of the security of Oracle vs SQL Server. Unbreakable? Don't think so.
It's interesting to note that PostgreSQL has had nowhere near Oracle's amount of security issues. Then again, I'll readily admit that PostgreSQL also hasn't received as much attention from professional security researchers, but it's still interesting.