Viewing entries tagged with postgresql. Return to full view.

Sometimes you just have to try

Finally took the time look at the issue with attached files to the new techdocs. Turns out it was really easy - unless you specify a file extension, the mirroring script used to generate our static mirrors gets confused. Only the instructions told the person adding the files not to use an extension when referencing them. Oops.

While at it I also fixed the problem that it didn't return 404 for missing attached files - instead it returned a proper HTML page that said the file didn't exist. Which was then dutifully mirrored by all the static mirrors. Oops again, but another very easy fix.

Sometimes it's just so easy to find the bugs your code. Sometimes I wish life was that easy to debug...

Revamping the postgresql.org web search, part 3

Just a couple of notes during the further progress I've made:

  • While OOP in PHP is certainly pretty far from polished, it is a lot nicer than in Perl. As for the actual implementation details, I think they pretty much even out in the end - but the fact that I couldn't get encoding to work at all in Perl was the killer for it. So the PHP implementation will be the one that's used.

  • Being able to use persistent connections when connecting to websites to download their content for indexing would give noticeable speedups, specifically over slow connections. But doing that requires implementing HTTP/1.1, which in turn requires implementing chunked encoding. Something for the future - I can still full-index all the sites we pull down except the archives (a little over 200 sites) in less than 10 minutes. And archives index pretty fast anyway, since Josh has kindly set me up with a box that lives on the same network as the archives server.

  • Many sites don't implement If-Modified-Since properly. Luckily I've been able to bug a couple of the site-owners into fixing it, given that they are pg sites and I have "fairly good connections" with some of the webmasters there. Common problems include not implementing it at all, or just comparing exact values instead of ranges (this second one is actually mentioned in a lot of places as a caveat for implementations, but hey, I want to us eit..)

  • I need to tune my tsearch2 dictionaries. Just running on the standard one now, can probably be a lot more efficient using ispell and/or snowball. And tune some stopwords.

  • While not done yet, this is progressing nicely and I should be able to move to proper testing fairly soon. Yay.

(Updated: removed note about headline() because obviously I can't read my own testresults properly!)

GIN performance (postgresql.org websearch part 2)

As I finished loading the approximately 550,000 emails in our archives into my UTF8 encoded database again (bad encoding earlier), I decided to run a quick benchmark of the different ways of doing fulltext search on it with tsearch2, partially because Devrim requested it. I ran the same query a couple of times to bring things into the cache, and then saw very stable performance on all the queries. The query I ran was:

explain analyze select count(*) from messages where fti @@ plainto_tsquery('magnus hagander');

Which should be simple and self-explaining. I got the following search times:

  • tsearch2 with no index: ~6000 milliseconds (6 seconds)
  • tsearch2 with GiST index: ~250ms (0.25 seconds)
  • tsearch2 with GIN index: ~6ms (very fast)

This shows that tsearch2 with GIN is a lot faster than the GiST version we had in 8.1. Updates are more expensive, but when your dataset has a lot more read than write, then GIN will definitly be a big win.

Revamping the postgresql.org web search, part 1

Over the past couple of weeks (that's how I mostly do these things - an hour here, half an hour there, over an extended period of time), I've been working on revamping the search on www.postgresql.org and archives.postgresql.org. oday we're using a custom patched version of ASPSeek backed with a PostgreSQL database. Unfortunatly, it's a real pain to maintain - upstream didn't want the patches John wrote (if I understood the situation correctly), it requires a very special version of GCC to build, even the web interface is in C++ and thus a pain to edit for layout changes etc. Short story, time to look at something else.

The new solution I'm working on is based on PostgreSQL 8.2 with tsearch2 and GIN indexes. So far it's showing good performance, and a very good flexibility given that you get to use metadata in the PostgreSQL database to further enhance hits. Plus, the web interface can be integrated with the main site layout engine. Finally, the indexer is "context aware" and knows how to read our archives in a way

This has also taught me some bad things about the common languages/frameworks used out there, and their (non) dealing with encoding. Basically, the system needs to deal with multiple encodings (iso-8859-1, utf-8 etc etc), and more specifically with files that have broken encodings (such as claiming to be utf-8 but half the file is utf-8 and the other half iso-8859-1).

Initially, my indexer implementation was in Perl using LWP and HTML::Parser. Eventually I had to abandon this completely, because I could just not find any way to get Perl to ensure the output data was proper UTF-8, which is required to insert it into a PostgreSQL database with UTF8 encoding. I tried several different ways (after all, it's Perl, so you should do it different ways), but it always broke one way or another.

I've currently re-implemented most of the indexer in PHP instead. This does appear to work much better. The iconv() function actually works as advertised and can be set to always output clean UTF8 and just ignore broken encoding on input characters replacing them with blanks. Initially, I was using the Tidy extensions to PHP to parse the HTML, but had to give this one up because of the insane memory leaks (such as eating up a gigabyte of memory after indexing <10,000 pages - and I need to index more than 500,000). There's also a bug in 5.1.x at least wrt strtotime() that causes a coredump, but it appears to be fixed in 5.2.

Current version uses preg_match() with a couple of fairly simple regexps, and this appears to be working much better. It also gives significantly better performance than the Perl version, because all the "heavy duty" work is in C code linked into PHP, and not in interpreted code.

There are still some issues with the PHP indexer, but it looks a lot better. Will keep posting more info when I have it :-)

Meeting at EnterpriseDB

I just got back (well, yesterday) from the meeting at EnterpriseDB, about their cooperation with the community and such. It was basically me, a couple of -core guys and a bunch of EDB folks. IMHO, a very good meeting, and there appears to be a lot of very interesting things that EDB are about to contirbute back into the community. Most if it has already been discussed on the lists or at the conference in Toronto, but there are a couple of more very nice things in the works that we'll know more about shortly. Apart from that, it was also another nice chance to meet with some of the PostgreSQL people face to face.

So big thanks to Denis at EDB for including me in the group for this meeting. Just make sure Rich gets the mints right next time, so everybody's happy...

8.2beta1 and website updates

I've just committed the changes to the PostgreSQL website announcing 8.2beta1. It'll take a short while before it replicates to all our mirrors, but the files are already available from the download site.

While hacking the website today, I've also made two other changes. First it's yet another attempt to make the new techdocs code do what it's supposed to do - by allowing code tags, and by forcibly stripping forbitten STYLE attributes that appear to sneak in now and then.

Finally, the interactive docs section now requires a community login before you can post comments. This is yet another attempt at getting rid of the insane amount of commentspam we're getting on those these days. Let's hope it works for those, but does not make it too much of a hassle for the users that want to contribute. This update will take a bit longer, because we don't update the docs section of the site as often as the rest.

Getting ready for beta

Why does this always happen just before beta... (ok, there's a logical explanation from that, but it can still be a bother when it does happen)

Dave's been building the beta MSI files, and came across a nasty bug that pretty much makes the current cvs head unusable if you enable redirect_stderr to a logfile. I've submitted a patch for this, hopefully it'll be applied before we need to push beta1, and we can steal it into the installer (even though official beta1 has already been tagged from what I can see in the cvs logs).

While hacking around things here, I also fixed a couple of other minor things for pg 8.2: a warning message about job object APIs won't be logged everytime the server is started on Windows 2000. Also, pg_regress has been updated with a couple of things it couldn't do on Windows before (missed them during Toms cleanup while I was on vacation). None of this is applied, but hopefully it will go in soon.

As for the installer, with the code I just committed we no longer ship postmaster.exe, which should reduce the size of the distribution with about 3.5Mb. It should not be needed anymore, but if it does, we'll just have to put it back.

I've also ripped out the check for terminal services install on Windows 2003 - in all the tests I've run it is not a problem installing through TS on 2003. Only on 2000, where the check remains.

Finally, I've changed some build stuff so it's not hardcoded to the location of the new MSVC8 dependencies introduced by pgAdmin 1.6.

Dave, as usual, have done a lot of work updating all dependencies for 8.2 to their new versions, so we should be good to go.

Life is good, let's hope PostgreSQL 8.2 also turns out good!

Identifying people from the conference

I've put up the conference picture from this summers Anniversary Summit on the postgresql.org website, including an imagemap that puts name on all the faces in it. (Warning, fairly large picture, almost 5Mb. Don't load if you're on GPRS or something..)

Check it out, find out if you know someone on the picture without a name, and let me/us know what to put there!

Dave Page goes insane?

Seems Dave has crossed the line to the insane. Beleive it or not, but he gave me commit permissions on pgAdmin3. I guess he got tired of me nagging on him for quick processing of my patches. Now wait for things to blow up completely...

So far I've done some hackery on the Query Tool, so 1.6 will have the capability to launch the query tool separate from the main pgAdmin window, making startup significantly faster if you just want a GUI query tool. It will also be possible to open a new Query Tool window from inside the current one, with no need to go back out to the main object tree and open from there.

For those not knowing, Dave has announced feature freeze on pgAdmin3 1.6 for monday so that translators can get things done in good time before the PostgreSQL 8.2 release (which will ship with pgAdmin3 1.6 on at least Win32).

(oh yeah: Thanks Dave!)

LDAP issue fixed (hopefully)

After spending the weekend doing completely unrelated things that were a lot more fun than coding PostgreSQL (what? Is that possible, you ask. Yes, it is. Really. I promise.), I've now put together a quick patch for the LDAP-breaks-everything-on-Windows-2000 issue. Very simple patch, so I think it's quite safe.

I've also noticed that the reason that there weren't any comments on my vc++ build patch is that the -patches list ate it. Twice. So nobody ever actually saw it. Oops. Reposted with a link instead today. Let's hope that list issue gets sorted quickly.

Conferences

I speak at and organize conferences around Open Source in general and PostgreSQL in particular.

Upcoming

PGDay Chicago 2024
Apr 26, 2024
Chicago, USA
PGConf.DEV 2024
May 28-31, 2024
Vancouver, Canada

Past

SCaLE 2024
Mar 14-17, 2024
Pasadena, USA
Nordic PGDay 2024
Mar 12, 2024
Oslo, Norway
FOSDEM PGDay 2024
Feb 2-4, 2024
Brussels, Belgium
PGConf.EU 2023
Dec 12-15, 2023
Prague, Czechia
PGConf.NYC 2023
Oct 3-5, 2023
New York, USA
More past conferences