PostgreSQL - now on git! - Magnus Hagander's blog

PostgreSQL - now on git!

Posted on Sep 21, 2010 at 11:14. Tags: cvs, git, postgresql.

So it finally happened. The official PostgreSQL master source tree is now managed in git, instead of cvs. This means, amongst other things, that the worlds most advanced open source database now has a version control system with.. eh. atomic commits!

Like the first run, this one had some issues with it, but it was smaller and resolved in time not to have to roll back. This time, it turned out that the cvs version that ships in Debian GNU/Linux comes with patches that change the default date format to the ISO standard. But since one of our main requirements on the conversion was to be able to faithfully represent the old versions of the code, this broke every single file - since we used CVS keyword expansion in the old tree. Once we found this, it was a simple case of adding the DateFormat=old parameter to the CVS config file and re-run the whole conversion - which took several hours.

A lot of work went into making the repository conversion correct. Some of this was due to issues in the toolchain used - many thanks to Michael Haggerty and Max Bowsher for getting those fixed and explaining some of the behaviors of the software for us. In the end, a number of things needed to be changed in our existing CVS repository to make it migrate properly. Tom Lane provided a big patch to apply to the CVS repository itself prior to the conversion that cleaned most of those up - you can find a copy of it my github page if you're interested.

With this patch applied, we managed a conversion that was very close to the original repository. I personally think this is only because the PostgreSQL project has been very careful about how it deals with it's CVS repository - using it in a fairly simple way. And even with that, we had a number of issues - such as tags moved "after the fact", and branches created off partial checkouts. A fair number of the issues were simply because CVS doesn't have ways to represent everything in a reasonable way, such as issues when a file was deleted, re-added, deleted again, and mix this over different branches.

Git obviously deals with this better, and hopefully we'll have no such issues creeping into the new repository. However, the PostgreSQL project will be sticking with our "conservative approach" to source control - at least for the time being. For this reason, we are restricting what committers can use within git. We still allow any developers (and committers) to use whatever parts of git they want as they develop, but for commits going into the main tree, we are making a number of restrictions:

We will not allow merge commits. The PostgreSQL project doesn't follow the "git workflow" - we generally develop our patches on the master branch, and then back-patch to released stable branches for important bugs. We will continue doing this as separate commits and not using merges, thus keeping history linear.
We will not use the author field in git to tag it with the patches original author (even in the few cases when the patch is actually authored by a single person). Instead, we will require that author and committer are always set to the same thing, and we will then credit the author(s) (along with the reviewer(s)) in the commit message, just like we've done before.
As a follow-on to that requirement, we will require that all committers are the ones registered with the project, using the same name and address on all commits. So even if a patch is developed on a topic branch on say github, it will get collapsed into a single commit (or maybe a couple, depending on size) tagged with the committers name on that.

There has been a lot of discussion around this, and this is how the PostgreSQL project has worked and wants to continue working. We may change this sometime in the future, but not now - we are only changing the tool, and not the workflow.

To enforce these requirements, I've developed a policy hook for our git server that makes sure we don't make the mistake. It's up on my github page, along with the script we use to generate commit mails to the pgsql-committers list that look just the way we want them to.

What does this mean for you as a PostgreSQL user? Really, nothing at all.

What does this mean for you as a PostgreSQL patch developer? Not much. If you did your work off the cvs-to-git mirror, you need to do a new clone. This repository is converted from scratch, so the old one is not valid anymore. We still encourage you to use for example github if you want to do your development there, but the patch submission process remains the same - send a context style diff to the pgsql-hackers mailinglist.

What does this mean for you as a buildfarm-animal maintainer? You need to reconfigure it to use git. I expect Andrew to post instructions on exactly what to do, and keep track of who hasn't done it ;)

Thanks and Well done to all the people involved in making this happen!

Comments

Bravo.

This is what I love about the Postgres community. A task may take time, which can be frustrating for some, but it will be done right.

Posted on Sep 21, 2010 at 12:56 by Gurjeet Singh.

kudos !

Posted on Sep 21, 2010 at 13:32 by damien clochard.

Hip, hip, hooray!

Posted on Sep 21, 2010 at 14:28 by gabrielle.

I think you misunderstood the goal of the separate 'Author'-field in git. To my best knowledge, Author is meant for showing the committer of the original commit, and Committer for when a rebase or cherry-pick or some such is run, recreating commits based on a different commit than the one the original was based on.

In fact, because it's usually not so interesting to know who rebased or cherry-picked a certain commit, the Committer field is not shown by git commands with --pretty up to and including 'medium' (which is the default setting for git-log and git-whatchanged, for example).

So the requirement that the two are always the same seems a bit strange to me (I mean, which one do you keep when someone cherry-picks changes from a different branch?); I think it would be better to simply ask people allowed to push patches to the PostgreSQL master to set their own name and email address in .gitconfig (or for all postgresql checkouts, in .git/config), and not override that. Patches from people not having write access to the main PostgreSQL git repo will then automatically get an 'Author' field set to the person with write access who applies his patch and commits it.

Anyway, kudos on having switched to one of the most awesome VC systems on the planet! :-)

(PS. Could you please turn on nl2br on this blog? I can't figure out any way to type newlines in my comment, so this one became one huge blob of text, which doesn't help it's legibility. Or is there some other way to type newlines here?)

Posted on Sep 21, 2010 at 16:59 by Sytse Wielinga.

From the perspective of the repository, we don't cherry pick. We don't merge. This is just one of the ways to enforce this.

And we don't override anything. If the two fields don't match, we reject the push and the committer gets to fix his problem and try again.

(As for nl2br - I had it on before, and it broke something else. I'll see if I can figure out how to set it for just comments)

Posted on Sep 21, 2010 at 17:02 by Magnus.

Turns out I could - so I've enabled nl2br for the comments. Thanks for pointing that out.

Posted on Sep 21, 2010 at 17:04 by Magnus.

You don't cherry pick? So how do you backport important patches from the trunk to older stable branches? That's what cherry pick is for, mainly.. it's simply a shortcut for git diff blah^ blah | patch -p1, with the additional benefits that a merge is automatically tried (with failing hunks marked in the source files and not automatically added to the index) and added to the index, and optionally automatically committed, also with the option of editing the comment with your favorite editor before you commit. So why not use it if it's there?

I've personally always seen it as a tool to accomplish the same thing more efficiently, not as something that changes the way you work with git.

Thanks for the amazingly fast response btw.

Posted on Sep 21, 2010 at 17:14 by Sytse Wielinga.

Maybe just wrong choice of terminology. We don't necessarily use "git cherry-pick", no. We might do that more - we just don't know yet :-)

So we probably do, just not necessarily with the tool - just yet.

Posted on Sep 21, 2010 at 17:22 by Magnus Hagander.

Exactly my point :-)

Posted on Sep 21, 2010 at 17:30 by Sytse Wielinga.

We'll get there, likely. Keep in mind that several of the key committers are not yet familiar with git. So we're taking this by baby steps.

Posted on Sep 22, 2010 at 03:05 by Tom Lane.

Add comment

New comments can no longer be posted on this entry.