Subversion and Git

This page dates back to 2013, when we started actively considering converting our (then) Subversion repositories to Git. We eventually did go ahead with that change.

Introduction

This page summarises my personal view of the version control system question. It does not claim to represent the position of the Shibboleth committers as a whole. The purpose of the page is to try and track previous discussions so that when (not if) this issue comes round again we're not starting from scratch.  I think periodic review of the situation is good, but we need to make progress each time and not just keep going over the same ground.

Over the past couple of years the Shibboleth team has repeatedly looked at the question of whether we're using the best version control system for our purposes. In particular, we have more than once considered a move from our current choice of Subversion (a conventional centralised version control system) to Git, which appears to be the leading decentralised version control system used by all the cool kids.  I think it's fair to say that each time we've looked at this question we have ended up more positive (or at least no more negative) about the idea of moving, but have in each case determined that the time is not yet right.

My own position is that a move to one of the distributed system is inevitable in the long term, and that Git is the best current choice amongst those systems (I've also looked at Mercurial, and while each system has its own technical pros and cons, it seems that Git has the advantage of more widespread adoption by our community.)

However, I also know that moving from one version control system to another would be disruptive, and that this is particularly true when moving to a system which repays thinking about version control in a different way.  Given that we're in the middle of V3 IdP development, this is probably not the best time to make a disruptive change unless we can convince ourselves that we'd pay back the disruption during the V3 development cycle.

Hosting

If we moved to Git, we'd have a number of alternative hosting choices available.  Although as with any DVCS, there is no absolute "central" repository, I assume that we'd want to designate one set of repositories as being the canonical copy of the source trees and they would have to live somewhere.

  • We could go back to institutionally hosted repositories as we did earlier with Subversion at Georgetown.  Many of the institutions with which committers are associated already run Git hosting infrastructures we could piggy-back on.

  • We could host our central repositories at GitHub; one advantage of this would be the ability of non-committer contributors to make use of the GitHub fork/pull-request workflow to work alongside us.  We don't necessarily have to keep the "master" repositories there for this to be possible, however, as we can always set up a GitHub organization and a set of repository clones there to get most of that functionality while still hosting the main repository copies elsewhere (this is what Linus does with the Linux Kernel repository, for example)

  • We could use something like gitolite to manage a set of repositories on the shibboleth.net infrastructure servers.

Conversion

There are tools (git svn and svn2git which is layered on that) to perform basic conversions of existing Subversion repositories to Git ones.

They do a reasonably good job once you have figured out the options to use, but there are some wrinkles:

  • If you want the history to make sense (and I think that's non-negotiable) then you have to figure out the IDs for every Subversion committer and generate a mapping to Git's system which involves full names and e-mail addresses.  svn2git allows for an "authors file" designated by --authors for this purpose. The good news is that this is probably a one-time thing across all our repositories.

  • Tags can be a problem in general.  Going forward, we'd want to use signed Git tags in the repositories but the import won't do that. It is possible to go back and re-tag but it's a pain, particularly if you want the date to appear correct.  You can do this kind of thing to fake retrospective dates on new signed tags:

GIT_COMMITTER_DATE="dd-mm-yyyy hh:mm" git tag -s v1.1 -m "Version 1.1" <revision>
  • In some cases, we have tagged Subversion tags from the workspace rather than from the repository's HEAD (indeed this is our currently recommended approach for releases).  I am not sure whether svn2git knows how to handle that, as it is (pretty much by definition) not how you do things in Git-land. We would need to test this.

  • Our projects current use Subversion "externals" to arrange for the project's Eclipse .settings directory to be checked out from a subdirectory of the parent project. Although Git has a conceptually similar facility (Git submodules) I don't believe it can be used in quite this way. Instead, we might have to make the whole parent project a submodule and use a symbolic link for .settings, or separate the settings as a new repository, or something of that kind. Another option seems to be git subtree. We'd need to look into this. We might also take the opportunity to consider whether there might not be a better way to standardise Eclipse settings within projects, as I've never really been happy with using subversion externals. As an example of the kind of thing I mean, Workspace Mechanic can be used to standardise workspace-level settings, and I wonder whether something similar might not be available at the project level.

  • Our projects set Subversion properties to handle line ending conventions. This is normally achieved via a configuration file set within Eclipse. git svn does not pay any attention to this Subversion property (it ignores everything except the executable property) so we might need to make use of the .gitattributes file in some cases.

  • git svn doesn't convert the svn:ignore property into .gitignore files by default, but there is an additional subcommand to do that.

  • Our parent projects have a bin directory with svn-dependent scripts which would need to be converted. I think most of them are related to the externals issue, though.

It would be possible to unbundle the composite repositories (e.g., "utilities") on the fly during such a conversion, so that each project has its own Git repository. I would be in favour of this; using something like gitolite to manage repositories makes it simple enough to create new ones that I don't see any benefit in bundling except in the case of the multi-module Maven projects.  If you use svn2git to do this, my understanding is that only the files from the appropriate part of the composite repository end up in each result, and that the full history of those files is nevertheless available, but that might be worth re-checking.

Required Effort

There has been some discussion around whether the conversion would take a lot of effort (estimates up to months to do it) or be simple (estimates down in the couple of days range). My opinion is that what we're doing today in particular in the area of externals makes conversion non-trivial, but I'd still estimate down at the low end. On the other hand, I don't think that's the right (or at least not the only) question to be asking.

There is probably only a few hours work in figuring out an authors file and a general set of options for svn2git. On the assumption that tags which were made from the workspace rather than the repository HEAD "just work" (we'd need to test this, allow a few more hours for that) then the main pre-conversion effort would be in figuring out hosting and how we wanted to handle the .settings external. That last part is hard to estimate, particularly if we decide the right answer is to do something completely different.

Actually converting the repositories is probably a few days work for a couple of people.  I imagine it would mainly be making sure that the conversion had worked correctly, fixing up Jenkins jobs, and the like. We have been consistent enough in terms of repository structure that what works for one repository will probably work for them all. (I should add that my own conversion experience was a lot less straightforward, because I hadn't been consistent about repository structure in particular in aggregated repositories)

I personally think the main cost of a Git conversion would be converting those developers who don't have much Git experience, and who don't currently "think in Git". Rod is probably the person with the most experience here, so it may be worth getting his opinion about how much that would cost people with different current levels of exposure. Personally, I think I'm "over the hump" without being completely fluent in the arcane corners.

Mindset

It is possible to treat Git just like Subversion: just follow every git commit with a git push. Doing this of course turns Git back into a centralised version control system, and I think that probably means we'd be missing out on most of the benefits from a conversion (we'd still gain offline working and perhaps some performance improvements). The most obvious way to get the complete set of benefits would be to defer conversion until IdP V3.0 is out of the way and everyone has time to come up to speed by learning Git. We would then be in a position to apply Git-like thinking in everything we do, which might include changing our central repository branching policies and the like.

Another alternative would be to convert to Git sooner and just use it like Subversion to start with. To the extent that the Git way of doing things enables a different way of working for individual developers rather than enforcing change centrally, people could learn about Git at their own pace and take advantage of it or not at their own discretion.

Here are some random thoughts from me about the Git mindset issue, from previous e-mail.

On 9 Aug 2013, at 00:07, Tom Zeller wrote: > Well, and branching is really more like tagging, except not svn tags, > since a branch is actually a hash ? Subversion tagging and branching are both essentially making (efficient) copies of the state. Git tagging and branching are essentially attaching labels to nodes in the state graph. Nodes in the state graph are identified by hashes. > I forget, but I think that is why > it is difficult to compare svn and git branching, they're different. Right. It's also why you can rebase a local branch in Git but not in Subversion: Git writes a new set of commits to the tree starting at a later node in the history, which to the developer looks a lot like moving the branch point. In Subversion, branches are immutable. -- Ian
On 9 Aug 2013, at 01:22, Russ Allbery wrote: > The thing I love the most about Git is that it gives me a lot more control > over the change sets that I construct. The above nicely encapsulates one of the Git mindset things that I found wasn't obvious as a Git newbie (which I admit to still being, mostly). In Git, committing is not something that is just a recording of the state of your workspace, it's part of the construction of a changeset that does a particular thing, and could potentially be reviewed as a whole elsewhere before being accepted. You should recognise that as very like the patch-based workflow that the Linux kernel developers have used since the beginning (they have certainly never believed in everyone having commit access to one master repository), and obviously that's not a coincidence. -- Ian

Also a comment from Rod: