Use Lots of Git Submodules

March 21, 2011

Overview

Common advice found in the README.md of projects found on Github: "Copy the contents of this folder into your project, ...". This seems especially prevalent for iOS projects.

The problem here is the amount of extra work this creates. If you find a bug and need to commit upstream, you then need to clone, verify the bug still exists, fix the bug, copy the files back over see if it works, correct mistakes in the original, copy the files back over, commit and e-mail or fork and push and issue a pull request.

The source of the extra work is the copying, the big old DRY violation which causes duplicate maintenance effort.

Step 1: Git Submodules

Git provides "submodules", similar to Subversion's externals. In essence, you clone another repository into a subdirectory of your own. Git tracks the submodule intelligently. When you commit the master repository, it stores a reference to the submodule's commit. If you pull in new changes or back out changes from the submodule, you commit this as an update to the submodule.

The catch: you need to be sure to push the contents of the submodule, preferrably before you push in the master repository. If you push a commit referring to a newer version of a submodule that doesn't yet exist in a remote, people can't pull it.

The advantages: No more DRY violation. If you want to update the version you are using, or even switch from the stable branch to the development branch, simply do the pull and checkout commands in the submodule, test, commit and push in the master repository.

 $ git submodule add git@github.com:BlueFrogGaming/icuke.git ThirdParty/icuke
 $ git ci -m "Added iCuke as a submodule"

Cloning a repository with submodules requires an extra option to indicate that git should clone submodules as well:

 $ git clone --recursive git@github.com:BlueFrogGaming/icuke.git

This can be done after the fact by running the following commands in the repository:

 $ git submodule init
 $ git submodule update

The "submodule update" command should be run after "git pull" to get corresponding changes to the submodules.

Step 2: Keeping a Fork

There are two concerns to resolve here. First, what if the remote repository gets destroyed or removed? If we had copied the files into our repository, we would still have a project which can be cloned; however, since we've used a submodule, we can't clone the upstream source anymore.

Second, what if we want to allow other Github users access to local modifications of our library? Maybe we have a fork of it that does something different, or maybe we have a bugfix.

In order to resolve these problems, we fork the Github repository into our own context then use our own fork as the submodule. Since we are the owner of the fork, we can be sure that it won't go away. We also have a place to push bugfixes.

If you keep a fork, you will want to keep an "upstream" remote so that you can pull new changes. This way "origin" refers to our remote, while "upstream" is the remote from which we forked.

 $ git remote add upstream git@github.com:TheOriginator/project.git

You can now "git fetch upstream" or "git pull upstream".

Step 3: Topic Branches to Contribute Upstream

When contributing upstream, it's considered rude to send a batch of entangled patches that the upstream maintainer must sort out. In order to resolve this, it's polite to create a topic branch for each change. This also allows us to play nice with Github, since we can only submit a single pull request per branch (and patches pushed to a branch after a pull request is issued might become part of the pull request after the fact).

To ensure each topic branch applies cleanly upstream, fork each branch from the tip of the upstream development branch.

 Project/submodule$ git fetch upstream
 Project/submodule$ git checkout upstream/master
 Project/submodule$ git checkout -b bugfix_a

... make your changes here ...

 Project/submodule$ git ci -m "Fix bug a"
 Project/submodule$ git checkout master
 Project/submodule$ git merge bugfix_a

If your bugfix doesn't work, you can do a little branch moving and dropping and recreating to repeat the process to get your commits right.

Merging bugfix_a into our master (not upstream's master) allows us to start using our bugfix right away in our master project.

We can now issue a pull request to the upstream maintainer:

 Project/submodule$ git push origin bugfix_a

From the Github UI, you can now create a pull request from the bugfix_a branch to upstream's master.

Step 4: Keep a Project Branch

I suppose this depends on how much you trust git. Theoretically, this isn't necessary but it makes me a little less nervous.

Suppose you create a topic branch, merge it into your master so you can use it, then issue a pull request. Further suppose that upstream merges a bunch of things, including your pull request, in different orders. Now you'd like to pull new changes from upstream.

Git is supposed to be smart enough to do this well. And it is, but since the patches in your master are now in a different order, it might require some conflict resolution on your part.

Instead, keep a local branch (pushed to origin) where you merge your local changes. Then your master branch can track upstream's master without issue. You can then drop your local branch and re-make it from the upstream's master at will, maybe merging back in changes which have not been accepted upstream. It's OK to drop this branch and then reuse it since git submodules are tracked by reference to their patch objects.

Conclusion

This process is not too hard, though it might seem a bit intimidating at first. In the end, it makes it much easier to be a good citizen and submit fixes upstream. When copying files into your repo, you always think, "I should submit this later." Now, you will always be set up to do this.

Tags: iOS fixme