A Treatise On The History Of Distributed Version Control

By | January 20, 2013

This is not yet another git vs Mercurial debate. I admit bias towards git, which I use whenever I have a choice. This is most of the time now that gitifyhg is awesome. However, I have been using Mercurial for my day job for about a year. I am more familiar with Mercurial and it’s extensions than many developers who prefer it. I consider myself an advanced git user (not an expert) and an intermediate Mercurial user.

I therefore have the background to claim that Mercurial and git are equally capable. Mercurial doesn’t have certain features of git that I miss, but those features are implementable with development time. Sometimes git’s interface isn’t as easy to use or teach as I would like, but aliases and projects like git extras alleviate this issue.

This article is about philosophy, not technology. Mercurial’s documentation, mailing lists, and stack overflow questions are littered with dire warnings that extensions that rewrite history are dangerous and best avoided. Git, on the other hand, takes a “consenting adults” approach to history rewriting. While it acknowledges that rewriting history can be dangerous and should be avoided in certain circumstances, it also allows the coder to choose when and how to apply this rule.

To avoid comparing the two systems, I’ll refer to the two styles as “permanent history” and “mutable history”. Both git and Mercurial are fully capable of maintaining both styles of history. However, Mercurial users tend to prefer “permanent,” while git users typically adopt a “mutable” approach.

The permanent history philosophy emphasizes that a changeset cannot be altered once it has been committed. It is important to record exactly what state the repository was in when the commit was made. If that state is not acceptable, then a new commit is made to correct it. Future readers of the history in question will see that a commit was made and that later, it was amended in another commit. Permanent history is analogous to a captain’s log or accountant’s general journal. Every action should be recorded separately.

The goal of committing in the permanent history paradigm is to record a specific state of the repository.

The mutable history philosophy, in contrast, sees changesets as individual paragraphs in a living story. It can and should be edited to ensure it tells the story as effectively and coherently as possible. Each changeset should have a topic sentence (the commit message) and supporting sentences (the patch). When a commit is initially made, the book is never assumed to be in the final draft that will go to the publisher.

The goal of committing in the mutable history paradigm is to record a related set of changes.

The different stages of code history

There are several stages that changesets go through as a program is written. These stages are perceived differently by the two styles.

  1. Working directory changes have not yet been committed.
  2. Local changesets have been committed but not yet pushed to any public repository.
  3. Public changesets have been published to a public repository and are available to other coders.

The working directory stage is treated identically by the two philosophies. If it hasn’t been committed, both styles have an “anything goes” attitude. If you screw something up and fix it, you are not expected to commit the bad code for posterity. If you leave a debugging statement in the code but catch it in a git|hg diff command, just delete it before committing. If your tests aren’t passing during the uncommitted changes stage, edit the files to make sure they do pass.

The attitudes diverge slightly when it is time to commit the changes in the working directory. Neither style requires committing all of the changes that are in the working directory. For example, if you edit a test that is not related to the feature you are about to commit, you can separate the two diverse changes into separate commits. However, this practice is more common in mutable history circles than permanent, largely because permanent history coders want to record the current state of the repository while mutable history followers are focusing on related changes.

The two histories have polar opposite beliefs about the local changeset stage. Permanent history maintains that a committed changeset should not be altered in any way. Some proponents may allow amending or rolling back the most recent commit, provided it has not been pushed publicly. They frown upon editing the “second last” or earlier commits, even if they haven’t been published.

Mutable history, on the other hand, takes the same “anything goes” approach that applies to the working directory stage. If changes have never been pushed publicly, then the mutable historian will comfortably rearrange and reorder them, move patch hunks from one changeset to another, or squash relevant commits together.

Permanent history fans may be surprised to learn that their philosophy on public commits is the same as mutable history’s. Once a commit has been pushed to the permanent public repository, both philosophies consider that it should not be changed, ever. If mutable history is likened to writing a story with coherent chapters, then public commits are like a published book. Once it has been published, the book should not be altered.

Let me reiterate: altering permanent published history is considered a Bad Thing by both philosophies.

There is a fourth stage available in the mutable style that permanent history does not allow. This “temporary public” stage lies between the local changesets and public changesets phases. At this stage, other people can see your changesets before they are moved into permanent public history. They may be rearranged and edited as if they are local history, but there must be agreement between all viewers that this section of history is still considered mutable. This is akin to sharing a draft of the book with a proofreader or copy-editor before it is published.

The source code for git itself is managed in this way, as discussed in maintain-git.txt. While it contains permanent public branches that canot be altered, it also describes a “pu” branch that is temporary public. This branch is used to share the state of upcoming changesets; other developers can provide feedback, not only on the quality of the code, but also on the quality of individual patches, commit messages, and ordering.

History Is Communication

There are several reasons to maintain code history. Some examples include:

  1. Preserve a record of past state in case you need to return to it.
  2. Compare two versions of a code base and to find the specific code that introduced a new bug.
  3. Concurrent development via patching and merging is virtually impossible without it.

However, the primary purpose of code history is to communicate. Each changeset implicitly communicates that the developer had some reason to take a snapshot of the repository at that time. It communicates exactly what the state of the repository was when the snapshot was taken, and is even able to communicate what changed between that snapshot and the previous one. The commit message describes these changes in English, preferably with a one line summary followed by a complete description of what changed and why.

While the two ideologies agree that history is extremely useful for communication, mutable and permanent history disagree as to what should be communicated.

Permanent history’s main purpose is to communicate “honestly” what happened, for all posterity to see. Each snapshot shows exactly what occurred in the repository. Two developers created different changesets in parallel and then at some specified point, they merged them. Someone forgot to delete a debugging statement and made a second commit to fix it.

Mutable history prefers to communicate “effectively”. The goal is to make local changesets as readable as possible before pushing them. Each changeset ideally contains a single related set of changes. Related changesets are further grouped together on individual branches. If this is not the case, they are modified or moved before being made public.

If you catch a problem in mutable history after committing but before pushing publicly, fix the commit. If two distinct changesets actually communicate a single idea, squash them together. If a single changeset contains two ideas, split them apart or move one to a different branch so the current one only contains cogent changes.

The permanent history crowd suggests that this rewriting of local changes before they are pushed is dishonest, or lying. However, it is easy to lie at the working directory stage in the permanent history paradigm. If you run an hg|git diff and notice that you forgot to delete a debugging statement committing, then it is perfectly acceptable to delete that line and “lie” about having forgotten it.

If they truly wanted to record what “honestly” occurred, permanent history tools would track every single change at the text editor or IDE level.

I think we all agree that this is ridiculous. In truth, permanent history shares mutable history’s desire to have clean, communicative commits. The primary difference is deciding when it’s “too late” to change them. In permanent history, once committed, you can’t change it. In mutable history, you can and should change it up until the point it is pushed to a permanent public repository.

The ability to change history before pushing allows the developer to separate the two distinct tasks of “coding” and “organizing”. Often, when coding, we encounter a separate issue that needs to be addressed, a missing feature, a bug, documentation that needs writing. In strict permanent history paradigm, your only “honest” option is to commit both features in a single changeset. However, permanent history rules are relaxed before the first commit has been made, so two other available options are:

  • shelve/stash the existing changes, write and commit the second feature, and then unshelve/stash apply
  • write the two distinct features in the same working directory and use git’s index or Mercurial’s crecord extension to commit them as separate patches.

These options are commonly used by mutable history developers, but they also have another option: Commit the two features and continue coding. Then reorganize or split the changesets into a sensible series of commits appropriate to good communication before pushing the features to a permanent public repository.

To create clean, well-ordered commits, the permanent history style demands that we think about one thing at a time and decide what the most relevant history communication path is before we start coding.

The mutable history style understands that programming doesn’t work this way. It is common and acceptable to begin work on one feature and discover a bad comment or FIXME you had forgotten and perform a psychological context switch to work on that.

One of my colleagues once informed me that “Mercurial is for people who don’t need to hide their mistakes.” This is bullshit for two reasons. First, Mercurial, like git, is perfectly capable of hiding mistakes. It’s easy to edit local unpublished changesets in Mercurial before pushing them live. There are numerous extensions — both third party and built in — that allow this kind of operation.

Second, this statement deliberately misrepresents the purpose of history rewriting. We don’t rewrite history to hide our mistakes. We do it for the benefit of future readers of our git|hg log. Reorganizing history when we write it greatly reduces the cognitive overhead for readers trying to understand what we did and why. History, like code, is meant to be read more often than it is written. Crafting it before pushing it publicly eases the amount of work for future readers of that history.

It is difficult to understand a series of cumulative changesets that keep undoing themselves or refactor large sections of code. It is better to order these changesets such that they make sense. I’m not saying that only the final product should be committed, if other changesets are able to communicate useful information. There are legitimate reasons, regardless of history ideology, to record mistakes in the permanent record of the repository. If the mistake has already been pushed publicly, the best thing to do is admit that you did not communicate as effectively as you had intended and make a new changeset to fix the problem. This is akin to providing an errata to a published book.

Another good reason is to record that some experiment you attempted was a failure. Perhaps it made the system unbearably slow or it arbitrarily deletes data. These commits can live in a branch in permanent history, forever documenting that this experiment was attempted, that it failed (so people don’t waste time trying it again), and how it failed (in case someone else wants to improve upon your design). Neither philosophy advocates the hiding of this kind of mistake. However, mutable history does expect that the failed commits be well ordered with commit messages that effectively communicate what you did and what went wrong.

History is a form of documentation. Like any documentation, it should be well-crafted and report the evolution of the system effectively. For example, one of the best early pieces of advice I received when I started using git (and inadvertantly learned the mutable history philosophy) is to “never use the word ‘and’ in a commit message.” The word “and” is a sign that you are trying to communicate two different ideas or changes in a single changeset.

There are other motivators for this piece of advice in addition to disseminating useful information. If you ever want to revert certain related changes, it is easier to do so if those changes exist in a single patch or consecutive set of patches. There is no need to extract the changes that you want to keep from combined snapshots. Both DVCS’s provide commands to trivially reverse earlier changesets, but this is only useful if the changesets contain only the idea you wish to reverse. It also eases communication when a commit message says “reverse the changes from revision X,” compared to “reverse some of the changes that do this and this as committed in revision X, but allow the lines that perform an unrelated operation alone.”

Further, if you want to apply an individual set of changes to a different branch of the project without merging an entire branch, it is a trifling matter if those changes are part of a single coherent, cohesive, consecutive set of patches.

Finally, if you don’t have write access to a project, single idea changesets make it easy for the person who is integrating your patches to see what you did and what you intended. Mutable history integrators will generally reject your patches if they do not communicate effectively. You are unlikely to ever get write access to an upstream project (if it follows the mutable history paradigm) if you do not prove that you adhere to the single concept per changeset guideline.

Admittedly, it takes more effort to alter history than to just take a snapshot when you feel the code is in a semi-acceptable state or you want to have a backup available. This effort has a huge long-term payoff. When people say that they don’t care about maintaining clean history, I get the same sense of distaste as “I don’t bother with writing tests” or “I tend to put documentation off to the last minute”. Not maintaining clean history is a sign of a lazy developer. You may be saving yourself time by just committing, but you are adding overhead to everyone who ever has to interpret your commit history, including yourself.

Code Review

Code review is an extremely useful tool for improving the state of a source code repository. It’s a simple concept: other members of the team review each changeset and make suggestions for future improvement.

Reviewers using the permanent history philosophy can improve the quality of the future codebase, but they cannot suggest improvements to patches already under review. They do not have an opportunity to improve the communication quality of those patches if they have been pushed publicly, which is normally the case if you want to share patches for review.

The mutable history style changes the point at which modifications are not allowed from “in the local directory” to “publicly pushed to the permanent repository”. Code can be pushed to a temporary public location for review purposes. Other team members can review it and comment on the quality, not only of the code, but also of the individual changesets. Once review is completed, and any suggestions integrated into the change history, the temporary repository can be safely deleted.

Thus, the code review phase becomes more than a review of the code, it is also a patch review, a history review. Code review gives other developers the chance to say, “This patch could communicate more effectively if…”.

Incremental Merging

The most dangerous moments in version control occur when two different branches of development that touch overlapping pieces of code have to be merged.Someone has to figure out what the two original sets of changes did and then figure out what the combined code has to do to accommodate both ideas. If the branches have been divergent for a long period of time, this job is nearly as difficult as rewriting both features entirely from scratch.

This is compounded by the fact that normally, the two different branches were written by different developers. While the person doing the merge may be intimately familiar with their own work, they have to become just as well-versed in the alternate branch before they can merge it safely.

Worse, all of these changes get combined into a large “merge commit” that basically includes the entire modified history of the two feature branches squished into a single gigantic diff. This is horrible for communication. If just one line of code was inappropriately merged, it becomes a nightmare to answer the question, “why did this work on the feature branch but fail after the merge”?

In the permanent history paradigm, it is common to attempt to alleviate this problem by merging frequently. This way, a smaller subset of changes can be covered in each merge. Unfortunately, this is a terrific way to introduce malicous or erronious changes into the history. People tend to assume merge commits “did the right thing” and don’t review them as closely as the patches being merged.

Moreover, the history becomes much less readable as these unnecessary merge commits clutter up the intention of both feature branches. Such commits do not serve any communication purpose other than, “the developer of this branch was afraid that divergent changes would be too hard to merge at a later, more appropriate time.”

The cognitive overhead is greatly reduced in the mutable history paradigm. Mutable history encourages rebasing over unnecessary merging. Instead of merging two branches of commits, rebasing makes one branch appear linearly after another branch. When rebased, the branch contains a series of commits that make sense in a linear order with no confusing merges in the history.

When you rebases a branch, each individual changeset is applied against the upstream branch, one at a time. Because the changesets contain small unit of changes, they are less likely to conflict, and therefore apply cleanly. When there is a conflict, it is easier to tell (from both the commit message and the code) how the code needs to be written to apply that changeset against new changes. This “one change at a time” process is much easier to apply than a single large merge commit. In addition, if you have previously rebased a branch and reordered commits for optimal communication, you will find that future merges or rebases onto other branches are even easier.

This means that, compared to a merge, you do not need to be as intimate with upstream changes from other developers. You figure out how you would have written each of your own changeset if you had been applying them directly to the upstream branch. This is not nearly so mentally fatiguing as trying to unravel two parallel sets of changes a la “ok, they did this, and I did this so I need to do this to get things back into a sane state”.

While I am avoiding a git vs Mercurial debate here, I’d like to point out that the various Mercurial utilities for rebasing and history editing are not very effective as compared to git’s tools. The rebase extesion, histedit, Mercurial queues, and pbranch all do the job, but they require more effort than in git. They don’t have git’s famous rerere functionality, they have potential to lose or obliterate history altogether, and they are neither as well integrated nor maintained as git’s tools. I do not say this to convince you to switch to git, but to point out that if you have tried these tools and found them lacking, it is not because the concept of rebasing and mutable history is a bad thing, but because the tools require further development.

Note that while mutable history users avoid unnecessary merges whose soul purpose is to reducing merge fatigue, they are not averse to merge commits that communicate useful information. So it is perfectly sensible to create a merge commit that demonstrate that a feature branch (usually containing a linear set of related changesets) has been merged into default. However, before the merge occurs, the feature branch should have its history edited in such a way that the entire branch will apply cleanly and no merge conflicts will occur that require any diff to be committed with the merge.

Have your cake and eat it, too

Distributed version control systems allow us to have multiple copies of repos in different states. If you feel strongly that an “honest” permanent history is important, perhaps it would be a good idea to keep this honest copy of history in a separate repository or a different branch. But for the sake of effective, coherent communication, maintain the main history in a mutable style.

In git, this can be done with branches. Simply do the development work on one branch (maybe give it a name like permanent/branchname to identify it as such). Push all changes to this branch as they occur. However, keep your master branch clean for most effective communication. When a feature is ready to go live, create a new branch from the commits on the permanent branch and rebase them onto master in a well-ordered manner that communicates clearly.

I’m not sure how viable this would be in Mercurial, since it’s not easy to copy commits between branches. More likely, it would be suitable to have two repositories; one that contains the permanent historical record, and one that contains the edited history.

I don’t personally believe this is necessary. The mutable history paradigm communicates everything I need it to. However, if you are unsure if you are ready to make the switch, I want to make it clear that it is possible to maintain both styles for a period while you experiment with the idea. If it turns out you don’t like the mutable history paradigm, you can always delete the offending mutable branches or repository… though, of course, this would be a mutation of history in itself.

I expect people who perform this experiment to realize that well-crafted history is worth the small amount of extra up-front effort required to maintain it.

3 thoughts on “A Treatise On The History Of Distributed Version Control

  1. Norbert Kéri

    Very nice article. I have started off working with distributed version control with mercurial, and for the current job I’m using git, and I strongly relate to everything you said. Both of them can do what the other can, but because they follow a different a mentality, some things are easier in one, and some are harder than in the other.

  2. btubbs

    I admit to being envious of the curation done on the commit history for things like the Linux kernel. That’s a good example of where an understandable history is hugely important.

    Nice job making this post about mutable vs immutable histories instead of Git vs Mercurial.

    I’ve gotten value in the past from being able to drill down exactly and “hg blame” the changeset that introduced a particular bug. Some of the examples above threaten to subvert that, because commits from several users would be lumped into one.

    But I don’t do it that often, and I think I’d probably get more value from the curated history if I had to choose.

    I do feel that the Mercurial CLI is a lot more human friendly than Git’s. There seem to be a lot fewer obscure terms, commands, and command options to learn. And though it’s a trivial difference, I really like Mercurial’s feature of letting you partially type commands if you’ve given enough characters to disambiguate what you want to do. I’ll have to see what git-extras offers.

    1. Jakub Narebski

      @btubbs: Git doesn’t allow to “partially type commands if you’ve given enough characters to disambiguate” (and I am not sure if it could be considered to be added, as adding new command could make abbreviation no longer work), but it has tab-completion of commands, options and parameters, and does support aliased.


Leave a Reply

Your email address will not be published. Required fields are marked *