Autochrome (repo here) uses a full parse to highlight and structurally diff Clojure source code. It aims to make the experience of reviewing Clojure code just as nice as writing it. It takes the form of a command-line tool which generates diffs as static HTML:
$ lein run owner repo num -o diff.html # write a diff for a github pull request $ lein run --token user:123abc owner repo num # use supplied auth token for github api $ lein run --git-dir /your/repo/ old-tree new-tree # like git diff, using specified repo $ lein run --open ... # try to open the diff in a browser
If generated from GitHub, the line numbers in Clojure diffs link back to the PR. Bold symbols link to documentation.
Structural diffing is something I always wanted for Clojure. When I saw Tristan Hume's article about tree diffing, I was inspired to give it a shot myself using the same A* pathfinding technique he described. I ended up ditching A* for plain old Dijkstra's algorithm however - more on that later. Either way, in order to frame tree diffing as a pathfinding problem, you need to extend the concepts of location, cost, and adjacency to tree diffs. Location is clearly needed to know where you are, but in addition locations need to be comparable, so you know not to bother when you already have a better path to the same place. Cost is what makes some paths preferred over others. For pathfinding on a road network, this would be the total distance traveled along the roads used. Adjacency is what states are reachable from a particular state. For roads you might say that intersections are the nodes and adjacency means there is a road connecting them. In autochrome:
This is the basic version of adjacency that I started with. However, when implemented this way, the algorithm cannot match subtrees at different levels of nesting, since the cursors always move up or down together. To handle changes in nesting, the cursors need to be allowed to move up and down independently, like they are allowed to do within lists. This means that instead of one stack of pairs of pointers, we need a pair of stacks of pointers, one per cursor. Then we need to add some state transitions:
Since there are quite a lot of branch nodes, this creates a ton of extra states for the algorithm to explore. So although it seems like the steps which move both cursors up/down would obsolete, since they can be replicated with two single-cursor movements, they are needed so that performance is not terrible on mostly identical subtrees (ie the common case). It is also helpful to make single-cursor movement cost more than two-cursor movement, so that we only try a single-cursor move after matched movement fails. The extra cost accounts for the fact that single-cursor movement corresponds to adding or removing a set of parens.
I don't know about you, but I'm not the type who can absorb the essence of a complicated algorithm from a wall of text as seen above. So let's look at a detailed log of the states we popped from our priority queue while generating the example diff at the top of this page. The states are numbered in the order in which they were processed. We will look at the goal state and each of its predecessors, starting from the initial state.
I had originally implemented the diff algorithm as A*, which was a lot better at finding diffs with fewer explored states. What made me decide to switch to plain Dijkstra's algorithm was the problem of alignment. When multiple forms in a file are changed, inserted, renamed or deleted, how do you figure out which pairs to diff?A* works great when you know both the source and the target forms, but this proved difficult in practice.
My first idea was to simply diff the entire source file with the entire target file, basically treating each file as if it had [] surrounding the entire thing. This led to a lot of weird diffs; for example when you deleted something and inserted something else in its place, the diff would show how to transform the deleted thing into the new thing, which was confusing. Top-level forms are the basic unit of clojure code, so diffs which span them are unnatural and hard to read. When the change-of-nesting support was implemented, things really got out of hand.
Something had to be done. My next idea was to basically hack it by trying to match forms by their top-level text, for example 'defn somefn' or 'defmethod foo :dval'. This has a lot of obvious problems, including docstrings, but especially renames. It worked better than I expected but the problem was still not solved.
The solution I came up with is to diff each target form in the old file against all forms in the new file. This is done by adding N start states to the priority queue, instead of only one, where N is the number of candidate target forms. Since A* really only makes sense in the context of single-source shortest paths, I decided to just switch to Dijkstra's algorithm, which can deal just fine with multiple origins. Since the diffs are processed in order of increasing cost, we know that the first complete diff we see will be the lowest-cost-possible diff of the source form with any of the target forms. So we trade away single-target diff performance, but in return we get the guaranteed optimal solution to the alignment problem.
Doing diffs this way is technically quadratic, since in the worst case it requires every source form to be diffed against every target form, but there are a couple tricks that can be used to make it more palatable. Most of the time, the majority of the forms in a file will be unchanged, so we can just hash everything first and match those right away. That means the runtime is only quadratic with respect to the number of changed forms, which is better. Second, each target form can only be matched to one source form, so we if we have to diff the first source against N targets, we only need to diff the second against N-1, and so on. Still quadratic but oh well, parsing is usually slower anyway. Finally, in each list of candidate targets we always include nil, representing the cost of deleting the entire source form. This means no states more expensive than that are considered, which kind of controls the number of states we need to explore.
There are a couple of slow cases, but for the most part I think the gains are worth the switch to Dijkstra. Probably the slowest type of change to diff is splitting a very large form into two or more smaller forms, since we will spend a huge amount of time trying to figure out which smaller form is most similar to the original large form. For example, If you split a 100-line function into two pieces and also make a bunch of changes, it might take like 30 seconds to diff. That's not great, but you'll probably spend more than 30 seconds looking at a diff like that anyway.