I've been looking at different diff tools trying to figure out how I would design my own cross-platform portable diff program with the features I use most. The most interesting thing I found out was how different the various outputs of the diff programs all are. There's also no one algorithm that creates the most intuitive output for a human to read. A certain algorithm may be better for one case but not for another. I'll share a summary of what I've researched about diff utilities so far.

The two main diff programs I think about when I consider diff programs are the diff from GNU diffutils and the BSD diff program that's available with many BSD operating systems. The BSD diff program traditionally uses a variant of the Longest Common Subsequence and attributes the algorithm they use to Harold Stone. The latest GNU diff program uses Myers' algorithm. Myers' algorithm was a breakthrough because it managed to solve the problem in O(ND) time, something that was once thought to be impossible. The original GNU diff program predated Myers' algorithm so much older versions used a different solution. I've seen some posts at the BSD site that one of their goals was to add a Myers' implementation to their diff program to improve speed. I haven't seen anyone complete that project to date. The improvement in speed of the Myers' algorithm does come at a cost of requiring more memory. Useful diff algorithms need to keep the trade off of space versus time in mind and not come up with solutions that fail if the files being compared are arbitrarily large.

Most systems use either the GNU diff or the BSD diff or are based on them. Busybox is based on the BSD diff. I did search for other implementations that might be useful for people using an alternative to the standard GNU coreutils and diffutils. I was only able to find a few. It seems diff algorithms are more easily found in version control tools such as git. Toybox has an original implementation of diff. It's in the toys pending directory so I assume it's not part of most standard Toybox installations. I did some experimenting with it to see if I could use it as a stand-alone program. I got as far as finding some bugs in the implementation of displaying diffs in Unidiff format with a specified number of context lines. The sbase project also has an implementation of diff. It's written to optimize for speed. It can use different algorithms depending on the situation. However, in using it, I found cases where it was incorrectly marking lines as different when they were the same line. I originally assumed both projects used Myers' algorithm, but on closer examination, they probably don't. I found another C based implementation of the diff algorithm in GOT (Game of Trees version control system) which does use Myers' algorithm. Was able to get it to compile and compare two files using it. Interestingly, the output from it did not match with the output from GNU diff.

The projects have a wide range of licenses from GPL to BSD 0 clause. Busybox is licensed using GPLv2.
I am not at all sure how they legally use the diff code based on BSD systems which is licensed with a Caldera 4 clause license that is incompatible with GPL licenses. Possibly the Busybox developers wrote an exception for the Busybox license so they could incorporate that code. sbase uses an MIT license. The BSD based diff is mainly using a BSD 3 clause license (or 4 clause in older versions) but includes the Caldera license in the diffreg.c file. I've searched for other versions of the BSD diff tool that use other licenses. While the licenses vary, most copies of the diffreg.c file include the Caldera license. I did run across a few BSD variants that were licensed with BSD 3 clause licenses and didn't mention the Caldera license in diffreg.c but I think this was an omission and don't think their versions of diffreg.c were actually licensed as BSD 3 clause. Finding a decent version of the diff algorithm that can be used in a library and can link with code using GNU GPL licenses or proprietary licenses is no easy task.

I still find it surprising that the various diff implementations all come up with different output and not even different versions of the GNU diff utility will necessarily have the same results. It makes it harder to verify if a diff program's output is valid or not.

I'd be very interested to hear what others think are necessary features in a diff program. Does an alternative to GNU diff work well enough for your situation? Are there features from one diff program you wish were part of another? I'd like to take the best from some of the various diff programs out there like sbase and toybox and combine them to make a diff that has all the functionality I use most. I'd need a tool that I could use in conjunction with programs such as diffh which can display differences in files side by side in HTML format. It would be nice to discuss this further with others looking for lightweight, efficient or cross-platform tool implementations for their systems. How would you design a diff program?

October 2025

S M T W T F S
   1234
567891011
121314151617 18
19202122232425
262728293031 

Syndicate

RSS Atom

Most Popular Tags

Page Summary

Style Credit

Expand Cut Tags

No cut tags
Page generated Oct. 31st, 2025 03:43 am
Powered by Dreamwidth Studios