I've been looking at different diff tools trying to figure out how I would design my own cross-platform portable diff program with the features I use most. The most interesting thing I found out was how different the various outputs of the diff programs all are. There's also no one algorithm that creates the most intuitive output for a human to read. A certain algorithm may be better for one case but not for another. I'll share a summary of what I've researched about diff utilities so far.

The two main diff programs I think about when I consider diff programs are the diff from GNU diffutils and the BSD diff program that's available with many BSD operating systems. The BSD diff program traditionally uses a variant of the Longest Common Subsequence and attributes the algorithm they use to Harold Stone. The latest GNU diff program uses Myers' algorithm. Myers' algorithm was a breakthrough because it managed to solve the problem in O(ND) time, something that was once thought to be impossible. The original GNU diff program predated Myers' algorithm so much older versions used a different solution. I've seen some posts at the BSD site that one of their goals was to add a Myers' implementation to their diff program to improve speed. I haven't seen anyone complete that project to date. The improvement in speed of the Myers' algorithm does come at a cost of requiring more memory. Useful diff algorithms need to keep the trade off of space versus time in mind and not come up with solutions that fail if the files being compared are arbitrarily large.

Most systems use either the GNU diff or the BSD diff or are based on them. Busybox is based on the BSD diff. I did search for other implementations that might be useful for people using an alternative to the standard GNU coreutils and diffutils. I was only able to find a few. It seems diff algorithms are more easily found in version control tools such as git. Toybox has an original implementation of diff. It's in the toys pending directory so I assume it's not part of most standard Toybox installations. I did some experimenting with it to see if I could use it as a stand-alone program. I got as far as finding some bugs in the implementation of displaying diffs in Unidiff format with a specified number of context lines. The sbase project also has an implementation of diff. It's written to optimize for speed. It can use different algorithms depending on the situation. However, in using it, I found cases where it was incorrectly marking lines as different when they were the same line. I originally assumed both projects used Myers' algorithm, but on closer examination, they probably don't. I found another C based implementation of the diff algorithm in GOT (Game of Trees version control system) which does use Myers' algorithm. Was able to get it to compile and compare two files using it. Interestingly, the output from it did not match with the output from GNU diff.

The projects have a wide range of licenses from GPL to BSD 0 clause. Busybox is licensed using GPLv2.
I am not at all sure how they legally use the diff code based on BSD systems which is licensed with a Caldera 4 clause license that is incompatible with GPL licenses. Possibly the Busybox developers wrote an exception for the Busybox license so they could incorporate that code. sbase uses an MIT license. The BSD based diff is mainly using a BSD 3 clause license (or 4 clause in older versions) but includes the Caldera license in the diffreg.c file. I've searched for other versions of the BSD diff tool that use other licenses. While the licenses vary, most copies of the diffreg.c file include the Caldera license. I did run across a few BSD variants that were licensed with BSD 3 clause licenses and didn't mention the Caldera license in diffreg.c but I think this was an omission and don't think their versions of diffreg.c were actually licensed as BSD 3 clause. Finding a decent version of the diff algorithm that can be used in a library and can link with code using GNU GPL licenses or proprietary licenses is no easy task.

I still find it surprising that the various diff implementations all come up with different output and not even different versions of the GNU diff utility will necessarily have the same results. It makes it harder to verify if a diff program's output is valid or not.

I'd be very interested to hear what others think are necessary features in a diff program. Does an alternative to GNU diff work well enough for your situation? Are there features from one diff program you wish were part of another? I'd like to take the best from some of the various diff programs out there like sbase and toybox and combine them to make a diff that has all the functionality I use most. I'd need a tool that I could use in conjunction with programs such as diffh which can display differences in files side by side in HTML format. It would be nice to discuss this further with others looking for lightweight, efficient or cross-platform tool implementations for their systems. How would you design a diff program?
I've been interested in the BSD versions of diff and patch for a long while now. I often use a version of patch modified from various BSD patch implementations that were based on patch version 2.0.12u8. My variation includes some support for carriage return/line feed differences among systems. I've found it useful when working with patches from Windows or DOS systems that might accidentally introduce carriage return/line feed sequences instead of just line feed which POSIX systems use.

Recently saw an informative page covering some of the history of patch and diff:
https://invisible-island.net/diffstat/
It mentions that the 12u variant of patch contains no copylefted code. It also mentions a version 2.0.12.u9. So, I thought it was interesting that the various BSD versions found on NetBSD, OpenBSD and FreeBSD seem to start with version 2.0.12u8 as their basis. I'm very curious as to why they started with u8 instead of u9 but have seen no documentation on it. There doesn't seem to be that many differences between u8 and u9 so they probably didn't use much that was useful. If anyone knows anything further on this or know of some projects that started with the u9 version, I'd appreciate hearing about it.

BSD operating systems such as NetBSD switched from the BSD 4 clause license to the 2 clause. The 2 clause license is compatible with GNU license software while the 4 clause is not. So, it's not surprising that most of the code for BSD diff can be found with a 2 clause license. However, the diffreg.c file still contains the 4 clause license and a 3 clause license. I've searched for alternative versions of diffreg.c with other licenses. There are other versions available such as the diffreg.c that comes with the Plan 9 diff. They use the same algorithm. However, they don't contain some of the updates found in the BSD versions such as support for unified diff.

The BSD version of diff uses the longest common subsequence algorithm which performs in O(n2) at worst case and typically performs as O(nlogn). There is a newer algorithm by Myers that performs in O(nD). It's used by GNU diff. I saw a request a long while ago for someone to update the BSD diff to use the Myers algorithm. I don't think anyone's ever completed the task.

While the BSD version of diff is very usable, it would be nice to have a version of diff without the 4 code clause so that it could possibly be used as library code in conjunction with software that might contain GNU licensed source code. I prefer working with a more lenient license like BSD 2 clause or MIT rather than GNU when possible. So, it would be great to find a version of diff that supported unified diff output and avoided the GNU or BSD 4 clause licenses. I think the main alternative that might offer diff with a more lenient license and unified diff support would be Toybox. The Toybox license is BSD 0. However, Toybox doesn't support as many command line options as BSD or GNU diff.

I also just found out that there are versions of diff and patch for sbase. However, they're not in the main source code repository for sbase. You can find the source by searching the dev list from suckless.org. The sbase diff appears to be fully POSIX compliant. It's missing several features that were added to BSD and GNU diff programs. It supports unified diff only. The output is very close to the output of the BSD diff utility and it also uses the longest common subsequence algorithm.

I did run across the code to convert between diff standard output and unified diff output:
https://github.com/AceHusky12/unidiff
It's listed as public domain.

If anyone else is using an unusual variant of diff or patch or is writing their own, I'd be very interested in hearing about. I would love to compare notes on some of the different BSD variants out there and the various features they support.

October 2025

S M T W T F S
   1234
567891011
121314151617 18
19202122232425
262728293031 

Syndicate

RSS Atom

Most Popular Tags

Style Credit

Expand Cut Tags

No cut tags
Page generated Nov. 5th, 2025 04:06 am
Powered by Dreamwidth Studios