A Generic - Reusable Diff Algorithm in C#
This is a generic algorithm for finding the difference between 2 objects, no matter what the type. The algorithm uses the ICompare interface to tell if the items are the same. Maybe this could be used for producing a WiKi site in ASP.NET and track revisions.
I just wanted to let all you readers know, I don't have a degree in Computer Science and I haven't read or learned much about algorithm design. I haven't ever really analyzed a Diff algorithm before, but I know a little about them. If there is a way of improving this, or if I am going in the completely wrong direction, please let me know.
I am not really sure of the intimate details of how other Diff algorithms work (like the UNIX Diff utility), but it does use some ideas that I got from general reading on the topic. The Diff Engine uses a bit matrix that employs a sliding window of the 2 files. The matrix looks kind of like this:
The black squares are the squares with matching items. The algorithm starts at 0,0 and looks all the way along the x-axis to find a diagonal sequence, which is 2 or more matching characters in a row. The algorithm looks for the longest sequence, using the the closer of any that are the same length. It then looks down the y-axis, and if it finds a sequence that is larger and closer than any found on the x-axis, it uses that sequence. If it finds a sequence, it performs whatever operation is necessary to get to that sequence. In this case, it wouldn't find a diagonal sequence on the 0 x-axis or the y-axis, so it performs its standard operation, which is to delete the 'B' and insert the 'N'. It then increments our x and y location, so we are now at 1,1. Again, we do not find a diagonal sequence, so we perform the standard operation again. However, at 2,2 we find that we are in a sequence.
Once we find a sequence, first we slide the window down to where we are, since we won't be needing any of the previous characters any more. Then, we continually increment x and y until they reach a non matching square. At 4,4 there is no match, so we switch states again, meaning that we slide the window down, and once again look for a diagonal. Now, on the x-axis 4, we find a diagonal at 4,8, so now must perform whatever is necessary to get there. Moving in the y direction is an insertion, so we insert 'T', 'h', 'e' and ' '. At that point, we enter enter the diagonal, save state, etc.
We follow the diagonal down to the end from there. Pretty straight forward, right?
Using the code
I wrote this code so that it would be as generic as possible, so I built it around 3 basic classes and an interface. There are actually more classes but, I will get to those in a minute.
First, to do a comparison, you will need to subclass ComparableStreamReader. This is an abstract class that I created, and it has only one method, GetNext(), which returns an object that implements an IComparable object. This is really the trick to this. In the demo, I have included a class called ComparableStringReader, which just returns characters from a string, one at a time. If you look at the code, you will notice that it is actually returning strings with one character and not char objects. What can I say - string already implements IComparable, and I am lazy. However, if you want to compare items that are not strings, you may have to create your own custom object that implements IComparable.
publicclass ComparableStringReader : Diff.ComparableStreamReader
public ComparableStringReader(string source)
_source = source;
_location = 0;
publicoverride IComparable GetNext()
if ( _location < _source.Length )
return _source.Substring(_location++, 1);
Once you have created a reader class, you just need to create an instance of the DiffEngine class. DiffEngine has two properties, and one method, Compare(). Compare takes in 2 ComparableStreamReaders, the source (original), and the destination (changed document). Compare() returns an EditScript object, which is just a collection of DiffCommands objects. Before I go into the DiffCommand objects, however, I want to mention the two properties. The property WindowSize is an integer that defaults to 256. You have to be careful with this number however. The higher it goes, the more exact and concise the Diff script will be, but it uses memory at a rate of (n2/8). That's right. So at 256, we are only using 8K of memory to store the matrix (it's actually called an Edit Graph). However, if we decide to move up to 1024, we are not using 128K. Also, the higher the window size, the longer it takes to compute, at about the same rate (I am not really sure how to use big-oh notation, only read it, so I don't give it). 255 should work fine for most cases, especially if you are comparing entire lines, one at a time.
DiffEngine de = new DiffEngine();
de.WindowSize = 10; // very small window for a small script
// create string readers
ComparableStringReader csrSource = new ComparableStringReader(txtSource.Text);
ComparableStringReader csrDest = new ComparableStringReader(txtDestination.Text);
// do the compare
EditScript es = de.Compare(csrSource, csrDest);
Now, once the Compare() method returns, it returns a script (that same script is available through the Script property). The EditScript, like I said before, is a collection of DiffCommands. Each command represents an operation that must be performed against the source (original) to produce the destination (changed document). There are three types of commands, which are members of the DiffCommandTypeenum. There's Insert, Delete, and Skip. The Skip and Delete commands utilize the Length property of DiffCommand class, and the Insert command utilities the Value property. The Length property just represents the number of times to repeat the operation. The Value property represents the object to be inserted, and it is also the exact object that was returned from ComparableStreamReader.GetNext().
In the sample, I have used the EditScript to provide an exploded view of the source and destination strings so that you can see what they had in common. I also generated a very simple Diff script that could be used to recreate the destination from the original.
B i t Matrix
N ot The Matrix
In this example, the Diff script is using three operators - '*' for Skip followed by the number of characters to skip, '-' for Delete followed by the number of characters to delete, and '+' for Insert, followed by the character to insert.
Eventually, I will write a method to merge a source and an EditScript to recreate the original, but first, I want to do some better testing just to make sure that this algorithm is actually viable. If you use it successfully, please let me know. Also, if you have any recommendations for improving the algorithm, also let me know.
Points of Interest
There is a reusable BitMatrix class that this algorithm employs to cut down on memory use. If you want, you can change it to use just bool values. This may or not speed up the algorithm, since you would be using 8 times more memory (unless the C# compiler does lots of optimizing!).
- 4/21/2004 - Version 0.9 - released.
I am the Sr. Applications Designer for the U.S. Olympic Committee. I have been writing software for over ten years, starting with QBasic when I was 10. I do not have a Computer Science degree, but I am very interested in advanced algorithms, so I learn as much as possible (although I must admit, not knowing calculus does make it difficult at times). I am always trying to find the next really difficult problem to solve programatically.