An O(ND) Difference Algorithm for C#
How it works (briefly)
You can find an online working version of the download files here.
- Comparing the characters of two huge text files is not easy to implement and tends to be slow. Comparing numbers is much easier, so the first step is to compute unique numbers for all the text lines. If the text lines are identical then identical numbers are computed.
- There are some options that need to be considered before computing these numbers, which are normally useful for some kind of text: stripping off space characters and comparing case insensitive.
- The core algorithm itself will compare two arrays of numbers, and the preparation is done in the private DiffCodes method and by using a Hashtable.
- The methods are DiffText and DiffInts.
- The core of the algorithm is built using two methods:
- LCS: This is the divide-and-conquer implementation of the longest-common-subsequence algorithm.
- SMS: This method finds the Shortest Middle Snake.
- To get some usable performance, I did some changes to the original algorithm. The original algorithm was described using a recursive approach, and compares zero indexed sequences, and passes parts of these sequences as parameters. Extracting sub-arrays and rejoining them is very CPU and memory intensive. To avoid copying these sequences as arrays around, the used arrays together with the lower and upper bounds are passed while the sequences are not copied around all the time. This circumstance makes the LCS and SMS functions more complicate.
- I added some code to the LCS function to get a fast response on sub-arrays that are identical, completely deleted, or inserted instead of recursively analyzing them down to the single number case.
- The result is stored in two arrays that flag for modified (deleted or inserted) lines in the two data arrays. These bits are then analyzed to produce an array of objects that describe the found differences.
- Read the original article if you want to understand more.
To use this functionality, you only have to call one of the DiffText methods. They all get a pair of strings, and return an array of items that describe the inserts and deletes between the two strings. There are no "changes" reported. Instead, you can find an "insert" and "deleted" pair.
DiffText(string TextA, string TextB)
Finds the difference between two texts, comparing by text lines without any conversion. An array of items containing the differences is returned.
DiffText(string TextA, string TextB, bool trimSpace, bool ignoreSpace, bool ignoreCase)
Finds the difference between two texts, comparing by text lines with some optional conversions. An array of items containing the differences is returned.
Diff(int ArrayA, int ArrayB)
Finds the difference between two arrays of integers. An array of items containing the differences is returned.
A sample application for the algorithm
The sample is a small web application that calculates the difference between two text files and displays the result using HTML.
To setup the sample, you have to create a web-project and copy all the files found in the zip file into the directory. The implementation of the algorithm is given inside the "diff.cs" class.
Further possible optimizations (if you really need speed)
(First rule: don't do it; second: don't do it yet.)
The arrays DataA and DataB are passed as parameters, but are never changed after the creation, so they can be used as members of the class to avoid parameter overhead. In the SMS, there is a lot of boundary arithmetic in the for-D and for-k loops that can be done by incrementing and decrementing the local variables. The DownVector and UpVector arrays are always created and destroyed each time the SMS gets called. It is possible to reuse them when transferring them to members of the class.
See TODO: hints.
Security issues with the web application
- Using this web-site implementation enables clients to view all the source code of your website. Just be sure that you do not use it as-it-is on a public server.
- You should implement a check of the parameters, and allow only a diff output on the files that can be displayed (text based files).
The class now has a built-in self-test, that works without changing the code. If you compile the diff and debug class files together, you get an EXE file that tests some simple diff-scenarios that were used to discover the bugs in versions 1 and 2 (OutOfArrayBounds typically).
The compile command is:
csc /target:exe /out:diffTest.exe /debug /d:SELFTEST Diff.cs Debug.cs
Thanks for your feedback and the two reported cases that did not diff correctly. (It was always my fault, not a problem of the algorithm).
This work was first published here.
- There was a "hang" in some situations.
- Now, I understand a little bit more of the SMS algorithm.
- There have been overlapping boxes; those where analyzed differently. One return-point is enough.
- An assertion was added in CreateDiffs when in debug-mode, that counts the number of equal (not modified) lines in both arrays. They must be identical.
- Out of bounds error in the Up/Down vector arrays in some situations.
- The two vectors are now accessed using different offsets that are adjusted using the start k-Line.
- A test case is added.
- Another test that threw an exception was found, but that seems to be fixed by the 2002.09.20 work.
- Refactored the API to static methods on the diff class to make the usage simpler.
- The self test is now using the standard debug class.