Lựa chọn

 Danh mục

 Chi tiết tài nguyên
Ternary Search Tree Dictionary in C#: Faster String Dictionary!
   Đăng bởi: host | Ngày : 01:43 05/12/09 | Xem: 1343 |        
Ternary Search Tree Dictionary in C#: Faster String Dictionary!
 
Please comment your votes!
Introduction
This article presents a ternary search tree dictionary (a symbol table) that is generally as fast as the Hashtable but also features new search possibilities. This new symbol table can be used in replacement of the other .Net dictionaries when the keys are strings. The main class behaves almost like any IDictionary table, to the difference that it accepts string key only.
This data structure supports also interesting search methods such as partial match and near-neighbor match. For instance, the near-neighbor search can be used to provide similar words while implementing an online thesaurus.
The ternary search tree dictionary implementation is based on the article of Jon L. Bentley and Robert Sedgewick, Fast Algorithms for Sorting and Searching Strings, cfr. [1].
The class comes with a full set of NUnit test, benchmarks with Hashtable and some utility to visualize the tree.
What is a ternary search tree ?
The ternary search tree (TST) is a dictionary data structure specially crafted for the situation where the key are string. It can find, remove and add keys quickly. It can also easily search the tree for partial matches -- auto matches autograph, automatic, and so on. In addition, you can implement near-match functions, which gives you the ability to suggest alternatives for misspelled words -- e.g., impliment matches implement.
A TST stores key-value pairs, where keys are strings and values are objects. In addition, TSTs use memory efficiently to store large quantities of data. Best of all, all of this magic is performed lightning fast (see benchmark). The tremendous flexibility of TSTs provides ample opportunity for creative programming.
The picture in the article front shows how the ternary tree is layout for 12 two letters words: as at be by he in is it of on or to.
How to use it ?
The TstDictionary implements the dictionary. It can be used as any other IDictionary based class, the only restriction being that key must be strings.
TstDictionary tst = new TstDictionary();
Add entries
tst.Add("hello", null);
// attach a value to the key
Object value = ...;
tst.Add("another key",value);
Remove entries
tst.Remove("hello");
Clear dictionary
tst.Clear();
Iterate over entries
foreach(DictionaryEntry de in tst)
{
   // de.Key is the key,
   // de.Value is the value
}
Access elements
tst["hello"]=null;
Object value = tst["hello"].Value;
Partial Match search
Partial match search with wild char character. Searching the dictionary for the pattern *o*o*o matches the single word rococo, while the pattern *a*a*a matches many words, including banana, casaba, and pajama:
foreach(DictionaryEntry de in tst.PartialMatch("*a*a"))
{
   // de.Key matches the pattern
}
Near-Neighbor search
This methods finds all words in the dictionary that are within a given Hamming distance of a query word. For instance, a search for all words within distance two of soda finds code, coma and 117 other words.
foreach(DictionaryEntry de in tst.NearNeighbors("a",1))
{
   // de.Key matches the pattern as, at, ma, ba, etc...
}
Why not implementing IDictionary ?
As mentionned in the introduction, the TstDictionary does not implement the IDictionary interface, although it implements all the methods needed. The rationale for this is to avoid confusing the user by letting him use non-string keys.
Threading
The TstDictionary class provides a Synchronized method to obtain a synchronized (thread-safe) wrapper of the dictionary. This wrapper behaves similarly to other .NET synchronized wrappers.
Benchmarks
Projects
The Tst project is composed of the following projects:
  • Tst, a class library that contains the TST implementation,
  • Tst.Utility, a set of helper classes to visualize the TST modifications,
  • Tst.Tests, NUnit tests
  • Tst.Benchmark, benchmarks with Hashtable,
  • Tst.Console, demo application.
Todo list
  • The enumerator should iterate the keys in alphabetical order, which it does not... yet.
History
  • 11-01-2004, initial release.
References
About Jonathan de Halleux

Jonathan de Halleux is Civil Engineer in Applied Mathematics. He finished his PhD in 2004 in the rainy country of Belgium. He is now employed at Microsoft in the CLR team (testing the JIT).

 
Bình luận

Thêm nhận xét