Code Analysis Tools - A collection of IL code analysis tools
This tool depends on a DLL that ships with FxCop. You will have to install FxCop (or just the Microsoft.Cci.dll), and you'll most likely have to reset the assembly reference in this project for Microsoft.Cci.dll based on where you installed it.
So the neat thing about .NET (as everyone knows or should know by now) is that no matter what language you are working in, it all gets compiled down to IL. Nothing new there, right? Correct. So, if I'm going to write a code analysis tool, would it make sense to write one that parses C#? Nope, because then I'd have to write one that parses VB.NET too... or the new Python.Net. That whole route would just be kind of stupid because then I'd have to maintain a parser for each language that I want to support. No fun!
But if I accessed and used the compiled IL for my code analysis, then my tool can work on any .NET application, regardless of the language it is written in. OK, so how do I parse the actual IL of a compiled assembly? There are actually several different ways.
You could do it manually. There is an article here written by Sorin Serban (Parsing the IL of a Method Body) that does just this. But it gets quite complex. You could use the API that comes with Reflector to get at the IL. Microsoft Research has a project called Phoenix that also will give you an API for digging through the IL. There is also the Mono project's Cecil which will load an assembly and provide an object graph that represents the assembly contents. I've used this quite a bit, but it sometimes doesn’t play well with Visual Studio compiled DLLs, and is always in a state of change (a good and a bad thing).
But the API I like the best is called the Common Compiler Infrastructure, and ships with FxCop (Microsoft.Cci.dll). It provides an object hierarchy / graph of the contents of the assembly, very much like Cecil does, but I've never had it throw an error with any assembly that I've ever loaded.
To start playing with CCI, create a project, and add Microsoft.Cci.dll to your references (it's located wherever you have installed FxCop), and add a using statement for Microsoft.Cci. The root class that you'll probably use is the AssemblyNode class, and its static GetAssembly method. This method takes a path to an assembly, and returns an AssemblyNode instance. Among many of the properties exposed by AssemblyNode is one called Types. This is a TypeNodeList which contains every Type within the assembly.
AssemblyNode assembly = AssemblyNode.GetAssembly(someAssemblyPath);
foreach (TypeNode type in assembly.Types)
Now, a TypeNode has a Members property which is a MemberList. This list holds every member of that type; fields, events, methods, properties, sub types, and namespace. A Method class is a sub-class of Member, so as you spin through the type's Members, look for one that is of type Method.
foreach (Member member in type.Members)
Method method = member as Method;
if (method != null)
Now that you have a method, you can look at its parameters, attributes, and… its IL instruction list. This is where it gets cool! Get it? You can programmatically spin through an assembly's IL code and look for specific scenarios. This is pretty much exactly what FxCop does. If you've ever written a custom FxCop rule, then you’ve most likely worked with the Method object's IL instruction list.
InstructionList instList = method.Instructions;
for (int index = 0; index < instList.Length; index++)
Each Instruction in the InstructionList has an OpCode property, and a Value property (among many other properties). I'm not going to get into an IL tutorial mode, but the OpCode property is the instruction. It tells the computer what operation to do. The Value property is what the OpCode is going to operate on. For example, if the program is going to load a field onto the stack, the OpCode would be ldfld, and the Value would be a Field instance. Using all this, it makes it fairly easy to figure out what types, methods, and fields are being used by your IL code.
"Not Used" Analysis
Several times throughout my career, I have worked on large software projects that contain dozens of projects in a solution, thousands of classes, and tens of thousands of methods. It is always a chore to keep all this code straight, and eventually after a few years of updates and code changes, some classes and methods become deprecated...unused. But they are still left in the code because no one knows about them.
This is generally considered bad because it clutters your code up, exposes more than you intend to, takes longer to compile, and takes up more memory when the DLL is loaded.
Recently, I had to work in an area of code that I wasn’t familiar with. As I was going through the code, I would look to see how and where each class was being used. I started to become frustrated when I found several classes that were no longer used, but were still hanging out. This is one of my pet peeves.
So I got on a role and started searching for more unused code. I found fields, methods, and classes that were no longer used. Yeah! There is nothing more fun than deleting unused code! But the annoying thing about trying to find unused code is that the only way to check that it is truly not used anywhere in your solution is to compile (unless you are using reflection, at which point you can only do a string search), which, if you have 32 projects, can take a while.
So in order to speed up this process, I set out to write a tool that walks through all my code, looking for where each class, method, and field is referenced or used.
"Not Used" Architecture
The architecture for this tool is fairly simple. There are basically two main classes; a cataloger class and an analyzer class. I created an AssemblyCataloger class to rip through a list of assemblies and catalog every type, field, and method (constructors and properties are included in this) defined in the assembly. I then have a UsageAnalyzer class which accepts the AssemblyCataloger instance and a list of assemblies to check the catalog against.
What the analyzer does is spin through each assembly in the list and look for any type, field, or method that is used or called in any way. When I find a type, method, or field that exists in the catalog, I mark that type, method, or field in the catalog as being called. At the end of the analysis, I then spin through the catalog and look for the types, methods, and fields in the catalog that don't have any callers. These are then considered unused.
I originally used the Visitor model that comes with the Microsoft.Cci.dll in order to "visit" every type, field, and method in order to catalog them. The way you do this is create a class that inherits from Microsoft.Cci.StandardVisitor. You then override various VisitXXX methods like VisitField, VisitTypeNode, VisitMethod, etc. When you tell your visitor class to run against an assembly, it would spin thorough the assembly and call each overridden visitor when it finds a member that corresponds with the overridden visitor. For example, when it finds a field, it would call VisitField, and your overridden method would get called. So, when the overridden VisitXXX method is called, I would catalog that item.
This worked great, was really simple, and cut down on the amount of code I thought I'd have to write in order to traverse the entire assembly to get every field, method, and type. The only problem was that it took a long time to catalog an entire assembly, much less 30+ assemblies.
I then wrote another cataloger that manually traversed the assembly, and found that I could catalog an assembly in 30% of the time that the StandardVisitor class took. So obviously, that’s the model I ended up using.
"Not Used" Rules
The analyzer's structure is very similar to the cataloger, in that it spins through an assembly in the same manner. Basically, it loads an assembly that you want to check the catalog against. It looks at every type, method, field, event, attribute, and parameter too see if they use or reference anything in the catalog. This means getting down to the IL level. Yeah!
The following three sections are a list of the rules I came up with to determine what types, fields, and methods are used by an assembly. It actually turned out to be more complex than I thought it would. There are two categories of rules. "Don’t Catalog" rules: those rules that tell you not to even catalog a specific type, method, or field. The other type of rules are called "Mark as Used" rules: those rules that tell you that a type, field, or method in the catalog is used.
Don't Catalog Rules for Types
- Don’t catalog any type that starts with "<": If you open any assembly up in Reflector, the very first namespace you'll find is called; and it will always contain a type called <Module>, and sometimes another type called <PrivateImplementationDetails>. These are internally used types that get auto generated. For this reason, I filter out (don’t catalog) any type that starts with the character "<". This works well because the CLR doesn’t allow you to manually define types with this character anyway.
Mark as Used Rules for Types
- Look for an interface a type inherits from in the catalog and remove.
- Look for a type's base type in the catalog and remove.
- Look for any attribute that decorates a type in the catalog and remove.
- Look for a field's type in the catalog and remove.
- Look for any attributes that decorate a field in the catalog and remove.
- Look for an event's handler type (delegate) in the catalog and remove.
- Look for any attribute that decorates an event in the catalog and remove.
- Look for a property's return type in the catalog and remove.
- Look for any attributes that decorate a property in the catalog and remove.
- Look for any attribute that decorates a method in the catalog and remove.
- Look for the type of a method's arguments in the catalog and remove.
- Look for any attributes that decorate a method's arguments in the catalog and remove.
- Look for the return type of a method in the catalog and remove.
- Look for any attribute that decorates a method's return type in the catalog and remove.
- The first opcode in any method is a list of local variables that are needed by the method. Look at each local variable's type in the catalog and remove it.
- If the opcode is one of the following (OpCode.Castclass, OpCode._Catch, OpCode.Newarr, OpCode.Box, OpCode.Initobj, OpCode.Isinst, OpCode.Unbox, OpCode.Unbox_Any, OpCode.Ldtoken), look for the opcode's value in the type catalog and remove.
- If the opcode is one of the following (OpCode.Ldfld, OpCode.Ldflda, OpCode.Ldsfld, OpCode.Ldsflda, OpCode.Stfld, OpCode.Ldftn, OpCode.Stsfld, OpCode.Ldtoken) and the opcode value is a Field, then look for the field's type in the catalog and remove.
- If the opcode is one of the following (OpCode.Call, OpCode.Callvirt, OpCode.Calli, OpCode.Newobj, OpCode.Ldvirtftn, OpCode.Ldftn, OpCode.Tail_) and the opcode value is a Method, then look for the method's declaring type in the catalog and remove.
Don't Catalog Rules for Methods
- Don’t catalog static constructors: if a type has a static field, then the compiler will generate a static constructor to initialize the field. Also, static constructors are explicitly called anyway, they are called the first time any member of a type is called.
- Don’t catalog default type constructors (constructors with no parameters): every type must have one of these, so don’t catalog them.
- Don’t catalog instance constructors that have the same parameter list that the type's base type constructor has: If type B inherits from type A, and A has a constructor that takes one string, and B wants to expose a constructor that allows the caller to pass a string to the base, it needs to expose this constructor overload as well. While this is not explicitly needed if it's never actually called, it's considered "good form" to provide it if needed.
- Don’t catalog Finalize methods: type finalizers are not explicitly called, they are called by the GC, so don’t catalog them.
Mark as Used Rules for Methods
- If a method is not in a catalog, but the type that it belongs to implements an interface that defines this method, then keep the method in the catalog since it's needed by the interface.
- If the opcode is one of the following (OpCode.Call, OpCode.Callvirt, OpCode.Calli, OpCode.Newobj, OpCode.Ldvirtftn, OpCode.Ldftn, OpCode.Tail_) and the opcode value is a Method, then look for the method in the method catalog and remove.
- If the method found in the previous rule was found and removed from the catalog, and is also abstract or virtual, then walk up and down methods declaring type inheritance hierarchy, and see if one of those types declares a method with the same signature. When found, remove that from the method catalog as well.
Don't Catalog Rules for Fields
- Don’t catalog fields that are constants: this is because the value of the constant field just gets hard-coded into the IL at compile time. There is no way to know if it ever gets used or not.
- Don’t catalog fields that have the name "value__": I don’t know why the compiler puts these in there, but every enumeration defined has a private field defined by the compiler that is called "value__".
Mark as Used Rules for Fields
- If the opcode is one of the following (OpCode.Ldfld, OpCode.Ldflda, OpCode.Ldsfld, OpCode.Ldsflda, OpCode.Stfld, OpCode.Ldftn, OpCode.Stsfld, OpCode.Ldtoken) and the opcode value is a Field, then look for the field in the field catalog and remove.
Known False Positives
Methods, types, and fields invoked via reflection: since reflection invocations often are initiated with the string name of the method or field, this type of analysis would not catch any type, method, or field that is called via reflection.
There are also reference chains that could give you a false positive, which I currently don't analyze for. For example, take the following reference chain: class A is only used by class B, and class B is only used by method x() of class C, and method x is not called anywhere. Method C.x would be marked as unused, but classes A and B would not be marked as unused. Now, once you have removed method C.x, then class B would be marked as unused. And then once B is removed, then class A would be marked as unused. But this is a lot of steps to manually perform.
Eventually, I'd like to fix both of these deficiencies.
One of the analysis suggestions that I received recently was to check to see if a type, method, or field's scope is more open or visible than it needs to be. For example, if a method is marked as public, but no other assembly calls it, it could be marked as internal. Along those same lines, if that method is called only by other members defined in the same class, it could be marked as private.
This has several benefits such as reducing the 'API surface', reducing documentation and testing requirements, and reducing possible security vulnerabilities. Not to mention it's just good practice to limit the scope of your members to the lowest required visibility.
Well, this style of analysis is very easy to do considering all the data I gather when I catalog and analyze the assemblies during the Not Used analysis, so I added it in.
Note: so far, I only looked for public, private, and internal scopes. I haven't analyzed for protected and protectedinternal (yet).
Not Used and Visibility Analysis Usage and Results
To begin with either analysis (or both), click the "Pick files to get checked..." link. This will popup a file selection dialog. Pick the assemblies that you want to catalog.
Then, click the "Pick files to check against..." link, and again select one or more assemblies. These are the assemblies that you will use to see if the cataloged items are ever used. Alternately, you can click the ">>" button, which will bring over all the assemblies in your "to catalog" list over to your "to analyze" list.
Next, select the analysis options. You can analyze for any combination of classes, methods, fields, and visibility (in order to analyze for visibility, you have to check at least one of the three from the classes, methods, and fields).
Then, click the "Run Analysis" button to actually run the analysis. Depending on how many assemblies you are cataloging and analyzing, this could take up to a minute. My biggest analysis was with 42 DLLs and EXEs, and it took about 50 seconds (though most of that time was spent populating the ListViews to show the results).
On the lower half of the UI, there are four tabs where the results of the analysis are shown. Each tab shows how many members were cataloged and how many were found to be not used. The ListView below that is a member by member breakdown of each member not used. If you double click on any row, you will get the details about that member. This is useful so you can hunt down the offending member and terminate its existence!
I ran my analysis tool against FxCop, cataloging two of FxCop's internal DLLs, and checking them against all seven of its assemblies; I got the following results:
- Out of 110 classes cataloged, 2 were not used.
- Out of 916 methods cataloged, 173 were not used.
- Out of 244 fields cataloged, 10 were not used.
- Out of 1270 total members analyzed, 71 were shown to have too much visibility.
I also ran the tool against Reflector, which only has one assembly. These are the results I got:
- Out of 363 classes cataloged, 15 were not used.
- Out of 2500 methods cataloged, 1313 were not used.
- Out of 454 fields cataloged, 0 were not used (very good!).
- Out of 3317 total members analyzed, 983 were shown to have too much visibility (not so good; but Reflector does expose an add-in API, so that would explain why there are so many public members not being used).
What Does All This Mean?
So if no one calls your method, does that mean you should delete it? It depends. If you are defining a public API for a product and it’s a useful method for your customer, then no, don’t delete. If it’s a private method or field, then yes, it's probably left over from legacy code and not used anymore, so blow it away. Basically, this tool just sniffs out a specific type of smell, but it's up to you to refactor it correctly.
Duplicate Code Analysis
One code smell that is really stinky is duplicate code. But it can be hard to find when, like I’ve said before, you have thousands of classes in your project. When you need a piece of functionality that already exists in one class, to be in another class, it’s just too easy to copy and paste the code. We all know this is bad, and we have all done it at one point or another.
But this again is just added clutter and memory overhead, not to mention a breeding ground for possible bugs. What happens if a function, over the years, is copied to five different classes, and then it's determined that, that function has a bug in it. Well, hopefully, you remember all the other places you copied the code to, so you can go fix it there too. But what about all the places Joey or Stu copied it to that you don’t know about?
That is the motivation behind this tool. Basically it scans your assemblies for methods that are either an exact duplicate or a pretty close near duplicate of other methods. It then (as shown above) reports back all the different clusters of duplicate functions that it found. Then it’s up to you to clean up the code.
Duplicate Code Analysis
In order to determine if one function is a duplicate of another, I spin through each function, and build up a string that consists of each OpCode / OpCode-operand pair in the function appended together. The OpCode operand is the thing that the OpCode is operating upon. For instance, the call OpCode could be operating upon the method System.Console.WriteLine(string), or the ldfld opcode operating on a class field.
Once I have this string built up for a function, I then run it through an MD5 hash to get a 16 byte hash of the function. I check to see if this hash already exists in a dictionary of hashes. If it exists, I’ve found a duplicate and report it back to the UI. If I didn’t, I add it to the dictionary to be checked against other functions. The reason I hash the IL/ IL operands is that the strings get fairly large and if I used the string as my key in the dictionary, I could easily run out of memory if my application was large enough (like I did with when analyzing the entire .NET framework).
There are a few OpCodes that I don’t worry about when building the string of OpCodes: these are the nop OpCode and the _Locals OpCode. The nop OpCode is used in debug builds for syncing break points up with the code in Visual Studio. The _Locals OpCode is the first OpCode of every function that holds information about all the variables that the function declares.
Lots of Duplicates! More filtering
When I first ran this tool against my code, I found a disturbing number of duplicate methods… hundreds of duplicate methods. I mean, I knew I had copied a few methods here and there, once in a while. But not that many! In looking closer at the results, I found two culprits: basic getter / setter properties, and default constructors. Basic getter /setter properties are those properties that do nothing else but return a field, or set a field. Hmmm… this is OK right? Yup. I’m not going worry about those. Default constructors are those constructors that do nothing but just call the constructor on its base class, and are usually inserted into your assemblies by the C# compiler. Again, just noise that can be ignored.
Near Duplicate Code Analysis
This worked pretty good, but I thought to my self… "Self, what about methods that do the same processing steps, but against different types and methods?" These could be indicators of methods that might be candidates for refactoring down to a base class, or a helper method.
For example, what if you had a function that did a series of steps against a FileStream, and another function that did the exact same steps against a MemoryStream. The find duplicates algorithm would not find this because it uses the OpCode operand when calculating the hash-code; namely the methods of FileStream and MemoryStream. But what if you only included the OpCodes in your hash, and not the operands? You might actually find more candidates for refactoring!
Turns out that I was correct. And the existing algorithm made it simple to add this option. Below are two screenshots. The first one is the duplicate code check against the FxCop binaries, and the second one is a near duplicate code check against the FxCop binaries. Notice that it found more candidates for refactoring with the near duplicate analysis.
Another type of duplicate code analysis that I’ve been working on (but not ready for public eyes yet) is one that only looks at smaller groups of IL OpCodes vs. the entire method. This would allow you to find duplicate blocks of code buried in a function. For instance, say you have a method that does a bunch of stuff. Then it opens a file stream, and reads the data out into a byte array or a string. Then the method goes on to do more stuff. A month later you write a new method. This one does some stuff that needs to open and read a file, and then do more stuff.
These two methods might do drastically different things, but have a duplicate block of code in them; namely the code that opens and reads a file. This sounds like another candidate for refactoring! This type of analysis could also look for duplicate blocks of code within the same function as well. Ever seen one of those 500 line functions written by an old school ASP developer? Where there are code blocks after code blocks just copied and pasted over and over again? This type of analysis should be able to find those as well.
I am always looking for new ideas for code analysis tools, and writing them is a passion of mine. So if you can think of other rules that I forgot to check in any of these existing tools, or any other type of code analysis, please leave a comment, or send me an email, and I'll try to work it into the tools.
I have been a professional developer since 1996. My experience comes from many different industries; Data Mining Software, Consulting, E-Commerce, Wholesale Operations, Clinical Software, Insurance, Energy.
I started programming in the military, trying to find better ways to analyze database data, eventually automating my entire job. Later, in college, I automated my way out of another job. This gave me the great idea to switch majors to the only thing that seemed natural…Programming! I am a devout fan of new technology, and all things Microsoft, and in my spare time I enjoy, what else? Programming. God, I'm a geek.