Archive for the ‘C#’ Category

January 12th, 2010

Determining A Significant Change In A Web Page

The problem with many web pages today is that they include useless hidden pieces of constantly changing information that serves no purpose to the general web or anything else besides debugging.  A pretty good example of this is one that I found on my own blog, and removed, as I was experimenting with the code in this post:

<!-- 12 queries. 0.274 seconds. -->

This really serves no purpose to anybody but the the few debuggers who might be looking at the code for performance reasons once every so often. 

You may be saying what’s the big deal, it is a comment and it is not seen by the user, so who cares? Well you are right nobody does care, and this post isn’t going to change that fact, this type of debugging flair is going to continue on for the entire lifespan of the web. 

But it does cause problems for any service that wants to figure out if your webpage has been changed overtime.  Like search engines, proxy caches, and the service I am working on that inspired this post.

The simplest method for determining a change

The simplest method for determining a change is to use a hashing algorithm, like MD5, which will provide you a 128-bit (16 byte) hash of whatever length text you put in it.

MD5("The quick brown fox jumps over the lazy dog")
    = 9e107d9d372bb6826bd81d3542a419d6

Even a small change results it a completely different hash value.

MD5("The quick brown fox jumps over the lazy dog.")
    = e4d909c290d0fb1ca068ffaddf22cbd0

So as you can see, by just adding a period to the end of the string, it resulted in a completely different hash of the string.  And this is an important and desirable property of most hashing algorithms called the avalanche effect, because it allows you to determine if there have been even the smallest of changes made.

Because of the avalanche effect, you are able to just store the hash of your old value and compare it to the hash of your new value to determine if there has been a change made.  And for reasons I won’t get into, in this post, this is desirable because you can use it compare inputs of any size, 1 B, 1 KB, 1 MB, 1 GB, 1 TB, and etc.  And with each case it will only produce a simple to store 16 byte output.

The Problem

Many of you already know where I am going with this.  But given the fact that many web pages contain this constantly changing debugging flair that will produce different hashes with every request. 

How can you determine if a change has been significant enough to warrant the CPU cycles to reprocess the web page?

Well that was the problem I was stuck trying to figure out.  Because after I realized that a simple hashing comparison wasn’t enough and I was spending too much time storing and processing pages that really didn’t change.  I determined a couple simple rules for determining if a change was significant enough to warrant a reprocessing.  

  1. Any change less that 5 characters wouldn’t be counted.
  2. And the whole web page had to change by more than 1% for it to be counted as significant.

After I had these simple rules, I decided to turn to an O(ND) Difference Algorithm like the one used on StackOverflow.  After a little research I determined that the StackOverflow team was probably using this difference algorithm created in C# and has been around and maintained since 2002.

Using the difference algorithm I created the following method to check for significant changes.

private bool IsChangeSignificant(string text1, string text2)
{
	var differences = Difference.DiffText(text1, text2, true, true, true);
	int text2SignificantChangeLength = 0;
	int text2Length = text2.Length;

	foreach (var diff in differences)
	{
		if (diff.insertedB > 5)
			text2SignificantChangeLength += diff.insertedB;
	}

	// if more than 1% of the document has changed of changes larger than 5 characters,
	// then it is considered significant
	// TODO: probably need a more robust solution in the future
	return ((double)text2SignificantChangeLength / (double)text2Length) > 0.01;
}

As my TODO indicates I probably need to find a more robust solution in the future, however this simple routine, so far, has given me the results I was looking for.  For the most part all the debugging flair has been marginalized as related to comparing two web pages.

However there are some downsides I can anticipate in the future. 

  1. A 1% change for a 1 KB web page is only about 100 characters, which is probably enough to weed out the debugging flair, however a 1% change for a 1 MB web page is about 1000 characters, and this is well beyond the debugging flair changes.
  2. The CPU overhead is more significant in comparing text this way, than comparing two hashes.  In my case it made sense for various reasons to incur this overhead, it might not for you.
  3. Weeding out 5 characters and above for a significant change is arbitrary, because if depends on where those 5 characters occurred at within the web page.  If it was in the main body, it is probably a more significant change than if it occurred in the footer.

Tags: , , , ,

Posted in C#, How To | kick it on DotNetKicks.com | Bookmark | View blog reactions | No Comments »

December 10th, 2009

ASP.NET MVC – 0 to 30 in 30 minutes

I recorded the following presentation for DevReady.NET, a new project that I am working on with a bunch of talented MVC’s and Microsoft employees.

After I get done with editing the presentation, I will post it up to DevReady.NET and my blog.  This will be my first try at making a video presentation for the web, so I look forward to your comments. 

Tags:

Posted in ASP.NET, C#, Portfolio | kick it on DotNetKicks.com | Bookmark | View blog reactions | 2 Comments »

December 3rd, 2009

Sometimes you just need to CodingHorror it!

The title of this post is a tongue-in-cheek reference to SubSonic’s inline query by the same name, which in turn is a reference to the blogger Jeff Atwood’s blog who you should all know.  Rob Conery, the SubSonic project leader, named the inline query class CodingHorror after he allegedly read Jeff Atwood’s post titled Embracing Languages Inside Languages, in which he bestowed the virtues of inline SQL inside your code, instead of the standard bequeathed statement that I bet you all have heard “We do all database work in stored procedures.”.  Jeff outlined his though process on ad-hoc vs stored procs as follows:

The Subsonic project attempts to do something similar for SQL. Consider this SQL query:

SELECT * from Customers WHERE Country = "USA"
ORDER BY CompanyName

Here’s how we would express that same SQL query in SubSonic’s fluent interface:

CustomerCollection c = new CustomerCollection();
c.Where(Customer.Columns.Country, "USA");
c.OrderByAsc(Customer.Columns.CompanyName);
c.Load();

I’ve mentioned before that I’m no fan of object-oriented rendering when a simple string will suffice. That’s exactly the reaction I had here; why in the world would I want to use four lines of code instead of one? This seems like a particularly egregious example. The SQL is harder to write and more difficult to understand when it’s wrapped in all that proprietary SubSonic object noise. Furthermore, if you don’t learn the underlying SQL– and how databases work– you’re in serious trouble as a software developer.

Enough of the History lesson, why are you writing this post

Well the reason for this post is two fold, I haven’t written a post in a couple of weeks, and I wanted to write about my own CodingHorror moment, while working with the Entity Framework, that solved a big problem I was having with getting the LIKE statement to work in LINQ.

I am working with a client of mine trying to expose their data to the web via a simple REST service.  Their whole database model has been constructed in the Entity Framework in a simple active record format.  One of the requirements, I was given, was to support wildcards on a couple of the input fields.  I thought this was no big deal, because I knew that the StartWith, EndsWith, and Contains methods of the System.String object, when used in LINQ, translated down to the SQL LIKE operator. 

However at the time I didn’t anticipate how much of a pain in the butt it was to actually figure out which of these three methods I should use depending on where the wildcard was placed in the string.  And how I would support a wildcard that was placed in the middle of a string.

Enter Entity Frameworks’s parameterized Where method

The parameterized Where method, of the ObjectQuery<T> object, really saved my butt.  It allowed me to produce the exact effect of the query I was looking for, and with out a lot of hacking.  Here is what I did:

if (!String.IsNullOrEmpty(toAddress))
	table = table.Where(
		"it.ToAddress LIKE @toAddress",
		new ObjectParameter("toAddress", toAddress)
	);

var results =
	from result in table
	where result.Account.AccountId == accountId
		&& result.DateTime >= startDate

return View(results.ToList());

The best part of the above code is that I can still use LINQ, and jump in to standard SQL syntax when the LINQ syntax fell short. 

This is my opinion gives the Entity Framework a step above the rest, because it means that I don’t have to complicate my life and my program by moving to a stored proc or making the whole select statement inline-SQL (which I wouldn’t ever do). 

Features like this that make developing software easier, and obviously have the programmer in mind, are what we should all strive for.  I like this for the very fact that the Entity Framework team has acknowledged that LINQ, while a wonderful addition to .NET, doesn’t meet all the needs for pulling data from the database and they sought to minimize the short comings by allowing developers to jump back in to the natural SQL language.  This is a wonderful addition in my book.

Tags: , ,

Posted in C#, How To, Programming, SQL | kick it on DotNetKicks.com | Bookmark | View blog reactions | 2 Comments »

August 30th, 2009

Static Constructors in .NET 3.5, still a bad thing?

Recently at the Philly.NET User Group, Kathleen Dollard gave a great presentation on the use of generics and rethinking object orientation.  Both topics were very engaging.  But the part of the night that I found most intriguing was a conversation, that I had in a Ruby Tuesdays after the presentation, about the useage of static constructors and if they are still a bad thing to use in your code.

Many years ago, I had read the articles by K. Scott Allen and Brad Abrams, explaining why the original FxCop rule, “Do not declare explicit static constructors”, existed and the IL command beforefieldinit, that caused the FxCop rule to trigger and cause performance issues.  Jon Skeet explained it best in a recent Stack Overflow post.

Basically, beforefieldinit means “the type can be initialized at any point before any fields are referenced.”

In theory that means it can be very lazily initialized – if you call a static method which doesn’t touch any fields, the JIT doesn’t need to initialize the type.

In practice it means that the class is initialized earlier than it would be otherwise – it’s okay for it to be initialized at the start of the first method which might use it. Compare this with types which don’t have beforefieldinit applied to them, where the type initialization has to occur immediately before the first actual use.

But, I had assumed that since this rule was never include in the Code Analysis (FxCop replacement) part of Visual Studio 2005, 2008, or 2010 that it was a non issue in .NET 2.0 and forward. Since all these previous credible articles that I could find were run against the .NET 1.1 framework.  But I had never really looked in to it until now.  The first thing I did was create the following program to test 8 different senarios where a static constructor could be created by the compiler or was created by me to test the performance differences.

Each of the 8 senarios had a public interface like the following:

class StaticX
{
    static string Name = "Nick Berardi";
    static string GetName() { /* to make sure the Name property was referenced */ }
}

The scenarios were broken down as follows:

  1. Static1: static class, name set on field
  2. Static2: static class, name set in constructor
  3. Static3: static class, name set in property
  4. Static4: static class, name set on field and in constructor
  5. Static5: name set on field
  6. Static6: name set in constructor
  7. Static7: name set in property
  8. Static8: name set on field and in constructor

Static Constructor TestI then used reflector to give me the following IL dump of the code.

If we look at the IL code for each of these classes and static constructors we start finding some interesting patterns in how the compiler optimizes the code.

Static1, Static2, Static5, and Static6

All the the following static constructor (.cctor()), which is interesting because half of them defined the code in the static constructor and half defined the code by setting the field during initialization.

.method private hidebysig specialname rtspecialname static void .cctor() cil managed
{
    .maxstack 8
    L_0000: ldstr "Nick Berardi"
    L_0005: stsfld string ConsoleApplication1.Program/Static1::Name
    L_000a: ret
}

Static3 and Static7

These do not have any static constructor, which we could have probably guessed because there was no constructor defined and no fields set during intialization of the classes.

Static4 and Static8

These do not match any of the other static constructors we have seen, probably because there is an operating occurring in them.

.method private hidebysig specialname rtspecialname static void .cctor() cil managed
{
    .maxstack 8
    L_0000: ldstr "Nick"
    L_0005: stsfld string ConsoleApplication1.Program/Static8::Name
    L_000a: ldsfld string ConsoleApplication1.Program/Static8::Name
    L_000f: ldstr " Berardi"
    L_0014: call string [mscorlib]System.String::Concat(string, string)
    L_0019: stsfld string ConsoleApplication1.Program/Static8::Name
    L_001e: ret
}

There is one more place we are going to want to pay attention to before we start looking at the performance, and I sort of alluded to this at the beginning of the article. That is the IL for the class definition:

.class abstract auto ansi sealed nested public beforefieldinit Static1
.class abstract auto ansi sealed nested public                 Static2
.class abstract auto ansi sealed nested public beforefieldinit Static3
.class abstract auto ansi sealed nested public                 Static4
.class          auto ansi        nested public beforefieldinit Static5
.class          auto ansi        nested public                 Static6
.class          auto ansi        nested public beforefieldinit Static7
.class          auto ansi        nested public                 Static8

Please note that I added in the spaces to show the differences between the definitions.

You will notice that the abstract and sealed are missing from Static5Static8, don’t worry about those keywords they are the difference between a static class and an instantiatable class. The one we are about is beforefieldinit, which is only added to classes that don’t already contain a static constructor. The definition of how this works has changed slightly from the original specification (ECMA-335). Here is the break down as related to .NET Framework releases:

  • 1st Edition (December 2001) – .NET 1.0
  • 2nd Edition (December 2002) – .NET 1.1
  • 3rd Edition (June 2005) – .NET 2.0, .NET 3.0, .NET 3.5
  • 4th Edition (June 2006) – .NET 2.0, .NET 3.0, .NET 3.5 (Changes from the previous edition were made to align this Standard with ISO/IEC 23271:2006.)

I have marked changes that have been added since the first edition.

The semantics of when and what triggers execution of such type initialization methods, is as follows:

  1. A type can have a type-initializer method, or not.
  2. A type can be specified as having a relaxed semantic for its type-initializer method (for
    convenience below, we call this relaxed semantic BeforeFieldInit).
  3. If marked BeforeFieldInit then the type’s initializer method is executed at, or sometime before,
    first access to any static field defined for that type.
  4. If not marked BeforeFieldInit then that type’s initializer method is executed at (i.e., is triggered
    by):

    • first access to any static field of that type, or
    • first invocation of any static method of that type or
    • first invocation of any constructor for that type. (new in 3rd edition)
  5. Execution of any type’s initializer method will not trigger automatic execution of any initializer
    methods defined by its base type, nor of any interfaces that the type implements

START — The following is new to the 3rd edition.
For reference types, a constructor has to be called to create a non-null instance. Thus, for reference types, the
.cctor will be called before instance fields can be accessed and methods can be called on non-null instances. For
value types, an “all-zero” instance can be created without a constructor (but only this value can be created
without a constructor). Thus for value types, the .cctor is only guaranteed to be called for instances of the value
type that are not “all-zero”. [Note: This changes the semantics slightly in the reference class case from the first
edition of this standard, in that the .cctor might not be called before an instance method is invoked if the 'this'
argument is null. The added performance of avoiding class constructors warrants this change. end note]
END

[Note: BeforeFieldInit behavior is intended for initialization code with no interesting side-effects, where exact
timing does not matter. Also, under BeforeFieldInit semantics, type initializers are allowed to be executed at
or before
first access to any static field of that type, at the discretion of the CLI.
If a language wishes to provide more rigid behavior—e.g., type initialization automatically triggers execution
of base class’s initializers, in a top-to-bottom order—then it can do so by either:

  • defining hidden static fields and code in each class constructor that touches the hidden static field of its
    base class and/or interfaces it implements, or
  • by making explicit calls to System.Runtime.CompilerServices.Runtime-HelpersRuntimeHelpers3rd edition.RunClassConstructor
    (see Partition IV).

end note]

So to analyze what has changed, it looks like some major changes took place in the CLR for the .NET 2.0 framework on how and when static constructors should be initialized. In K. Scott Allen’s post that I referenced above he concluded that classes with out the beforefieldinit IL command ran 5 times faster on the .NET 1.1 framework. Lets see if that still holds true for the .NET 2.0 framework (and .NET 3.0 and 3.5 which run off the same CLR as .NET 2.0).

To test this, I ran each of the 8 scenarios through a loop about 2.15 billion times. Or Int32.MaxValue. Then I repeated the process 5 times (for those counting each one was ran about 10 billion times), and averaged the results and came up with these numbers:

  1. Static1: 05.7505114 seconds (static class, name set on field)
  2. Static2: 07.4920393 seconds (static class, name set in constructor)
  3. Static3: 11.6594099 seconds (static class, name set in property)
  4. Static4: 08.3254804 seconds (static class, name set on field and in constructor)
  5. Static5: 05.7362727 seconds (name set on field)
  6. Static6: 07.6593536 seconds (name set in constructor)
  7. Static7: 12.4801669 seconds (name set in property)
  8. Static8: 08.3191176 seconds (name set on field and in constructor)

We can obviously throw out the outliers of setting the name in the property because they where 2 times slower than the fastest method. And we can also throw out the name being set half in the field and half in the constructor, because this is obviously not the best performing and is also not really the best way to code this type of field, and it was only in there as a baseline for doing both a field and constructor setting in the same class.

So lets look at the actual numbers of field vs constructor:

  field constructor difference
static class 05.750511 07.492039 01.741528
instantiatable class 05.7362727 07.6593536 01.9230809
difference 00.0142383 00.1673146  

The performance points are pretty much a wash between static class and instantiatable class, and there is essentially no difference between the field and constructor for everyday normal programs to think about changing how you create your static-code as related to fields or constructors. The difference was only 30%, while that sounds like a lot the difference was only 2 seconds over 2.15 billion times run, so we are really talking about fractions of milliseconds per instance. In addition it was no 500% reported by K. Scott Allen against .NET 1.1 back in 2004.

You can draw your own conclusions. But my recommendation is to NOT spend your time changing your code from the use of static constructor to fields, because you won’t find the performance gains in your application. Spend the time on something useful, like the amount of data you are pulling from the database or other places.

I think we can finally put this issue to rest and say that there is no more performance penalty to a degree that every day programmers should care about if you are using the .NET 2.0 framework or forward. Which means I probably guessed correctly as to why they removed it from the Code Analysis rules in Visual Studio 2005.

Update: I had to run each static class individually again to get the correct results, because after I had published this I started playing around with other scenario and noticed that there seemed to be an optimization of static constructor classes (without beforefieldinit), with the same constructor signature, that got loaded later. With the new results, there is now there is virtually no difference between static classes and instantiatable classes which is what I sort of assumed would happen from the beginning.

If anybody knows why static constructor classes that share the same signature would be optimized can you please let me know, this is a new mystery to the static constructor debate.

Tags: , , , ,

Posted in C#, Programming | kick it on DotNetKicks.com | Bookmark | View blog reactions | 2 Comments »