Performance: Compiled vs. Interpreted Regular Expressions

August 6, 2009 at 10:33 PMBen

When a regular expression in .NET will be used multiple times, it’s common to create that Regex with the Compiled flag, e.g. RegexOptions.Compiled.  Compiled regexp’s take a bit more time to create initially, but will run faster than a regexp created without the Compiled flag.  At least that’s what the documentation states!

Without the Compiled flag, your regexp will be interpreted.  There’s even "precompiled” regular expressions.  You need to compile these regular expressions into an assembly before runtime.  This might be a good option if you have constant regexps that don’t change.  If your regexps are subject to change, pre-compiled is not a good option.  These three types of regexp’s (interpreted, compiled and pre-compiled) are explained with a few more technical details in this somewhat dated MS blog article.

Theory is great, but real benchmarks are more meaningful.  I’ve assembled some code that benchmarks the difference in speed it takes to create and run 5,000 regular expressions.  There’s actually a big difference in the time taken to run a compiled regular expression the first time, versus subsequent times.  So the results shown here will include the first run time as well as the subsequent run times.

Here’s some code to get us started:

    private static List<Regex> _expressions;
    private static object _SyncRoot = new object();

    private static List<Regex> GetExpressions()
    {
        if (_expressions != null)
            return _expressions;

        lock (_SyncRoot)
        {
            if (_expressions == null)
            {
                DateTime startTime = DateTime.Now;

                List<Regex> tempExpressions = new List<Regex>();
                string regExPattern =
                    @"^[a-zA-Z0-9]+[a-zA-Z0-9._%-]*@{0}$";

                for (int i = 0; i < 5000; i++)
                {
                    tempExpressions.Add(new Regex(
                        string.Format(regExPattern,
                        Regex.Escape("domain" + i.ToString() + "." +
                        (i % 3 == 0 ? ".com" : ".net"))),
                        RegexOptions.IgnoreCase | RegexOptions.Compiled));
                }

                _expressions = new List<Regex>(tempExpressions);
                DateTime endTime = DateTime.Now;
                double msTaken = endTime.Subtract(startTime).TotalMilliseconds;
            }
        }

        return _expressions;
    }

We’re storing 5,000 regular expressions in a static list.  Notice the RegexOptions.Compiled flag is being used.  The regexp’s are just looking for email addresses with specific domain names – domain1.net, domain2.net, domain3.com, etc.  Not very useful, but I just wanted the regexp’s to vary.  You can see we’re also recording the number of milliseconds taken to create the regular expressions.  Now here’s the code that calls GetExpressions() and actually invokes the IsMatch function on each regexp.

    private static void CheckForMatches(string text)
    {
        List<Regex> expressions = GetExpressions();
        DateTime startTime = DateTime.Now;

        foreach (Regex e in expressions)
        {
            bool isMatch = e.IsMatch(text);
        }

        DateTime endTime = DateTime.Now;
        double msTaken = endTime.Subtract(startTime).TotalMilliseconds;
    }

And we call CheckForMatches like so:

    CheckForMatches("some random text with email address, [email protected]");

How much time does it take to create and run these 5,000 compiled expressions?  Here’s what I get:

Compiled Regular Expressions
CREATION TIME: 1662 ms
FIRST RUN TIME: 25137 ms
SUBSEQUENT RUN TIMES: 41 ms

Subsequent runs of all 5,000 expressions is very fast.  However, look how much time it takes the first time these 5,000 expressions are run in CheckForMatches() – 25 seconds!!!

Let’s make ONE change.  Remove the RegexOptions.Compiled flag.  By doing this, our regular expressions will be interpreted.  Here’s what we get:

Interpreted Regular Expressions
CREATION TIME:
493 ms
FIRST RUN TIME: 22 ms
SUBSEQUENT RUN TIMES: 20 ms

Interpreted regexp’s beat compiled in every category!  Running these tests several times produces similar results.  The BIG difference here is obviously the First Run Time.  25 seconds versus .022 seconds.

I’ve seen some benchmarks showing static regexp’s performing a little slower than instance regexp’s.  I ran the same tests without the static modifier on the fields and methods above.  Same results – using the Compiled flag takes around 25 seconds for the regular expressions to run the first time.  Without the Compiled flag, they run in hundredths of a second.

Clearly, interpreted regexps are the winner.  Granted, if you’re only dealing with a small number of regular expressions, and you use the compiled flag, the first run time isn’t going to be anywhere near what I’ve shown here with 5,000 regexps.  However, even with just a few regular expressions, in .NET, you’ll see me sticking with interpreted regular expressions!

Beware: CodePlex

February 22, 2009 at 4:45 PMBen

Ever since setting up this blog which is powered by BlogEngine.NET, I spend a small amount of time everyday at the BE.NET code hosting area at CodePlex.  Mostly I participate in the discussions there and have even submitted some bug reports in the issue tracker.  It's been a rewarding experience to be able to contribute to an open-source application used by probably thousands of bloggers out there.

However, my time spent at CodePlex has been anything but pleasant.  CodePlex is a website built, maintained and run by Microsoft.  It's a place anyone can store their open source projects for free.  There's other code hosting services out there such as SourceForge and Google Code.

Allow me to vent a bit and list the problems I find with CodePlex.

  1. SLOW, SLOW, SLOW!  CodePlex has to be the slowest website on the face of the internet.  Just about anything you do on CodePlex takes 10 times longer than it takes to do the same type of action on other websites.  Just pulling up a simple discussion takes 3 seconds for a simple GET request.  That 3 seconds is a best case scenario.  Almost on a daily basis, CodePlex will start slowing up.  Posting a message can take 10 to 20 seconds or even minutes at times.  Searching the Discussions takes too much time.  Even worse is searching the Issue Tracker.  There's been times it takes over 2 or 3 minutes for search results to come back.
  2. Server side errors.  I don't think a day has gone by that I haven't received at least one server side error while on CodePlex.  The typical error is an XML parsing error.  Sometimes the response from the server takes so long that I just get a general timeout error or page cannot be displayed error.
  3. Issue Tracker editor.  When creating an issue in the tracker or adding a comment to an issue, instead of getting a WYSIWYG editor like you get in the Discussions, all you get here is a plain old textarea.  And a pretty small textarea at that.  It's a common need to paste some code into the issue tracker -- after all, that's what CodePlex is for!  But all leading spaces are lost when saving your post.  This means properly indented code is no longer indented, looks crappy and is difficult to follow.  You also can't edit anything you posted in the Issue Tracker.  I'm not sure why you can edit messages and get a WYSIWYG editor in the Discussions area, but not in the Issue Tracker.
  4. Spam.  For a while, various spammers kept posting spam messages in the Discussions area.  CodePlex doesn't seem to have any mechanism to report spammers.  The spammers would post messages with some relevance to technology, but the messages had nothing to do with the Discussion at hand.  Throw in CodePlex's speed woes, and trying to find the real messages while sifting past the spam equates to a waste of my time.
  5. Team Foundation Server unavailable.  The source code for projects at CodePlex is stored in a TFS database.  Every now and then, I get messages stating TFS is unavailable for short to long periods of time.  While TFS is unavailable, the Issue Tracker and Source Code areas are completely unavailable.
  6. RSS Feed Lags.  CodePlex uses caching for its RSS feed which is a good idea.  New items to be added to the feed seem to normally show up within an hour.  There are times, however, that certain items may take several hours before they appear in the feed.

I wouldn't recommend CodePlex to anyone looking for a place to host their code.  I've not yet spent any time at Google Code, but I do visit SourceForge on rare occasion.  I recall no slowness or errors while browsing through the messages in the discussions at SourceForge.

I'm definitely not the first and certainly won't be the last one to bring up some of these CodePlex problems.  Dave Ward had this great blog post where he broke down some of the massive performance inefficiencies in CodePlex with its voting system.  It's disturbing too since this isn't the only Microsoft website to suffer from a performance / reliability standpoint.  A few years ago when I was spending some time on the www.asp.net forums, forum searches were very slow.  Apparently the asp.net website was completely down about a week ago.  The Microsoft blogs website is another site where blog posts and paging through the posts often results in long wait times.

I'd love to see BE.NET move and host its codebase elsewhere.  I unfortunately haven't seen any sign of Microsoft planning to fix CodePlex and they seem perfectly content with the way CodePlex is now.  If they do fix CodePlex, great.  I'm just constantly frustrated everytime I'm at CodePlex, and don't think a code hosting website (or any website) should be a hindrance or distraction from the real reason I'm at the site -- to participate in and have fun with a growing open-source project.

Posted in: Opinion

Tags: , ,

Disable ViewState - reap performance gains

October 19, 2008 at 11:17 AMBen

Turning off viewstate for an entire page or just for certain controls can potentially reduce bandwidth greatly.  This is not news for most developers, but at least for me, I often forget to take this into account when developing ASP.NET pages.

Just the other day, I found two server side controls that had enough content in them to make viewstate much larger than it should have been.  After disabling viewstate on those 2 controls, the viewstate hidden field sent to the client went down from about 6,500 bytes to 500 bytes!  That's 6 KB of unneeded data that was being sent down to each client.  And because viewstate is sent to the client in a hidden input field, if there's a postback, all that data gets sent back up to the server.  Most people don't have a very fast upload speed, so the post hurts performance more than sending the data to the client.

Especially for pages that don't do any postbacks, there's no reason I can think of to even have viewstate turned on at all.  For these pages, disable viewstate at the page level.  It's scary to imagine the number of ASP.NET pages out there that output static reports in large tables (gridview, datagrid, listview etc) with viewstate unnecessarily enabled.  Bandwidth savings on such pages could probably be as large as the tens or even hundreds of kilobytes.

There are reasons to leave viewstate enabled for server controls on pages that do postbacks, however.  During a postback, if the resources required are high for obtaining the data needed to put back into a disabled viewstate control, it may be more advantageous to leave viewstate enabled.  For instance, if you need to make a database call or consume a web service across a network to obtain the data, leaving viewstate turned on may be better.  Same with most standard form controls (e.g. input, select), there's typically not going to be enough of a performance gain to justify turning off viewstate and losing some of the perks you get with viewstate enabled when a page is going back and forth between the client and server one or more times.  The big performance gains are going to be with server controls that output large blocks of HTML not editable by the user.

Posted in: Development

Tags: ,