Home Uncategorized Faster, More Scalable SQLCLR String Splitting

    Faster, More Scalable SQLCLR String Splitting

    1063
    24

    It seems like every couple of months we see yet another post on SQLCLR string splitting routines. Many bloggers, I suppose, are still struggling, even three years later, to find that “perfect” use case for SQLCLR. Is string splitting it? Probably not. And with SQL Server 2008 table-valued parameters now available, SQLCLR string splitting has become an even less interesting exercise than it was before. None the less, a recent post on the Less Than Dot site has inspired me to counter with my own best SQLCLR string splitting method, for those of you who are still interested in solving this problem.

    I’ve noticed that almost invariably, the methods posted online stress how very easy it is to do string splitting in .NET, thanks to the String.Split method. And while this easy method tends to work pretty well for small strings and on workloads that don’t need to scale, it quickly breaks down when any amount of load is introduced (something that, unfortunately, most writers don’t bother considering). The Less Than Dot writer, “onpnt” did do some testing, and discovered that–surprise, surprise–String.Split isn’t all that great.

    The issue? Well, it all comes down to large memory allocations and the art of scalable .NET programming–an area about which many SQL Server developers can (and do) remain blissfully naïve. In .NET, reduction of memory utilization–especially large allocation done en masse–is king, and String.Split does exactly the wrong thing. It takes the input string, breaks it into N chunks, and allocates all of the memory needed to store those chunks and pointers to those chunks, in one big huge operation. This can’t possibly scale, and indeed it doesn’t.  I did a quick SQLQueryStress test of a TVF based on String.Split and saw fairly good performance when the input sentences were small (in the 40-400 byte range–see below), but after a certain point the AppDomains began recycling and performance became abysmal. Protections put in place for stability of the CLR host include memory leak detection, and this kicks in quite readily when we force allocation of so much memory in one shot–a good thing for the SQL Server instance as a whole, but not great when we’re trying to really split a huge string.

    Students of .NET who are concerned with scale (and really, everyone should be) are urged to look at the way problems are handled in LINQ. Here the vast majority of requests are internally handled using streaming iterator patterns, rather than moving around huge chunks of memory. This turns out to a much more scalable option for several reasons: Lower in-flight memory utilization, fewer large object heap allocations, and better access by the garbage collector to collect intermediate data that is no longer needed.

    So how can we apply streaming to the string splitting problem? Rather than break the string up into all of its component parts upfront, we can walk the string step-by-step, only finding the next piece as required. In order to handle this, I created the following worker class:

    public class splitIt : IEnumerator
    {
    	public splitIt(string theString, char delimiter)
    	{
    		this.theString = theString;
    		this.delimiter = delimiter;
    		this.lastPos = -1;
    		this.nextPos = -1;
    	}
    
    	#region IEnumerator Members
    
    	public object Current
    	{
    		get { return theString.Substring(lastPos, nextPos - lastPos).Trim(); }
    	}
    
    	public bool MoveNext()
    	{
    		if (nextPos >= theString.Length)
    			return false;
    		else
    		{
    			lastPos = nextPos + 1;
    
    			if (lastPos == theString.Length)
    				return false;
    
    			nextPos = theString.IndexOf(delimiter, lastPos);
    			if (nextPos == -1)
    				nextPos = theString.Length;
    
    			return true;
    		}
    	}
    
    	public void Reset()
    	{
    		this.lastPos = -1;
    		this.nextPos = -1;
    	}
    
    	#endregion
    
    	private int lastPos;
    	private int nextPos;
    	private string theString;
    	private char delimiter;
    }
    

    This class is a simple enumerator implementation that looks for the next delimiter on each iteration, only when requested. Splitting strings in this way, rather than using String.Split, means that no huge upfront allocation takes place. Aside from the sentence itself, only one “chunk” is in play at any given time, and any chunks that have already been used can be garbage collected as needed when memory is tight.

    Wiring this class up in a SQLCR TVF is just about as simple as when using String.Split:

    [Microsoft.SqlServer.Server.SqlFunction(FillRowMethodName = "FillIt", TableDefinition = "output nvarchar(4000)")]
        public static IEnumerator faster_split
            (
                SqlChars instr, 
                [SqlFacet(IsFixedLength=true, MaxSize=1)]
                SqlString delimiter
            )
        {
            return (
                (instr.IsNull || delimiter.IsNull) ? 
                new splitIt("", ',') : 
                new splitIt(instr.ToSqlString().Value, Convert.ToChar(delimiter.Value)));
        }
    
        public static void FillIt(object obj, out SqlString output)
        {
            output = (new SqlString((string)obj));
        }
    

    I’ve enhanced this example slightly compared with most of the usual suspects: A SqlFacet attribute is used to make sure that the delimiter is only a single character, and I’ve added a bit of additional code in the main method to deal with NULL inputs.

    The scalability difference between this method and the String.Split method is staggering in the simple tests I ran today on my SQL Server 2008 test server.  With small sentences, even under moderate load (100 concurrent threads), each method performs more or less equivalently.  But as sentence size increases to 50KB, the String.Split method begins slowing, taking almost 2 seconds per iteration, and the occasional AppDomain recycle is seen in the SQL Server log.  The streaming method, on the other hand, continues to complete its job in just over 1/10th of a second, and causes no AppDomain recycles. Increasing to 500KB sentences, String.Split causes numerous AppDomain recycles and time per iteration increases to over 30 seconds, while the streaming method averages just 16 seconds per iteration. Jumping to 5MB sentences, String.Split causes almost continuous AppDomain recycles, and each iteration takes almost 6 minutes to complete. Yet with the streaming method, even with sentences of this size I am still unable to cause an AppDomain recycle to occur, and iterations complete in around 55 seconds.

    The test I did was quite simple, and I won’t post too many details here as I prefer that you test with your own workloads and draw your own conclusion about how this method fares when compared with T-SQL versions or the naïve String.Split method. I hope that if you do test you’ll post back here with your results so that we can all learn the best way to handle these problems–whether or not string splitting really is all that interesting in the post-SQL Server 2005 world.

    Previous articleWho’s On First? Solving the Top per Group Problem (Part 1: Technique)
    Next articleSQLCLR String Splitting Part 2: Even Faster, Even More Scalable
    Adam Machanic helps companies get the most out of their SQL Server databases. He creates solid architectural foundations for high performance databases and is author of the award-winning SQL Server monitoring stored procedure, sp_WhoIsActive. Adam has contributed to numerous books on SQL Server development. A long-time Microsoft MVP for SQL Server, he speaks and trains at IT conferences across North America and Europe.

    24 COMMENTS

    1. Very impressive Adam as with anything I’ve read of yours.  I’m glad the one on LTD got you up to writing this and that I had the chance to get the feedback on the imperfections in mine along with the more efficient ways of performing the task.   It’s easy to say things can be better, but showing it like you have here is the best way we can learn from it.   I know I have!  -Ted

    2. Hey Adam,
      This debate is currently running over at SQLServerCentral:
      http://www.sqlservercentral.com/Forums/Topic695508-338-1.aspx?Update=1
      We used a very similar TVF to yours, and found that leaving SqlChars.Value as char[] improved the speed of the split slightly.
      That is the current fastest CLR TVF for most cases, though for larger strings with fewer delimiters, a RegEx split is even faster.
      A nice graph showing the two fastest CLR TVFs versus two of the best T-SQL approaches can be found here:
      http://www.sqlservercentral.com/Forums/FindPost704614.aspx
      The other interesting thing to come out so far is that the CLR TVFs are very significantly faster than tally (numbers) table based T-SQL routines, long held to be the best way to split strings.
      There are a large number of tests in that thread, from small strings to splitting the entire text of Moby Dick!
      Finally, it slightly sucks that while the TVF can stream its output via IEnumerator and IEnumerable, it is a shame that streaming input can’t be achieved at the same time, either via the SqlChars or by using a SqlDataReader to read the rows-to-split from a persistent table.
      Cheers,
      Paul

    3. Hi Paul,
      Thanks for the pointer.  I just took a quick look through the thread but couldn’t find any methods similar to mine.  I saw the "chars" method, but it has the same problem as String.Split: It does all of the splitting work upfront, producing an intermediate collection (it uses an ArrayList — if I’ve looked at the wrong solution please point me to the correct one).  I didn’t see anything in there about anyone actually load testing these methods, and that’s where the scalability factor comes into play.  The upfront allocations simply will not scale.
      By the way, SqlChars does stream the data in, and if I can figure out a way to leverage the char collection without doing an upfront large allocation I’ll post back an updated solution that should be even better.

    4. Thanks for the quick reply – and, yes, you are quite right – all the solutions there do the allocation up-front, not using IEnumerator methods explicitly.  I do apologize for my confusion there.  Sorry.
      The reason for my confusion was that I had experimented with a solution using explicit IEnumerator methods, when trying to write a CLR TVF to read all the strings to split directly from the source table.  The rows were read into a SqlDataReader, and the plan was to stream the input from the Reader and also stream the output in the IEnumerator methods.
      (BTW yes SqlChars can stream, but this method was trying to read all rows at once, rather than streaming individual rows…)
      Sadly, it failed for two reasons:
      (a) CLR TVFs can only do data access in the InitMethod, not in the method called via FillRowMethodName, so I would have had to read all the data up front again.
      (b) Sending the results down the SqlPipe row-by-row using a SqlDataRecord and associated metadata is really slow.  Building a DataTable in a DataSet and sending a SqlDataReader from the DataSet is even worse.
      My apologies again.
      Paul

    5. Hi Paul,
      No need to apologize for anything 🙂
      I did some work with the SqlPipe for a running sums problem and found it to be plenty fast for my needs.  How big was the data you were returning?

    6. It was returning the split string results from Jeff Moden’s test table.  This generates 800K rows of single small strings O.0
      It wasn’t that slow in absolute terms, it was just a little disappointing that it was slower than the CROSS APPLY with the original CLR TVF.  It was about the same speed as the simple tally table solution, which was one of the worst performers overall.
      Of course, it may be simply that my C# skills are not all they could be!
      The table-at-once approach was never intended to be a good addition to the SQL CLR developer’s toolkit, for anyone wondering – it was an attempt to find the fastest possible solution for the original posted problem on SSC.
      It’s memory allocation in particular would be horrible – I just wanted to find a way to cut down the round trips between the database engine and CLR, and to reduce the number of calls (late-bound by reflection?) to the IEnumerator methods.
      I was apologising for replying too quickly without taking the time to fully digest your solution.  I stand by that!  I will learn :c)
      Paul

    7. M,
      Yes – see the link to SSC in Adam’s article.  The Regex is modestly faster for very large strings – but see Adam’s next post for the ultimate solution.  As you will see, the problem with all the CLR methods up to this point (including a pre-compiled Regex) is that memory is allocated for the entire split up front…
      Cheers,
      Paul

    8. M: As Paul mentioned, Regex.Split is out, due to upfront memory allocation.  Precompiled regex might still be an option, but I can’t find a way to stream the results (i.e., an IndexOf method or something similar).  Do you have any ideas?

    9. What if you stream the positions back and then use substring in the TSQL. This avoids the need to make any copies of the strings.

    10. Simon,
      Wouldn’t SQL Server be making a copy of the string when it performs the substring? ;c)
      Paul

    11. I tried something similar in the past without much success, but it was 100% T-SQL.  I’m not sure if the performance issue was with SUBSTRING or CHARINDEX but on VARCHAR(MAX) it was pretty bad.  I’ll have to test and see if that makes a difference.  Only issue is that it would require two separate modules to do a split, which becomes a bit more of a maintenance nightmare.

    12. Thank you, this is very useful (though it should be included as a built-in T-SQL function).  I haven’t read through all the many pages spilled on this subject, but I assume the conclusion is that your CLR version is among the fastest, and faster in most scenarios than the T-SQL version?
      btw, I added a comment about an update to your pure T-SQL version, at:
      http://sqlblog.com/blogs/adam_machanic/archive/2006/07/12/splitting-a-string-of-unlimited-length.aspx

    13. Hi Adam
      Would you mind please posting the whole code for your example?  I am trying to learn how to create a CLR TVF do split strings, and this page comes highly recommended.  However, I can’t figure out how to fill in the gaps… Thanks 🙂

    14. The function would be simpler using yield return:
      public partial class UserDefinedFunctions
      {
      [Microsoft.SqlServer.Server.SqlFunction(
      TableDefinition="token nvarchar(4000)",
      FillRowMethodName="GetNextString")]
      public static IEnumerable Split(
      [SqlFacet(MaxSize=-1)] SqlString str,
      [SqlFacet(IsFixedLength=true, MaxSize=1)] SqlString separator)
      {
      if (str.IsNull || separator.IsNull) yield break;
      char splitter = separator.Value[0];
      int currentPosition = 0;
      int nextPosition = 0;
      string inputStr = str.Value;
      int strLength = inputStr.Length;
      do
      {
      nextPosition = inputStr.IndexOf(splitter, currentPosition);
      if (nextPosition == -1) nextPosition = strLength;
      yield return inputStr.Substring(currentPosition, nextPosition – currentPosition).Trim();
      currentPosition = nextPosition + 1;
      } while (currentPosition < strLength);
      }
      public static void GetNextString(Object obj, out SqlString str)
      {
      str = new SqlString((string)obj);
      }
      };

    15. give my CLR JSON Parser a try. I believe it is the only one which deploys with SAFE permissions. It uses regex with recursion to blow through the string. https://github.com/jgcoding/J-SQL.git. I am working alternate params and functions to control the depth of the parse.
      Still working on dealing with pre-foratted (prettified) strings, those with line-feeds, carriage returns, and tabs not residing within a valid string value. Special characters residing within a string value pose no issue.

    16. Thanks for sharing, JG. I’ve not yet had any reason to work with JSON from SQL but I’m sure someone will find it useful.

    17. Hi Adam,
      I know this has been posted a while ago and it is a very good one ..my issue is it can do a split with enclosed quotes ("something,something" ) I don’t want to split using the delimiter that’s enclosed within quotes.
      Do you know a way to modify this code to handle to string of these two natures
      12321,john,doe,"1945,07,03",A1A
      "12321","john","doe","1945,07,03",A1A

    18. Sorry .. I meant .. it CANT ..
      my issue is it cant do a split with enclosed quotes ("something,something" ) I don’t want to split using the delimiter that’s enclosed within quotes

    19. Is it possible to have this query output a RowId integer for each row that is created without impacting its scalability?

    20. Thank you so much for this. We are on sql server 2019 and built in STRING_SPLIT is fast for varchar(8000)/nvarchar(4000) but performance falls of a cliff when handling varchar(max)/nvarchar(max) unfortunately.

      Anyone have any links to other user CLR code snippets from established Sql server MVPs? Is there any way to donate to the author?

    Comments are closed.