Home Uncategorized Tokenize UDF

    Tokenize UDF

    956
    1

    Yes, another string splitting UDF from a guy who’s obvioiusly become obsessed with TSQL string splitting. This time we delve into a mysterious world that I call, “Tokenization.”

    So what is Tokenization? It’s a word I made up for this problem.

    But what is it, really? It’s splitting up a string based on a delimiter — in this case, a comma — but being wary of substring delimiters. In this case, that’s a pair of apostrophes, because that’s what TSQL uses for strings.

    I think this is best illustrated with an example string:

    DECLARE @Tokens VARCHAR(50)
    
    SET @Tokens = 'a, ''b'', ''''c'', ''d'', ''e'''', f, ''1,2,3,4'''
    

    The basic split string function that you can find will produce the following output:

    SELECT * 
    FROM dbo.SplitString(@Tokens, ',')
    
    OutParam
    -------------
    a
    'b'
    ''c'
    'd'
    'e''
    f
    '1
    2
    3
    4'
    

    Well, that’s wrong. Because what I want to do is maintain the substrings (or, “tokens,” as I like to call them — thus, Tokenization!)

    The output I desire is:

    Token
    --------
    a
    'b'
    ''c', 'd', 'e''
    f
    '1,2,3,4'
    

    Notice that substrings — delimited with apostrophes — should be maintained.

    And here’s how I’ve solved this problem…

    CREATE FUNCTION dbo.Tokenize
    (
    	@Input NVARCHAR(2000)
    )
    RETURNS @Tokens TABLE 
    	(
    		TokenNum INT IDENTITY(1,1),
    		Token NVARCHAR(2000)
    	)
    AS
    BEGIN
    	DECLARE @i INT SET @i = 0
    	DECLARE @StartChar INT SET @StartChar = 1
    	DECLARE @Quote INT SET @Quote = 0	
    
    	DECLARE @Chars TABLE 
    	(
    		CharNum INT IDENTITY(1,1) NOT NULL PRIMARY KEY,
    		TheChar CHAR(1), 
    		TheCount INT,
    		StartChar INT
    	)
    
    	SET @Input = ' , ' + @Input + ' , '
    	
    	INSERT @Chars (TheChar)
    	SELECT SUBSTRING(@Input, n.Number, 1)
    	FROM Numbers n
    	WHERE n.Number > 0 
    		AND n.Number <= LEN(@Input)
    	ORDER BY n.Number
    	
    	UPDATE Chars SET 
    		@i = Chars.TheCount = 
    			CASE 
    				WHEN Chars1.TheChar = ',' 
    					AND @Quote % 2 = 0 THEN 0 
    				ELSE @i + 1 
    			END,
    		@Quote = 
    			CASE  
    				WHEN Chars1.TheChar = '''' THEN @Quote + 1 
    				WHEN @i = 0 THEN 0 
    				ELSE @Quote 
    			END,
    		@StartChar = Chars.StartChar =
    			CASE
    				WHEN @i = 1 THEN Chars1.CharNum - 1
    				WHEN @i = 0 THEN @StartChar + 1
    				ELSE @StartChar
    			END
    	FROM @Chars Chars
    	JOIN @Chars Chars1 ON Chars1.CharNum = Chars.CharNum + 1
    
    	INSERT @Tokens(Token)
    	SELECT
    		RTRIM(LTRIM(SUBSTRING(@Input, StartChar, CharNum - StartChar + 1)))
    	FROM (
    		SELECT StartChar, CharNum
    		FROM @Chars
    		WHERE TheCount = 0
    
    		UNION ALL
    
    		SELECT 
    			MAX
    			(
    				CASE TheCount 
    					WHEN 0 THEN CharNum 
    					ELSE 0 
    				END
    			) + 1, 
    			MAX(CharNum)
    		FROM @Chars
    	) x
    	WHERE RTRIM(LTRIM(SUBSTRING(@Input, StartChar, CharNum - StartChar + 1))) NOT IN ('', ',')
    	ORDER BY x.StartChar
    	RETURN
    END
    

    A word of warning: This UDF uses the undocumented — and unsupported — “aggregate update” functionality. I’ve tested thoroughly in this case and believe it works perfectly (and it sure is handy!), but I would advise you to not use it in your own projects without extensive testing! MS doesn’t support this one, so handle with care.

    And by the way, you need a numbers table to use this thing. Of course.

    As for using this thing, it’s pretty easy:

    DECLARE @Tokens VARCHAR(50)
    
    SET @Tokens = 'a, ''b'', ''''c'', ''d'', ''e'''', f, ''1,2,3,4'''
    
    SELECT Token
    FROM dbo.Tokenize(@Tokens)
    
    
    Token
    --------
    a
    'b'
    ''c', 'd', 'e''
    f
    '1,2,3,4'
    

    … and it even appears to work properly!

    Enjoy… and application for this and other strange things I’ve been posting recently coming very, very soon.

    Previous article“Reflect” a TSQL routine
    Next articleSQL Server 2005 T-SQL: Aggregates and the OVER clause
    Adam Machanic helps companies get the most out of their SQL Server databases. He creates solid architectural foundations for high performance databases and is author of the award-winning SQL Server monitoring stored procedure, sp_WhoIsActive. Adam has contributed to numerous books on SQL Server development. A long-time Microsoft MVP for SQL Server, he speaks and trains at IT conferences across North America and Europe.

    1 COMMENT

    1. Thank you for this.  To note, if there is a token and nothing but a space, your script excludes this extra "token".  In my situation I needed to always compare a specific token number so I needed this empty token.
      These changes are not efficient, but they worked.
      I changed the INSERT section to use a CASE instead that compared against ” and then it didn’t use the LTREM/RTRIM else it used the trim.
      case when RTRIM(LTRIM(SUBSTRING(@Input, StartChar, CharNum – StartChar + 1))) = ”
      then SUBSTRING(@Input, StartChar, CharNum – StartChar + 1)
       else RTRIM(LTRIM(SUBSTRING(@Input, StartChar, CharNum – StartChar + 1)))
      end
      Additionally, I had to change the WHERE clause because SQL thinks that ” = ‘ ‘.
      SUBSTRING(@Input, StartChar, CharNum – StartChar + 1) NOT LIKE ”
      AND
      RTRIM(LTRIM(SUBSTRING(@Input, StartChar, CharNum – StartChar + 1))) NOT LIKE ‘,’

    Comments are closed.