Home Uncategorized Tokenize UDF

Tokenize UDF

1

Yes, another string splitting UDF from a guy who’s obvioiusly become obsessed with TSQL string splitting. This time we delve into a mysterious world that I call, “Tokenization.”

So what is Tokenization? It’s a word I made up for this problem.

But what is it, really? It’s splitting up a string based on a delimiter — in this case, a comma — but being wary of substring delimiters. In this case, that’s a pair of apostrophes, because that’s what TSQL uses for strings.

I think this is best illustrated with an example string:

DECLARE @Tokens VARCHAR(50)

SET @Tokens = 'a, ''b'', ''''c'', ''d'', ''e'''', f, ''1,2,3,4'''

The basic split string function that you can find will produce the following output:

SELECT * 
FROM dbo.SplitString(@Tokens, ',')

OutParam
-------------
a
'b'
''c'
'd'
'e''
f
'1
2
3
4'

Well, that’s wrong. Because what I want to do is maintain the substrings (or, “tokens,” as I like to call them — thus, Tokenization!)

The output I desire is:

Token
--------
a
'b'
''c', 'd', 'e''
f
'1,2,3,4'

Notice that substrings — delimited with apostrophes — should be maintained.

And here’s how I’ve solved this problem…

CREATE FUNCTION dbo.Tokenize
(
	@Input NVARCHAR(2000)
)
RETURNS @Tokens TABLE 
	(
		TokenNum INT IDENTITY(1,1),
		Token NVARCHAR(2000)
	)
AS
BEGIN
	DECLARE @i INT SET @i = 0
	DECLARE @StartChar INT SET @StartChar = 1
	DECLARE @Quote INT SET @Quote = 0	

	DECLARE @Chars TABLE 
	(
		CharNum INT IDENTITY(1,1) NOT NULL PRIMARY KEY,
		TheChar CHAR(1), 
		TheCount INT,
		StartChar INT
	)

	SET @Input = ' , ' + @Input + ' , '
	
	INSERT @Chars (TheChar)
	SELECT SUBSTRING(@Input, n.Number, 1)
	FROM Numbers n
	WHERE n.Number > 0 
		AND n.Number <= LEN(@Input)
	ORDER BY n.Number
	
	UPDATE Chars SET 
		@i = Chars.TheCount = 
			CASE 
				WHEN Chars1.TheChar = ',' 
					AND @Quote % 2 = 0 THEN 0 
				ELSE @i + 1 
			END,
		@Quote = 
			CASE  
				WHEN Chars1.TheChar = '''' THEN @Quote + 1 
				WHEN @i = 0 THEN 0 
				ELSE @Quote 
			END,
		@StartChar = Chars.StartChar =
			CASE
				WHEN @i = 1 THEN Chars1.CharNum - 1
				WHEN @i = 0 THEN @StartChar + 1
				ELSE @StartChar
			END
	FROM @Chars Chars
	JOIN @Chars Chars1 ON Chars1.CharNum = Chars.CharNum + 1

	INSERT @Tokens(Token)
	SELECT
		RTRIM(LTRIM(SUBSTRING(@Input, StartChar, CharNum - StartChar + 1)))
	FROM (
		SELECT StartChar, CharNum
		FROM @Chars
		WHERE TheCount = 0

		UNION ALL

		SELECT 
			MAX
			(
				CASE TheCount 
					WHEN 0 THEN CharNum 
					ELSE 0 
				END
			) + 1, 
			MAX(CharNum)
		FROM @Chars
	) x
	WHERE RTRIM(LTRIM(SUBSTRING(@Input, StartChar, CharNum - StartChar + 1))) NOT IN ('', ',')
	ORDER BY x.StartChar
	RETURN
END

A word of warning: This UDF uses the undocumented — and unsupported — “aggregate update” functionality. I’ve tested thoroughly in this case and believe it works perfectly (and it sure is handy!), but I would advise you to not use it in your own projects without extensive testing! MS doesn’t support this one, so handle with care.

And by the way, you need a numbers table to use this thing. Of course.

As for using this thing, it’s pretty easy:

DECLARE @Tokens VARCHAR(50)

SET @Tokens = 'a, ''b'', ''''c'', ''d'', ''e'''', f, ''1,2,3,4'''

SELECT Token
FROM dbo.Tokenize(@Tokens)


Token
--------
a
'b'
''c', 'd', 'e''
f
'1,2,3,4'

… and it even appears to work properly!

Enjoy… and application for this and other strange things I’ve been posting recently coming very, very soon.

SHARE
Previous article“Reflect” a TSQL routine
Next articleSQL Server 2005 T-SQL: Aggregates and the OVER clause
Adam Machanic helps companies get the most out of their SQL Server databases. He creates solid architectural foundations for high performance databases and is author of the award-winning SQL Server monitoring stored procedure, sp_WhoIsActive. Adam has contributed to numerous books on SQL Server development. A long-time Microsoft MVP for SQL Server, he speaks and trains at IT conferences across North America and Europe.

1 COMMENT

  1. Thank you for this.  To note, if there is a token and nothing but a space, your script excludes this extra "token".  In my situation I needed to always compare a specific token number so I needed this empty token.
    These changes are not efficient, but they worked.
    I changed the INSERT section to use a CASE instead that compared against ” and then it didn’t use the LTREM/RTRIM else it used the trim.
    case when RTRIM(LTRIM(SUBSTRING(@Input, StartChar, CharNum – StartChar + 1))) = ”
    then SUBSTRING(@Input, StartChar, CharNum – StartChar + 1)
     else RTRIM(LTRIM(SUBSTRING(@Input, StartChar, CharNum – StartChar + 1)))
    end
    Additionally, I had to change the WHERE clause because SQL thinks that ” = ‘ ‘.
    SUBSTRING(@Input, StartChar, CharNum – StartChar + 1) NOT LIKE ”
    AND
    RTRIM(LTRIM(SUBSTRING(@Input, StartChar, CharNum – StartChar + 1))) NOT LIKE ‘,’

LEAVE A REPLY

Please enter your comment!
Please enter your name here