Tokenize UDF

March 10, 2005

668

Yes, another string splitting UDF from a guy who’s obvioiusly become obsessed with TSQL string splitting. This time we delve into a mysterious world that I call, “Tokenization.”

So what is Tokenization? It’s a word I made up for this problem.

But what is it, really? It’s splitting up a string based on a delimiter — in this case, a comma — but being wary of substring delimiters. In this case, that’s a pair of apostrophes, because that’s what TSQL uses for strings.

I think this is best illustrated with an example string:

DECLARE @Tokens VARCHAR(50)

SET @Tokens = 'a, ''b'', ''''c'', ''d'', ''e'''', f, ''1,2,3,4'''

The basic split string function that you can find will produce the following output:

SELECT * 
FROM dbo.SplitString(@Tokens, ',')

OutParam
-------------
a
'b'
''c'
'd'
'e''
f
'1
2
3
4'

Well, that’s wrong. Because what I want to do is maintain the substrings (or, “tokens,” as I like to call them — thus, Tokenization!)

The output I desire is:

Token
--------
a
'b'
''c', 'd', 'e''
f
'1,2,3,4'

Notice that substrings — delimited with apostrophes — should be maintained.

And here’s how I’ve solved this problem…

CREATE FUNCTION dbo.Tokenize
(
	@Input NVARCHAR(2000)
)
RETURNS @Tokens TABLE 
	(
		TokenNum INT IDENTITY(1,1),
		Token NVARCHAR(2000)
	)
AS
BEGIN
	DECLARE @i INT SET @i = 0
	DECLARE @StartChar INT SET @StartChar = 1
	DECLARE @Quote INT SET @Quote = 0	

	DECLARE @Chars TABLE 
	(
		CharNum INT IDENTITY(1,1) NOT NULL PRIMARY KEY,
		TheChar CHAR(1), 
		TheCount INT,
		StartChar INT
	)

	SET @Input = ' , ' + @Input + ' , '
	
	INSERT @Chars (TheChar)
	SELECT SUBSTRING(@Input, n.Number, 1)
	FROM Numbers n
	WHERE n.Number > 0 
		AND n.Number <= LEN(@Input)
	ORDER BY n.Number
	
	UPDATE Chars SET 
		@i = Chars.TheCount = 
			CASE 
				WHEN Chars1.TheChar = ',' 
					AND @Quote % 2 = 0 THEN 0 
				ELSE @i + 1 
			END,
		@Quote = 
			CASE  
				WHEN Chars1.TheChar = '''' THEN @Quote + 1 
				WHEN @i = 0 THEN 0 
				ELSE @Quote 
			END,
		@StartChar = Chars.StartChar =
			CASE
				WHEN @i = 1 THEN Chars1.CharNum - 1
				WHEN @i = 0 THEN @StartChar + 1
				ELSE @StartChar
			END
	FROM @Chars Chars
	JOIN @Chars Chars1 ON Chars1.CharNum = Chars.CharNum + 1

	INSERT @Tokens(Token)
	SELECT
		RTRIM(LTRIM(SUBSTRING(@Input, StartChar, CharNum - StartChar + 1)))
	FROM (
		SELECT StartChar, CharNum
		FROM @Chars
		WHERE TheCount = 0

		UNION ALL

		SELECT 
			MAX
			(
				CASE TheCount 
					WHEN 0 THEN CharNum 
					ELSE 0 
				END
			) + 1, 
			MAX(CharNum)
		FROM @Chars
	) x
	WHERE RTRIM(LTRIM(SUBSTRING(@Input, StartChar, CharNum - StartChar + 1))) NOT IN ('', ',')
	ORDER BY x.StartChar
	RETURN
END

A word of warning: This UDF uses the undocumented — and unsupported — “aggregate update” functionality. I’ve tested thoroughly in this case and believe it works perfectly (and it sure is handy!), but I would advise you to not use it in your own projects without extensive testing! MS doesn’t support this one, so handle with care.

And by the way, you need a numbers table to use this thing. Of course.

As for using this thing, it’s pretty easy:

DECLARE @Tokens VARCHAR(50)

SET @Tokens = 'a, ''b'', ''''c'', ''d'', ''e'''', f, ''1,2,3,4'''

SELECT Token
FROM dbo.Tokenize(@Tokens)


Token
--------
a
'b'
''c', 'd', 'e''
f
'1,2,3,4'

… and it even appears to work properly!

Enjoy… and application for this and other strange things I’ve been posting recently coming very, very soon.

1 COMMENT

Jeremy Swartwood May 29, 2013 At 11:39 pm

Thank you for this. To note, if there is a token and nothing but a space, your script excludes this extra "token". In my situation I needed to always compare a specific token number so I needed this empty token.
These changes are not efficient, but they worked.
I changed the INSERT section to use a CASE instead that compared against ” and then it didn’t use the LTREM/RTRIM else it used the trim.
case when RTRIM(LTRIM(SUBSTRING(@Input, StartChar, CharNum – StartChar + 1))) = ”
then SUBSTRING(@Input, StartChar, CharNum – StartChar + 1)
else RTRIM(LTRIM(SUBSTRING(@Input, StartChar, CharNum – StartChar + 1)))
end
Additionally, I had to change the WHERE clause because SQL thinks that ” = ‘ ‘.
SUBSTRING(@Input, StartChar, CharNum – StartChar + 1) NOT LIKE ”
AND
RTRIM(LTRIM(SUBSTRING(@Input, StartChar, CharNum – StartChar + 1))) NOT LIKE ‘,’

Comments are closed.

Tokenize UDF

1 COMMENT

Popular Posts

Looking Forward 100 Months (T-SQL Tuesday #100): The Roundup

T-SQL Tuesday #200 (v1.0): Last of the DBAs

Invitation: T-SQL Tuesday #100 – Looking Forward 100 Months

The Career Influence of Usenet Strangers (T-SQL Tuesday #096)

Playing the Third-Party Recruiter Game (T-SQL Tuesday #093)

The SQL Hall of Shame

The Guru (T-SQL Tuesday #089)

Solving the Net Changes Problem with Temporal Tables (T-SQL Tuesday #087)