Home Uncategorized Counting occurrences of a substring within a string

Counting occurrences of a substring within a string

22

I have absolutely no idea why anyone wants to do this, but I keep answering the same question in forums: “How do I count the occurrences of a substring [note: usually comma] within a string?”

In an effort to thwart carpal tunnel syndrome, I have created the Ultimate Substring Occurrence Counting UDF.

… And here it is:

CREATE FUNCTION dbo.GetSubstringCount
(
	@InputString TEXT, 
	@SubString VARCHAR(200),
	@NoisePattern VARCHAR(20)
)
RETURNS INT
WITH SCHEMABINDING
AS
BEGIN
	RETURN 
	(
		SELECT COUNT(*)
		FROM dbo.Numbers N
		WHERE
			SUBSTRING(@InputString, N.Number, LEN(@SubString)) = @SubString
			AND PATINDEX(@NoisePattern, SUBSTRING(@InputString, N.Number + LEN(@SubString), 1)) = 0
			AND 0 = 
				CASE 
					WHEN @NoisePattern = '' THEN 0
					ELSE PATINDEX(@NoisePattern, SUBSTRING(@InputString, N.Number - 1, 1))
				END
	)
END

First note: You need (regular readers, you guessed it) a numbers table.

Okay, so what’s it do? Simply put, it returns the number of times @SubString appears within @InputString. But wait! — Act now and you will receive an additional bonus feature at no extra cost! Can you feel the love?

The @NoisePattern parameter allows the user to put the UDF into “exact match” mode.

For instance, let’s say you have a big string containing some text about automobile manufacturers, and for some reason (again, I have no clue why people need this functionality — fill me in if you do!) you want to count the number of occurrences of the word “auto”, but not the number of occurrences of other forms of the word, e.g. “automobile” or …. some word that ends in “auto” (if such a word exists).

By specifying a pattern for @NoisePattern of characters that shouldn’t be adjacent to your word, you’re telling the UDF that any other characters are safe. Leaving the parameter empty means that all occurrences of the substring will be counted. Examples:

SELECT dbo.GetSubstringCount('We like the autos, the autos that go boom.', 'auto', '')
-- Returns 2

SELECT dbo.GetSubstringCount('Autos are fun.  I like to drive my auto.', 'auto', '')
-- Also returns 2

SELECT dbo.GetSubstringCount('Autos are fun.  I like to drive my auto.', 'auto', '%[a-z]%')
-- Only returns 1 -- The exact match must not have adjacent alphabetic characters

SHARE
Previous articleSplitting a string of unlimited length
Next articleValidate a URL from SQL Server
Adam Machanic helps companies get the most out of their SQL Server databases. He creates solid architectural foundations for high performance databases and is author of the award-winning SQL Server monitoring stored procedure, sp_WhoIsActive. Adam has contributed to numerous books on SQL Server development. A long-time Microsoft MVP for SQL Server, he speaks and trains at IT conferences across North America and Europe.

22 COMMENTS

  1. I am looking for this functionality, however I would prefer to do it without a Numbers table.
    My need is that of a search engine sproc.  I need to be able to weight the returned records and one of the items in the weighting system is to check how many times the search pattern occurs in the record.

  2. I need this too. I have a column that basically contains sql and I need to pull out field_ids from it, but each row has a different number of field_ids in the sql. Is there a different(better) solution?

  3. I don’t know why you would want to do it without the numbers table.  It makes the solution an order of magnitude faster.

  4. Hi Adam, I have tried this code but even the included examples do not work!
    SELECT dbo.GetSubstringCount(‘We like the autos, the autos that go boom.’, ‘auto’, ”)
    — returns 0   <– should be 2
    SELECT dbo.GetSubstringCount(‘Autos are fun.  I like to drive my auto.’, ‘auto’, ”)
    — returns 0   <– should be 2
    SELECT dbo.GetSubstringCount(‘Autos are fun.  I like to drive my auto.’, ‘auto’, ‘%[a-z]%’)
    — returns 1   <– should be 1
    Is there a SQL setting / version dependency that O could be missing?
    Craig

  5. Hi Craig,
    Thanks for pointing this out.  It is indeed a version issue.  I wrote this back in the bad old days of SQL Server 2000, and apparently there was a change to the way PATINDEX works between 2000 and 2005. I just tested:
    SELECT PATINDEX(”, ‘ ‘)
    SQL Server 2000 returns 0, whereas SQL Server 2005 returns 1.  This is breaking the third predicate in the WHERE clause within the function, which checks to see if the target string is prepended by anything that matches the input pattern.
    I’ll have to think about how to fix this, but as a temporary workaround if you don’t want to use exact-match mode, you could pass in some character that you know can’t possibly exist in the target string:
    SELECT dbo.GetSubstringCount(‘We like the autos, the autos that go boom.’, ‘auto’, CHAR(255))
    –Returns 2, even in SQL Server 2005 or 2008
    In the meantime, I’m wondering if this "new" behavior makes sense?  I’m not sure, but I’m leaning towards SQL Server 2000’s answer.  An empty pattern shouldn’t, in my opinion, match on anything at all…

  6. Funmarkaz, yes, it should work with a bit of modification; MySQL’s CREATE FUNCTION syntax isn’t quite the same as SQL Server’s.

  7. I found this post because I was searching for a function for  a project where I need to find out how many delimiters exist in a string.  This implementation does not require a numbers table or anything other than the function itself.  There is no "additional bonus" feature in this version but it will count occurences without any additional db objects. Happy Querying!
    CREATE FUNCTION dbo.GetSubStringCount (
          @InputString NVARCHAR(4000),
          @SearchString VARCHAR(255)
    )RETURNS INT WITH SCHEMABINDING AS
    BEGIN
          DECLARE @occurences AS BIGINT
                 ,@position AS BIGINT
          SET @occurences = 0
          SET @position = 0
          WHILE @position < LEN(@InputString)
          BEGIN
                IF CHARINDEX(@SearchString, @InputString, @position) > 0
                BEGIN
                   SET @occurences = @occurences + 1
                   SET @position = CHARINDEX(@SearchString, @InputString, @position)
                END
                SET @position = @position+1
          END
          RETURN @occurences
    END

  8. Mike,
    Thanks for sharing. A simpler and more efficient way to solve the problem (if you don’t want the “bonus” feature) is to do:
    SELECT LEN(@InputString) – REPLACE(@InputString, @SearchString, ”)
    I suspect that the numbers table will provide better performance than a WHILE loop, and both will be less efficient than the above solution, but I’ll leave that testing as an exercise for anyone interested in taking this a bit further. Even better would be to inline the Numbers table version (search my blog for my post on that topic), and a SQLCLR solution would probably be fastest of all. I would personally definitely keep the “bonus” around as it’s been quite useful in a few projects I’ve worked on.

  9. My reason for wanting this functionality: to count line breaks in sys.syscomments in order to measure how many lines of T-SQL there are in the project. The bonus feature will allow blank lines to be excluded from the count.

  10. Adam, thanks for posting this.   Very handy use of the numbers table.  
    I’m troubleshooting a system that is experiencing tempdb meta-data contention and trying to identify queries that are creating temp objects within ad-hoc sql statements.  I have traces of this activity and am using this to identify the ‘worst offenders’. Initially I was just counting each query where TextData was like ‘%table%’ until I found that some batches created 10-15 table variables, so I needed a way to count them.
    Works great.  thx.

  11. Adam,
    Is there a modification I can make to get:
    SELECT dbo.GetSubstringCount(‘ROUTINE MEDICAL EXAMINATION’, ‘NORMAL ROUTINE HISTORY AND PHYSICAL’, ‘%[a-z]%’)
    to return 1 (ROUTINE)
    and
    SELECT dbo.GetSubstringCount(‘REFLUX, ESOPHAGEAL’, ‘ESOPHAGEAL REFLUX’, ‘%[a-z]%’)
    to return 2 (ESOPHAGEAL & REFLUX)

  12. Ron: Sure, it’s doable, but it would be a completely different function. What you want to do is split both strings on any non-alpha character, then intersect the results. You can search my blog for a string splitter (the CLR version would probably be best), and then just use the INTERSECT operator to get your final answer.
    –Adam

  13. I’m not sure if it is faster or slower, but how about this:  a single line, no external tables needed.
    String: ‘The dog and the cat broke the plate before the farmer got home’
    find the length of the string (63)
    upshift the entire string (and the search string), and replace the search string with nothing (”). Find the length of that new string (51)
    Subtract the new length from the old length and divide by the length of the search string. The result will be the number of occurrences of the search string in the original line.
    select
    length(‘The cat and the dog broke the plate before the farmer got home.’) as orig_length,
    length(replace(upper(‘The cat and the dog broke the plate before the farmer got home.’),upper(‘the’),”)) as newlength
    ,length(‘the’) as substringlength
    ,(length(‘The cat and the dog broke the plate before the farmer got home.’) – length(replace(upper(‘The cat and the dog broke the plate before the farmer got home.’),upper(‘the’),”)))/length(‘the’) as NumberOfOccurrences
    from dual
    or with a table that has a line of chars in colC, and you need to search for ‘x’:
    select (length(colC) – length(replace(upper(colC),upper(‘x’)))/length(‘x’) as countByLine
    from myTable;

  14. oops, looks like Adam mentioned this in one of the posts already.  sorry to duplicate your solution.   I was interested in having the count so when we were lookign for a variable name throughout our code, and we were looking to see how it was used, we wouldn’t need to walk through the entire line if it only existed once, or we would know to look for the extra occurrences.  (We also bolded each occurrence so they would stand out, but the occurrence counter in addition to the bold helped out).

  15. SELECT dbo.GetSubstringCount(‘We like the autos, the autos that go boom.’, ‘auto’, ”)
    Does not work. Try searching "the", it returns 0 occurrences

  16. best ive ever seen.
    ALTER FUNCTION [CountSubStrings]
    ( @String VARCHAR(8000), @SubString VARCHAR(100) )
    RETURNS INT
    BEGIN
       RETURN (LEN(@String) –
               LEN(REPLACE(@String, @SubString, ”))) /
               LEN(@SubString)
    END

LEAVE A REPLY

Please enter your comment!
Please enter your name here