I have a query that takes about 3 hours to run on our server--and it doesn't take advantage of parallel processing. (about 1.15 million records in dbo.Deidentified
, 300 records in dbo.NamesMultiWord
). The server has access to 8 cores.
UPDATE dbo.Deidentified
WITH (TABLOCK)
SET IndexedXml = dbo.ReplaceMultiWord(IndexedXml),
DE461 = dbo.ReplaceMultiWord(DE461),
DE87 = dbo.ReplaceMultiWord(DE87),
DE15 = dbo.ReplaceMultiWord(DE15)
WHERE InProcess = 1;
and ReplaceMultiword
is a procedure defined as:
SELECT @body = REPLACE(@body,Names,Replacement)
FROM dbo.NamesMultiWord
ORDER BY [WordLength] DESC
RETURN @body --NVARCHAR(MAX)
Is the call to ReplaceMultiword
preventing forming a parallel plan? Is there a way to rewrite this to allow parallelism?
ReplaceMultiword
runs in descending order because some of the replacements are short versions of others, and I want the longest match to succeed.
For example, there may be 'George Washington University' and another from 'Washington University'. If the 'Washington University' match were first, then 'George' would be left behind.
Technically I can use CLR, I'm just not familiar with how to do so.
2 Answers 2
The UDF is preventing parallelism. It also is causing that spool.
You could use CLR and a compiled regex to do your search and replace. It doesn't block parallelism as long as the required attributes are present and will likely be significantly faster than performing 300 TSQL REPLACE
operations per function call.
Example code is below.
DECLARE @X XML =
(
SELECT Names AS [@find],
Replacement AS [@replace]
FROM dbo.NamesMultiWord
ORDER BY [WordLength] DESC
FOR XML PATH('x'), ROOT('spec')
);
UPDATE dbo.Deidentified WITH (TABLOCK)
SET IndexedXml = dbo.ReplaceMultiWord(IndexedXml, @X),
DE461 = dbo.ReplaceMultiWord(DE461, @X),
DE87 = dbo.ReplaceMultiWord(DE87, @X),
DE15 = dbo.ReplaceMultiWord(DE15, @X)
WHERE InProcess = 1;
This depends on the existence of a CLR UDF as below (the DataAccessKind.None
should mean the spool disappears as well as that is there for Halloween protection and isn't needed as this doesn't access the target table).
using System;
using System.Data.SqlTypes;
using Microsoft.SqlServer.Server;
using System.Text.RegularExpressions;
using System.Collections.Generic;
using System.Xml;
public partial class UserDefinedFunctions
{
//TODO: Concurrency?
private static readonly Dictionary<string, ReplaceSpecification> cachedSpecs =
new Dictionary<string, ReplaceSpecification>();
[SqlFunction(IsDeterministic = true,
IsPrecise = true,
DataAccess = DataAccessKind.None,
SystemDataAccess = SystemDataAccessKind.None)]
public static SqlString ReplaceMultiWord(SqlString inputString, SqlXml replacementSpec)
{
//TODO: Implement something to drop things from the cache and use a shorter key.
string s = replacementSpec.Value;
ReplaceSpecification rs;
if (!cachedSpecs.TryGetValue(s, out rs))
{
var doc = new XmlDocument();
doc.LoadXml(s);
rs = new ReplaceSpecification(doc);
cachedSpecs[s] = rs;
}
string result = rs.GetResult(inputString.ToString());
return new SqlString(result);
}
internal class ReplaceSpecification
{
internal ReplaceSpecification(XmlDocument doc)
{
Replacements = new Dictionary<string, string>();
XmlElement root = doc.DocumentElement;
XmlNodeList nodes = root.SelectNodes("x");
string pattern = null;
foreach (XmlNode node in nodes)
{
if (pattern != null)
pattern = pattern + "|";
string find = node.Attributes["find"].Value.ToLowerInvariant();
string replace = node.Attributes["replace"].Value;
//TODO: Escape any special characters in the regex syntax
pattern = pattern + find;
Replacements[find] = replace;
}
if (pattern != null)
{
pattern = "(?:" + pattern + ")";
Regex = new Regex(pattern, RegexOptions.IgnoreCase | RegexOptions.Compiled);
}
}
private Regex Regex { get; set; }
private Dictionary<string, string> Replacements { get; set; }
internal string GetResult(string inputString)
{
if (Regex == null)
return inputString;
return Regex.Replace(inputString,
(Match m) =>
{
string s;
if (Replacements.TryGetValue(m.Value.ToLowerInvariant(), out s))
{
return s;
}
else
{
throw new Exception("Missing replacement definition for " + m.Value);
}
});
}
}
}
-
I just benchmarked this. Using the same table and contents for each, the CLR took 3:03.51 to process the 1,174,731 rows, and the UDF took 3:16.21. It did save time. In my casual reading, it looks like SQL Server is loathe to parallelize UPDATE queries.rsjaffe– rsjaffe2017年03月23日 22:32:33 +00:00Commented Mar 23, 2017 at 22:32
-
@rsjaffe disappointing. I would have hoped for a much better result than that. What is the size of data involved? (Sum of data length of all affected columns)Martin Smith– Martin Smith2017年03月25日 16:21:53 +00:00Commented Mar 25, 2017 at 16:21
-
608 million characters, 1.216 GB, format is NVARCHAR. I was thinking of adding a
where
clause using a test for match with the regex, as most of the writes are unnecessary--the density of 'hits' should be low, but my C# skills (I'm a C++ guy) didn't get me there. I was thinking along the lines of a procedurepublic static SqlBoolean CanReplaceMultiWord(SqlString inputString, SqlXml replacementSpec)
that would returnreturn Regex.IsMatch(inputString.ToString());
but I get errors on that return statement, such as `System.Text.RegularExpressions.Regex is a type but is used like a variable.rsjaffe– rsjaffe2017年03月27日 20:10:41 +00:00Commented Mar 27, 2017 at 20:10
Bottom line: Adding criteria to the WHERE
clause and splitting the query into four separate queries, one for each field allowed SQL server to provide a parallel plan and made the query run 4X as fast as it had without the extra test in the WHERE
clause. Splitting the queries into four without the test didn't do that. Neither did adding the test without splitting the queries. Optimizing the test reduced the total run time to 3 minutes (from the original 3 hours).
My original UDF took 3 hours 16 minutes to process 1,174,731 rows, with 1.216 GB of nvarchar data tested. Using the CLR provided by Martin Smith in his answer, the execution plan was still not parallel and the task took 3 hours and 5 minutes. CLR, execution plan not parallel
Having read that WHERE
criteria could help push an UPDATE
to parallel, I did the following. I added a function to the CLR module to see if the field had a match to the regex:
[SqlFunction(IsDeterministic = true,
IsPrecise = true,
DataAccess = DataAccessKind.None,
SystemDataAccess = SystemDataAccessKind.None)]
public static SqlBoolean CanReplaceMultiWord(SqlString inputString, SqlXml replacementSpec)
{
string s = replacementSpec.Value;
ReplaceSpecification rs;
if (!cachedSpecs.TryGetValue(s, out rs))
{
var doc = new XmlDocument();
doc.LoadXml(s);
rs = new ReplaceSpecification(doc);
cachedSpecs[s] = rs;
}
return rs.IsMatch(inputString.ToString());
}
and, in internal class ReplaceSpecification
, I added the code to execute the test against the regex
internal bool IsMatch(string inputString)
{
if (Regex == null)
return false;
return Regex.IsMatch(inputString);
}
If all the fields are tested in a single statement, SQL server does not parallelize the work
UPDATE dbo.DeidentifiedTest
SET IndexedXml = dbo.ReplaceMultiWord(IndexedXml, @X),
DE461 = dbo.ReplaceMultiWord(DE461, @X),
DE87 = dbo.ReplaceMultiWord(DE87, @X),
DE15 = dbo.ReplaceMultiWord(DE15, @X)
WHERE InProcess = 1
AND (dbo.CanReplaceMultiWord(IndexedXml, @X) = 1
OR DE15 = dbo.ReplaceMultiWord(DE15, @X)
OR dbo.CanReplaceMultiWord(DE87, @X) = 1
OR dbo.CanReplaceMultiWord(DE15, @X) = 1);
Time to execute over 4 1/2 hours and still running. Execution plan: Test added, single statement
However, if the fields are separated into separate statements a parallel work plan is used, and my CPU usage goes from 12% with the serial plans to 100% with the parallel plans (8 cores).
UPDATE dbo.DeidentifiedTest
SET IndexedXml = dbo.ReplaceMultiWord(IndexedXml, @X)
WHERE InProcess = 1
AND dbo.CanReplaceMultiWord(IndexedXml, @X) = 1;
UPDATE dbo.DeidentifiedTest
SET DE461 = dbo.ReplaceMultiWord(DE461, @X)
WHERE InProcess = 1
AND dbo.CanReplaceMultiWord(DE461, @X) = 1;
UPDATE dbo.DeidentifiedTest
SET DE87 = dbo.ReplaceMultiWord(DE87, @X)
WHERE InProcess = 1
AND dbo.CanReplaceMultiWord(DE87, @X) = 1;
UPDATE dbo.DeidentifiedTest
SET DE15 = dbo.ReplaceMultiWord(DE15, @X)
WHERE InProcess = 1
AND dbo.CanReplaceMultiWord(DE15, @X) = 1;
Time to execute 46 minutes. Row statistics showed that about 0.5% of the records had at least one regex match. Execution plan: enter image description here
Now, the main drag on time was the WHERE
clause. I then replaced the regex test in the WHERE
clause with the Aho-Corasick algorithm implemented as a CLR. This reduced total time to 3 minutes 6 seconds.
This required the following changes. Load the assembly and functions for the Aho-Corasick algorithm. Change the WHERE
clause to
WHERE InProcess = 1 AND dbo.ContainsWordsByObject(ISNULL(FieldBeingTestedGoesHere,'x'), @ac) = 1;
And add the following before the first UPDATE
DECLARE @ac NVARCHAR(32);
SET @ac = dbo.CreateAhoCorasick(
(SELECT NAMES FROM dbo.NamesMultiWord FOR XML RAW, root('root')),
'en-us:i'
);
Explore related questions
See similar questions with these tags.
SELECT @var = REPLACE ... ORDER BY
construction is not guaranteed to work as you expect. Example Connect item (see the response from Microsoft). So, switching to SQLCLR has the added advantage of guaranteeing correct results, which is always nice.