We're using CDC to capture changes made to a production table. The changed rows are being exported out to a data warehouse (informatica). I know that the __$update_mask column stores what columns were updated in a varbinary form. I also know that I can use a variety of CDC functions to find out from that mask what those columns were.
My question is this. Can anyone define for me the logic behind that mask so that we can identify the columns that were changed over in the warehouse? Since we're processing outside of the server we don't have easy access to those MSSQL CDC functions. I would rather just break down the mask myself in code. Performance of the cdc functions on the SQL end is problematic for this solution.
In short, I'd like to identify changed columns by hand from the __$update_mask field.
Update:
As an alternate sending a human readable list of changed columns over to the warehouse was also accepatable. We found this could be performed with performance far greater than our original approach.
The CLR answer to this question below meets this alternative and includes details of interpreting the mask for future visitors. However the accepted answer using XML PATH is the fastest yet for the same final result.
-
2stackoverflow.com/questions/14607325/… ?Jon Seigel– Jon Seigel2013年03月05日 03:04:02 +00:00Commented Mar 5, 2013 at 3:04
2 Answers 2
And the moral of the story is... test, try other things, think big, then small, always assume there is a better way.
As scientifically interesting as my last answer was. I decided to try one other approach. I remembered I could do concat with the XML PATH('') trick. Since I knew how to get the ordinal of each changed column from the captured_column list from the previous answer I thought it would be worth testing if the MS bit function would work better that way for what we needed.
SELECT __$update_mask ,
( SELECT CC.column_name + ','
FROM cdc.captured_columns CC
INNER JOIN cdc.change_tables CT ON CC.[object_id] = CT.[object_id]
WHERE capture_instance = 'dbo_OurTableName'
AND sys.fn_cdc_is_bit_set(CC.column_ordinal,
PD.__$update_mask) = 1
FOR
XML PATH('')
) AS changedcolumns
FROM cdc.dbo_MyTableName PD
It's way cleaner than (though not as fun as) all that CLR, returns the approach back to native SQL code only. And, drum roll.... returns the same results in less than a second. Since the production data is 100 times bigger every second counts.
I'm leaving the other answer up for scientific purposes - but for now, this is our correct answer.
-
Append _CT to table name in the FROM clause.Chris Morley– Chris Morley2014年12月12日 21:04:31 +00:00Commented Dec 12, 2014 at 21:04
-
1Thanks for coming back and answering this, I'm looking for a very similar solution so we can filter it accordingly within the code once a SQL call has been done. I don't fancy do a call for every column on every row returned from CDC!nik0lai– nik0lai2016年12月05日 15:27:25 +00:00Commented Dec 5, 2016 at 15:27
So, after some research we decided to still do this on the SQL side before handing off to the data warehouse. But we're taking this much improved approach (based on our needs and new understanding of how the mask works).
We get a list of the column names and their ordinal positions with this query. The return comes back in an XML format so that we can pass off to SQL CLR.
DECLARE @colListXML varchar(max);
SET @colListXML = (SELECT column_name, column_ordinal
FROM cdc.captured_columns
INNER JOIN cdc.change_tables
ON captured_columns.[object_id] = change_tables.[object_id]
WHERE capture_instance = 'dbo_OurTableName'
FOR XML Auto);
We then pass that XML block as a variable and the mask field to a CLR function that returns a comma delimted string of the columns that changed per the _$update_mask binary field. This clr function interrogates the mask field for change bit for each column in the xml list and then returns it's name from the related ordinal.
SELECT cdc.udf_clr_ChangedColumns(@colListXML,
CAST(__$update_mask AS VARCHAR(MAX))) AS changed
FROM cdc.dbo_OurCaptureTableName
WHERE NOT __$update_mask IS NULL;
The c# clr code looks like this: (compiled into an assembly called CDCUtilities)
using System;
using System.Data;
using System.Data.SqlClient;
using System.Data.SqlTypes;
using Microsoft.SqlServer.Server;
public partial class UserDefinedFunctions
{
[Microsoft.SqlServer.Server.SqlFunction]
public static SqlString udf_clr_cdcChangedColumns(string columnListXML, string updateMaskString)
{
/* xml of column ordinals shall be formatted as follows:
<cdc.captured_columns column_name="Column1" column_ordinal="1" />
<cdc.captured_columns column_name="Column2" column_ordinal="2" />
*/
System.Text.ASCIIEncoding encoding=new System.Text.ASCIIEncoding();
byte[] updateMask = encoding.GetBytes(updateMaskString);
string columnList = "";
System.Xml.XmlDocument colList = new System.Xml.XmlDocument();
colList.LoadXml("<columns>" + columnListXML + "</columns>"); /* generate xml with root node */
for (int i = 0; i < colList["columns"].ChildNodes.Count; i++)
{
if (columnChanged(updateMask, int.Parse(colList["columns"].ChildNodes[i].Attributes["column_ordinal"].Value)))
{
columnList += colList["columns"].ChildNodes[i].Attributes["column_name"].Value + ",";
}
}
if (columnList.LastIndexOf(',') > 0)
{
columnList = columnList.Remove(columnList.LastIndexOf(',')); /* get rid of trailing comma */
}
return columnList; /* return the comma seperated list of columns that changed */
}
private static bool columnChanged(byte[] updateMask, int colOrdinal)
{
unchecked
{
byte relevantByte = updateMask[(updateMask.Length - 1) - ((colOrdinal - 1) / 8)];
int bitMask = 1 << ((colOrdinal - 1) % 8);
var hasChanged = (relevantByte & bitMask) != 0;
return hasChanged;
}
}
}
And the function to the CLR like this:
CREATE FUNCTION [cdc].[udf_clr_ChangedColumns]
(@columnListXML [nvarchar](max), @updateMask [nvarchar](max))
RETURNS [nvarchar](max) WITH EXECUTE AS CALLER
AS
EXTERNAL NAME [CDCUtilities].[UserDefinedFunctions].[udf_clr_cdcChangedColumns]
We then append this column list to the rowset and pass off to the data warehouse for analysis. By using the query and the clr we avoid having to use two function calls per row per change. We can skip right to the meat with results customized for our change capture instance.
Thanks to this stackoverflow post suggested by Jon Seigel for manner in which to interpret mask.
In our experience with this approach we are able to get a list of all changed columns from 10k cdc rows in under 3 seconds.
-
Thanks for returning with a solution, I might have use for that soon.Mark Storey-Smith– Mark Storey-Smith2013年03月05日 21:04:33 +00:00Commented Mar 5, 2013 at 21:04
-
Check out my NEW answer before you do. As cool as the CLR is... we found an even better way. Good luck.RThomas– RThomas2013年03月07日 18:33:49 +00:00Commented Mar 7, 2013 at 18:33
Explore related questions
See similar questions with these tags.