Load and parse Stack Exchange data dump XML into DB table

Question 1

I have put together a stored procedure to load and parse the Stack Exchange Data Dump into a relational database (akin to Stack Exchange Data Explorer). Each site has 8 XML files like these:

xml-files

The stored procedure below performs the following steps:

Fetch the Badges.xml file for the target site from the local file system
Load the XML document into the database
Parse the XML document <row> notes and populate the destination table with each attribute in its own column

I wrote this for Badges data, but I have to apply the same logic for all 8 types of XML data, so I would like to make this procedure as good as possible before I apply its model to processing the other XML files.

The (very simple) structure of the Badges.xml files is as follows:

<?xml version="1.0" encoding="utf-8"?>
<badges>
 <row Id="1" UserId="2" Name="Autobiographer" Date="2011-01-19T20:52:02.027" Class="3" TagBased="False" />
 <row Id="2" UserId="4" Name="Autobiographer" Date="2011-01-19T20:57:02.100" Class="3" TagBased="False" />
 <row Id="3" UserId="6" Name="Autobiographer" Date="2011-01-19T20:57:02.133" Class="3" TagBased="False" />
 ...
 <row Id="176685" UserId="99330" Name="Supporter" Date="2016-03-06T03:34:14.827" Class="3" TagBased="False" />
</badges>

Tables

The following 3 tables are used in conjunction with the procedure:

CREATE TABLE RawDataXml.Badges (
 SiteId UNIQUEIDENTIFIER PRIMARY KEY,
 ApiSiteParameter NVARCHAR(256) NOT NULL,
 RawDataXml XML NULL,
 XmlDataSize BIGINT NULL,
 Inserted DATETIME2 DEFAULT GETDATE(),
 CONSTRAINT fk_Badges_SiteId FOREIGN KEY (SiteId) REFERENCES CleanData.Sites(Id)
);
CREATE TABLE CleanData.Badges (
 SiteId UNIQUEIDENTIFIER NOT NULL,
 ApiSiteParameter NVARCHAR(256) NOT NULL,
 RowId INT,
 UserId INT,
 Name NVARCHAR(256),
 CreationDate DATETIME2,
 Class INT,
 TagBased BIT,
 Inserted DATETIME2 DEFAULT GETDATE(),
 CONSTRAINT fk_Badges_SiteId FOREIGN KEY (SiteId) REFERENCES CleanData.Sites(Id)
);
CREATE TABLE RawDataXml.Globals (
 Parameter NVARCHAR(256) NOT NULL,
 Value NVARCHAR(256) NOT NULL,
 Inserted DATETIME2 DEFAULT GETDATE()
);

The RawDataXml.Globals table contains values such as these. The TargetSite values are meant to be used to run the procedure with a cursor iterating each of the sites (will show an example at the end).

Parameter Value
SourcePath D:\Downloads\stackexchange\
TargetSite codereview.stackexchange.com
TargetSite meta.codereview.stackexchange.com
TargetSite stats.stackexchange.com
TargetSite meta.stats.stackexchange.com

The procedure

This is the CREATE PROCEDURE statement. I added comments throughout to hopefully make it easy to understand and maintain.

IF EXISTS (
 SELECT 1 
 FROM INFORMATION_SCHEMA.ROUTINES
 WHERE SPECIFIC_SCHEMA = 'RawDataXml'
 AND SPECIFIC_NAME = 'usp_LoadBadgesXml'
)
DROP PROCEDURE RawDataXml.usp_LoadBadgesXml;
GO
CREATE PROCEDURE RawDataXml.usp_LoadBadgesXml
 @SiteDirectory NVARCHAR(256),
 -- Delete the loaded XML file after processing if True/1 (default True):
 @DeleteXmlRawDataAfterProcessing BIT = 1,
 -- Display/Return results to caller if @ReturnRows is set to True (default False)
 @ReturnRows BIT = 0
AS 
BEGIN
 SET NOCOUNT ON;
 -- Fetch global source path parameter:
 DECLARE @SourcePath NVARCHAR(256);
 DECLARE @bslash CHAR = CHAR(92);
 SET @SourcePath = (SELECT Value FROM RawDataXml.Globals WHERE Parameter = 'SourcePath');
 -- Make sure path ends with backslash (ASCII char 92)
 IF(SELECT RIGHT(@SourcePath, 1)) <> @bslash SET @SourcePath += @bslash;
 -- Fetch site identifiers based on @SiteDirectory parameter:
 DECLARE @SiteId UNIQUEIDENTIFIER;
 DECLARE @ApiSiteParameter NVARCHAR(256);
 SELECT 
 @SiteId = Id, 
 @ApiSiteParameter = ApiSiteParameter
 FROM CleanData.Sites
 WHERE SiteDirectory = @SiteDirectory;
 -- Throw error if @SiteDirectory parameter does not match an existing site:
 IF @SiteId IS NULL OR @ApiSiteParameter IS NULL
 BEGIN
 DECLARE @ErrMsg NVARCHAR(512) = 'The input site directory "' + @SiteDirectory + '" could not be matched to an existing site. Please verify and try again.';
 RAISERROR(@ErrMsg, 11, 1);
 END 
 -- Delete any previous XML data that may be present for the site:
 DELETE FROM RawDataXml.Badges
 WHERE SiteId = @SiteId;
 /** XML FILE HANDLING **
 This section loads the XML file from the file system into a table.
 If @DeleteXmlRawDataAfterProcessing is set to 1 (default)
 this XML data will be deleted from the database (but not from the file system) 
 after the data is parsed into a relational table (below). 
 *****/
 DECLARE @FilePath NVARCHAR(512) = @SourcePath + @SiteDirectory + @bslash + 'Badges.xml';
 DECLARE @SQL_OPENROWSET_QUERY NVARCHAR(1024);
 -- Dynamic SQL is used here because OPENROWSET will only accept a string literal as argument for the file path.
 SET @SQL_OPENROWSET_QUERY = 
 'INSERT INTO RawDataXml.Badges (SiteId, ApiSiteParameter, RawDataXml)' + CHAR(10)
 + 'SELECT ' + QUOTENAME(@SiteId, '''') + ', ' + CHAR(10)
 + QUOTENAME(@ApiSiteParameter, '''') + ', ' + CHAR(10)
 + 'CONVERT(XML, BulkColumn) AS BulkColumn' + CHAR(10)
 + 'FROM OPENROWSET(BULK ' + QUOTENAME(@FilePath, '''') + ', SINGLE_BLOB) AS x;'
 PRINT CONVERT(NVARCHAR(256), GETDATE(), 21) + ' Processing ' + @FilePath;
 -- Execute the dynamic query to load XML into the table:
 EXECUTE sp_executesql @SQL_OPENROWSET_QUERY;
 /** XML DATA PARSING & PROCESSING **
 This section parses the loaded XML document into columns and puts those in CleanData.Badges table.
 If previous data existed, that data is deleted prior to adding new data, to avoid duplication of rows
 and ensure a "fresh" set of data.
 *****/
 -- Clear any existing data:
 DELETE FROM CleanData.Badges
 WHERE SiteId = @SiteId;
 -- Prepare XML document for parsing:
 DECLARE @XML AS XML;
 DECLARE @Doc AS INT;
 SELECT @XML = RawDataXml
 FROM RawDataXml.Badges
 WHERE SiteId = @SiteId;
 EXEC sp_xml_preparedocument @Doc OUTPUT, @XML;
 -- Parse XML <row> node attributes and insert them into their respective columns:
 INSERT INTO CleanData.Badges (
 SiteId, 
 ApiSiteParameter, 
 RowId, 
 UserId, 
 Name, 
 CreationDate, 
 Class, 
 TagBased
 )
 SELECT 
 @SiteId,
 @ApiSiteParameter,
 Id,
 UserId,
 Name,
 [Date],
 Class,
 CASE
 WHEN LOWER(TagBased) = 'true' THEN 1
 ELSE 0
 END AS TagBased
 FROM OPENXML(@Doc, 'badges/row')
 WITH (
 Id INT '@Id',
 UserId INT '@UserId',
 Name NVARCHAR(256) '@Name',
 [Date] DATETIME2 '@Date',
 Class INT '@Class',
 TagBased NVARCHAR(256) '@TagBased'
 );
 EXEC sp_xml_removedocument @Doc;
 -- Delete the loaded XML file after processing if True/1 (default True):
 IF @DeleteXmlRawDataAfterProcessing = 1
 BEGIN
 DELETE FROM RawDataXml.Badges
 WHERE SiteId = @SiteId;
 END
 -- Display/Return results to caller if @ReturnRows is set to True (default False)
 IF @ReturnRows = 1
 BEGIN
 SELECT * FROM CleanData.Badges
 WHERE SiteId = @SiteId
 ORDER BY CreationDate ASC;
 END
END
GO

Example run with stats

Here is an example run for the 4 sites currently in the Globals table.

Note that this is a post-compile run, i.e., it was ran before this run to calculate the execution plan.

DECLARE @Start DATETIME2 = GETDATE();
DECLARE @RowsProcessed INT;
DECLARE @Now DATETIME2;
DECLARE @CurrentSite NVARCHAR(256);
DECLARE _SitesToProcess CURSOR FOR
 SELECT Value 
 FROM RawDataXml.Globals
 WHERE Parameter = 'TargetSite';
OPEN _SitesToProcess;
FETCH NEXT FROM _SitesToProcess INTO @CurrentSite;
WHILE @@FETCH_STATUS = 0
BEGIN
 SET @Now = GETDATE();
 EXECUTE RawDataXml.usp_LoadBadgesXml @CurrentSite;
 PRINT 'Processing time: ' + CAST(DATEDIFF(MILLISECOND, @Now, GETDATE()) AS VARCHAR(20)) +' ms.';
 FETCH NEXT FROM _SitesToProcess INTO @CurrentSite;
END
CLOSE _SitesToProcess;
DEALLOCATE _SitesToProcess;
PRINT 'TOTAL Processing time: ' + CAST(DATEDIFF(MILLISECOND, @Start, GETDATE()) AS VARCHAR(20)) +' ms.';
SELECT * FROM CleanData.Badges ORDER BY CreationDate DESC;

Which prints the following to console, and finally displays the rows parsed from the XML document.

2016年08月31日 00:05:04.983 Processing D:\Downloads\stackexchange\codereview.stackexchange.com\Badges.xml
Processing time: 8060 ms.
2016年08月31日 00:05:13.033 Processing D:\Downloads\stackexchange\meta.codereview.stackexchange.com\Badges.xml
Processing time: 1517 ms.
2016年08月31日 00:05:14.550 Processing D:\Downloads\stackexchange\stats.stackexchange.com\Badges.xml
Processing time: 8120 ms.
2016年08月31日 00:05:22.670 Processing D:\Downloads\stackexchange\meta.stats.stackexchange.com\Badges.xml
Processing time: 1740 ms.
TOTAL Processing time: 19437 ms.
(345368 row(s) affected)

Finally, here is a screenshot of the nontrivial parts of the actual execution plan:

exec-plan

Question 2

The percentages in the part of the execution plan that is displayed are all quite low (combined 5% of the complete batch), suggesting that the more costly operations are not shown in this part. Could you show which statements in the stored procedure have the highest percentage estimates?

Question 3

Readability

Whitespace

Multi-line statements are written without any indentation, which makes them a bit harder to read. Especially if they are also not separated from each other by extra vertical whitespace. This had me squint a bit on

-- Prepare XML document for parsing:
DECLARE @XML AS XML;
DECLARE @Doc AS INT;
SELECT @XML = RawDataXml
FROM RawDataXml.Badges
WHERE SiteId = @SiteId;
EXEC sp_xml_preparedocument @Doc OUTPUT, @XML;

which I would write like

-- Prepare XML document for parsing:
DECLARE @XML AS XML;
DECLARE @Doc AS INT;
SELECT @XML = RawDataXml
 FROM RawDataXml.Badges
 WHERE SiteId = @SiteId;
EXEC sp_xml_preparedocument @Doc OUTPUT, @XML;

making sure that only the first line of each statement is fully left-aligned.

Dynamic SQL

Dynamic SQL always adds some readability issues, as most editors won't syntax highlight it, and readers will often have to count single quotes to make sure they are escaped correctly. But it is not necessary to split strings in order to use newlines with CHAR(10). You could rewrite the string like so:

SET @SQL_OPENROWSET_QUERY = 
 'INSERT INTO RawDataXml.Badges (SiteId, ApiSiteParameter, RawDataXml)
 SELECT ' + QUOTENAME(@SiteId, '''') + '
 , ' + QUOTENAME(@ApiSiteParameter, '''') + '
 , CONVERT(XML, BulkColumn) AS BulkColumn
 FROM OPENROWSET( BULK ' + QUOTENAME(@FilePath, '''') + '
 , SINGLE_BLOB ) AS x;'

This will remove many of the distracting +es, CHAR(10) calls and quotes, and bring the SQL code back to a more readable formatting as well.

Performance

Add primary keys or index to all tables

The CleanData.Badges table is actually not a table but a heap. Because it doesn't have a primary key or other clustered index, SQL Server will always have to consult all full rows when operating on the data. The execution plan screenshot actually has a suggestion for this:

Missing Index (Impact 31.7402): CREATE NONCLUSTERED INDEX [<Name of Missing Index, sysname, >] ON [CleanData].[Badges] ([SiteId])

Which means that it would help if there were an index on the SiteId column. Indices only work with tables, not heaps, so you would also need to add a primary key (which would probably be (SiteId, RowId), guessing from the example data).

Using `xml.nodes()` instead of `OPENXML`

Note: This is actually only a suggestion based on a hunch I have, based on what I see in the execution plan, not because I know this to have better performance. I do not have a sql server instance at hand to compare performance.

The execution plan shows a 57% cost for "Remote Scan", which is basically "have a remote server give me all their data". Performance of that remote server is a black box, as far as the execution plan is concerned.

A test like used in this answer may show that using @XML.nodes('/badges/row') may not only be less of a hassle with sp_xml_preparedocument, but may also perform better. The code would look like:

-- Prepare XML document for parsing:
DECLARE @XML AS XML;
SELECT @XML = RawDataXml
 FROM RawDataXml.Badges
 WHERE SiteId = @SiteId;
-- Parse XML <row> node attributes and insert them into their respective columns:
INSERT INTO CleanData.Badges (
 SiteId, 
 ApiSiteParameter, 
 RowId, 
 UserId, 
 Name, 
 CreationDate, 
 Class, 
 TagBased
 )
 SELECT 
 @SiteId,
 @ApiSiteParameter,
 x.r.value('@Id','INT') as Id,
 x.r.value('@UserId', 'INT') as UserId,
 x.r.value('@Name', 'NVARCHAR(256)') as Name,
 x.r.value('@Date', 'DATETIME2') as [Date],
 x.r.value('@Class', 'INT') as Class,
 CASE
 WHEN LOWER(x.r.value('@TagBased', 'NVARCHAR(256)')) = 'true' THEN 1
 ELSE 0
 END AS TagBased
 FROM @XML.nodes('/badges/row') as x(r);

This answer may also give pointers on how to select values from multiple xml documents combined, allowing you to process RawDataXml.Badges entries from multiple sites as a set.

joranvar 9927 silver badges16 bronze badges · Accepted Answer · 2016-09-24 12:16:11Z

Readability

Whitespace

Multi-line statements are written without any indentation, which makes them a bit harder to read. Especially if they are also not separated from each other by extra vertical whitespace. This had me squint a bit on

-- Prepare XML document for parsing:
DECLARE @XML AS XML;
DECLARE @Doc AS INT;
SELECT @XML = RawDataXml
FROM RawDataXml.Badges
WHERE SiteId = @SiteId;
EXEC sp_xml_preparedocument @Doc OUTPUT, @XML;

which I would write like

-- Prepare XML document for parsing:
DECLARE @XML AS XML;
DECLARE @Doc AS INT;
SELECT @XML = RawDataXml
 FROM RawDataXml.Badges
 WHERE SiteId = @SiteId;
EXEC sp_xml_preparedocument @Doc OUTPUT, @XML;

making sure that only the first line of each statement is fully left-aligned.

Dynamic SQL

Dynamic SQL always adds some readability issues, as most editors won't syntax highlight it, and readers will often have to count single quotes to make sure they are escaped correctly. But it is not necessary to split strings in order to use newlines with CHAR(10). You could rewrite the string like so:

SET @SQL_OPENROWSET_QUERY = 
 'INSERT INTO RawDataXml.Badges (SiteId, ApiSiteParameter, RawDataXml)
 SELECT ' + QUOTENAME(@SiteId, '''') + '
 , ' + QUOTENAME(@ApiSiteParameter, '''') + '
 , CONVERT(XML, BulkColumn) AS BulkColumn
 FROM OPENROWSET( BULK ' + QUOTENAME(@FilePath, '''') + '
 , SINGLE_BLOB ) AS x;'

This will remove many of the distracting +es, CHAR(10) calls and quotes, and bring the SQL code back to a more readable formatting as well.

Performance

Add primary keys or index to all tables

The CleanData.Badges table is actually not a table but a heap. Because it doesn't have a primary key or other clustered index, SQL Server will always have to consult all full rows when operating on the data. The execution plan screenshot actually has a suggestion for this:

Missing Index (Impact 31.7402): CREATE NONCLUSTERED INDEX [<Name of Missing Index, sysname, >] ON [CleanData].[Badges] ([SiteId])

Which means that it would help if there were an index on the SiteId column. Indices only work with tables, not heaps, so you would also need to add a primary key (which would probably be (SiteId, RowId), guessing from the example data).

Using `xml.nodes()` instead of `OPENXML`

Note: This is actually only a suggestion based on a hunch I have, based on what I see in the execution plan, not because I know this to have better performance. I do not have a sql server instance at hand to compare performance.

The execution plan shows a 57% cost for "Remote Scan", which is basically "have a remote server give me all their data". Performance of that remote server is a black box, as far as the execution plan is concerned.

A test like used in this answer may show that using @XML.nodes('/badges/row') may not only be less of a hassle with sp_xml_preparedocument, but may also perform better. The code would look like:

-- Prepare XML document for parsing:
DECLARE @XML AS XML;
SELECT @XML = RawDataXml
 FROM RawDataXml.Badges
 WHERE SiteId = @SiteId;
-- Parse XML <row> node attributes and insert them into their respective columns:
INSERT INTO CleanData.Badges (
 SiteId, 
 ApiSiteParameter, 
 RowId, 
 UserId, 
 Name, 
 CreationDate, 
 Class, 
 TagBased
 )
 SELECT 
 @SiteId,
 @ApiSiteParameter,
 x.r.value('@Id','INT') as Id,
 x.r.value('@UserId', 'INT') as UserId,
 x.r.value('@Name', 'NVARCHAR(256)') as Name,
 x.r.value('@Date', 'DATETIME2') as [Date],
 x.r.value('@Class', 'INT') as Class,
 CASE
 WHEN LOWER(x.r.value('@TagBased', 'NVARCHAR(256)')) = 'true' THEN 1
 ELSE 0
 END AS TagBased
 FROM @XML.nodes('/badges/row') as x(r);

This answer may also give pointers on how to select values from multiple xml documents combined, allowing you to process RawDataXml.Badges entries from multiple sites as a set.

Stack Exchange Network

Load and parse Stack Exchange data dump XML into DB table

Tables

The procedure

Example run with stats

1 Answer 1

Readability

Whitespace

Dynamic SQL

Performance

Add primary keys or index to all tables

Using `xml.nodes()` instead of `OPENXML`

You must log in to answer this question.

Hot Network Questions

Load and parse Stack Exchange data dump XML into DB table

Tables

The procedure

Example run with stats

1 Answer 1

Readability

Whitespace

Dynamic SQL

Performance

Add primary keys or index to all tables

Using xml.nodes() instead of OPENXML

You must log in to answer this question.

Related

Hot Network Questions

Using `xml.nodes()` instead of `OPENXML`