SQL Server 2019: Memory performance with graph queries (possible memory leak)

Question 1

I'm currently working on implementing an upgrade to SQL Server 2019 in order to make use of the graph features that are available in it. Our database stores records of files and their children, and the graph features allow us to quickly find all of a file's relations in either direction. Our current dev environment is using SQL Server 2019 Standard (15.0.4023.6) on a Linux server.

I'm noticing a concerning problem when I run my graph queries. The server's 'internal' resource pool appears to not free up all resources after a graph query. Left unchecked, this fills up the resource pool. Larger queries will fail until the SQL Server process to be restarted. Depending on server load, this could happen in as little as 1-2 hours. This can also fill up the tempdb and threaten to fill the storage drive. The files for the tempdb also cannot be shrunk/truncated significantly until the server is restarted. In configuration, 'memory.memorylimitmb' is not set, so this problem happens when the resource pool starts to have used the better part of the default 80% of system memory (12.8 GB, with 16GB of system memory)

To set up the tables within a demo database:

CREATE TABLE FileNode (ID BIGINT NOT NULL CONSTRAINT PK_FileNode PRIMARY KEY) AS NODE
GO
CREATE TABLE FileNodeArchiveEdge AS EDGE
GO
CREATE INDEX [IX_FileNodeArchiveEdge_ChildFile] ON [dbo].[FileNodeArchiveEdge] ($from_id)
GO
CREATE INDEX [IX_FileNodeArchiveEdge_ParentFile] ON [dbo].[FileNodeArchiveEdge] ($to_id)
GO

To populate the demo database tables:

INSERT INTO [FileNode] (ID) VALUES
 (1),(2),(3),(4),(5),
 (6),(7),(8),(9),(10),
 (11),(12),(13),(14),(15)
-- Convenient intermediate table
DECLARE @bridge TABLE (f BIGINT, t BIGINT)
INSERT INTO @bridge (f, t) VALUES
 (1,4),
 (4,9),
 (4,10),
 (1,5),
 (5,11),
 (11,12),
 (2,5),
 (2,6),
 (6,13),
 (6,14),
 (13,15),
 (14,15),
 (15,12),
 (7,14),
 (3,7),
 (3,8)
INSERT INTO FileNodeArchiveEdge
($from_id, $to_id)
SELECT 
 (SELECT $node_id FROM FileNode WHERE ID = f),
 (SELECT $node_id FROM FileNode WHERE ID = t)
FROM @bridge

To get all child IDs of a file (repeating this same query will eat up memory resources and cause 'USERSTORE_SCHEMAMGR' to grow uncontrollably):

DECLARE @parentId BIGINT = 1
SELECT 
 LAST_VALUE(f2.ID) WITHIN GROUP (GRAPH PATH)
FROM
 FileNode f1,
 FileNodeArchiveEdge FOR PATH contains_file,
 FileNode FOR PATH f2
WHERE
 f1.ID = @parentId
 AND MATCH(SHORTEST_PATH(f1(-(contains_file)->f2)+))

Re-running the provided query to retrieve all child nodes of a particular file will eventually see the 'USERSTORE_SCHEMAMGR' memory clerk type eat up the bulk of used resources. The tempdb will also grow uncontrollably.

The demonstration database queries are too small to trigger an explicit message reporting that the 'internal' resource pool. However, running a larger query on the same server should trigger the warning, and performance should still be impacted.

The following queries are useful for monitoring server performance:

-- Memory clerk usage
SELECT TOP(10) mc.[type] AS [Memory Clerk Type],
 CAST((SUM(mc.pages_kb)/1024.0) AS DECIMAL (15,2)) AS [Memory Usage (MB)]
FROM sys.dm_os_memory_clerks AS mc WITH (NOLOCK)
GROUP BY mc.[type]
ORDER BY SUM(mc.pages_kb) DESC OPTION (RECOMPILE);
/*
Example output of above query:
Memory Clerk Type Memory Usage (MB)
------------------------------------------------------------ -----------------
USERSTORE_SCHEMAMGR 9224.26
MEMORYCLERK_SQLSTORENG 1114.73
MEMORYCLERK_SQLBUFFERPOOL 471.50
CACHESTORE_SEHOBTCOLUMNATTRIBUTE 376.47
MEMORYCLERK_SOSNODE 292.02
MEMORYCLERK_SQLGENERAL 19.84
MEMORYCLERK_SQLCLR 12.04
MEMORYCLERK_SQLQUERYPLAN 2.99
MEMORYCLERK_SQLLOGPOOL 2.61
MEMORYCLERK_SQLTRACE 2.14
*/
SELECT cache_memory_kb/1024.0 AS [cache_memory_MB],compile_memory_kb/1024 AS compile_memory_MB, used_memory_kb/1024.0 AS [used_memory_MB] FROM sys.dm_resource_governor_resource_pools

To rapidly query the server and eat up resources, I'm using the following BASH loop:

l=1000 # Number of loops
# The loop will probably need to be run 2M times or so to start to see significant usage.
c=0 # Loop tracker
touch marker # Alternate emergency stop: Remove the marker file from another terminal session.
time while [ $c -lt $l ] && [ -f "marker" ]; do
 c="$((${c}+1))"
 echo ${c}/${l}
 # Notes: SQLCMDPASSWORD has been set in environment variable
 # child-query.sql contains the above child query to loop for the children of file ID 1.
 time sqlcmd -U db_user -S localhost -d DemoDatabase -i child-query.sql > /dev/null || break
done
rm marker

The DROPCLEANBUFFERS/FREEPROCCACHE/FLUSHPROCINDB DBCC commands complete successfully, but don't appear to have an effect.

Is there a configuration or procedure that could solve this problem, or is this a fundamental server problem?

Question 2

I'm able to reproduce this on Windows as well. Do you have a support contract with Microsoft? I would recommend opening a case, as this appears to be a bug. I was unable to clear that cache with DBCC FREESYSTEMCACHE ('ALL'); as well.

Question 3

I am a Program Manager with the SQL team at Microsoft, overseeing the SQL Graph functionality. Thank you for the details, and sincere apologies for the issue. ~~(削除) This issue has been escalated to our engineering team and is being investigated on a priority basis. We will keep you posted on the outcome of the investigation. (削除ここまで)~~

UPDATED August 4th, 2021: This issue has been fixed in SQL Server 2019 CU12.

~~(削除) P.S. in the meantime, (削除ここまで)~~As a reference, here are some tips on how you can also do some further self-investigation (if you'd like) on what's happening under the hood. Firstly, any memory allocation in SQL Server can generally be traced in detail down to a call stack level, using the Extended Events (XE) mechanism. The first step is to define an XE session, filtering on the specific memory clerk name, which in this case is USERSTORE_SCHEMAMGR:

CREATE EVENT SESSION [TraceMemObj] ON SERVER 
ADD EVENT sqlos.page_allocated(
 ACTION(package0.callstack)
 WHERE (memory_clerk_name = 'USERSTORE_SCHEMAMGR')
 )
,
ADD EVENT sqlos.page_freed(
 ACTION(package0.callstack)
 WHERE (memory_clerk_name = 'USERSTORE_SCHEMAMGR')
 )
ADD TARGET package0.histogram 
(SET source_type=1,
 source=N'package0.callstack')
WITH (MAX_MEMORY=32768 KB,EVENT_RETENTION_MODE=ALLOW_SINGLE_EVENT_LOSS,
MAX_DISPATCH_LATENCY=5 SECONDS,MAX_EVENT_SIZE=0 KB,MEMORY_PARTITION_MODE=PER_CPU,TRACK_CAUSALITY=OFF,STARTUP_STATE=OFF)

Start the Extended Events session:

ALTER EVENT SESSION [TraceMemObj] on server state = start

Then, wait for problem to repro, and then run the below query. Click on the XML output from the query.

SELECT event_session_address,
 target_name,
 execution_count,
 CAST (target_data AS XML) AS MemObjData
FROM sys.dm_xe_session_targets AS xst
 INNER JOIN
 sys.dm_xe_sessions AS xs
 ON (xst.event_session_address = xs.address)
WHERE xs.name = 'TraceMemObj';

Using SQLCallStackResolver the above raw call stacks can be resolved to a more readable version. The readable call stacks will give you an idea where the majority of the allocations are originating from.

Be sure to stop the XE session after a minute or two of the problem reproducing:

ALTER EVENT SESSION [TraceMemObj] ON SERVER 
STATE = STOP;

In this case, for the reproduction scenario you have shared, you will notice some functions on the top allocation related call stacks, which seem to indicate a correlation to SQL Graph query execution.

Arvind Shyamsundar Arvind Shyamsundar 861 silver badge4 bronze badges · Answer 1 · 2021-06-26 00:50:15Z

I am a Program Manager with the SQL team at Microsoft, overseeing the SQL Graph functionality. Thank you for the details, and sincere apologies for the issue. ~~(削除) This issue has been escalated to our engineering team and is being investigated on a priority basis. We will keep you posted on the outcome of the investigation. (削除ここまで)~~

UPDATED August 4th, 2021: This issue has been fixed in SQL Server 2019 CU12.

~~(削除) P.S. in the meantime, (削除ここまで)~~As a reference, here are some tips on how you can also do some further self-investigation (if you'd like) on what's happening under the hood. Firstly, any memory allocation in SQL Server can generally be traced in detail down to a call stack level, using the Extended Events (XE) mechanism. The first step is to define an XE session, filtering on the specific memory clerk name, which in this case is USERSTORE_SCHEMAMGR:

CREATE EVENT SESSION [TraceMemObj] ON SERVER 
ADD EVENT sqlos.page_allocated(
 ACTION(package0.callstack)
 WHERE (memory_clerk_name = 'USERSTORE_SCHEMAMGR')
 )
,
ADD EVENT sqlos.page_freed(
 ACTION(package0.callstack)
 WHERE (memory_clerk_name = 'USERSTORE_SCHEMAMGR')
 )
ADD TARGET package0.histogram 
(SET source_type=1,
 source=N'package0.callstack')
WITH (MAX_MEMORY=32768 KB,EVENT_RETENTION_MODE=ALLOW_SINGLE_EVENT_LOSS,
MAX_DISPATCH_LATENCY=5 SECONDS,MAX_EVENT_SIZE=0 KB,MEMORY_PARTITION_MODE=PER_CPU,TRACK_CAUSALITY=OFF,STARTUP_STATE=OFF)

Start the Extended Events session:

ALTER EVENT SESSION [TraceMemObj] on server state = start

Then, wait for problem to repro, and then run the below query. Click on the XML output from the query.

SELECT event_session_address,
 target_name,
 execution_count,
 CAST (target_data AS XML) AS MemObjData
FROM sys.dm_xe_session_targets AS xst
 INNER JOIN
 sys.dm_xe_sessions AS xs
 ON (xst.event_session_address = xs.address)
WHERE xs.name = 'TraceMemObj';

Using SQLCallStackResolver the above raw call stacks can be resolved to a more readable version. The readable call stacks will give you an idea where the majority of the allocations are originating from.

Be sure to stop the XE session after a minute or two of the problem reproducing:

ALTER EVENT SESSION [TraceMemObj] ON SERVER 
STATE = STOP;

In this case, for the reproduction scenario you have shared, you will notice some functions on the top allocation related call stacks, which seem to indicate a correlation to SQL Graph query execution.

Stack Exchange Network

SQL Server 2019: Memory performance with graph queries (possible memory leak)

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

SQL Server 2019: Memory performance with graph queries (possible memory leak)

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions