I am working to increase the performance of bulk loads; 100's of millions of records + daily.
I moved this over to use the IDatareader interface in lieu of the data tables and did get a noticeable performance boost (500,000 more records a minute). The current setup is:
- A custom cached reader to parse the delimited files.
- Wrapping the stream reader in a buffered stream.
- A custom object reader class that enumerates over the objects and implements the
IDatareaderinterface. - Then
SqlBulkCopywrites to server
The bulk of the performance bottle neck is directly in SqlBulkCopy.WriteToServer. If I unit test the process up to but excluding just the WriteToServer the process returns in roughly 1 minute. WriteToServer is taking on an additional 15 minutes +. For the unit test it is on my local machine so the same drive the database lives on so it's not having to copy the data across the network.
I am using a heap table (no indexes; clustered or unclustered; I have played around various batch sizes without major differences in performance).
There is a need to decrease the load times so I am hoping someone might now a way to squeeze a little more blood out of this turn-up.
-
1Are you running release mode? What's the bitness? Have you profiled? During the 15+ minutes, is the CPU load mostly in the server or in the client? How much data are you actually copying? What settings did you use for the bulk copy? What's the I/O load during the copy?Eamon Nerbonne– Eamon Nerbonne2013年03月20日 14:51:07 +00:00Commented Mar 20, 2013 at 14:51
-
Yes, this is in release mode. 32 on my local where I am unit testing. The client is running on the same box as the db. Holding steady at roughly 12% CPU usage for the process and 80k of memory. For this unit test 12 million records. I used the table lock setting. Good idea, I will run the sql profiler on this process and see if it turns anything up.Michael S– Michael S2013年03月20日 15:02:30 +00:00Commented Mar 20, 2013 at 15:02
-
I mean the Cpu load on the client process vs. the server process; and you should look at a CPU profiler for the client which might tell you more than the sql profiler here - it sounds like the sql side is very simple (which is good). 12% CPU load - that might be 1 core of an 8 core machine maxed out - is it? How many bytes is one record? Oh, finally, are you running without a debugger attached (possible even in release mode)?Eamon Nerbonne– Eamon Nerbonne2013年03月20日 16:07:24 +00:00Commented Mar 20, 2013 at 16:07
-
To help benchmark you might want to include the custom data reader in your pre-WriteToServer benchmark. Since data-readers are pull-based (i.e. lazy) it's likely that by excluding WriteToServer you're also excluding possible perf. bottlenecks in the custom data reader.Eamon Nerbonne– Eamon Nerbonne2013年03月20日 16:15:01 +00:00Commented Mar 20, 2013 at 16:15
2 Answers 2
Why not use SSIS directly?
Anyway, if you did a treaming from parsing to IDataReader you're already on the right path. To optimize SqlBulkCopy itself you need to turn your focus to SQL Server. The key is minimally logged operations. You must read these MSDN articles:
If your target is a B-Tree (ie a clustered indexed table) unfortunately one of the most important tenets of performant bulk insert, namely the sorted-input rowset, cannot be declared. Sis simple as this, ADO.Net SqlClient does not have the equivalent of SSPROP_FASTLOADOPTIONS -> ORDER(Column) (OleDb). Since the engine does not know that the data is already sorted it will add a Sort operator in the plan which is not that bad except when it spills. To avoid spills, use a small batch size (~10k). See my original point: all these are just options and clicks to set in SSIS rather than digging through OleDB MSDN spec...
If your data stream is unsorted to start with or the destination is a heap then my point above is mute.
However, achieving minimally logging is still a must for decent performance.
3 Comments
I was having an issue with SqlBulkCopy timing out. When using VarChar/NVarChar, performance is significantly reduced. I changed the datatype to Text, and it processed instantly. Specifically, I had a record with two fields that were VarChar(999). I changed these fields' datatype to Text and inserted a record with these two fields, each containing 100,000 characters. I'm not sure why the Text data type helps with performance? I suspect it has to do with how SQL Server stores the text in memory.